A data matching process is complicated, time-consuming, and prone to mistakes. It involves navigating through a myriad of variables, handling dirty data, and understanding the nuances of the dataset; causing even the most skilled data specialists undue stress and anxiety!
So if you want accurate results without losing your cool, here are some common fuzzy data match mistakes to watch out for. Remember, you can’t really avoid mistakes, so don’t fret about it too much. But, at the same time, you also want to prevent it from happening so it’s recommended to follow some best practices when using fuzzy matching techniques on your data.
Here goes.
Fuzzy Matching: Your ultimate solution for data cleansing!
Most businesses perform basic ETL (extract, transform, load) operations on their data before using the data for business purposes. But basic ETL processes don’t cut it anymore.
Our modern world data doesn’t come from a single source, nor does it contain a basic structure. You have data streaming in from web forms, from POS salespeople, or from third-party sources. All these data sources have human errors, different formats, and structures. Most importantly, they lack data quality parameters; and the people handling data entry are not trained in data quality.
The result? High levels of dirty data.
The mistake? Neglecting dirty data or treating it superficially.
Here’s what typically happens when a dataset is required for a business purpose like sales forecasting.
A junior IT resource will be assigned the task. They need to export the data into Excel, and perform basic ETL operations. They just need to make sure letters are standardized, for example, turn all UPPERCASE LETTERS into lowercase, or just look for exact duplicates in the record and remove them. Once done, the file is saved and sent to business users. Job done!
When the business user gets this record, they will have a tough time corroborating the data with their own lists or records. And from here on starts a tug-of-war. Eventually, business users take what is given because they don’t know any better and go ahead with the flawed data.
It’s only when this data causes a crisis, (such as false insights) that senior IT resources are involved. At this point, there is an impulsive reaction to go in and make drastic fixes – or to outsource to an expensive specialist for data matching.
You can avoid all of this, if you simply follow a basic protocol:
Remember, dirty data is the leading cause of false match results, which eventually leads to inaccurate insights and predictions. You will also face significant challenges in terms of time and resources. The time spent in outsourcing, fixing, and matching the data can take weeks if not months. Additionally, you will have to share the data with an external party which means you cannot be in control of any data privacy leaks if they happen. Hence, using an on-premises data cleaning or data matching tool is a better way to avoid costly mistakes, clean your data, and make sure it is error-free before it is used for business purposes.
Another common mistake to avoid is failing to consider the data’s context. In an insightful webinar with the WinPure team, Libba Stanford, a renowned data strategist says, ‘Companies push for more data without addressing problems. They don’t speak to frontline workers collecting the data and sometimes forget about the whole context of the data. Frontline workers collect and experience the data, and have that full picture of information that companies can question and pull answers from and better strategize and build out a plan. However, there is very little communication with frontline workers. Again, there is no problem-solving. The focus is merely on process or information improvement.’
Libba is absolutely correct! Let’s share a scenario:
A junior data analyst may believe using the full form of abbreviated companies like (IBM, UPS, and BMW) is better and so decide to “standardize” all of it, without consulting with the business team. However if in the HR’s corporate list, this data is stored as “IBM Inc.,” the two records will not match. The analyst may only realize the mistake when they have to compare these two data lists for a business purpose. That’s when the mistake becomes a critical data match bottleneck.
In data matching, context matters significantly. If a data analyst doesn’t know the context of the content they’re trying to match, they will not be able to detect false positives and negatives. In fact, they may even merge, purge, or create false golden records! This is a problem that eventually causes downstream applications to fail – often catastrophically.
Find your perfect match: Fuzzy matching made easy!
This may seem very inconsequential, but in the grand scheme of a data-matching exercise, setting the wrong fuzzy level can derail the whole match exercise.
A fuzzy match level is a threshold you manually set to determine the level of similarity between two strings. In simple terms, if you want to know whether Rob Jone and Robe Jons are a match, you will need to set the threshold level to a certain percentage. If you set the wrong threshold level, you’re asking the algorithm to cluster matches that aren’t duplicates, but are simply similar at a string level (also why being aware of the context of your data matters).
A lower threshold allows for more flexibility, potentially capturing more matches but also introducing more false positives. Conversely, a higher threshold increases precision but may miss some valid matches.
We suggest using a 90% threshold for fuzzy match projects, but it depends on your specific needs and goals.
Here are examples of what you might get when setting a fuzzy level at 85%, 90%, and 95%:
At 85%:
At a fuzzy level of 85%, the match process is more permissive, allowing for a broader range of similarity between records. This is good if you want to get an overview of your data and the level of duplication. If the results cluster fields that are not duplicates, but simply have similar strings, you can narrow the match threshold.
For example:
The algorithm might match “John Smith” in Set A with “Jon Smithe” in Set B, even though the spellings have more variations and a typo. Both these contacts might not be the same, but the system flags this as a match. If this threshold is applied to a larger database with more variations, it can result in thousands of false positives, derailing the match exercise completely.
At 90%:
This level strikes a balance between capturing valid matches and reducing false positives. You will still have false positives (can’t ever get rid of them!), but at a lower level than with an 85% threshold. A 90% threshold is good for tasks where you want to handle moderate variations, making it effective for matching data with common discrepancies, such as typos, abbreviations, or minor variations.
For example:
In a product catalog, you are matching product names from different suppliers. With a 90% fuzzy level, the algorithm matches “Samsung Galaxy S21” with “Samsung Galaxy S21 5G,” as it recognizes the high similarity but may not match “Samsung Galaxy S21” with “Samsung Z Fold” due to the level of “dissimilarity.”
At 95%:
Matches are more likely to be accurate, resulting in fewer false positives – however – the algorithm is sensitive even to minor differences. At 95%, it’s almost an exact match so the algorithm will look for strings that are nearly exact or are exact matches. For tasks where precision is critical, such as fraud detection, a higher threshold like 95% may be appropriate.
For example:
With a 95% fuzzy level, the algorithm may only match “Robert Johnson” with “Robert Johson” if the differences are minimal. It may not match “Robert Johnson” with “Rob Johnson” due to more significant differences.
However, the choice of setting the threshold often depends on the context of the data, your data match goals, and your specific use case. The WinPure data match solution has the fuzzy match level set at 90% default, however, users are free to tweak the thresholds and review different match results. If you were to do this using Python fuzzy match scripts, you would be wasting a significant amount of time in tweaking and tuning the algorithm – not to mention, also being prone to mistakes along the way.
Right, so we covered data cleaning, data context, and setting a data match threshold. Next up, a classic mistake we see customers make is with configuration settings.
Let’s see this with a simple example:
Record ID | Name | Address | Preference | |
101 | John Doe | 123 Main St. | johndoe@email.com | Music, Cooking |
101 | Jon Doe | 245 Main Street | jondoe@email.com | Music, Cooking |
102 | Jane Smith | 456 Maple Ave. | janesmith@email.com | Travel |
103 | John Doe | 1234 Elm St. | john.doe@email.com | Shopping |
104 | Jane Smyth | 456 Maple Avenue | jane.smith@email.com | Shopping |
There are multiple records and variations of a contact named John Doe. The first and second row indicates a duplicate record, with just a difference in address data. However, in the fourth row, we have another record of John Doe with different email IDs, addresses, and preferences.
Now, assuming you want to cluster customers by their preference, how would you go about setting the configuration for this match?
In this case, you may want to first run a match between customer ID and email to consolidate the duplicate record. Once done, you can then run another match to consolidate the data based on preferences. This way you can remove duplicates, investigate why you have multiple addresses of John Doe (maybe try to remove obsolete data), and review the validity of their data.
See how it’s done in this video 👇🏼
Setting the right configuration can help you create golden records and get a 360 customer view. With WinPure, you can tweak configuration settings to:
All of this – without a single line of code! Ask any fuzzy match specialist and they will tell you the amount of effort it takes to build configuration settings that can get the job done right in the first time. Most fail in the first try, and have to re-tweak the algorithm until they get the matches right.
Unlock the potential of your data with Fuzzy Matching!
To conclude, fuzzy data matching is a challenging task, prone to mistakes especially when you have to make so many decisions with techniques, variations in data, match configurations, and so on.
Add in problems like deadlines, urgent requirements, limited resources, dirty data, and you have a full-blown crisis in your department. As you attempt to resolve these issues all at once, you risk all kinds of mistakes that lead to business failures. An insight pulled from raw data could be flawed. A report based on duplicate data could give the wrong information. A business application fed with dirty data can malfunction – there are 1001 ways to ruin a business with poor data.
That’s where a powerful fuzzy match solution like WinPure can help you. With our fuzzy match technology, you can perform all these complex tasks in minutes – without requiring trained specialists. Additionally, you can:
Download the free trial or book a call to see how we can solve your business problems with better, faster, and more accurate data match abilities.
We’re here to help you get the most from your data.
Download and try out our Award-Winning WinPure™ Clean & Match Data Cleansing and Matching Software Suite.
© 2023 WinPure | All Rights Reserved
| Registration number: 04460145 | VAT number: GB798949036