Discover the power of Fuzzy Matching today

Maximize your data efficiency

Start a Free Trial

4 Common Mistakes in Fuzzy Data Matching

Author photo

Farah Kim • March,2023

A data matching process is complicated, time-consuming, and prone to mistakes. It involves navigating through a myriad of variables, handling dirty data, and understanding the nuances of the dataset; causing even the most skilled data specialists undue stress and anxiety!

So if you want accurate results without losing your cool, here are some common fuzzy data match mistakes to watch out for. Remember, you can’t really avoid mistakes, so don’t fret about it too much. But, at the same time, you also want to prevent it from happening so it’s recommended to follow some best practices when using fuzzy matching techniques on your data.

Here goes.

Fuzzy Matching: Your ultimate solution for data cleansing!

Insufficient Data Cleaning

Most businesses perform basic ETL (extract, transform, load) operations on their data before using the data for business purposes. But basic ETL processes don’t cut it anymore.

 

Our modern world data doesn’t come from a single source, nor does it contain a basic structure. You have data streaming in from web forms, from POS salespeople, or from third-party sources. All these data sources have human errors, different formats, and structures. Most importantly, they lack data quality parameters; and the people handling data entry are not trained in data quality.

 

The result? High levels of dirty data.

 

The mistake? Neglecting dirty data or treating it superficially.

 

Here’s what typically happens when a dataset is required for a business purpose like sales forecasting.

 

A junior IT resource will be assigned the task. They need to export the data into Excel, and perform basic ETL operations. They just need to make sure letters are standardized, for example, turn all UPPERCASE LETTERS into lowercase, or just look for exact duplicates in the record and remove them. Once done, the file is saved and sent to business users. Job done!

 

When the business user gets this record, they will have a tough time corroborating the data with their own lists or records. And from here on starts a tug-of-war. Eventually, business users take what is given because they don’t know any better and go ahead with the flawed data.

 

It’s only when this data causes a crisis, (such as false insights) that senior IT resources are involved. At this point, there is an impulsive reaction to go in and make drastic fixes – or to outsource to an expensive specialist for data matching.

 

You can avoid all of this, if you simply follow a basic protocol:

 

  • Invest in data-cleaning automation tools that have pre-built components for data cleaning, standardization, and deduplication. These frameworks can streamline development and maintenance.

 

  • Go beyond superficial fixes; use advanced data cleansing techniques to address issues like missing values, duplicates, and outliers.

 

  • Train your team on the importance of data quality and the best practices for data management.

 

  • Implement data validation checks to ensure that cleaned data meets predefined quality criteria.

 

  • Schedule regular data cleaning and maintenance tasks to keep your dataset up to date.

 

Remember, dirty data is the leading cause of false match results, which eventually leads to inaccurate insights and predictions. You will also face significant challenges in terms of time and resources. The time spent in outsourcing, fixing, and matching the data can take weeks if not months. Additionally, you will have to share the data with an external party which means you cannot be in control of any data privacy leaks if they happen. Hence, using an on-premises data cleaning or data matching tool is a better way to avoid costly mistakes, clean your data, and make sure it is error-free before it is used for business purposes.

Overlooking Data Context

Another common mistake to avoid is failing to consider the data’s context. In an insightful webinar with the WinPure team, Libba Stanford, a renowned data strategist says, ‘Companies push for more data without addressing problems. They don’t speak to frontline workers collecting the data and sometimes forget about the whole context of the data. Frontline workers collect and experience the data, and have that full picture of information that companies can question and pull answers from and better strategize and build out a plan. However, there is very little communication with frontline workers. Again, there is no problem-solving. The focus is merely on process or information improvement.’ 

 

Libba is absolutely correct! Let’s share a scenario:

 

A junior data analyst may believe using the full form of abbreviated companies like (IBM, UPS, and BMW) is better and so decide to “standardize” all of it, without consulting with the business team. However if in the HR’s corporate list, this data is stored as “IBM Inc.,” the two records will not match. The analyst may only realize the mistake when they have to compare these two data lists for a business purpose. That’s when the mistake becomes a critical data match bottleneck.

 

In data matching, context matters significantly. If a data analyst doesn’t know the context of the content they’re trying to match, they will not be able to detect false positives and negatives. In fact, they may even merge, purge, or create false golden records! This is a problem that eventually causes downstream applications to fail – often catastrophically.

 

Cluster`s image

Find your perfect match: Fuzzy matching made easy!

Setting the Wrong Fuzzy Level

This may seem very inconsequential, but in the grand scheme of a data-matching exercise, setting the wrong fuzzy level can derail the whole match exercise.

 

A fuzzy match level is a threshold you manually set to determine the level of similarity between two strings. In simple terms, if you want to know whether Rob Jone and Robe Jons are a match, you will need to set the threshold level to a certain percentage. If you set the wrong threshold level, you’re asking the algorithm to cluster matches that aren’t duplicates, but are simply similar at a string level (also why being aware of the context of your data matters).

 

A lower threshold allows for more flexibility, potentially capturing more matches but also introducing more false positives. Conversely, a higher threshold increases precision but may miss some valid matches.

 

We suggest using a 90% threshold for fuzzy match projects, but it depends on your specific needs and goals.

 

Here are examples of what you might get when setting a fuzzy level at 85%, 90%, and 95%:

 

At 85%:

At a fuzzy level of 85%, the match process is more permissive, allowing for a broader range of similarity between records. This is good if you want to get an overview of your data and the level of duplication. If the results cluster fields that are not duplicates, but simply have similar strings, you can narrow the match threshold.

 

For example: 

 

The algorithm might match “John Smith” in Set A with “Jon Smithe” in Set B, even though the spellings have more variations and a typo. Both these contacts might not be the same, but the system flags this as a match. If this threshold is applied to a larger database with more variations, it can result in thousands of false positives, derailing the match exercise completely.

 

At 90%:

This level strikes a balance between capturing valid matches and reducing false positives. You will still have false positives (can’t ever get rid of them!), but at a lower level than with an 85% threshold. A 90% threshold is good for tasks where you want to handle moderate variations, making it effective for matching data with common discrepancies, such as typos, abbreviations, or minor variations.

 

For example: 

 

In a product catalog, you are matching product names from different suppliers. With a 90% fuzzy level, the algorithm matches “Samsung Galaxy S21” with “Samsung Galaxy S21 5G,” as it recognizes the high similarity but may not match “Samsung Galaxy S21” with “Samsung Z Fold” due to the level of “dissimilarity.”

 

At 95%:

Matches are more likely to be accurate, resulting in fewer false positives – however – the algorithm is sensitive even to minor differences. At 95%, it’s almost an exact match so the algorithm will look for strings that are nearly exact or are exact matches. For tasks where precision is critical, such as fraud detection, a higher threshold like 95% may be appropriate.

 

For example: 

 

With a 95% fuzzy level, the algorithm may only match “Robert Johnson” with “Robert Johson” if the differences are minimal. It may not match “Robert Johnson” with “Rob Johnson” due to more significant differences.

 

However, the choice of setting the threshold often depends on the context of the data, your data match goals, and your specific use case. The WinPure data match solution has the fuzzy match level set at 90% default, however, users are free to tweak the thresholds and review different match results. If you were to do this using Python fuzzy match scripts, you would be wasting a significant amount of time in tweaking and tuning the algorithm – not to mention, also being prone to mistakes along the way.

Incorrect Configuration Settings

Right, so we covered data cleaning, data context, and setting a data match threshold. Next up, a classic mistake we see customers make is with configuration settings. 

 

Let’s see this with a simple example: 

 

Record ID Name Address Email Preference
101 John Doe 123 Main St. johndoe@email.com Music, Cooking
101 Jon Doe 245 Main Street jondoe@email.com Music, Cooking
102 Jane Smith 456 Maple Ave. janesmith@email.com Travel
103 John Doe 1234 Elm St. john.doe@email.com Shopping
104 Jane Smyth 456 Maple Avenue jane.smith@email.com Shopping

 

There are multiple records and variations of a contact named John Doe. The first and second row indicates a duplicate record, with just a difference in address data. However, in the fourth row, we have another record of John Doe with different email IDs, addresses, and preferences.

 

Now, assuming you want to cluster customers by their preference, how would you go about setting the configuration for this match?

 

In this case, you may want to first run a match between customer ID and email to consolidate the duplicate record. Once done, you can then run another match to consolidate the data based on preferences. This way you can remove duplicates, investigate why you have multiple addresses of John Doe (maybe try to remove obsolete data), and review the validity of their data.

See how it’s done in this video 👇🏼

 

 

Setting the right configuration can help you create golden records and get a 360 customer view. With WinPure, you can tweak configuration settings to:

 

  • record common words in a knowledge base library so it doesn’t get matched
  • set different rules to match data across or within datasets
  • decide to merge or purge records
  • set and export master records

 

All of this – without a single line of code! Ask any fuzzy match specialist and they will tell you the amount of effort it takes to build configuration settings that can get the job done right in the first time. Most fail in the first try, and have to re-tweak the algorithm until they get the matches right.

 

 

Unlock the potential of your data with Fuzzy Matching!

Better Match Results, Less Mistakes with WinPure

To conclude, fuzzy data matching is a challenging task, prone to mistakes especially when you have to make so many decisions with techniques, variations in data, match configurations, and so on. 

 

Add in problems like deadlines, urgent requirements, limited resources, dirty data, and you have a full-blown crisis in your department. As you attempt to resolve these issues all at once, you risk all kinds of mistakes that lead to business failures. An insight pulled from raw data could be flawed. A report based on duplicate data could give the wrong information. A business application fed with dirty data can malfunction – there are 1001 ways to ruin a business with poor data.

 

That’s where a powerful fuzzy match solution like WinPure can help you. With our fuzzy match technology, you can perform all these complex tasks in minutes – without requiring trained specialists. Additionally, you can:

 

  • Perform advance data cleaning such as setting standards for abbreviations or short forms like Str for Street during the match process
  • Match within and across data sets using a combination of fuzzy, exact, and numeric match algorithms
  • Create golden records for business use
  • Reduce errors and mistakes with a professional team that works with you to resolve roadblocks in the match process

 

Download the free trial or book a call to see how we can solve your business problems with better, faster, and more accurate data match abilities.

 

 

[sharethis-inline-buttons]

Download Clean & Match Enterprise Free Trial

  • Hidden
  • * The download link will be emailed to you
  • windows

Author photo

Farah Kim

linkedin

Farah Kim is a human-centric product marketer and specializes in simplifying complex information into actionable insights for the WinPure audience. She holds a BS degree in Computer Science, followed by two post-grad degrees specializing in Linguistics and Media Communications. She works with the WinPure team to create awareness on a no-code solution for solving complex tasks like data matching, data deduplication, and MDM.

Any Questions?

We’re here to help you get the most from your data.

Download and try out our Award-Winning WinPure™ Clean & Match Data Cleansing and Matching Software Suite.

WinPure, a trusted innovator in Data Quality and Master Data Management Tools.
Join the thousands of customers who rely on WinPure to grow faster with better data.

McAfee Logo Deloitte logo vodafone HP logo