Fuzzy matching, when applied to your business rules, will help standardize your customer view for improved data quality.
In this fuzzy matching guide, we’ll walk you through creating a fuzzy matching strategy and how you can use a codeless fuzzy match solution for record linkage and data deduplication of a million records within minutes.
Fuzzy Matching: Your ultimate solution for data cleansing!
A 2020 Trends in Data Management report states that trust in an organization’s data quality remains low, only 13.77%. Simultaneously, the highly respected Gartner Annual CMO Spend Survey Research reported increased demand for customer understanding and insight.
In 2022 there are many different ways to gain the insight necessary for business growth, one of these is fuzzy matching: a powerful tool transforming messy data to a standard customer view in line with your business rules.
For beginners, fuzzy matching defines a type of data matching algorithm used to calculate probabilities and weights in order to determine similarities and differences between business entities like customers. This data matching technique differs from comparing unique reference data, like name and birthday, deterministic data matching.
Let’s imagine a typical scenario where fuzzy matching adds value to a business.
Say you entered 2022 down in sales due to the economy. You want to increase sales and get ready to launch a new marketing initiative in response.
So, you start to get together all your sales information to make a big splash with all your customers. You start with your customer relationship management (CRM) system and then move on to other marketing or product systems.
But each system contains slightly different information, resulting in messy data: duplicated and fragmented contacts, accounts, transactions, products, and addresses. You need to apply fuzzy matching algorithms in line with your business rules, standardize customer information, remove duplicate data, and reduce errors.
You need to turn messy data like the one below into clean, accurate, refined records.
Other business use cases and examples where fuzzy matching is required include:
1). Entity resolution for sales, marketing, and insights teams. Customer data is inherently messy, especially if they come in from multiple data sources. Entity resolution refers to cleaning, standardizing, and merging all of these records to create a unified, 360-degree view of the customer.
2). Identity resolution for government agencies. It’s not uncommon to hear immigration authorities flagging the wrong individual because of something as small as a spelling difference. A robust data-matching solution is required to help governments and authorities curb false identity problems.
3). GDPR & sanctions compliance for businesses. Sanctions are only getting stricter as world powers collide and companies dealing in international trading and transactions are required to ensure they meet sanctions compliance. On the other hand, companies in the EU/UK are required to meet GDPR compliance and other data privacy and safety requirements. The matching engine that a company uses for identity resolution must be able to detect matches in near-real-time and be scalable to handle data from multiple domains.
Read about GDPR and Sanctions Compliance.
Fuzzy matching techniques or probabilistic data matching apply parameters that you choose, scoring data patterns mathematically. Then, fuzzy matching techniques compare sets of characters, numbers, strings, or other data types for similarities. When presented with the likelihood, that customer entities match your fuzzy matching search; you decide whether to link records and combine data into a single customer view.
Say goodbye to inconsistent data with Fuzzy Matching!
Fuzzy logic operates on estimates or approximations. Unlike boolean logic, there are no binary results. It’s not a yes or a no, but a maybe, with the highest approximate being the most positive response.
Boolean logic data matching uses exact spellings or characters to determine a match is highly limited. It does not consider variations in text or numbers which means it will always leave out potential matches. Fuzzy data matching finds similar strings instead of exactly alike strings. It determines similarity on the basis of distance, score, or a likelihood of similarity.
For example, it will use the Edit Distance (also called as the Levenshtein Distance) to determine similarity. Rise and Rice for example has a distance of 1 since only S and C are the different alphabets here.
Fuzzy matching is not a new concept. Most programming languages provide a means to compare strings and programmers have been using fuzzy matching algorithms to implement a string comparison function that considers different characters and match options to return a true value.
The complication with a traditional fuzzy matching approach lies in setting up a match strategy. You will have to identify answers to questions like:
Generally, most fuzzy matching techniques and algorithms can be categorized into three types.
The character overlap approach looks at strings that share many of the same characters indicating a high level of similarity. The Jaccard measure and the Jaro-Winkle measure are two of the most common methods/approaches under the Character overlap measures.
The Jaccard distance is one of the most basic methods for fuzzy string matching. It is based on how many elements are on either string, divided by the total count of distinct elements.
For example, if Robert and Roberta are two strings, then the Jaccard measure will report them having an 85% match since they share 6 out 7 letters.
Because the Jaccard measure is entirely dependent on string similarity, it can also give false positives (that is showing a match when there isn’t).
For example, name strings sharing the same letters (anagrams) like, ‘Abel, Bela’ will be a match despite it being the record of two different people.
This leads to the two most feared consequences of poor fuzzy matching – false positives and false negatives.
In the Jaccard method, strings with the same characters in different orders are also considered a match based on the number of similar letters. The Jaro-Winkler distance solves this problem in three ways – it measures the similarity between two strings, and the length of the common prefix at the start of the string, and adds a score to the number of common prefixes. The algorithm then returns a match result between the range of 0 and 1 where 0 means no similarity.
String 1 = Crate
String 2 = Trace
Would result in an approximation of 0.73 match (using the Jaro-Winkler formula) even though the character orders are not sequential.
Character overlap approaches are not efficient, can be computationally expensive, and do not model character order accurately.
The edit distance approach measures similarity between two strings by defining the minimum number of changes required to convert String A into String B. Edit distances come in a variety of forms, but insertion, deletion, and substitution of characters are the most common types of operations to transform one string into another.
For example, transforming Maria into Mariam would require one letter and would have an edit distance of 1 letter. It is based on this distance that the algorithm would detect a match. The simple form edit operations are each given an equal weight which is known as the Levenshtein distance.
The edit distance method only involves single characters. One way to extend the capabilities of Edit distance is to capture multiple characters at a time, known also as an N-gram edit distance. It takes the idea of Levenshtein distance and treats each n-gram as a character. The matching works by limiting potential matches to those that share one or more n-grams with a query string.
These are but just three of the most common types of fuzzy matching approaches. Within these approaches, you’ll find many different types of algorithms at work to match all types of data. Other fuzzy matching algorithms include:
|Acronym||Determines whether a business name matches its acronym. For example, Advanced Micro Devices and its abbreviation AMD are considered a match, returning a score of 100.|
|Edit Distance||Determines the similarity between two strings based on the number of deletions, insertions, and character replacements needed to transform one string into the other. For example, VP Sales matches VP of Sales with score of 73.|
|Initials||Determines the similarity of two sets of initials in personal names. For example, the first name Jonathan and its initial J match and return a score of 100.|
|Keyboard Distance||Determines the similarity between two strings based on the number of deletions, insertions, and character replacements needed to transform one string into the other, weighted by the position of the keys on the keyboard.|
|Kullback Liebler Distance||Determines the similarity between two strings based on their sounds. This algorithm attempts to account for the irregularities among languages and works well for first and last names. For example, Joseph matches Josef with a score of 100.|
|Name Variant||Determines whether two names are a variation of each other. For example, Bob is a variation of Robert and returns a match score of 100. Bob is not a variation of Bill and returns a score of 0.|
|Syllable Alignment||Determines the similarity between two strings based on their sounds. First, the character strings are converted into syllables strings. Then the syllable strings are also compared and scored using the Edit Distance algorithm. This matching algorithm works well for company names.|
|Metaphone 3||Determines the similarity between two strings based on their sounds. This algorithm attempts to account for the irregularities among languages and works well for first and last names. For example, Joseph matches Josef with a score of 100.|
A fuzzy search uses several fuzzy matching techniques to filter and group customer data according to the set of user characteristics, likeness thresholds, and patterns you specify. In return, you get the potential matching customers of interest and the weight describing how likely one customer’s record resembles another.
Additional software lets you interact with fuzzy search results in a friendly user interface. You can locate less obvious relationships among hundreds of thousands of records and decide what records link and what customer to combine. You can see fuzzy matching search results below.
You find a 95% similarity between the “BHP Copper Inc” and “BHP Copper Inc,” indicating two records you may wish to merge. You scan the other similar company records.
You drill down deeper to see each company and customer record. From there, you can profile your data, plan your data cleansing tasks, and meet your business rules designed to standardize each customer entity.
Fuzzy matching has traditionally been implemented in one of the four following programs.
Python: It’s the most common language used in data science to build complex algorithms. Python has a FuzzyWuzzy library consisting of the most common expressions you can use to perform approximate string matching.
R – It is a popular language used by statisticians, data analysts, and researchers to retrieve, clean, analyze and present data. It’s often used in comparison with Python to clean + match data.
Java – Often used with Python, Java is beneficial when you need to host business-critical data science applications. Spotify, Uber are companies that use both Java and Python to work with their data.
Excel – The good old Excel! Great for deterministic matching, cleaning up and merging records. You do have to know a bunch of Excel formulas to treat and match the data but Excel is highly limited in terms of scale and flexibility.
Other than the knowledge of these languages, implementing a fuzzy matching process will require knowledge of:
Fuzzy matching’s reliability depends on suitable fuzzy search parameters and software to return a low number of false positives and negatives.
A false positive happens when the software retrieves two customer entities as a match when they are not. For example, “Joseph Mc Connell,” who works in Birmingham, does not match “Joseph Mc Donnell,” who works in San Francisco. They identify as separate customer entities.
A false negative occurs when software does not pick up two customers as a match when representing the same entity. For example, the algorithm does not pick up that “Ted Doe,” who works at “Oral Technology LTD,” is the same person as “Edward Doe,” who works at “Oral Technology.”
False positives lead to wasted time spent combing through irrelevant records. False negatives lead to duplicates and errors in customer information.
To avoid false positives and negatives, you want to use reliable software to profile your data ahead of time. Next, you want to come up with the business rules and plans to clean the data. Then you want to use trustworthy automation to clean the data, meeting your goals.
With a reduced chance of false positives and negatives, you can be more confident your fuzzy matching software will meet your data cleaning needs.
Unlock the potential of your data with Fuzzy Matching!
In the modern world where data sources are complex, varied, and inherently messy, fuzzy matching is required to perform two critical tasks: remove duplicates and link multiple data sources to get a consolidated view of the entity – also known as record linkage.
Deduplication and record linkage tasks are highly time-consuming and demand for highly accurate data matching abilities to weed out similarities. It must be noted that a poorly created fuzzy matching script will result in more false positives & negatives, thereby making the entire process ineffective.
Traditionally, developers and data scientists use the following fuzzy matching approach:
Data scientists are hired to manually script codes to do mundane tasks instead of working on strategy.
You would require someone proficient in multiple programming languages and a strong knowledge of Excel to create fuzzy match algorithms that can catch variations of data sets.
Here’s a real-world scenario of how a simple record linkage task can take months.
You are required to merge records from marketing, sales, and customer service to get a 360-degree view of your customer.
Now, if we were to dig deeper into the data itself, you’ll have to spend more time fixing problems like poor address data as given in the example below.
Using a traditional fuzzy matching approach, it would take you:
3 months if you spend each working hour on the project to simply transform the data
Another month to script the matching code
Multiple iterations of scripting, testing, and measuring results
Expertise in at least two languages with full command in Excel
Total = Around 4 to 5 months on a simple 1,000-row data set from three departments.
This is too long to wait especially if a business wants access to insights faster.
You and your employees need trustworthy information for business operations. Fragmented and duplicated customer information from multiple systems disguises similar customer entities and less obvious duplications, leading to messy data. Fuzzy matching algorithms and fuzzy searches retrieve like data elements typically missed manually.
Fuzzy searches retrieve similar records based on your parameters and thresholds. They give data sets scores to profile data and what to clean, based on your business rules. Use fuzzy matching software you trust to gather reliable information about potential matching customer entities.
Fuzzy matching helps you plan and enact your data cleansing projects, combining customer records into a single view. With better data quality, enabled by fuzzy matching, you will have successful marketing campaigns and a greater readiness to add machine learning for better insights.
We’re here to help you get the most from your data.
Download and try out our Award-Winning WinPure™ Clean & Match Data Cleansing and Matching Software Suite.
© 2023 WinPure | All Rights Reserved
| Registration number: 04460145 | VAT number: GB798949036