Table of Contents
Data matching is a critical process in data transformation initiatives like record linkage, single customer view, and entity and identity resolution. In this data matching guide, we’ll explain all about data matching, from what it is, how it works, the different case uses as well as its many benefits.
Jump to a section:
Table of Contents
What Is Data Matching?
Data matching (also known as record matching) compares two or more records according to given criteria to fulfill a given purpose. For example, comparing and matching phone and address data of one or multiple individuals to identify uniqueness and remove duplicate entries.
Data matching is commonly needed for:
Entity resolution: the process of identifying and merging records to represent the same real-world entity. For example combining address, social media, sales, and billing data to get a consolidated record of a customer.
Data matching enables the match and merge of multiple identities from multiple data sources to get an accurate view of the individual as shown in the image below.
Master data management: creating reliable and accurate master records free of duplicates and errors. MDM records are the sole source of truth that companies rely on to make business decisions.
Data quality improvement: matching data to validate existing data or to remove duplicates and enhance the credibility, value, and accuracy of data.
Why Is Data Matching Important?
Data matching is one of the most vital steps in data management and serves many important purposes. Data matching can increase efficiency, validity, accuracy, and compliance within a wide range of businesses and industries.
Data matching has the positive effects of improved customer service and customer retention, increased efficiency, improved data quality and most of all, data matching helps drive business growth. This is what makes it an essential part of data management.
How Does Data Matching Work?
Data matching makes use of several algorithms to match attribute data, reference data, and group data; also known as the three levels of data in an ER system.
The Attribute Level: The lowest level of data matching takes place between two identity attributes – such as names, places, location, numbers, etc. If two attributes show the same level of similarity, then it is considered an exact match.
The Reference Level: Most ER systems use reference data for matching. This is a special subset of data that includes postal codes, currencies, transaction codes, hierarchies, and IP addresses or cookies.
Cluster/Group Level: Rather than attempt to compare all records at attribute or reference level, they are placed within groups. For example, collecting all identifiers with phone numbers starting with a specific country code into one group.
There is no one matching algorithm at work. At its highest level, data matching makes use of several algorithms to determine a match, however, almost all algorithmic approaches can be categorized into four groups.
Deterministic Matching Algorithms
Also known as ‘exact matching,’ deterministic matching algorithms match records based on attribute-level similarity. Each attribute is compared against the same attribute in another record to identify a match or no-match.
For example, a deterministic matching algorithm would compare a phone number in one record with a phone number in another record to identify if the two values are equal and are a match. This relative simplicity enables high performance and quicker match results.
The catch? Records must be accurate and have exact spelling and casing to be a match.
Deterministic matching algorithms have limited capabilities to handle anomalies like incomplete or blank fields, common misspellings of names (Catherine vs Kathryn), and typos among other complexities.
Most advanced data matching tools are able to overcome this limitation by providing data cleaning and data prepping along with data matching in one single interface. This facilitates the user to clean, standardize, match, and create final records without switching between tasks and platforms.
Related Product: WinPure Data Matching Software
Fuzzy Matching Algorithms
Fuzzy matching algorithms are used to facilitate deterministic matches. Fuzzy techniques such as the Soundex algorithm to handle phonetic matches, or the Levenshtein Edit Distance to handle variation in string values caused by mistyping or data quality errors. For example, the edit distance between the strings Catherine and Katherine is “1” because only one edit operation, the substitution of C for K is necessary to transform Catherine into Katherine. Fuzzy matching allows for easy matching of semi-structured data and records that cannot be matched using deterministic methods.
Probabilistic Matching Algorithms
Probabilistic matching relies on statistical probability theories to determine a match. It uses a wider set of data elements and weights to calculate match scores, and thresholds to determine whether two attributes are a match.
For example, in the English language, Will is understood as a common nickname for William, or Kath for Katherine. Most probabilistic matching schemes will incorporate nickname rules or make use of pre-defined rules to handle non-exact matches.
With social media data and digital identities becoming important data sources, most entity resolution matching rules will rely on probabilistic matching to determine a match. For example, William may list his full name as Will Spade on Linkedin, but at the same time may keep his name as Willy or Bill on Facebook. Similarly, he may have a completely different email ID or a different name on his credit card such as William John Cutler. In this case, probabilistic matching algorithms can determine if all these references belong to one person.
Hybrid Algorithms
Advanced matching solutions don’t just rely on one matching algorithm to perform complex matching processes. They make use of hybrid algorithms that combines deterministic, probabilistic, phonetic, fuzzy, and machine learning techniques to achieve higher matching rates. As new matching techniques become available, their implementations would be more hybrid as opposed to pure deterministic or probabilistic.
Regardless of how innovative matching algorithms are, in practice, there is no single technique that can satisfy the increasingly complex requirements of data matching at an organizational level. When it comes to entity resolution and multi-domain MDM, multiple matching algorithms are required to do the job well.
What is the Data Matching Process?
The data matching process in general terms encompasses key data management functions like cleaning/formatting the data, indexing the data, performing a data comparison, and creating views.
Here’s a brief breakdown.
1). Data Cleaning and Standardization
Most data sources contain incomplete, incorrect, and inaccurate data that affects the accuracy of a matching process. Failing to clean, format, and reduce errors before a match can result in harmful consequences. For example, if a bank customer’s address is incorrect or incomplete, the bank may mistakenly send sensitive information to the wrong individual, leading to court cases, penalties, or violation of privacy laws. Data cleaning, therefore, is the most important step of data matching.
2). Identifying Attributes to Match
Once the data is cleaned and standardized, each record from one database will be compared with all records in other databases to calculate similarities. While this seems easy, it’s incredibly hard to manage with large databases. For example, matching two databases with 1 million records each will result in 1 trillion record pair comparisons. To reduce unnecessary comparisons, fixed attributes (such as phone numbers) are picked to perform a match. Only records that share the same value for filter criteria (such as only phone numbers of people within a certain town), are compared with each other.
3). Creating Comparison Blocks
Once records are indexed for matching, blocking techniques are applied to narrow down the match criteria. For example, only records from two databases sharing the same value (for example entities that share the same postcode value) are inserted into the same block. This specific blocking approach likely refers to a match and significantly reduces the workload of assessing and reviewing matches.
4). Record Pair Classification
In a data matching process, classification simply means to mark your records as matches or non-matches. Matched records are kept for human review later, while non-matches are kept in a separate record. Some companies even prefer to keep a record of duplicated data for a thorough review later.
5). Creating a Final Record for Review
A final master record or golden view is created which is stored as the ultimate source of truth. This treated and matched data is applied in downstream applications to fulfill business goals.
Data matching is a complicated process that most companies are wary of because they either attempt the process manually or they are unable to have good enough data to run a match.
Watching Out for Errors in Matching
False positives – when a record is erroneously matched to the same master record, and false negatives – when a record is not matched even though they belong to the same entity record are two common errors in matching to watch out for.
For example, false positives can trigger a security alert for an innocent citizen if their data is erroneously matched with an individual marked as a criminal, as is frequently seen at immigration centers.
Using the same example, false negatives exclude the individual with a criminal record completely, thus causing a security concern.
The chances for errors to occur significantly increase with poor data. Some common instances that cause false negatives and positives are:
- Invalid or incorrect data for multiple records – for example, customers filling in a company phone number instead of the personal number on a form (which could be shared by many people).
- Using multiple versions of the same name – for example, Cath, Kath, Catherine, Cathy, Kathy
- Incorrect spellings and addresses – for example, Berkeley as Berkley
- Using abbreviations instead of full spellings – Avenue vs Ave
- Non-standardized format of phone numbers and addresses – +1 vs 001
Errors in matching can be curbed with a data matching solution that allows for in-depth data profiling before the matching process. During the profile stage, users can get a bird eye’s view of the errors affecting their data.
Here’s how data profiling in WinPure helps identify and rectify data quality issues before the match process.
Fortunately, data matching technologies like WinPure and many others now allow users to clean and match multi-million, multi-domain data sources in seconds, saving months of effort.
If your firm has not yet considered a data matching solution, the right time is now.
Why Firms Need to Invest in Data Matching Technologies
On a micro-level, teams still use Excel to perform exact data matching, a time-consuming and counter-productive effort. You would have to manually clean and transform data before you can run a comparison using Excel. As advanced as Excel’s capabilities are, it is not a tool designed for data matching.
On a larger level, most companies expect their data analysts or engineers to manually create, test, and run match algorithms only to fail and cause avoidable project delays.
Here’s a rough breakdown of expected timelines for matching employee records of a large organization with 500+ employees.
- 1 month to identify attributes and build a data matching plan
- 2 months to assess and resolve data quality issues (longer if the organization does not have a formal data management structure)
- 1 month to create and test different algorithms based on the complexity of the data.
- 1 month to finalize a match record along with reviewing false or negative matches
Now imagine doing the same for larger, more complex, and diverse databases. You could spend years and still not be able to truly get a consolidated view of your customers. With a data matching solution though, you can clean, merge, dedupe, and create final records in just a few minutes without requiring any coding or mathematical knowledge.
The time and effort saved can be directed towards strategic tasks like implementing a data governance framework, resolving data quality problems at the root level and so on. In a world that’s competing on data, the last thing you would want is your team of highly-paid professionals cleaning every T and every dot when a tool could have done the same with better accuracy and speed.
10 Data Matching Benefits
When done right, here are the top 10 benefits of data matching that firms can experience. Data matching is a complicated process that requires the combination of strategic planning, an advanced matching engine, and clearly defined business goals to deliver on expectations.
1). Fraud detection and prevention. Financial institutions are under immense pressure to deal with increasingly complex fraudulent activities. From scams to fake identities, from money laundering to punitive regulatory compliances, firms need access to accurate, reliable, consolidated records. Data matching is used to compare the firm’s records against criminal and sanctions databases to identify details about the individual.
2). Identity verification. Government and law-enforcement departments benefit greatly from data matching as they can match records at an attribute and reference data level (such as registrations, drivers license, social security numbers etc) across multiple databases to get an overall picture of the individual. Based on the match and analysis result, risk scores are calculated to identify whether the individual is a real person, a threat, or a synthetic identity.
3). Sanctions & GDPR Compliance: In 2019, sanctions violations have resulted in fines of $10bn for non-compliance with AML, KYC, and sanctions regulations. Similarly, there have been over €359 million in major GDPR fines so far. Businesses are required to match their data against the data given in sanctions lists to ensure they are not accidentally engaging in trade with listed entities or individuals.
4). Better Public Programs: A CBPP report in the United States, revealed over 40% of eligible individuals missed out on a public nutritional program for women, infants and children because of enrollment gaps that prevented the individuals from actually getting the benefits. Data matching allowed four states to find enrollment gaps, and also to identify individuals for targeted outreach. Additionally, the data matching also reduced the applicant’s documentation burden and simplified the certification process for eligible individuals. Data matching has repeatedly been demonstrated as a necessary and beneficial technology to uplift, improve, and deliver effective public programs.
5). Targeted Campaigns: According to Salesforce 70% of CRM data becomes obsolete while around 30% of records are duplicated. Yet companies will still send out emails, direct mails, flyers to all existing database customers only to be at the receiving end of the customer’s wrath. Data matching in this case becomes a dependent activity. Companies would need to regularly de-deduplicate records and ensure they are complete, updated, and accurate whenever they want to run a campaign. Most importantly, data matching helps with identifying household data (members of a family using the same service), which in turn enables marketing teams to build targeted lists. For example, an insurance company can cut down on the costs of direct mailing by up to 80% if they send out just one flyer to one household instead of five flyers to five members of the same household!
6). Improved Customer Service: Data matching can be used to consolidate scattered customer records giving companies a 360 view of their customer journey. This in turn helps marketing and customer service departments improve customer service. For example, an airline can use data matching to identify customers’ preferences and provide them with optional stays at exclusive hotels or BnBs.
7). Improved Customer Retention: For example, a gym can use data matching to offer exclusive discounts to members of the same family. Similarly, an insurance company can use data matching to offer specific health insurance plans for people with newborns or people with teenagers. This ability to predict customer needs and offer a relevant service or product greatly boosts customer retention.
8). Increased Organizational Efficiency: When teams have access to accurate and reliable data, they can make decisions faster and better. Companies that have invested in MDM and entity resolution processes have reported higher efficiency.
9). Remove Duplicates & Improve Data Quality: One of the biggest benefits of a data matching is deduplication – the process of removing duplicates within a data source. Duplicate data is a critical data quality concern that most firms are ill-equipped to handle because of informal data matching processes that result in higher false negatives and positives.
10). Drive Business Growth: Efficient data matching is the backbone of entity resolution which boosts growth factors like complete customer views, targeted marketing, better product and services and so on.If businesses want to compete in today’s world, they need reliable data systems and processes.
Most firms we’ve worked with are well aware of the benefits of data matching but struggle with the implementation process. Our Clean and Match solution is designed to take the struggle away and help you and your team breeze through a complicated process with a point-and-click, user-friendly, no-code solution that ensures you meet your business goals fast.
What Makes WinPure a Market-Leading Data Matching Solution?
WinPure is a market-leading data matching solution that delivers the highest accuracy rate in the industry. Some of the key features that make the solution a best-in-class data matching solution are:
Match accuracy: Higher accuracy gives a more complete and integrated view of the entity and reduces the errors associated with false positives (for example, incorrectly associating an individual as an employee of a wrong organization) or false negatives (for example, missing the employee name completely). WinPure’s data matching has nearly 97% matching accuracy, higher than any other in the industry.
Match speed: It’s critical for a data match solution to deliver match results in real-time. WinPure offers the fastest and most intelligent data matching engine from proprietary phonetic and fuzzy matching algorithms, combined with sophisticated scoring and merging functions.
Master records: Once linking and matching are done, the results need to be stored as a master record for follow-up processing. WinPure allows you to create custom rules to define master records.
Scalability: A data matching engine has to be able to address scalability concerns. These include the number of records to be matched, the ability for easy integration with various data sources, the ability for user-defined matching rules, and more. WinPure allows for the matching of a million records and beyond, depending on the in-memory capacity of the user’s hardware.
Ease of Use: The data matching solution should provide an easy-to-use, intuitive interface for users to define and customize matching criteria and resolve uncertainties. WinPure’s data matching solution is loved for its ease of use wherein even business users can perform a data match activity with minimal training. An intuitive interface makes all the difference between user acceptance and rejection.
WinPure’s data matching engine is finely tuned to achieve high matching accuracy and preserves the integrity of its algorithmic techniques. However, there are situations where custom-defined matching rules and manual processes must be followed to achieve the desired business outcome. In such cases, WinPure’s data matching engine allows for users to create custom rules in conjunction with the match algorithms to deliver satisfactory results.
Data Matching Example: How Vodafone Used WinPure’s Fuzzy Matching Capabilities to Align Sales and Acquisitions
Vodafone, a British multinational telecommunications company owns and operates in 22 countries and provides services to corporate clients in 150 countries. With such a large database, Vodafone needed a data matching engine that could match all their account names from one list and align it correctly to the same accounts on their master sales database.
WinPure’s solution not only helped with the match but also fixed spellings and other data quality issues to optimize the match results. This saved Vodafone months of effort and helped them improve their revenue in a timely manner.
Read about a data matching example of Vodafone’s experience using WinPure.
To Conclude
Data Matching is a critical function that demands Best-in-Class technology.
As data structures become more complicated, it requires data matching technologies that can keep up. Whether it’s an enterprise project like matching multi-million records for a merger, or a small departmental project like combining marketing & sales data for insights, primitive data matching techniques such as exact comparisons via Excel formulas or manually coded algorithms are no longer enough or effective. Organizations need best-in-class data matching solutions that offer data preparation, data cleaning, and data transformation as part of the process. Moreover, an intelligent data matching technology must have multiple algorithms at work to tackle data nuances and reduce errors.
The true benefits of data matching can only be experienced when the matching process is fast, easy to implement, and delivers accurate matches. WinPure is the only data match tool in the industry that meets all the checkboxes of a modern, user-friendly, affordable data match solution for businesses of all types and sizes.
Want to speak to our data match solutions specialist & check out the demo? Get in touch!