An Overview of Data Matching

Author photo

Farah Kim • December,2022

Data matching is an important function of data management that directly impacts business outcomes. The ability to connect data from multiple sources, to get singular customer views, to map source and target data, to meet GDPR/sanctions compliance, to achieve entity and identity resolution goals – all depend on the core function of fast, efficient, accurate data matching. 

In this guide, we discuss how data matching works, the different data matching techniques, how it is used in various industries, and how WinPure’s data matching software helps businesses achieve their data goals. 

What is Data Matching?

Real-world entities are recorded and identified in databases based on attribute values – which are descriptions such as names, addresses, phone numbers, and so on. When recording this data, either manually or via automatic means, there is always a chance of error.

 

For instance, a data entry officer may accidentally write a name as Catherine instead of Kathryn, a variation of the same name but with different spellings. Companies with multiple data sources, such as customer data from a third-party app, are bound to have duplication issues, and unresolved duplication can lead to unreliable insights, directly impacting business objectives. 

 

Data matching aims to identify duplicates and connect disparate data sources to create a single, golden customer view for analytics, reporting, or other business purposes.

What is the Traditional Data Matching Process?

The data matching process consists of five major steps, starting from cleaning and standardzing the data, to ending at creating master or golden records. 

 

Here’s an example: 

There are two database tables that are to be matched. Both contain the same name, address, DOBs and phone numbers. The problem? They differ in terms of structure and content quality. Also, they have different unique IDs. You now have to match this data so as to ensure they refer to the same real world people.

 

data matching table

 There are five ways to kickstart a data matching process for table structures like these.

data matching process
 Source: Data Matching Process by Peter Christen

 

Most data sources contain incomplete, incorrect, and inaccurate data that affect the accuracy of a matching process. Before initiating a data matching process, you must clean, format, and reduce errors to avoid a false match. There are real-world consequences to incorrectly matched data, such as a bank mistakenly sending sensitive information to the wrong individual, leading to court cases, penalties, or violation of privacy laws.

Data cleaning, therefore, is the most critical step of data matching.

 

There are various causes of incorrect or dirty data attributes – from manual data entry (such as someone typing in data), to form fields on websites that do not have data quality checks in place (such as automated country code numbers to prevent incorrect manual entries).

There are generally four types of data cleaning steps:

 

a). Removing stop words and irrelevant characters: This step aims to clean the input data so that only relevant information remains. For example, commas and periods are removed because they can clutter an analysis; likewise, with semicolons ( ; ), quotes” terminate sentences. In applications like WinPure, you can also remove common stop words such as articles, prepositions, pronouns, etc. (or, and, but the), which do not add any value to the information in the text.

 

 b). Fixing spellings & abbreviations. Along with typos, spelling differences are significant issues in data matching. For instance, Johnny is a common nickname for John. If you’ve collected John’s data from a third-party app such as Facebook, where he uses Johnny, you’ve got a duplicate & an inconsistent entry.

This process often relies on look-up tables containing name variations, nicknames or common variant spellings—along with their expanded versions–to ensure a standard set for all values throughout your system. Standardization improves how well attributes match up because it reduces variation in names across different records and ensures consistency.

 

 c). Parsing fragmented attributes: This process deals with attributes that have fragmented pieces of information. For example, matching the street, postal code, and suburb columns of the first set with the text column of the second set is challenging. In such cases, you would need to parse the data through manual methods or a data matching solution that allows you to parse the data based on multiple probabilistic techniques automatically.

 

d). Verification of attribute values: The final step in the data cleaning process is to verify the correctness of addresses against available databases. Also called address verification, available addresses are matched against a reliable external source like the US or UK postal database, where locations can be verified. For example, it is known that there’s no town called, ‘ShallCruss” in a popular time, but there’s a “Shallcross Crescent,” and that will prompt a check during the process. Popular data cleaning solutions have address verification as an essential module of the software that lets users verify against government databases in the US, UK, and Canada.

 

 Here’s how the above database looks after a data cleaning process.

data matching table Image 3

 

Here’s how the above database looks after a data cleaning process.

 

data matching

Note: With WinPure, you can also add a column into the data. In this case, a Gender column was also added.

2. Preparing for Match with Indexing

Once the data is cleaned and standardized, each record from one database will be compared with all other databases to calculate similarities. While this seems straightforward, it’s tough to manage with large databases. 

 

For example, matching two databases with 1 million records each will result in 1 trillion pair comparisons. Such comparisons are computationally heavy and complex and are not feasible for today’s large databases that may contain millions of records. The number of possible matches *and* non-matches for these records would be exponentially huge, making the data matching process unnecessarily complex.

 

Indexing helps resolve the “match all pairs” problem by reducing the number of record pairs that need a comparison. Without indexing, there would have been m × n detailed comparisons for every pair where only one was necessary if we assume likeliness between records equals 100%. 

 

Indexing techniques involve the creation of blocks, lists, or clusters. The key values on which these operations depend often come from one attribute but could include several different ones if needed.

 

For example, a postcode (or zip code) attribute could be used to block all records that have the same value. This would make it easier for users looking up specific addresses by assigning them one index list with blocks of data. Indexing is important for more than just matching data. It needs to be applied when deduplicating a single database as well. The techniques used during indexing can also help with matching any new databases that come across yours – saving time on repetitive tasks.

 

data matching indexing image 5

Although a useful technique, caution must be applied with blocking as it can, In this case, a match goes undetected, meaning your data is skewed. You must have domain and data matching knowledge when defining blocking criteria.

3. Record Pair Comparison

Indexing may generate potential matches, however, more details are required to define a true match. That’s when record pair comparison comes in. You attempt to find two records that share as many similarities possible. 

 

Generally, the greater the number of attributes shared by two given individuals will result in a higher degree of matching between them. Using the running example, we can find more potential matches by including postal codes and phone number columns to narrow the match.

 

The ability to find a match between two sets of data is only the first step in addressing incomplete matches. Even after all possible characters have been matched and standardized, there may still exist different attribute values that correspond with true matches. In this case, ‘100’ and ‘106’ likely refer to the same person, but their respective First Names differ just enough such that they would not be considered exact duplicates. Therefore, it is necessary to conduct some approximate comparison to measure similarity.

 

Different attributes contain various types of data, and this requires different comparison functions to be applied. For example names with string values such as “name” or addresses that consist only of letters cannot be compared using the same criteria as dates which also include numbers in some cases; thus it is advised you compare groups when dealing with these sorts of attributes.

4. Classifying as a Match or a Non-Match

In traditional data matching approaches, for example those based on probabilistic record linkage models are classified into one of three classes: matches (i), non-matches(ii) or undecided cases.

 

The difference between a match and non-match is based on how the records are related. A match is when the pairs of records refer to the same real-world entity. A none match is when the two records do not refer to the same entity. The third classification is a potential match that calls for a manual review.

5. Ensuring the Data Quality of the Matches 

Match quality is an essential measure of how well matches are identified and is greatly affected by how the data cleaning was performed. If you miss cleaning unnecessary characters (such as hyphens between phone numbers), you could be looking at inaccurate and false match results.   Remember, though, much of the assessment of quality matches requires human intervention – from carrying out extensive investigations to double checking individual records or contacting them to determine the truth about the match. It is one reason why having a data matching solution helps so data analysts can focus on assessing the validity of a match instead of worrying about how to perform it. Today’s data matching solution allows you to accomplish the data matching process in as little as five minutes but automatically.

Data Matching Techniques & Configuration

While there are multiple algorithms and methods to data matching, most solutions use three types of data matching techniques to match text, strings, and numeric data. 

 

a. Fuzzy Matching 

Fuzzy matching allows for easy matching of semi-structured data and records that do not have exact matching attributes. Text strings like names are matched using fuzzy techniques such as Soundex for same sounding names, or Levenshtein Edit Distance for differences in spellings. 

For example, the edit distance between the strings Catherine and Katherine is “1” because only one edit operation, the substitution of C for K is necessary to transform Catherine into Katherine.

 

b. Exact Matching 

In this technique, you match text that you are 100% sure is correct or exact and doesn’t require any additional edits. 

For example, you know all the Catherines in your database is spelled as it is and there are no variations. When you are sure about the validity of your data, you can use exact matching to get exact matches. 

 

c. Numeric Matching

Numeric matching deals only with numbers. It’s great for matching phone numbers or postal codes that contains only numbers. 

You can choose what kind of matching configuration to use depending on the health of your data.

How to Perform Data Matching Using an Automated Solution

Data matching solutions allow you to pre-process, index, block, and compare within and across multiple data sets in an easy on/off premises platform. With WinPure for example, you can match data in five easy steps: 

 

  • Integrate data sources from multiple data sets & file formats: Unlike a few decades ago, you no longer need to manually transform data to run a comparison. With easy integration functions, you can connect a CSV file or a MySQL file to the interface and begin a match process. 

data matching

 

  • Advanced data cleaning functions: In the image given below, you will see how the tool profiles the data for inconsistencies and errors. So if you’re in the marketing department, you can see straight away you’ve got empty email addresses and fields with punctuations and characters that add “noise” to the data. Traditionally, it would take hours and weeks to clean this data, because if it’s not clean, it leads to unreliable insights that directly impact business and revenue. 

data matching table image 7

  • Advance cleaning with custom regex expressions: Sometimes you’ve got complex string data such as email ids that contain numbers and text such as [winpure123@winpure.com]. You can match these strings using advanced regex expressions that are built in the tool or that you can get from a library of expressions to match complex data.
  • Standardizing and cleaning data with splitting the data: When you have mutliple data sets, such as from market, product management, or sales, you can end up with inconsistencies in standards. For example, someone can write the data structure as dd//mm//yyyy and someone can write it as dd/mm/yy. 

This supposedly small discrepancy can cause discrepancies in match results. It’s always recommended to standardize problems such as capital letters, spaces between code numbers, abbreviations and others.

You can resolve these issues on the WinPure platform by splitting the data and choosing options like Propercase, Uppercase etc to identify and resolve standardization problems. 

 

  • Building your own word library: Have specific words and abbreviations that you want to consider during the match process? WinPure lets you build a custom word library using Word Manager, a function that lets you build your own records about how you want your data to be corrected or standardized. For example, you prefer Limited over LTD, or Ltd. Whichever your preference, you can set the standard and the tool will correct it accordingly. 

data matching table image8

 

  • Matching within and across data sets: From the columns you’ve cleaned before, you can now match within data sets (such as matching data of Table A, then Table B). Once done, you can then match data across the tables (A x B) to weed out duplicates.
  • What to match? Obviously the data sets that you believe share unique traits. For example, matching Contact Numbers with Company Names. Two contact numbers for company A is a duplication. When choosing what to match, it’s better not to overcomplicate the decision and opt for attributes that you know have lower chances of similarity. For instance there could be multiple individuals with the name Johnny, but there’s only one company named Apple!
  • How to match? Fuzzy, exact, or numeric? Everyone does this section differently. You can select 90% fuzzy match to see how the results turn out or you can do an exact match for data you already know has been fixed and standardized. You can also do numeric matches for phone numbers and postal codes and exact matches for data that has been thoroughly processed. 

fuzzy matching winpure image 9

 

9. Assessing the match or creating master records: Once you get the match results, you can decide whether you want to manually assess the results (evaluation) or you’re reading to save it as a master record. Remember, master records become the most reliable version of the data, so always make sure your data is processed and deduped before you create a master record for use later on.

 

data matching results image 10

…. And there you go! Your match done! Using an automated data matching solution can save you at least 4 weeks of manual work and improve efficiency by 70%! 

Why Data is Difficult to Match - Common Challenges

Despite the most advanced technology, data matching can still be difficult especially if you don’t take care of pre-processing issues like: 

 

 Data entry issues: One reason is that the entry of data can be subjective. For example, when entering someone’s name into a database, there may be variations in how that name is spelled. 

 

Disparate data sources: It’s not uncommon for organizations to have data streaming in from multiple sources. Consider an airline for example – with data streaming in from dozens of third-party websites (such as booking.com) resulting in multiple versions of the same customer data, but stored in different formats. Unless there’s a strong data governance system in place, it is impossible to match data that has been recorded under different standards. 

 

Corrupted data: A software malfunction, a migration error, or even a virus can cause data to become corrupted, which can cause your information to become inaccurate and incomplete. As a result, it can be difficult to match it up with other data. 

 

Human error: Finally, human error can also cause data matching problems. For example, someone may enter the wrong information into a database or make a mistake while comparing data. 

 

Most of these challenges need to be resolved through a thorough implementation of a data quality and data governance strategy where source level problems are nipped in the bud. 

Why is Data Matching so Important?

Data matching is not a new phenomenon. For as long as we’ve had digitized data, we’ve had to perform matching between data sets to get a complete overview of customers. Today, as data analytics becomes a key component of data-driven organizations, we need consolidated information on internal and external entities. Data matching facilitates data consolidation from disparate data sources and reduces: 

Unnecessary costs

Every time a mail is sent, there is a $10 cost to it. Printing, distribution, marketing all adds up to every flyer that is sent to an address where the customer no longer lives or to members of a family who all receive the flyer even if it doesn’t relate to them. Poor marketing is expensive, but direct marketing is an unnecessary cost that is often the result of poor data consolidation practices. 

Various sources of customer information

Data matching can help businesses reduce the cost of customer information by consolidating various sources of information into a single repository. By matching data from different sources, businesses can eliminate the need to maintain multiple copies of customer data, reducing the amount of storage space and processing power required. 

Client Service Issues

Data matching can help a customer service representative match a customer’s contact information with their support history to better understand the customer’s needs. They won’t have to constantly ask the customer to reiterate information (a problem most commonly experienced by banking or finance app customers). 

Identity Resolution

Data matching is the backbone of an identity resolution process – where data about an individual from multiple sources is collected to create a unified record. An identity resolution benefits the organization in terms of targeted marketing, profiling for strategic growth and understand the customer journey. 

Entity Resolution

Where identity resolution is about individuals, entity resolution creates a view of the different relationships between specific groups of people (or entities) and consolidates data from different reference points to create a single view of the entity. 

Benefits of Data Matching

When done right, here are the top benefits of data matching that firms can experience.

1. Fraud detection and prevention

Financial institutions are under immense pressure to deal with increasingly complex fraudulent activities. From scams to fake identities, from money laundering to punitive regulatory compliances, firms need access to accurate, reliable, consolidated records. Data matching is used to compare the firm’s records against criminal and sanctions databases to identify details about the individual.  

2. Identity verification

Identity resolution and verification relies on accurate data matching abilities. Government and law-enforcement departments benefit greatly from data matching as they can match records at an attribute and reference data level (such as registrations, drivers license, social security numbers etc) across multiple databases to get an overall picture of the individual. Based on the match and analysis result, risk scores are calculated to identify whether the individual is a real person, a threat, or a synthetic identity.

3. Sanctions & GDPR Compliance

In 2019, sanctions violations have resulted in fines of $10bn for non-compliance with AML, KYC, and sanctions regulations. Similarly, there have been over €359 million in major GDPR fines so far. Sanctions screening is a critical activity for businesses that are involved in cross-border trading. They are required to match their data against the data given in sanctions lists to ensure they are not accidentally engaging in trade with listed entities or individuals.

4. Better Public Programs

A CBPP report in the United States, revealed over 40% of eligible individuals missed out on a public nutritional program for women, infants and children because of enrollment gaps that prevented the individuals from actually getting the benefits. Data matching allowed four states to find enrollment gaps, and also to identify individuals for targeted outreach. Additionally, the data matching also reduced the applicant’s documentation burden and simplified the certification process for eligible individuals. Data matching has repeatedly been demonstrated as a necessary and beneficial technology to uplift, improve, and deliver effective public programs.

5. Targeted Campaigns

According to Salesforce 70% of CRM data becomes obsolete while around 30% of records are duplicated. Yet companies will still send out emails, direct mails, flyers to all existing database customers only to be at the receiving end of the customer’s wrath. Data matching becomes a necessary activity. 

Companies would need to regularly de-deduplicate records and ensure they are complete, updated, and accurate whenever they want to run a campaign. Most importantly, data matching helps with identifying household data (members of a family using the same service), which in turn enables marketing teams to build targeted lists. For example, an insurance company can cut down on the costs of direct mailing by up to 80% if they send out just one flyer to one household instead of five flyers to five members of the same household!

6. Improved Customer Service

Data matching can be used to consolidate scattered customer records giving companies a 360 view of their customer journey. This in turn helps marketing and customer service departments improve customer service. For example, an airline can use data matching to identify customers’ preferences and provide them with optional stays at exclusive hotels or BnBs.  

7. Improved Customer Retention

For example, a gym can use data matching to offer exclusive discounts to members of the same family. Similarly, an insurance company can use data matching to offer specific health insurance plans for people with newborns or people with teenagers. This ability to predict customer needs and offer a relevant service or product greatly boosts customer retention.

8.  Increased Organizational Efficiency

When teams have access to accurate and reliable data, they can make decisions faster and better. Companies that have invested in MDM and entity resolution processes have reported higher efficiency of up to 80%! 

9. Remove Duplicates & Improve Data Quality

One of the biggest benefits of data matching is deduplication – the process of removing duplicates within a data source. Duplicate data is a critical data quality concern that most firms are ill-equipped to handle because of informal data matching processes that result in higher false negatives and positives.

10. Drive Business Growth

Efficient data matching is the backbone of entity resolution which boosts growth factors like complete customer views, targeted marketing, better product and services and so on. If businesses want to compete in today’s world, they need reliable data systems and processes.

Most firms we’ve worked with are well aware of the benefits of data matching but struggle with the implementation process. Our Clean and Match solution is designed to take the struggle away and help you and your team breeze through a complicated process with a point-and-click, user-friendly, no-code solution that ensures you meet your business goals fast.

Three Major Data Matching Challenges

Data matching can be done manually, and most companies do employ expensive data analysts to use Excels for data matching – but – as datasets become larger and more complex, it is practically a nightmare to manual match and clean data. Ask any of your data engineers and they would likely call data matching one of the most mundane and toughest part of the job. 

Some of the key data matching challenges that happens when you try to do it manually involves: 

 

Variability in data: The fancy term for this is ‘data heterogeneity,’ which describes variability in a data caused by factors such as differences in the data’s source, structure, or format. Take for example a customer database may include customer name, address, and contact information, while a product database may include product name, SKU, and price. To match these two databases, the data must be standardized so that the same type of information is included in each record. This can be a time-consuming process and often requires human intervention.

 

Preventing data matching errors: False positives and negatives are two types of errors make data matching a challenging process. False positives – when a record is erroneously matched to the same master record, and false negatives – when a record is not matched even though they belong to the same entity record. 

 

Privacy and the lack of data: For most data matching processes, it is almost impossible to obtain data due to privacy laws. There are numerous studies on the best way to access and solve data matching problems based on privacy restrictions, and one of the most popular models is proposed by experts is the One-way hash encoding functions to convert a string value into a hash-code (for example ‘peter’ into ‘51dc3dc01ea0’) such that having access to only a hash-code will make it nearly impossible with current computing technology to learn its original string value. There are many other models including Secure multi-party computation (SMC), phonetic encoding, bloom filters and many more. 

Data Matching Use Cases

Data matching is commonly used in financial services, healthcare, and retail. In financial services, data matching is used to identify duplicate accounts, fraud, and other money-related crimes. In healthcare, data matching is used to identify patients who have been seen by multiple doctors or received multiple prescriptions from different pharmacies. In retail, data matching is used to identify customers who have purchased the same product from multiple stores.

Financial Services

The goal with data matching in the financial service industry is to identify potential matches between customers and banks in order to assess risk and prevent fraud.

One example of data matching in financial services is when a bank reviews an application for a new credit card. The bank will compare the customer’s information (name, address, etc.) with its own records to see if the customer already has an account with the bank. If they do, the bank will likely not approve the new credit card application, as it would be considered high risk.

 

Another example is when a bank detects fraudulent activity on one of its accounts. The bank will compare the account information (account number, name on account, etc.) with its own records as well as with records from other banks. This helps the bank to identify any potential patterns in fraudulent activity and better protect its customers.

 

Data matching is also used for compliance purposes in financial services. For example, banks are required to keep track of all of their customers’ transactions over a certain period of time. This information is then compared against regulatory guidelines to ensure that the bank is in compliance.

 

The benefits of data matching for financial services include improved risk management, decreased fraud losses, and increased compliance efficiency. Data matching helps to identify potential risks before they become problems, which can save the bank money in terms of losses and fines. It also helps to streamline compliance processes by automating the comparison of data against regulatory guidelines.

Government and Public Sector

Let’s consider the example of a city government that wants to improve its decision making around budgeting. The city government has two datasets: one that contains information about the current budget and one that contains information about past budgets. The city government wants to use data matching to identify discrepancies between the two datasets.

By identifying these errors, the city government can identify accounts that have been tampered with or where numbers didn’t add up. Additionally, the matching process may also help with different information about one entity (multiple versions of a name).

Education Industry

In the education sector, data matching can be used to match data from different sources, such as student information systems and alumni databases for multiple purposes – from monitoring student progress to tracking benefits, to ensuring performance benchmarks, to providing mental health support on time.

In some cases data matching can help with identifying outdated data. For example, if a student changes their name or moves to a new school, the data in different systems may not be updated with the latest information.

 

Healthcare Sector

 

Data matching is a pivotal function in the healthcare sector where patient records such as their medical history, treatment history, lab results are always matched and consolidated to provide their specialists with an overview of their health and the treatment they require. It is not uncommon for patients in the healthcare sector to suffer lethal and often times tragic consequences of mismatched IDs (such as providing a wrong medication or perform a misdiagnosis).

 

The use case below describes how healthcare sector organizations can use data matching to improve patient safety and care.

A patient is admitted to the hospital with a suspected infection. The hospital staff enter the patient’s information into the hospital’s electronic health record (EHR) system. The EHR system automatically compares the patient’s information against a database of patients with known infections. If the EHR system detects a potential match, the system notifies the hospital staff so that they can take appropriate action.

 

This use case illustrates how data matching can be used to improve patient safety and care in the healthcare sector. By automatically comparing patients’ information against a database of patients with known infections, hospitals can identify potential matches and take appropriate action. This helps to ensure that patients receive the correct treatment and avoid potential complications.

 

Marketing and Sales

In the marketing and sales domain, data matching is used to perform a plethora of activities – from identifying potential customers, to finding common interests among customers, to running targeted marketing campaigns and much more. In fact, of all the industries, marketing and sales benefits the most from data matching especially since it deals directly with customer data.

Takek for example, a company that wants to improve its marketing and sales efforts by matching customer data with potential new customers. It has a database of current customers and would like to find potential customers based on the attributes of the current ones. The first step is to gather and assess data like customer names, addresses, phone numbers, such as their purchase history, reviews, and other relevant information. Once this data is gathered, it is cleaned and standardized and a matching algorithm is created to compare the data from both databases. Matching results provide insights on customers of certain areas that are more receptive to marketing offers. Based on this info, a targeted marketing campaigns or to find potential new customers. The data could also be used to improve sales efforts by identifying potential leads for sales representatives.

Data Matching is a critical function that demands Best-in-Class technology.

As data structures become more complicated, it requires data matching technologies that can keep up. Whether it’s an enterprise project like matching multi-million records for a merger, or a small departmental project like combining marketing & sales data for insights, primitive data matching techniques such as exact comparisons via Excel formulas or manually coded algorithms are no longer enough or effective. Organizations need best-in-class data matching solutions that offer data preparation, data cleaning, and data transformation as part of the process. Moreover, an intelligent data matching technology must have multiple algorithms at work to tackle data nuances and reduce errors.

Frequently Asked Questions

What is data matching used for?

Identifying duplicate records, verifying the accuracy of data, and consolidating data. 

What is matching in data quality?

Identifying and correcting inconsistencies between data sets. 

What is an example of data matching?

Comparing two records to identify if they are duplicates. 

What are the types of matching?

There are three types of matching; fuzzy, exact, and numeric among many others. 

What are data matching issues?

Incorrect or incomplete data, mismatches in data formats, and differences in coding schemes are common data matching issues. 

 

Author photo

Farah Kim

linkedin

Farah Kim is a human-centric product marketer and specializes in simplifying complex information into actionable insights for the WinPure audience. She holds a BS degree in Computer Science, followed by two post-grad degrees specializing in Linguistics and Media Communications. She works with the WinPure team to create awareness on a no-code solution for solving complex tasks like data matching, data deduplication, and MDM.

Any Questions?

We’re here to help you get the most from your data.

Download and try out our Award-Winning WinPure™ Clean & Match Data Cleansing and Matching Software Suite.

WinPure, a trusted innovator in Data Quality and Master Data Management Tools.
Join the thousands of customers who rely on WinPure to grow faster with better data.

McAfee Logo Deloitte logo vodafone HP logo