Data Matching MS 03

Data match technology is the backbone of most data-dependent goals. Whether you’re looking to combine records from multiple sources or to remove duplicate records, whether you want to do a simple task like matching names in list A with names in list B or a complex task like unifying multiple versions of a customer record, you need a robust data match technology.

However, many teams and organizations still spend a significant amount of time manually cleaning and matching data using spreadsheet formulas like Excel’s VLookUps, which while effective for basic data-matching activities is time-consuming and counter-intuitive for complex projects. It gets even more complex when data analysts and developers have to build Python scripts to do the bare minimum. Scaling on large projects with these traditional methods can take months of effort – and perhaps millions in wasted resources and expenses.

It’s about time we replace manual data-matching processes with automated solutions that come with advanced data-match algorithms, making it easier for both tech and business users to get the most out of their data in the shortest amount of time possible.

How? Here’s a complete guide on data matching that highlights the basics of the process and how an automated solution like WinPure can solve a significant industry challenge with built-in data match algorithms.

Let’s roll.

WHAT IS DATA MATCHING AND WHY DOES IT MATTER?

Simply put, data matching is the process of comparing and linking data from different sources to identify and establish relationships between records. This could involve combining customer data from various databases to get insights or merging duplicate records to create a unified customer view.

Think of data match as a function that attempts to answer questions like:

👉 Is John Smith the same person as Jon Smiths? (identity resolution) 

👉 Is the name spelled as Mary Jones or Marie Jones? (typos)

👉 Do we have more than one record of Mary Jones across different data sets? (duplicate data)

👉 How many entries in the database point to Mary Jones? (record linkage)

A robust data match process enables users to connect, combine, and view relationships between records to answer these questions. For example, with the help of a fuzzy data match software, a user can identify all the different names and records of Mary Jones and Jon Smith and unify them under one block. Without data matching, it would be impossible to find an accurate answer to these questions – at least not without manually reviewing every row of a dataset!

In the next section, we’ll briefly go over how data matching works. If you’re a developer, you can skip this section and move on to the fourth section where we show you how to use a data match solution to find duplicates or merge records within minutes.

USING DATA MATCHING ALGORITHMS: FUZZY, EXACT, AND NUMERIC

Typically, data match on complex data sets is done by Python scripts wherein, a developer uses available Python libraries to code a custom fuzzy match script for record linkage. However, this is a time-consuming process and one that demands a specific skill set that is not easily available or affordable.  That’s where an automated solution comes in handy. Data matching software in the market, make use of three types of data match algorithm: fuzzy, exact, and numeric. Some others, like WinPure, also have proprietary algorithms built in to treat more advanced data issues such as non-phonetic matches (Sandler vs Sander), transposition errors (Mary vs Marie), and so on.

A quick overview:

a. Fuzzy Matching

Fuzzy matching allows for easy matching of semi-structured data and records that do not have exact matching attributes. Text strings like names and addresses use fuzzy techniques such as Soundex for same-sounding names, or Levenshtein Edit Distance for differences in spellings.

For example, the edit distance between the strings Catherine and Katherine is “1” because only one edit operation, the substitution of C for K is necessary to transform Catherine into Katherine.

The main problem with fuzzy data matching is that it can sometimes mistakenly identify things as matches (false positives) or miss real matches (false negatives). This happens because data can be similar or unclear, making it harder to match things accurately.

Therefore, careful consideration and validation are necessary when employing fuzzy data matching to ensure the reliability and accuracy of the results.

b. Exact Matching

In this technique, you want results that show exact matches. Unlike fuzzy matching, exact matching doesn’t take into account similarity, instead, it looks for cells with the exact characters.

For example, to match zip codes or postal codes between your database and the USPS database, use exact matching to identify duplicates. 

However, a problematic limitation of exact matching is its inability to handle data inconsistencies or variations. Since exact matching relies on strict criteria of identical values, even minor differences or errors can lead to missed matches. For example, a typographical error, a slight variation in formatting, or the use of abbreviations can result in failed matches, comprising the overall quality of a database.

c. Numeric Matching

Numeric matching deals only with numbers. It’s great for matching phone numbers or postal codes that contain only numbers.

Similar to exact matching, numeric data matching has precision issues. It relies heavily on the accuracy and consistency of numeric values. However, when dealing with large datasets or complex calculations, rounding errors or inconsistencies in decimal places can occur. These small discrepancies can lead to mismatches or inaccurate results.

Apart from the above, other data match algorithms include:

  • Soundex: A phonetic algorithm that encodes names and words into a four-character code based on their pronunciation. It is used for matching similar-sounding names.
  • Jaccard Index: measures the similarity between two sets by calculating the size of their intersection divided by the size of their union. It is used in text analysis and set matching.
  • Token-Based Matching: involves breaking text into tokens (e.g., words or n-grams) and comparing these tokens for similarity. It’s often used in text and string matching.
  • N-gram Matching: N-gram matching involves breaking text into overlapping sequences of N characters or words. It is used to find similarities in text data.
  • Smith-Waterman Algorithm: This algorithm is used for local sequence alignment of strings. It finds the optimal local alignment between two strings, taking into account gaps and substitutions.

peterchristenIf you’d like to get more details on data match algorithms, we recommend reading Peter Christen’s authoritative book on Data Matching: Concepts and Techniques.

The book gives a very easy-to-understand overview on:

  • The complete data-matching process including blocking & indexing techniques
  • Detailed step-by-step on how to clean and deduplicate data
  • Strategies for record linkage and entity resolution
  • Specialized topics like privacy and real-time matching

Enjoy the read!

 

WHAT IS THE DATA MATCHING PROCESS?

Understanding the basic process of data matching can help you decide on the type of results you want from a match exercise, and what kind of tool, or approach you would want to use to get the desired result.

As a basic overview, here’s a common data match process that most businesses use:

✅ Define the scope of the data matching project:

Like with most data-driven projects, you must first identify what you want from the data. Do you want to simply identify and remove duplicates in a customer database? Or want to gain valuable insights for a marketing campaign?

For example:

To identify your top 100 loyal customers over the past five years, you would match your customer database with your sales database to extract the information. You require names, addresses, email addresses, and phone numbers from both databases to match the data.

✅  Prepare the data with data cleaning activities:

Unless you’ve had a dedicated resource to keep your organizational data clean, chances are your data is dirty, messy, and has inconsistencies.

For example:

To match customer data, you must begin by standardizing contact names, removing odd characters from data fields, and ensuring data formats (such as naming a city as New York City instead of NYC) are uniform. Optimizing for uniformity and consistency improves match result outcomes and prevents false positives and negatives.

✅ Select a matching algorithm

As discussed above, there are a variety of data-matching algorithms available, each with its own strengths and weaknesses. The type of algorithm to use depends on the match goal.

For example:

To match first and last names, you can use a fuzzy match, and once you’ve resolved duplicate contacts. To identify duplicates by phone numbers, an exact match will be a better option as it will count exact characters.

✅ Review the match results

A person who knows the context of the data must review the match results to prevent false negatives and positives from affecting the interpretation of the match.

For example:

The system might flag two customer entries, ‘John Smith’ and ‘John S. Smith,‘ as duplicates because of similar names. However, a person with contextual knowledge would recognize that these are different individuals and should not be merged as duplicates, thereby, preserving the accuracy of the database.”

✅ Merge, Purge, or Set Master Records

This is the final stage of the data match process. Once you have the desired results, you can decide to merge two similar entries of one entity into a single record – for example, John Smith may have a work address and a home address that you would want to merge into a single record.

For Example: 

Name Age Email Phone Address
John Smith 35 john.smith@email.com 123-456-7890, 987-654-3210 123 Main St, Apt 4B

When it’s all done and classified as matches or non-matches, you can select the final records and export them as a master record!

With an automated data match solution, you could save up to 20 hours a week (a rough estimate we’ve collected from working closely with customers), with an automated solution as compared to using manual methods.

In the next section, we cover a step-by-step breakdown of how you can match data using an automated solution like WinPure and remove duplicates or merge data within minutes.

HOW TO USE WINPURE’S NO-CODE DATA MATCH TOOL

WinPure is a true no-code solution that lets you clean, transform, and match your data to achieve business goals. With a plug-and-play interface, and the ability to create a custom library, WinPure is a solution that saves time, improves efficiency – and most importantly – ensures accuracy of match results.

Watch a video of how our solution specialist uses the WinPure software to resolve for duplicates within minutes!

Here’s a quick breakdown of how to use WinPure to match data.

  • Integrate data sources from multiple data sets & file formats: Unlike a few decades ago, you no longer need to manually transform data to run a comparison. With easy integration functions, you can connect a CSV file or a MySQL file to the interface and begin a match process.
  • Advanced data cleaning functions: In the image given below, you will see how the tool profiles the data for inconsistencies and errors. So if you’re in the marketing department, you can see straight away you’ve got empty email addresses and fields with punctuations and characters that add “noise” to the data.
  • Advanced cleaning with custom regex expressions: Sometimes you’ve got complex string data such as email IDs that contain numbers and text such as [winpure123@winpure.com]. You can match these strings using advanced regex expressions built into the tool or create your library of expressions for future reference,
  • Standardizing and cleaning data by splitting the data: When you have multiple data sets, such as from market, product management, or sales, you can end up with inconsistencies in standards. For example, someone can write the data structure as dd//mm//yyyy and someone can write it as dd/mm/yy. 

This supposedly small discrepancy can affect the quality of match results and lead to a higher chance of false positives.

You can resolve these issues on the WinPure platform by splitting the data and choosing options like Propercase, Uppercase, and many more options to resolve standardization problems.

  • Building your own word library: Have specific words and abbreviations that you want to consider during the match process? WinPure lets you build a custom word library using Word Manager that prevents the system from flagging unnecessary matches. For example, you prefer Limited over LTD, or Ltd. 

data matching table image8

  • Matching within and across data sets: From the columns you’ve cleaned before, you can now match within data sets (such as matching data of Table A, then Table B). Once done, you can then match data across the tables (A x B) to weed out duplicates.
  • What to match? When choosing what to match, use:

Relevance: Choose attributes that are essential for identifying duplicates or similarities.
Data Quality: Prioritize attributes with accurate and consistent data.
Specificity: Opt for attributes that offer distinct and reliable matching criteria.

  • How to match data? It varies among users. You can choose 90% fuzzy matching for similar records, exact matching for identical values, or numeric matching for phone numbers and postal codes. Exact matching works well for well-processed data.

fuzzy matching winpure image 9

  • Assessing the match or creating master records: Once the match result is assessed, you can then decide to merge the records or save a new set of records as a master set.

data matching results image 10

…. And there you go! You now have a clean record, fit for business use!

According to feedback and reviews from our customers, WinPure’s no-code data matching has saved them considerable time and effort in cleaning and setting up master records.

BUSINESS BENEFITS OF DATA MATCHING

A few decades ago, data matching was simply a logical model used by database managers to match basic data sets. But today, as no-code data match solutions are on the rise, they have also empowered business users – and – businesses to achieve goals that go beyond database management. In fact, with the onset of AI/ML based applications, data matching has become a prominent technology that fuels data-driven goals like:

 Entity resolution: determining and linking different data entries that refer to the same real-world entity.

 Identity resolution: verifying and matching multiple attributes or identifiers to establish the true identity of an individual.

 Record linkage: linking information about one individual spread over multiple systems (such as a government benefits database)

 GDPR/sanctionsmatching a company’s database with government databases to ensure sanctions and privacy law compliance.

 Customer360 viewenabling teams to get a consolidated view of their customer data across systems.

These benefits demonstrate that data match technology is beyond an IT consideration. Instead, it shapes business decisions, which are implemented by business users. Therefore, it is essential for business users to actively participate in data match projects so that they can contribute to the effective implementation of a data-driven business strategy.

TO CONCLUDE: DATA MATCHING IS A KEY PROCESS TO BETTER DATA QUALITY

In the current business landscape, companies are drowning in data, yet resources are limited. Not every business can afford to hire a data analyst to address the challenges of cleaning, merging, and purging large datasets, nor can every business invest in a high-cost platform. However, neglecting these issues can disrupt the accuracy of their insights.

An automated data-matching solution offers a clear path out of this dilemma. It empowers both business and tech users to collaborate seamlessly, bridging potential gaps in data understanding and minimizing conflicts.

If you’d like to test WinPure’s data matching, please download the free trial form below to get started.

Written by Farah Kim

Farah Kim is a human-centric product marketer and specializes in simplifying complex information into actionable insights for the WinPure audience. She holds a BS degree in Computer Science, followed by two post-grad degrees specializing in Linguistics and Media Communications. She works with the WinPure team to create awareness on a no-code solution for solving complex tasks like data matching, entity resolution and Master Data Management.

Share this Post

Download the 30-Day Free Trial

and improve your data quality with no-code:

  • Data Profiling
  • Data Cleansing & Standardization
  • Data Matching
  • Data Deduplication
  • AI Entity Resolution
  • Address Verification

…. and much more!

"*" indicates required fields

Hidden
This field is for validation purposes and should be left unchanged.

Index