Entity Resolution with Data Matching

Author photo

Farah Kim • January,2023

In a database, “entity” is a term used to describe a real-world object, a person, or a thing. For example, a person named Kathryn is an entity. When there are multiple references of Kathryn in different datasets of an organization, such as in a CRM, in an employee dataset, or even in a third-party database, then it’s imperative to connect all these references to ensure they refer to the same Kathryn. Known as “resolution” this process identifies and merges records to represent the real-world entity in question.

The function that makes entity resolution possible? Data matching.

In this quick guide on entity resolution, we’ll walk you through:

Key Components of an Entity

An entity has three key components:

 

Identity Attributes: Personal characteristics such as a person’s name, age, address, social security number, etc. that distinguishes them from other entities. For product data, identity attributes are model numbers, manufacturer UPC codes et. These features are sometimes unique to the individual such as a social security number. In entity resolution, the biggest challenge is the lack of access to identity attributes due to data privacy restrictions (more on this later). In such instances, analysts have to rely on other non-unique identity attributes such as age, address, or phone number to resolve references about an entity.

 

Values: Information associated with the entity such as a product’s price, color, or dimensions. Values help group entities according to specific categories such as creating a column for all red color phones or for all residents living in a certain block.

 

References: A reference may seem like an attribute, however, where attributes are single fields, reference is a collection of attributes that is used to refer to an entity. In the context of entity resolution, references can also include things like email address, phone number, physical address, etc. These are all characteristics that can help identify an individual entity.

 

By understanding these core components of an entity, you can better understand how data is stored and retrieved in databases and computer systems.

Basic Goals of Entity Resolution

The term entity resolution was coined some time in 2004 in articles published by Stanford InfoLab and was mainly derived from the merge-purge process. In this process, if there are two references pointing to the same individual such as the same name, address, or telephone numbers, then they either have to be merged to get a complete reference or removed to avoid duplication. From here stems the need to collect all references and resolve them into a single dataset, hence called, ‘entity resolution.’

 

The process has two basic goals:

 

1 . Reduce (that is purge) the number of similar or equivalent references (duplicates) of an entity and create a single dataset containing the best, the most accurate master record.

 

2 . Link multiple references without purging them. This is also called record linkage and is usually performed when one database wants to link to the records of the entity in another database. For example, if a business has customer records stored in one database and order records stored in another, record linkage could be used to create a single master file containing all customer contact information as well as their order history. This would be accomplished by linking the two databases via a shared reference identifier such as an email address.

 

Whether records are merged, purged, or linked, the eventual goal is to create a comprehensive, 360 view of the entity. Regardless of the type of data you’re handling, be it product data or individual data, it’s imperative to have records that gives you complete, accurate, up-to-date, and reliable information – thus also fulfilling data quality requirements.

Five Basic Activities of Entity Resolution

Entity resolution encompasses a broad range of activities that go beyond just merge/purge or record linking. Five of the most common activities include:

 

Entity reference extraction:  is the process of locating and collecting entity references from unstructured information.

 

For example, a company may want to review large segments of data such as order numbers, refund history, or , billing data to understand purchase behaviors during a recession. Companies may also use natural language processing techniques to extract key information, such as a contact’s name or address, from a document.

 

Entity reference preparation: involves applying profiling, standardization, data matching, data cleansing, and other data quality techniques to data fields before resolving references.

 

For example, a company may need to prepare customer records by ensuring that all first names have been entered in the correct format before resolving any conflicts between two different customers with the same name. This often requires standardizing formatting conventions across all sources where customer data is collected.

 

Entity reference resolution using data matching: is the process of deciding whether two references are to the same or different entities. This activity often uses fuzzy matching techniques that involve comparing attributes such as names or addresses at various levels of granularity (e.g., exact matches on every word or partial matches based on similarity scores) to determine if two records refer to one person or business. Companies can also use algorithms such as Levenshtein distance and Jaro-Winkler score when attempting to match records.

 

Entity identity management: refers to building and maintaining a persistent record of entity identity information over time. This can include collecting additional data points about a customer through surveys and questionnaires, as well as tracking changes in their contact information over time (e.g., new email addresses). Companies may also need to manage duplicate data & identities for customers who have multiple accounts.

 

Entity relationship analysis: explores the network of associations between different but related entities. This process involves looking at how entities interact with each other in order to better understand relationships between them – for instance, understanding user behavior by analyzing connections between customers who interacted with certain products/services offered by a company or uncovering hidden insights into market trends by recognizing purchase behaviors between different customer segments.

 

Entity resolution is often required in organizations dealing with big data, where the entity’s relationship with the organization, their journey, their interaction points are all collected and consolidated to get a comprehensive overview. As such, an entity resolution process requires a powerful data matching solution that can handle multiple data points and match millions of rows of data without the need for extensive coding.

What is the Data Matching Process of Entity Resolution

The matching process for entity resolution is a complex undertaking that involves identifying and comparing entities, often across different data sources. Here’s a basic overview of the key steps involved:

 

1 . Defining your entity resolution problem: The first step is defining the goal you want to achieve or the problem you want to resolve with entity resolution. This involves either deciding on merging/purging duplicates and creating master records or simply linking existing master records to fulfill an organizational purpose.  You will need to determine what types of attributes need to be used for matching and how those attributes should be collected or extracted from the data sources.

 

2 . Pre-processing: Once you’ve identified the data sources to use, you will need to pre-process it before it can be used for data matching. For example, this means ensuring that all values are consistent across different sources, and reformatting data types (e.g., taking a date value expressed as “6/5/20” and converting it into a standardized format like “2020-06-05”). Additionally, this step may involve adding in missing values that can help with the matching process by providing additional contextual information.

 

3 . Feature extraction: In this step, you can decide which attributes are best suited for use in matching records together. This step can involve techniques such as tokenization (breaking down words into their component parts), n-grams (sequences of characters), phonetic encoding (replacing similar sounding words with phonetic codes) and fuzzy logic (allowing for inexact matches). After feature extraction has been performed on each record/entity, different similarity metrics can be used to determine how closely two records match based on their features. If you’re using a data match solution, these techniques come built in.

 

4 . Blocking: Because data structures are complex, you cannot realistically match every data set. You’ll need to use blocking techniques to reduce the amount of comparison operations performed.  Blocking categorizing records into groups (or blocks), then comparing records that are within the same group (such as Zip Codes). This approach can greatly reduce the time and processing power needed for comparison, as it eliminates unnecessary comparisons between unrelated data.

 

5 . Clustering: This technique group records by similarity. This could include grouping records that have similar values for a certain attribute (such as age) or records that have the same combination of values across multiple attributes (such as gender and race).

 

6 . Handling false negatives & positives: It is highly likely matching results may indicate either a false negative or positive result.

 

7 . Creating master records: You must create master records only after all known contingencies have been resolved. This could include verifying customer information, resolving false negative/positives, or validating the data. Master records are the final, accurate, reliable copy of the data and so they must meet the information quality criteria of reliability, accuracy, timeliness, uniqueness etc to qualify as master records.

 

A false negative occurs when two records should be matched, but the algorithm fails to match them correctly due to a mismatch in attribute values. For example, if two records have different names (e.g. “John Smith” and “Jonathan Smith”), an algorithm may not be able to match them even though they are the same person.

 

For example, if two records have names that are very similar (e.g. “John Smith” and “Jon Smith”), an algorithm may incorrectly match them even though they are not the same person.

 

You will need to keep a separate copy of records that needs human intervention to resolve false positives and negatives. Once these records are clear, they can then be transformed into master records.

 

Traditionally, the entity resolution process is performed by IT professionals who are fluent in programming languages like C/C++, JavaScript, Python and R. They use a combination of coding language expertise to compile scripts for extracting and preparing the data. Then they would have to compile and test matching algorithms (such as Levenshtein, Edit Distance, N-Gram, Soundex, Phonetic and other fuzzy algorithms) to match the data. Some professionals even use advanced Excel formulas to clean and match data.

 

The problem with the approach above is it is time-consuming, computationally expensive, and resource intensive. Moreover, the match accuracy is often under 80% which is a poor accuracy rate. For match results to truly be effective in entity resolution, it must have an accuracy rate of 80% and higher.

 

Another challenge with the manual approach is the time it takes to collect, review, normalize, match, and create records. It takes weeks if not months, which probably explains why over 85% of businesses are unable to ever truly achieve their entity resolution goals.

 

Instead of relying solely on manual processes or Excel formulas, teams can incorporate data matching solutions like WinPure to make the process faster and easier.

 

You can cut down the pre-processing and matching process by half. For example, if it takes you 1 week to pre-process data, with the help of a data matching tool that allows for data preparation, it will take just two days! This shaving off of hours allows team members to focus on critical analysis, resolving false negatives/positives, and ensuring your master records meet the data quality framework.

Key Challenges of Entity Resolution to Address

While an entity resolution process helps with connecting the dots of your data, it has its own set of challenges that business often finds difficult to overcome.

 

Before you involve the team in entity resolution, do watch out for some key challenges:

 

Unstructured Data: Entity resolution often requires dealing with unstructured data, which can be complex and difficult to manage. This can include webpages, text documents, emails, images, and other content that may not have a consistent format or structure. Unstructured data can be time consuming to process and requires specialized techniques such as tokenization and natural language processing to turn it into structured data. Remember, you cannot perform entity resolution or data matching on unstructured data – it must always be converted to structured or semi-structured data.

Inaccurate Data: Missing fields, noise in the data, typos etc are all examples of inaccurate data. Not treating this data can lead to incorrect resolutions of entities due to errors in the source material. This issue is particularly prevalent when dealing with user-generated content such as website forms without data entry guardrails or when manual data entry is performed by employees.  Effective entity resolution in this case starts by addressing and resolving the root causes of inaccurate data.

Variations in data: Entity resolution requires data to be standardized. Variations and aliases in names or numbers, such as one person known by multiple different names (John or Johnny) makes it difficult to match and resolve the data. Fuzzy matching techniques works best for handling variations in data.

Scalability: Datasets grow larger, additional techniques are often needed in order to ensure efficient processing times while still maintaining accuracy levels required by the organization utilizing the system. Most ER initiatives start out great but falter when it comes to handling larger amounts of data. When implementing an ER strategy, it is imperative to prioritize scalability.

Evolutionary Changes: Organizational structures, people’s roles within companies and relationships between entities are constantly changing over time which can make the task of entity resolution more challenging as new information needs to be accounted for when resolving identifiers associated with various entities across datasets. This means that entity resolution systems must be able to quickly adapt and update their internal databases so that they remain up-to-date with any changes in data sources over time in order to maintain accuracy levels required by organizations utilizing them.

 

As you can see, an entity resolution process is not just a merge/purge process. It is mostly, a business decision with outcomes impacting business revenue.

WinPure’s Recommended Data Matching Strategy for Entity Resolution

When clients approach us for entity resolution, we not only help them with the linking and matching itself, but also guide them on the best strategic approach for entity resolution.

 

We recommend:

 

Use a test dataset to first evaluate and assess the problems affecting your data. This includes reviewing errors such as incomplete fields, attributes with missing information (such as missing last names) and so on. This assessment can easily be done using WinPure’s data profiling feature for free.

 

Once you have a clear picture of the problem with your data, you can then begin the process of cleaning and transforming the data. Here, you might need human intervention in resolving information quality issues – such as ensuring all last names for contacts must be complete and filled. If they aren’t there must be custom rules defined to exclude the fields that do not have last names.

 

The matching process is where you’ll need to be strategic.

 

It’s better to use a combination of exact, numeric, and fuzzy linkage techniques.

 

You could start by identify any existing potential matches by using exact matching techniques such as name-based matching and address-based matching. This would help in identifying any potential duplicates within the dataset. Additionally, rules must be developed to determine how often these criteria should match in order to consider two or more records as belonging to the same individual or entity (e.g., 90% match instead of 100%).

 

Next, you could use numeric match logic. This involves analyzing numerical fields such as zip codes, phone numbers, dates etc., and comparing them between records to find any potential matches. It also allows for more flexibility in terms of matches since it does not rely solely on exact string values but also on semantically meaningful numbers.

 

Finally, fuzzy data match logic which is useful for finding inexact matches between strings or records that may not have an exact or even close correspondence.

 

You may need to repeat the process on multiple data sets and iterate the match within or across datasets. You can save multiple copies of the data after duplicate data is merged and purged. Once you’re satisfied with the results, you can then proceed to create master records.

 

WinPure offers users all three matching algorithms along with a built-in library and options to create custom rules to meet the demands of your data. We highly recommend a combination of tools, critical analysis, and a carefully crafted strategy to perform entity resolution on organizational data.

 

You can always use the free version of WinPure to test out a segment of your data before making a purchase.

To Conclude

Technically, it’s easier to implement entity resolution today than it was a few decades ago. However, technical limitations are far from being the reason companies aren’t able to go through with an ER initiative. It falls down to strategic planning and the investment in the right resources and solutions. Once you’ve crossed that bridge though, you’ll see an immediate impact on organizational efficiency, analytics, and insights.

Author photo

Farah Kim

linkedin

Farah Kim is a human-centric product marketer and specializes in simplifying complex information into actionable insights for the WinPure audience. She holds a BS degree in Computer Science, followed by two post-grad degrees specializing in Linguistics and Media Communications. She works with the WinPure team to create awareness on a no-code solution for solving complex tasks like data matching, data deduplication, and MDM.

Any Questions?

We’re here to help you get the most from your data.

Download and try out our Award-Winning WinPure™ Clean & Match Data Cleansing and Matching Software Suite.

WinPure, a trusted innovator in Data Quality and Master Data Management Tools.
Join the thousands of customers who rely on WinPure to grow faster with better data.

McAfee Logo Deloitte logo vodafone HP logo