Dirty data is a harsh reality most companies want to escape from. It’s like having a beautiful home, but with a plumbing system that causes leakage and affects the integrity, aesthetics, as well as functionality of your home. You could try covering up the problem with fancy hardware, rugs, sealants, or even paint, but unless the problem is nipped in the bud, your home will always remain at the mercy of the plumbing system. Such a small thing, but such a huge impact!
So how do you overcome this danger caused by dirty data and what are the critical steps companies can take today?
Our webinar with Navneet Makhni, Strategy Principal Director at Accenture addresses these critical issues where Navneet provides strategic methods for overcoming dirty data challenges.
You can watch the complete webinar below. 👇
What is Dirty Data and Why are Companies Still Struggling to Cope?
We all understand data and everybody likes to talk about data as the new oil, but would you live with bad-quality oil? You wouldn’t. But you live with bad-quality data so yes data is the new oil but it’s not treated anywhere close to how oil is treated in our economies!
Companies continue to deal with dirty data because they don’t understand what they need to fix. You had a data warehouse, to begin with, which came from flat filed and Excel sources so you thought that putting a data warehouse would solve your issue then you thought taking that data into a data lake will solve that. However, using data lakes or building new data infrastructures does not solve the dirty data problem.
Most companies are slow in fixing data quality issues because the problem is not being recognized at the level that it should be. Navneet shares a good example to demonstrate a company’s inability to detect data quality issues.
A company may have the names of one vendor duplicated and entered twice into the database.
- How that information was entered?
- Who was responsible?
- Which of the two pieces of information is correct?
These are questions that no one would know the answer to. Once this information is used by a business team, say the billing team to send an invoice, they would have to send two invoices to the same vendor! At this point, the company would assume there’s something wrong with the invoice or the invoicing system, but few would trace it back to the original entry. This is a real example of the type of dirty data companies deal with on a daily basis, yet are unable to identify what they truly need to fix to solve the problem.
Similar to the plumbing problem in the house – you have no idea who caused the plumbing problem unless you hire a technical expert to assess the whole infrastructure and identify the root cause of the problem. For most companies, this opens a pandora’s box of problems and fixes that would result in very high expenses. Companies choose to live with bad data because it’s easier than fixing it from the source.
What Are the Most Common Instances of Dirty Data?
The most incorrect forms of bad data would be personal information such as names, addresses, age, and even data like gender. Irrespective of the industry, the first problem data specialists are always required to solve is – clean customer data. Companies want a sanitized, best-in-class list of their customer data. Some of the most common challenges they face with this data are:
👉First names are blank, incorrect, or incomplete
👉Typos, culturally incorrect spellings and poor data entry (such as punctuation in a name)
👉Address data that is incomplete, non-standardized, and not validated
Along with this, reference data also has severe challenges with data quality.
Reference is data that helps categorize other data.
For example, a list of countries, the list of cities in the country, the possible values of currencies etc. There are international standards and lists of countries, cities, and currencies that most companies simply don’t follow. They create their own list, only to end up with problems later when they realize a country or city is missing – or when a currency is not assigned to the right country!
What Actionable Steps Can Companies Take to Fix Dirty Data Challenges?
Using the plumbing analogy, Navneet explains a systematic approach to solving the dirty data challenge.
✅Identify the Scope:
To start with, you must first identify the scope of the plumbing/leakage problem. In terms of data, identify what it is you’re trying to fix. For example, are you trying to fix it because it’s been modeled incorrectly which means it’s a data modeling quality? Are you trying to fix it because it’s a master data issue or it’s the foundational information that you want to have in your company? Identify what are you trying to fix. That’s your first step or bridge to cross when attempting to fix any data quality issue.
✅Building the Long-term Plan:
There are two ways you can fix this problem: short-term or long-term. For the short-term plan, you can put some sealant on it to give temporary relief. Once you’ve done the temporary fix, you might want to look into a long-term approach. You could identify the cause behind the leakage. You’d want to identify what caused the leakage – could it be a housing problem? A man-made problem? Or was it caused by pest infestation? Understanding the cause can help you solve the problem at its root level.
✅Implement Data Governance:
Once you’ve identified the short-term and long-term problems, you can then look into cleaning up your historical data (which is analogous to the water on the floor caused by the leakage). You would want to clean that up and ensure such leakage doesn’t occur again – or if it does, you should have a better response mechanism to it (analogous to having a data governance plan).
How Does Poor Data Quality Affect Downstream Applications
Downstream applications such as analytics and AI/ML projects are the most affected by poor data. For example, if an analytic dashboard is fed with 100 customer records that are duplicated, incorrect, or rubbish entries, it will generate rubbish results following the Garbage In, Garbage Out phenomena.
There are plenty of real-life instances where a predictive model was fed with flawed data resulting in skewed forecasts. If a company is led to believe that it has doubled its sales metrics, but instead the needle hasn’t moved at all, this can result in real-time business setbacks! Similarly, if a company is led to believe that it has suffered a decline in profitability, it can move to lay off staff causing real-time damage.
When it comes to AI, the effect is very apparent – feed an AI model poor data, and it will deliver poor output. Again, there are plenty of instances where AI models were considered biased and racist because of the poor data that was fed to them.
Who is Responsible for Data Quality?
Finally, this is a critical question, with a straightforward answer, that most companies don’t usually get right.
When there is a dirty data problem, companies are inclined to hire an IT specialist with data management knowledge, or a data analyst to come in and “Fix” the data.
But contrary to popular practices, an IT person cannot fix this data. Simply because they do not have the context of the data. Technology teams understand data as an attribute a value in the column of a table which is stored in the data model which is connected through pipelines and that is it.
For example, if 30% of a company’s customers are based in Europe, another 40% in UK, and the remaining in the US, then the data needs currencies as an important reference data. If you call in an IT person, they might not even know this is a problem! Somebody needs to segregate this data at a country (or even city) level to include the currency factor.
Data quality is a contextual problem more than it is an attribute-level problem. Therefore, the ownership of quality lies with the business – whoever is accountable for that business process. It could be the CDO, the CEO, the CMO, mainly, the leader that oversees the business process. Data quality depends on context and rules are driven by the purpose, function, and requirements of the business process itself. So for example, if a business process doesn’t require the use of First Names or titles, then the quality of the names or titles is not relevant.
However, these are not decisions that can be made by IT or technology teams. These are decisions that can only be made by business teams that understand the context of the data and can determine what would constitute “poor” quality.
Of course, this is not to say that CMOs or CDOs have to do the actual work. Instead, they are required to be accountable and lead the focus on data quality, ensuring that they understand the nature and challenges posed by the data.
fixing Dirty data is a lather/rinse repeat process
Dirty data is a prevalent problem, however, in order to solve this problem, you would need to identify the scope, get the leadership involved, define accountability and ownership, fix processes, and adhere to actual solutions instead of patch fixes. Even when you find the solution though, you need to stick to a routine of cleaning and fixing. Data gets populated at a rapid pace especially if you have integrations with multiple parties and by that account, it means that new data comes in with new challenges. Even if you solve for the leakage at the root level, you still need to have a routine check to make sure everything is in order. The same rule applies to data quality!
And as Navneet puts it:
Don’t be complacent. Of course if there is an interim fix that you need to do do it considering the business urgency around it but remember to chase it as an agenda next time when you speak with the organization. You may be given different excuses, but if you don’t solve the problem, the puddle remains, causing mould and eventually weakening the infrastructure of your apartment.
Data Quality Improvement with WinPure
WinPure is a no-code data quality solution that lets you clean, deduplicate, and standardize large data sets through a point-and-click interface. You set the rules. You set the cleaning matrix. You choose the standardization rules you want to use. All without requiring any additional code, technical expertise, or additional infrastructure. WinPure is an on-premise solution that can be used by your business users to clean and standardize business data.
Get in touch with us to see how we can help! 👇
Choose Your Preferred Method