As data sources and volumes grow exponentially year by year, we are seeing an increasing interest in automated solutions for managing data quality. So much data is being generated and consolidated so quickly that it’s become impossible to use traditional methods to manage data quality successfully.
Data quality typically has been a manual task that falls on data stewards who can no longer keep up with the number of issues that arise. There are too many data sources now, and data is inconsistent and disorganized. As a result of ongoing data quality challenges, projects are taking longer to complete, decision-making is delayed or flawed and resources are being wasted. This is not sustainable.
Augmented data quality solutions that use artificial intelligence and machine learning offer a fast, effective way to understand and improve data quality while minimizing manual intervention by data stewards.
But getting to this point requires an honest assessment of your data quality situation and a realistic view of what AI and ML can do for you now. AI can dramatically improve your data quality productivity, but it cannot fully automate it. In this article, we’ll look at the challenges of traditional data quality management and how you can get started with augmented data quality.
Machine learning models identify and correct data quality issues.
The goal of today’s machine learning-driven data quality solutions is to minimize the need for intervention by a data steward — not to eliminate the need entirely.
Augmented data quality can assess either categorical data or numerical data. Categorical data such as master data is a list of distinct values. Data quality in this case is meant to determine if a value matches a value already on the list, is a new value or is a data quality issue that should map to a valid value. For numerical data such as fact data, ML uses statistical process control — for example, to identify a range of values, trends, and boundaries of the data feed.
Solutions that use machine learning for data quality essentially train models to look at what has been done in the past to correct bad data and understand how data stewards have authorized those corrections.
With these learnings, you might expect that in 80% of situations, a machine learning model could achieve a high enough level of confidence to accurately identify a data quality problem and make the correct change to fix it, or at least to set up the change so a data steward can review and authorize it.
• High Confidence: The model makes the change and asks the data steward to validate the change.
• Medium Confidence: The model makes a recommendation and asks the data steward to authorize the change.
• Low Confidence: The model displays the options explored with the confidence level shown and the data steward makes whatever change is necessary.
With a self-learning model, you could aim even higher. As data stewards authorize changes recommended by the model and choose options for dealing with low-confidence cases, the model learns how to make better decisions, which drives higher confidence, better results, and reduced data steward intervention.
Prerequisites: Data Governance And Stewardship
To build effective machine learning models for augmented data quality, an organization should have a track record with data governance and stewardship to indicate how data quality issues have been handled in the past.
This requires business data stewards to determine how to handle data quality issues. They will have the necessary understanding of the data the business needs, the semantic meaning of the data, why the data is needed, and how it will be used. This is why it is recommended that the data steward be part of the business organization. It may help to think of data quality management in terms of a RACI matrix:
• Responsible: IT makes the change requested by the data stewards.
• Accountable: Data stewards make sure the changes are identified and executed.
• Consulted: Business leadership is consulted.
• Informed: Users are informed.
Getting started with augmented data quality
If your organization doesn’t have a strong history of governance and stewardship, this should be your first priority. You cannot improve what you don’t measure; putting governance over your data quality will help you understand where you are on your data quality journey.
As you advance your data quality capabilities, keep in mind a few recommendations:
• Data stewards should reside in the business — they will have the best perspective regarding the need for data and its use in the business.
• IT provides the support (people, process, and technology) to ensure stewards can easily maintain data quality.
• Executive ownership and business adoption are critical to success.
• 100% quality is 100% not necessary — there is a diminishing return on your investment.
• Get your master data in order first. This is the easiest to control and there is often a clear need in the business to understand your master data.
This article was originally published as The Journey To Augmented Data Quality on March 8, 2022, on Forbes.