Attention: You are using an outdated browser, device or you do not have the latest version of JavaScript downloaded and so this website may not work as expected. Please download the latest software or switch device to avoid further issues.
| 24 Dec 2025 | |
| Written by Julia Cherashore | |
| Blogs |
Effective and scalable use of AI is not as easy as it sounds.
Data provenance, or “the comprehensive documentation of a dataset's origin, transformation history, ownership chain, access controls, and usage patterns across its lifecycle,” is a tall order and should be a foundational requirement for responsible Artificial Intelligence (AI) governance. Adding to data provenance, effective data policy should align with three interdependent components: effective governance, technical capacity, and high-quality data to support responsible and scalable AI development. These components build on each other and each could take the scope of this page and beyond. But, for the sake of this blog, I will explore how one could use AI to address another fundamental of useful AI – data quality.
To be sure, delivering high-quality data at scale has been one of the biggest challenges facing data professionals. The good news is, with AI, the toolkit to tackle data quality has meaningfully expanded.
First, AI-powered standalone agentic data quality solutions have emerged in the marketplace that were inaccessible just a couple of years ago. An agentic AI data quality solution commonly refers to an interconnected network of AI agents – autonomous software systems – which can proactively monitor and detect data discrepancies, learn data patterns and suggest data quality rules, and perform some automated remediation of data issues, among others.
Second, existing data governance and data quality management solutions, which previously heavily relied on rules-based logic for anomaly detection, have been racing to embed AI at their core and deliver improved scalability and an enhanced set of features. As an example, this could mean that a user of such a platform can ask a ChatGPT-style AI assistant specific questions about a particular dataset and discover its quality in mere minutes. These efficiencies open possibilities for data governance platforms to digest and analyze larger datasets and deliver on more complex tasks.
Lastly, new open-source data quality solutions have blossomed for those looking to “try before you buy”, thereby removing barriers for organizations small and large to improve their data quality.
While usage of AI has already gained traction for data quality anomaly detection, it’s not the only application of AI within the overall data quality framework that can help scale data quality. Once data quality issues are discovered or identified, how does one prioritize data remediation in a consistent and quantifiable way? This is not a new problem for data practitioners to solve, but one that has become more salient. Scarce organizational resources and creation of new data at an unprecedented pace add to the challenges for data professionals, but data quality remains a critical foundational step to get data AI-ready.
One way to approach prioritization is to use an established framework to identify which data has the most impact on the AI model’s decision. Quantifying the importance of each data attribute can be a proxy for the priority of the particular attribute to the model and guide the prioritization of data quality remediation for the most impactful attributes, thereby scaling data quality efforts.
Traditional manual data quality remediation approaches have significant limitations when solving for data quality at scale. Just think of the times you have transposed numbers in your zip code or your name has been misspelled in an email – simple, common data entries that lend towards “fat finger” mistakes are magnified in orders of high magnitudes, making it nearly impossible for human review to improve data quality given the rate of data production now. Undoubtedly, progress has been made in automating data quality remediation using rules-based ‘if/then’ approaches. It’s not perfect, but it is a start. Whether leveraging AI-powered standalone agentic data quality solutions, modern data governance platforms with embedded AI capabilities, or open-source data quality solutions, with innovations at the interaction of AI and data quality a reality, this evolving space is exciting to watch. The novel applications of AI to scale data quality present compelling opportunities to accelerate data provenance and move from data hoarding to data strategy.
Additional resources:
1) “Data Provenance in AI,” a Data Foundation white paper
2) “From Data Hoarding to Data Strategy: Building AI that Actually Works” by Nick Hart, Data Foundation President and CEO, for Forbes Technology Council