LEARN > Reports > Data Provenance in AI

Data Provenance in AI

This paper advocates for prioritizing data provenance—documenting a dataset's origin, transformation, ownership, access, and use throughout its lifecycle.

	24 Sep 2025
	Reports

Download >>

Executive Summary

As artificial intelligence (AI) systems become increasingly central to government operations and private sector innovation, the United States faces a critical governance gap: the absence of comprehensive data provenance frameworks. This paper establishes the case for elevating data provenance, which is the comprehensive documentation of a dataset's origin, transformation history, ownership chain, access controls, and usage patterns across its lifecycle. Data provenance should be a foundational requirement for responsible AI governance.

Without robust provenance mechanisms, AI systems operate as "black boxes" where the origins and transformations of both training and analyzed data remain invisible to users, regulators, and affected communities. As federal agencies deploy over 1,700 AI use cases and foundation models reshape entire sectors, the stakes of this provenance gap continue to rise.

Data provenance represents more than a compliance requirement, it is infrastructure for innovation. By establishing a clear data provenance framework, organizations can reduce redundancy, improve quality control, and build systems that earn public trust. For the United States, leadership in data provenance standards offers a pathway to shape international AI governance norms while maintaining competitive advantage in global markets increasingly deﬁned by trustworthy AI deployment.

Comprehensive provenance frameworks can contribute to increased transparency and accountability, holding organizations responsible when AI systems cause harm. They may also help address security vulnerabilities, including data poisoning attacks and contamination, and expose fairness and representation gaps that perpetuate bias and undermine democratic participation.

This paper identiﬁes three domains requiring immediate attention: government AI applications where accountability demands transparency; foundation models that require scalable provenance for massive datasets; and multi-stakeholder governance coalitions that can drive standards development across sectors.

Addressing the data provenance gap requires coordinated action across government, industry, academia, and civil society. Key priorities include establishing provenance requirements for federal AI systems, procurement, and funded research; developing scalable solutions for large-scale models and multi-party data pipelines; creating oversight mechanisms with clear authority and enforcement capabilities; and including affected communities in standards development and implementation.

Data provenance is not optional; it is essential infrastructure for AI systems that must be transparent, accountable, and trustworthy. The United States has a narrow window to establish data provenance standards before opaque AI systems become too entrenched to reform effectively. By treating provenance as a prerequisite rather than an afterthought, we can build an AI future that is powerful and principled, strengthening accountability while fostering innovation.