Why Data Standards Matter in the Age of AI

By Dean Ritz & Kurt Cagle

One way to tame a mess, whether it is a child’s room, a woodshop or a swamp of data, is to have a place to put everything. Data standards do this for data. More importantly, they can do this for generative AI systems, allowing us human beings to address many of the legitimate concerns that have arisen since the widespread release of large language models (LLMs) and their host generative AI (Gen AI) systems like OpenAI ChatGPT, Microsoft Copilot™, and Google Bard™.

Data standards represent a contract as to the meaning of the data. Governments and nonprofit organizations have, for decades, produced data standards as part of streamlining the exchange of information across entities public, private, and governmental. In fact, every piece of computer software has some standard for the structure and meaning of data it operates upon, a standard that exists even if just within that single software program. There are great benefits to having a place for everything and everything in its place.

As we step into the era of advanced artificial intelligence, the question arises: do we still need data standards? With the advent of Gen AI, some have speculated that AI systems can make sense of virtually anything, rendering data standards and tagging of data in compliance reports obsolete. This alluring notion, however, does not hold true. In fact, the Age of AI makes data standards more critical than ever before.

First, let's clarify some essential terms. Generative AI, or Gen AI, refers to AI systems capable of independently building knowledge models, even without human guidance. These models can encompass anything from recognizing a cat (or a tumor) in an image, to generating essays and even passing professional exams such as the bar exam.

What sets Gen AI apart is its approach to knowledge representation. Instead of relying solely on explicit probabilities, it employs models of probabilities. This seemingly subtle difference is what empowers Gen AI to learn and adapt, and express what appears to us to be creativity. It's the LLM (Large Language Model) aspect of Gen AI, where "language" encompasses almost any form of input that can be fed into a computer. It’s also a paradigm shift in scale. Simply, the enormity of its data resources enables it to do things that are beyond the capabilities of any person. This is a world-changer. It’s no wonder that policymakers are actively working to understand how it works and the needs for federal policy regarding its use.

With such powerful capabilities, why, then, are data standards needed? The answer lies in the alignment problem of artificial intelligence. This problem revolves around ensuring that AI systems, particularly Gen AI, make decisions and behave in ways that align with human values and objectives. News stories about deep fake images and voices, massive scale customized misinformation, and biological weapon creation all generate significant, legitimate concerns. 

But even without ‘bad actors’ AI systems can diverge from desired behavior because they generate recommendations rather than retrieve data from a database. Aligning Gen AI with societal needs goes beyond providing better training data; it involves embedding data standards directly into AI models (in fact, the word “embeddings” is a technical term of art in Gen AI, meant to serve this purpose).

Here is a concrete example. Consider self-driving cars. We don't permit these vehicles to veer off course, ignore stop signs, or collide with pedestrians. We don't expect them to "learn" these basics through trial and error. Instead, data standards are embedded into Gen AI models consistent with relevant laws and practices. These standards declare the existence of stop signs, traffic signals, and pedestrians, along with specifications of their meanings and relationships. Applying data standards for cars can be done by embedding relevant data standards into the Gen AI models as they then learn how to operate vehicles in this conceptual space and, eventually, on the road with real cars, real traffic signals, and real pedestrians.

The importance of data standards has long played a role in many other regulatory activities, including financial markets. Regulators define practices and criteria, while bodies like the Financial Accounting Standards Board (FASB) and the Securities and Exchange Commission (SEC) define terms used to describe market activities. Data standards can play a vital role in aligning Gen AI with financial regulations, just as they do for self-driving cars.

In conclusion, data standards are not relics of the past, but rather indispensable tools for shaping the behavior of Gen AI. Data standards are essential for decision-makers, policymakers, and program administrators as federal AI policies are crafted and implemented in the months and years ahead.  

——————————-

Dean Ritz is a Senior Research Fellow at the DATA Foundation, a nonprofit organization whose mission is to improve government, business, and society through open data and evidence-informed public policy.

Kurt Cagle is the Editor In Chief of The Cagle Report, and a domain expert in the development of data standards for enterprise data management, knowledge graphs, and generative AI applications.

Microsoft Copilot is a trademark of Microsoft Corporation. Bard is a trademark of Google LLC.