Why AI Amplifies Data Ambiguity

There's a hidden assumption embedded in almost every data catalog, governance framework, and metadata platform built over the last decade: that a human being would always be nearby to handle ambiguity. When a definition was unclear, the analyst tracked down the right answer. When "revenue" meant something different to finance than it did to sales, the analyst knew who to call. When a query surfaced something suspicious, the analyst noticed. Organizations built remarkable infrastructure: unified catalogs, lineage graphs, quality frameworks, and access policies. And the entire model depended on a person-in-the-loop who could interpret, question, and resolve.

That assumption is now wrong. AI agents are becoming the dominant consumer of enterprise data, and they don't ask clarifying questions. They infer. And when the inference is wrong, when "customer" means something different in the data warehouse than it does in the analytics layer, the model doesn't flag the uncertainty. It returns a confident answer built on a faulty premise.

That's an architecture problem, and solving it requires a layer most organizations haven't built yet.

What Does Your AI Actually Know About Your Business?

In most organizations, the path from a business question to a data answer has always run through people. A product manager who understands what "monthly active users" means for the business collaborates with a data engineer who knows which table actually stores it. A data scientist who gets an unexpected result sends a Slack message to the domain expert before publishing the analysis. This collaboration is invisible infrastructure. It doesn't appear in any architecture diagram, but it is doing an enormous amount of work: translating between business concepts and technical implementations, resolving ambiguity, and carrying institutional knowledge from person to person.

As Suresh Srinivas, co-founder of Collate and previously chief architect of data at Uber, put it at Data Summit Boston:

"AI doesn't ask you any questions. It makes its own assumptions and comes up with answers. How is the tribal knowledge — the things that human beings have in their head — how does it become accessible to AI? That is the biggest gap as we move towards AI-driven data." — Suresh Srinivas, Co-Founder, Collate

The tools organizations use today were built to support that human collaboration, not to replace it. A data catalog shows you what assets exist. Lineage tells you where the data came from. A business glossary captures some definitions in text. But none of these systems give an AI agent a formal, machine-readable model of what those concepts mean, how they relate to each other, or how to resolve conflicts between them. The agent arrives at your data estate with no one to call, no institutional memory, and no way to know which of the 20 tables named "customers" is the one you actually trust.

Why AI Amplifies Ambiguity Instead of Resolving It

The "revenue" problem is older than AI. In virtually every large organization, revenue has multiple definitions: finance calculates it one way, sales uses a broader number because commissions depend on it, and marketing attribution applies a third formula entirely. For years, human analysts navigated this by knowing the context, asking the right people, and documenting the disambiguation in a comment or a Confluence page.

AI has none of that institutional navigation. When it encounters an ambiguous concept, it doesn't pause. It picks a definition and proceeds.

"Revenue has different definitions depending on which domain you are coming from. For the finance team, the revenue could be different. For the sales team, revenue could be different because their commission gets calculated based on the revenue. When you ask what is revenue to AI — does it have the definition of revenue to begin with?" — Suresh Srinivas

The compounding problem: when AI models infer incorrectly, they tend to be more confident, not less. Research published in 2025 found that models are 34% more likely to use phrases like "definitely," "certainly," and "without doubt" when generating incorrect information. The organizations dealing with this aren't unlucky; they're running a system designed to return an answer even when the right answer isn't there.

As Suresh put it plainly: "With AI, the ambiguity that exists today in data... just will get amplified a hundred X."

This is why Gartner attributes 85% of AI project failures to poor data quality and readiness, and why MIT found that roughly 95% of generative AI pilots delivered zero measurable financial return in 2025. Retrieval-augmented generation helps, but if "revenue" isn't formally defined in the retrieval corpus, the model still guesses. Better retrieval doesn't resolve ambiguity; it delivers ambiguity faster, with higher confidence.

What the Catalog Era Got Right, and What It Left Unfinished

The data catalog era produced real advances. Before unified catalogs, metadata lived in siloed tool repositories: one system said the data owner was X, another said Y, one flagged PII, another didn't. Connecting those sources into a unified context graph gave organizations something genuinely new: a single place to see discovery, lineage, governance, quality, and observability together.

That was worth building, and it remains the foundation. But it answers a specific question: what is this data, where did it come from, who owns it, and can you trust it?

It doesn't answer the more fundamental question for AI: what does this data represent? Not its schema, not its lineage, not its quality score, but its meaning within the business.

"Context gives you understanding of data. But does it give you understanding of the business? Context is good for people; it is not sufficient for AI." — Suresh Srinivas

The failure mode is concrete. A data scientist queries a catalog with natural language: "Show me customer acquisition by region." The system returns 20 tables whose names contain "customer". The agent picks one based on proximity and keyword match. The query runs. The result looks plausible. Three weeks later, someone in finance notices the numbers don't match the system of record, and the debugging begins. The catalog did exactly what it was designed to do. The problem was that "customer" was never formally defined in a way the agent could reason with.

Data Is an Implementation Detail

The reframe Suresh offered at Data Summit is the one most data teams resist, because it inverts the order in which almost everyone has learned to build data infrastructure.

"Data is an implementation detail. The conceptual model is independent of any of these underlying tools and platforms." — Suresh Srinivas

This sounds abstract until you ground it in a specific example. At Uber, the conceptual model starts with a single sentence: Uber is a business that connects riders with drivers. Driver is a concept. Rider is a concept. A ride is the transaction between them. Those relationships, and the rules that govern them, form the ontology. The data systems that store trip records, driver ratings, and fare calculations are implementations of that model, not its source.

Most organizations have built this in reverse. They started with tables, added a catalog on top, wrote some glossary entries in text, and are now trying to infer meaning from the structure of the storage layer. When you ask AI to reason about that structure, it attempts to reconstruct the conceptual model from implementation details, roughly equivalent to reading a compiled binary and trying to understand the programmer's intent.

The semantic model needs to exist above the data layer, independently of it. It defines the core entities of the business, how they relate to each other, which transactions occur between them, and which rules govern those relationships. That model applies to the data warehouse, the data lake, the ETL pipelines, the API layer, and any AI system that touches any of them. Storage is where you put data. The semantic layer is where you say what it means.

What the Missing Layer Actually Looks Like in Practice

The technical gap between a context graph and a semantic ontology is smaller than it sounds, but the architectural commitment is significant. Most data systems today store metadata in JSON schemas or YAML, passed over REST APIs. These formats describe the structure adequately. They aren't designed for reasoning.

The evolution Suresh describes moves from metadata schemas to context graph to ontology. The key enabler at the ontology layer is RDF: a knowledge representation standard that stores information as subject-predicate-object triples. Subject: Customer. Predicate: is-a. Object: person. Once that triple exists, graph traversal produces inference: a customer is a person; persons have PII data; therefore, any table containing customer records is subject to PII governance. That inference doesn't require a rule to be written explicitly; it emerges from the structure of the graph.

"Once you have it in RDF, you can do graph queries and complex semantic search. Through graph traversal, you can infer new knowledge. An example: customer is a person. Person has PII data. These kinds of inferences become possible with RDF." — Suresh Srinivas

When an agent receives a natural-language query such as "How many customers did we have last month?", the path shifts from keyword search to concept resolution. Instead of matching "customer" against table names, the agent asks whether a formal concept for customer exists in the semantic layer, finds its governing definition, locates the canonical data source, and generates a storage-native query for exactly that system.

"AI says, 'Is there a concept for customer?' For this concept, where is the data stored? That's how you discover the data and come up with answers." — Suresh Srinivas

Because the agent knows it's working with a specific platform, it generates compatible syntax rather than generic SQL that may or may not run. Because lineage and quality metadata are attached to the concept, it can surface confidence signals alongside the answer. The output is still model-generated, but grounded in formal definitions rather than inferred from structure.

Understanding Has to Be Designed, Not Learned

The most common response to all of this is some version of: "The next generation of models will solve it." Better training data, larger context windows, and more sophisticated retrieval. The argument is that the semantic gap is a temporary limitation of the model.

"The understanding is not learned. It needs to be designed, defined, accepted within the organization." — Suresh Srinivas

Shared meaning is not a property of a model. It's a property of an organization. "Revenue" doesn't have one correct definition that a sufficiently capable model will eventually discover; it has the definition your organization has agreed to use, formalized, governed, and made machine-readable. That work cannot be outsourced to training data. It has to be done deliberately, by the people who understand both the business and the data.

This is organizational infrastructure work, not software work. The semantic model needs owners, versioning, and approval workflows. When an AI agent proposes a new definition or classification, a designated reviewer should approve it before it propagates. The agent handles the drafting; a human has to sign off. That human-in-the-loop isn't a concession to AI limitations; it's how shared meaning gets legitimized across a large organization, and it's how you prevent the semantic layer itself from becoming another source of inconsistency.

59% of enterprises are now investing in semantic layers as critical AI infrastructure, according to a 2025 Futurum survey. That number reflects practitioners who've seen the alternative: agents running on confidently wrong inferences, returning plausible-looking answers that quietly erode trust in AI across the organization.

The organizations that will run AI in production with any real degree of trust aren't the ones with the most sophisticated models or the most aggressive deployment timelines. They're the ones that did the prior work: defining shared meaning before asking AI to reason about it, treating the conceptual model of their business as infrastructure rather than documentation. Every ambiguity problem that took years to untangle in the data catalog era will replay, faster and with more confident wrong answers, for organizations that skip this step. The layer was always going to be necessary. The difference now is that the tools to build it exist, and the cost of not building it is no longer theoretical.

Request a demo to see how OpenMetadata's semantic context platform connects metadata, ontology, and AI governance in a single open-source foundation.