I attended a meetup with a panel discussion on data and AI last week, the conversation kept coming back to semantic layers. Why?
Well, LLMs are comparatively terrible at writing SQL to answer business questions from enterprise data. This is in contrast to quite good performance understanding and writing code and answering questions from text. Aren’t they similar concepts, why is SQL so hard?
Fortunately for me, I’ve been following David Jayatillake’s Substack for a few years so this is not too much of a surprise to me. David, founder of Delphi Labs and now VP AI at Cube, has written a lot about the importance of semantic layers for LLMs to effectively leverage data. If you want a deeper dive into the subject, this post would be a good place to start.
It turns out that the challenge that AI is having, is related to a human challenge that has existed for some time. Context and consistent definitions.
While LLMs can consume SQL just as readily as they can and have consumed all of the code on GitHub and the internet at large. Merely consuming SQL leaves LLMs lacking context. Data is messy.
Let’s put LLMs aside for a moment, if you drop a world class data analyst into an unfamiliar enterprise Snowflake deployment, will they be able to answer arbitrary business questions? Probably not. They might have a hard time in their OWN Snowflake environment.
Successful SQL queries require that an analyst, or LLM, know what to look for and where to look for it. That sounds simple, but it turns out to be very complicated. For example, you might find that a request for the number of customers from finance might be different from that request from the sales team (perhaps finance cares only about parent organizations, but sales sometimes treats subsidiaries as different companies).
An analyst, human or AI, needs to know the context of what a particular question actually means in the context of the person asking it AND needs to know where to look for it. There might be many tables with customers in the name, which one are you going to query? Not that one, that’s Jim’s side project and will give you the wrong answer, and your CEO will yell at you, or worse yet, your CEO won’t realize it’s wrong and will present it to the board 😨.
We talked last week about some of the problems that can arise with inconsistent definitions of customers and revenue👀.
Semantic layers and some related concepts we’ll touch on later seek to solve this problem. Let’s revisit David’s article for a definition and some history.
A semantic layer is a business representation of corporate data that helps end users access data autonomously using common business terms managed through Business semantics management. A semantic layer maps complex data into familiar business terms such as product, customer, or revenue to offer a unified, consolidated view of data across the organization.
By using common business terms, rather than data language, to access, manipulate, and organize information, a semantic layer simplifies the complexity of business data. Business terms are stored as objects in a semantic layer, which are accessed through business views.
This is not a new problem, in 1992 Business Objects patented the what was probably the first semantic layer which:
provides a new data representation and a query technique which allows information system end users to access (query) relational databases without knowing the relational structure or the structure query language (SQL)
Semantic layers have existed in various forms since then. Most commonly within BI tools, Business Objects first with many others to follow. More recently with LookML in Looker. Embedding with BI tools was not perfect but clearly good enough. We haven’t seen the standalone semantic layer take off. Sure Atscale has been around since 2013. Lloyd Tabb, of Looker fame, created Malloy in part to improve on SQL itself but with a semantic layer as a core component. Cube is a very popular open source project and a growing commercial company.
Why haven’t semantic layers taken off? It’s a twofold challenge:
1. Semantic layers are not simply a technology solution, they require implementation with deep enterprise knowledge to be useful
2. Historically, no one wants to own them as a standalone product
Is AI going to change all of this? There are some pretty big clues from 2 AI giants, Palantir and OpenAI, that suggest we might.
Palantir is, by the numbers, the second most successful AI company, after OpenAI, with ~$2B in annual revenue and a $200B market cap. For context, OpenAI just closed a record $40B funding round at a $300b valuation and an estimated $5B annual revenue run rate.
Folks have levied criticisms that Palantir is just warmed over Spark, Databricks with a 10x government sales team etc etc. But criticisms aside, what Palantir does very very well, is understand the organizations and problem spaces in which they operate (hat tip to Donald Farmer and Malthe Karbo for insights on this). Palantir develops otologies for their customers.
Let’s look at Palantir’s definition of an ontology:
The Palantir Ontology is an operational layer for the organization. The Ontology sits on top of the digital assets integrated into the Palantir platform (datasets and models) and connects them to their real-world counterparts, ranging from physical assets like plants, equipment, and products to concepts like customer orders or financial transactions. In many settings, the Ontology serves as a digital twin of the organization, containing both the semantic elements (objects, properties, links) and kinetic elements (actions, functions, dynamic security) needed to enable use cases of all types.
Does that sound familiar? It’s not exactly the same thing as a semantic layer but there are definitely shared elements.
How does Palantir develop ontologies for their customers? The Forward Deployed Engineer (FDE) According to LinkedIn, there are 846 forward deployed engineers at Palantir. These folks are embedded within the customer to understand the business and map that understanding back to the ontology to power AI and analytics.
It seems that the FDE has been a key component of Palantir’s success. And as an aside, also a key reason why there are so many successful ex-Palantir founders.
OpenAI seems to be following Palantir’s lead. They recently created an FDE function led by Colin Jarvis and have 10 open FDE roles.
Palantir and OpenAI are only 2 data points but they are compelling. This approach suggests to me 2 things.
A human on the ground understanding the business is critical to the success of Semantic Layers.
Semantic layers are indeed critical to AI success in enterprise data.
The unanswered question, in my mind, is whether standalone semantic layers will be part of the solution or whether either AI or data platform companies will develop (or acquire) them to solve this problem.
Warmly,
Paul Dudley