Baking the context cake

Baking the context cake
Photo by Vasylyna Kucherepa / Unsplash

If you ask 10 executives which roles they need to run an AI adoption project in their company, 9 of them won't mention data engineers. The AI development community spent immense effort in the past few years working on models and orchestration methods, but only recently we started to pay real attention to context that AI operates with. I believe this is a major reason why 80% of enterprise AI projects fail, and leading agents complete only 30-35% of multi-step tasks.

Let's review together the recent advances in the area of context management, and see why context is much more than just search over company documents.

A deceptively simple problem

As LLMs grew in their capabilities, developers started using two major ways of grounding them in relevant knowledge: adding whole documents as part of the prompt, and giving the model access to tools such as search or retrieval APIs.

It is easy to see why the former approach doesn't scale well. Even if the frontier models support context length of 1 million tokens, in practice their ability to refer to knowledge passed in the prompt diminishes quickly. This problem came to be known as "context rot".

Tool-based approach offers higher scalability on the surface. Indeed, it shouldn't matter how many documents you push into an Elasticsearch cluster, if the model can just search over them and pull the information it needs. Once you dig deeper though, you notice that even recent models tend to get confused with the number of tools available to them. That's in addition to all the usual concerns that go into organizing a well structured search engine: indexing and retrieval performance, access control, ranking etc.

On top of this, think about what we want the agents to know to do their tasks well. It's not just company documentation or emails, they also need to access aggregated data such as financial reports or product metrics. We could allow an agent to send SQL queries to the data warehouse, but how would the poor thing make sense of those 10 years of accumulated migrations and schema changes?

The problem of context management requires a solution that:

  • scales well with the size of the organization's knowledge base,
  • allows both fact retrieval and aggregation queries,
  • and with all that, is tailored to a form that agents can successfully use.

Thinking in layers

Back in January OpenAI released an article titled "Inside OpenAI’s in-house data agent". The article explained how they built the agent that answers "high-impact data questions" through natural language. In other words, teams like Finance, Engineering or GTM use this agent to evaluate new product launches or track financial health.

On the scale that OpenAI operates, they couldn't just give the agent access to their datalake or warehouse and leave it to run SQL queries. Even for users that are familiar with the structure of the internal database and the schema of major tables, producing these reports is a daunting task. Invite your fellow data analyst for a beer, and you will surely and quickly understand how complicated this can get. You need to know where the data is coming from (what's called data lineage), how frequently it is updated, whether it is accurate and who is responsible for producing it. You also need to be able to link warehouse tables to company's vocabulary: which table contains "customers", which contains "orders" and how they are related.

All these concerns led OpenAI to a layered approach to context retrieval.

Context retrieval in OpenAI data agent. © OpenAI 2026

In order to understand the company's data and the user's request, the agent reads from the following sources:

  • Historical view on table usage. The agent recognizes which tables are frequently joined together.
  • Human annotations. Curated descriptions provided by experts.
  • Codex enrichment. Codebase provides insights into how data is produced and transformed.
  • Institutional knowledge. Documentation from Slack, Google Docs or Notion.
  • Memory. When the agent is given corrections or discovers nuances about certain data questions, it's able to save these learnings for next time.
  • Runtime context. The agent issues live queries to the warehouse in order to infer additional details about data and schema.

As a data engineer myself, I can't help but be amused at how much complex processing goes into approximating what my fellow DEs do daily.

OpenAI proposes the terminology of context layers that allow the agent to retrieve the necessary information from various sources at different granularity. We will see that OpenAI's agent is not unique in this, and in fact we can speak of an emerging pattern for context management.

Layered Context Pattern

In the recent months several other companies shared their approach to building organizational context: Writer, Glean, Sentra, Embra, Zep. We have also seen open source projects in this area: Mem0, Databao, Graphiti.

In these projects we can see a common pattern, although implemented in very different ways: layered context indexing and retrieval.

This pattern aims to support various types of queries: granular fact retrieval, summaries of larger topics, aggregated metrics. In order to achieve this, knowledge is collected and organized in different systems:

  • Search over documentation, on the level of whole documents or chunks. This search often relies on a combination of lexical and semantic approaches.
  • Entity layer which captures structured relationships between entities described in the documents. Graph databases provide a good storage and retrieval basis for this data.
  • Catalog/topic layer. Engineers use LLMs to extract summaries from documents and group them into topics. This gives agents the ability to understand larger corpuses of text and route their requests to the right clusters of topics.
  • Runtime/live layer. Agents have the capability to access databases and warehouses to retrieve information or run aggregation queries.

Some projects extend this basic pattern with more sophisticated approaches to creating and managing memory. For example, Zep/Graphiti employ a bi-temporal model: they track when facts occurred and when they were recorded. This helps agents distinguish outdated information from current truth.

Future of context

Most of the tools used in context management are not new. Lexical and semantic search or graph databases have been an area of active development for years and decades. We are now "just" composing them in a way that is accessible for modern LLMs and agents. This composition in itself brings new questions and problems. For instance, conflict resolution is a hot topic that is still unsolved at scale. When multiple sources say different things, someone has to decide what to do with them: merge, invalidate or skip entirely. Mem0 uses LLM-powered resolution, but at scale it is expensive and non-deterministic.

Resolving access and governance issues across different sources is another problem that will have to be solved at enterprise scale. Security of data and organization content is paramount and will have to be carefully balanced against usability and reliability of agentic systems.

Given that AI adoption in enterprise is only just beginning to rely on agents, I see strong indication that context engineering will be a major factor in adoption projects going on. Projects like OpenAI's Frontier and Jetbrains' Databao show that it is also a major target of investment these days.

Sources and further reading

Inside OpenAI’s in-house data agent
How OpenAI built an in-house AI data agent that uses GPT-5, Codex, and memory to reason over massive datasets and deliver reliable insights in minutes.
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce Mem0, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and k-values, (iii) a full-context approach that processes the entire conversation history, (iv) an open-source memory solution, (v) a proprietary model system, and (vi) a dedicated memory management platform. Empirical results show that our methods consistently outperform all existing memory systems across four question categories: single-hop, temporal, multi-hop, and open-domain. Notably, Mem0 achieves 26% relative improvements in the LLM-as-a-Judge metric over OpenAI, while Mem0 with graph memory achieves around 2% higher overall score than the base configuration. Beyond accuracy gains, we also markedly reduce computational overhead compared to full-context method. In particular, Mem0 attains a 91% lower p95 latency and saves more than 90% token cost, offering a compelling balance between advanced reasoning capabilities and practical deployment constraints. Our findings highlight critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient LLM-driven AI agents.
Zep: A Temporal Knowledge Graph Architecture for Agent Memory
We introduce Zep, a novel memory layer service for AI agents that outperforms the current state-of-the-art system, MemGPT, in the Deep Memory Retrieval (DMR) benchmark. Additionally, Zep excels in more comprehensive and challenging evaluations than DMR that better reflect real-world enterprise use cases. While existing retrieval-augmented generation (RAG) frameworks for large language model (LLM)-based agents are limited to static document retrieval, enterprise applications demand dynamic knowledge integration from diverse sources including ongoing conversations and business data. Zep addresses this fundamental limitation through its core component Graphiti -- a temporally-aware knowledge graph engine that dynamically synthesizes both unstructured conversational data and structured business data while maintaining historical relationships. In the DMR benchmark, which the MemGPT team established as their primary evaluation metric, Zep demonstrates superior performance (94.8% vs 93.4%). Beyond DMR, Zep’s capabilities are further validated through the more challenging LongMemEval benchmark, which better reflects enterprise use cases through complex temporal reasoning tasks. In this evaluation, Zep achieves substantial results with accuracy improvements of up to 18.5% while simultaneously reducing response latency by 90% compared to baseline implementations. These results are particularly pronounced in enterprise-critical tasks such as cross-session information synthesis and long-term context maintenance, demonstrating Zep’s effectiveness for deployment in real-world applications.
Knowledge Graph
Knowledge Graph, our graph-based RAG, achieves higher accuracy than traditional RAG approaches that use vector retrieval. Find out how.
Enterprise Graph: Powering AI with Deep Organizational Knowledge
Discover how Glean’s Enterprise Graph connects people, projects, and processes to deliver AI that truly understands your business context and drives productivity.

Subscribe to The Elder Scripts

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe