Context engineering: why your AI agent fails when the wrong information goes in

AI agents are designed to reason over information, not to select it. Because most builds do not explicitly design what enters the context window, agents operate on an incomplete or overloaded picture. The result is confident wrong outputs in production: decisions that look correct but reflect the wrong information, not the real situation.

Stefan Finch

Founder, Head of AI

Apr 18, 2026

Discuss this article with AI

"Imagine you are sitting at the boardroom table with all of your data sliced into single A4 sheets. You have an envelope. Your job is to decide what data goes into the envelope, and then you pass it to an AI to answer. From that premise, the only question that matters is: what did you put in the envelope, and did the AI actually look at it? That is context engineering."

What context engineering is

Context engineering is the design practice that governs what information enters an AI agent's context window for each task. It is not prompt engineering. It is not model selection. It is a data and architecture decision that sits upstream of both, and it determines output quality before the model runs.

An AI agent does not decide what information it needs. It receives a context window (a finite working space) and reasons over whatever is in it. The quality of that reasoning is bounded by the quality of what was passed in. Too much, and the agent loses focus. Too little, and it acts on an incomplete picture. Contradictory or stale, and the outputs are internally coherent but wrong relative to the actual situation.

The context window is not infinitely large, and size alone does not solve the problem. A curated context of 128,000 tokens with fresh, relevant information outperforms an uncurated context of 200,000 tokens where critical information is buried in noise. Context engineering is the work of making that curation a deliberate design decision rather than an accident of implementation.

Gartner identified context engineering as the breakout AI capability of 2026: the capability most organisations building agent systems have not yet named as a design practice. Anthropic's engineering team documented the same finding from a different direction. The majority of production AI agent failures stem from poor context management, not from inferior language models. The model is reasoning correctly. The information going in is wrong.

This practice connects to what AI agents are and how they work, because context engineering only makes sense as a design layer once the mechanics of agent reasoning are clear. It explains why undesigned context is not a minor inefficiency. It is a quality floor that no subsequent refinement can raise.

Why agents fail without designed context

Most teams building AI agents are iterating on prompts when the problem is upstream of the prompt.

The agent produces a confident, well-structured response. It follows the instruction. The logic is sound. But the output does not match the real situation. It matches the information it was given, which was incomplete, stale, or overloaded. From the outside, this looks like a model error or a prompt error. It is neither. The agent reasoned correctly over the wrong inputs.

This is the diagnostic gap that makes undesigned context expensive. A visible failure (the agent crashes, refuses to respond, or produces garbled output) is debuggable. An invisible failure, where the agent produces a confident response based on the wrong data, is not. The organisation acts on outputs it trusts, because the output looks right.

This is not a speculative failure mode. Anthropic's engineering team identified it as the primary cause of production AI agent failures: context rot, where model accuracy degrades as context length increases and critical information gets buried in noise. The agent has the right model. It has a well-written prompt. The context is wrong.

The commercial consequence is specific. Teams that attribute failures to prompt quality spend weeks on prompt iteration. Teams that attribute them to model quality spend budget on evaluation and switching. Both are solving the wrong problem. The failure is at the context layer, which was never explicitly designed. Understanding how that failure manifests requires naming the two forms it takes.

What happens when you don't give agents enough context

Context underload occurs when the agent acts on an incomplete information set. The context is not too large; it is too sparse. The agent is asked to make a decision about a customer, a process, or a system state without access to the data that would make that decision correct. The output is internally consistent. It is wrong relative to reality.

The agent did not fail. It reasoned correctly from what it was given. The failure is upstream — in the information design decision that was never explicitly made. Because the output looks right, the error is invisible until it has consequences.

Common indicators: the agent acts on stale data rather than current state; responses ignore material facts present in the system but absent from the context pack; the system recommends an action that contradicts information held in a different data source it was never given access to.

What happens when you give agents too much context

Context overload occurs when the agent receives more information than it can weight correctly. The transformer architecture creates pairwise relationships between every token in the context. As context length grows, maintaining focused attention becomes mechanically harder, and information at the edges of the context window receives more attention weight than information in the middle.

This is the phenomenon researchers call the lost-in-the-middle effect. Studies show accuracy drops of up to 30% for information positioned mid-context compared to information at the start or end. The practical implication: if an agent receives a large, unstructured context pack and the critical fact is buried in the middle, the model is statistically likely to underweight it. GPT-4o's measured accuracy drops from 98.1% to 64.1% based solely on where relevant information is positioned within the context window, not based on model capability, not based on prompt quality.

Context overload also compounds through contradiction. When accumulated information contains conflicting signals (a customer record updated in one system but not another, a policy document alongside a superseded version) the model must arbitrate between them. Research from Galileo AI found a 39% average performance drop across tested models when contradictory information accumulates in context across conversation turns. The model does not flag the contradiction. It produces a response.

Both failure modes share one diagnostic characteristic: the output looks correct. The agent did not fail. It reasoned well from what it was given. The failure is upstream.

How do I know if my AI agent has a context problem?

The diagnostic signature is consistent: the agent produces outputs that are internally coherent but wrong relative to the actual situation, and the error is not flagged as an error. Common indicators: the agent acts on stale data rather than current state; confident responses ignore material facts present in the system but absent from the context pack; output quality varies by task without a clear pattern tied to prompt or model changes. In each case, the agent performed its task correctly. The information it received was not.

Context rot and accuracy

Modern frontier models have context windows of 200,000 to 1 million tokens. The temptation is to use them. The correct design discipline is to use as little as possible — ideally below 30% of the available window.

Why the ceiling? Accuracy degrades as context fills up. It is not a linear relationship. The more you pack in, the harder it becomes for the model to weight signal against noise. The critical information does not disappear from the window — it gets outweighed by everything else in it. Past a certain fill rate, adding more context actively degrades output quality even when the additional content is relevant.

[CHART: Context accuracy vs context fill rate — Stefan to supply]

The most common cause of context rot is not a single large context pack assembled for one task. It is accumulation. Multi-turn agents and long conversation threads keep appending context — the history of previous turns, the outputs of previous steps, accumulated background from earlier in the workflow. Each turn adds more. Nothing gets pruned. By the time the agent reaches the task that matters, the signal it needs is buried under everything that came before it.

The fix is architectural: treat context like working memory, not like a filing cabinet. Each task receives a bounded context slice. What is not needed for this task does not go in. What was needed for a previous task does not carry forward unless explicitly required. The context window is not a session log. Designing it that way is the most common source of gradual production degradation that teams attribute to model drift rather than context growth.

Context engineering vs prompt engineering

The distinction is architectural, not a matter of degree.

Prompt engineering works inside the context window. A system prompt, a chain-of-thought instruction, a role definition: all of these are information that goes into the context window. They govern agent behaviour. They do not govern what other information is in the window alongside them.

Context engineering works on the window itself. It is the design decision about what structured data, retrieved documents, session state, and instruction content goes into the window for a specific task, assembled at the moment the task runs.

A useful analogy: consider a lawyer preparing for a case. The closing argument is the prompt, precise, sequenced, persuasive. But the lawyer can only argue from the case file in front of them. If the case file contains the wrong documents, the argument is irrelevant. Context engineering is the work of assembling the correct case file before the argument begins.

Most agent builds treat context as an implementation detail. Pass in the system prompt, the conversation history, and whatever data the task requires, and let the model sort it out. Context engineering treats context as a design surface: a set of explicit decisions about what goes in, what stays out, and in what form.

The consequence of conflating the two is direct. A precisely written prompt cannot compensate for a context that is overloaded, incomplete, or incorrectly scoped. The instruction is correct. The information environment it operates over is wrong. Switching to a more capable model makes the same problem worse. The agent reasons more effectively over the wrong information, and its incorrect outputs become more convincing.

No amount of prompt refinement or model upgrading resolves a context design failure. Context engineering is the upstream design layer that determines whether the information environment is correct in the first place.

What context slices are and how to design them

Think of it like sitting at the boardroom table and chopping all the relevant data into one-page sheets, then passing them to the AI in an envelope. The only question that matters is what pieces went into the envelope, and whether the AI actually used them.

That envelope is what Graph Digital calls a context slice: a bounded, task-scoped information pack assembled at runtime for a specific agent task. Not a general-purpose context that serves all tasks. Not a growing conversation history that accumulates across the session. One envelope. One task. Assembled from exactly what that task requires, and nothing else. Context slices operationalise the bounded context principle from domain-driven design for agentic systems: each agent task receives a scoped information boundary assembled at runtime, analogous to a DDD bounded context but broader in scope than retrieval-alone RAG chunk selection.

Four design questions govern each context slice:

What does this task require to reason correctly? Not what data is available, but what data is necessary. The distinction forces specificity.
What would degrade reasoning if included? Adjacent data, historical records, or conflicting versions that are irrelevant to this task but present in the system.
What is the right recency scope? Some tasks require real-time state. Others require session context. Others require historical records from a defined window. The time boundary is a design decision, not a default.
What format does the model process most reliably for this information type? Structured tables, prose summaries, key-value pairs, and retrieved document chunks each carry information differently. The format affects how the model weights and uses the content.

Each context slice is assembled at runtime, not pre-built at system design time. The agent task triggers the assembly. The slice is constructed from the relevant sources, scoped to the task, and passed to the model. After the task completes, the slice is not retained in the next task's window unless that task explicitly requires it.

Understanding which of these questions was not explicitly answered during your agent build is the starting point for a context architecture review.

How context slices work in practice

Katelyn, Graph Digital's production multi-agent AI platform, implements context slices as the core information architecture pattern.

Katelyn operates as a system of specialist workers, each handling a specific task type within a larger workflow. Every worker receives a bounded context pack scoped to its exact task: the files, state, knowledge, and instruction content that task requires, assembled at runtime. Workers do not share a single global context. Each operates in its own bounded information environment.

In early prototypes, workers received full-system context: the complete knowledge base, all session state, and the full instruction set for the platform. Output quality degraded consistently. Workers weighted irrelevant content, missed task-specific signals buried in the larger context, and produced responses that were technically coherent but operationally wrong. Model capability and prompt quality were identical between the degraded and corrected versions. The variable was context scope.

Reducing each worker's context pack to its task-specific slice restored output quality without model changes. The production system now assembles per-worker context packs as a deliberate architectural decision, not as an optimisation, but as the primary design principle. Context engineering is built into the platform from the first worker outwards.

This finding matches what Anthropic's multi-agent architecture research identifies as producing the highest-quality outputs for long-horizon tasks: isolated context per sub-agent, assembled per task, rather than a shared context that grows across the workflow. The production evidence and the research evidence point to the same design conclusion.

Context engineering and memory: the boundary that matters

Context engineering and memory are commonly conflated. They are different design problems at different layers.

Context engineering governs what an agent receives per run: the information assembled for a specific task, used during that task, and not automatically carried forward. Memory governs what persists between runs: the information retained from previous sessions and made available to future tasks.

The boundary matters in both directions. An agent with well-designed context slices but no memory architecture will reason correctly on each task but will not accumulate learning across tasks. An agent with strong memory architecture but undesigned context will accumulate information but reason over it poorly. Both layers require explicit design. Neither substitutes for the other.

Context engineering failures have a specific diagnostic signature: the agent produces outputs that look right but are wrong relative to the actual situation, and the organisation cannot easily determine whether the model is failing or the data is failing. Both look identical from the outside. If that description applies to a current production system, the problem is almost certainly not the model.

The context layer was never explicitly designed. That is a days-scale intervention to address. Waiting until the system is embedded in production workflows and decision-making makes it a weeks-scale problem, and the cost of decisions made on wrong outputs in the interim is harder to quantify than the diagnosis.

The memory layer is a separate design problem, covered at how AI agents handle memory between runs.

What to do with this:

Audit what context each agent task in your current system actually receives, not what you intended, but what it gets
Apply the four design questions to your highest-stakes agent task first
If context design was not an explicit decision during the build, treat it as a gap to close before extending the system further

If you are sponsoring the build internally, a context architecture review before the next sprint is the most defensible step you can take.

Graph Digital's AI agent development practice includes context architecture review as part of every build readiness assessment, auditing what information each agent task receives, whether the context design matches task requirements, and where overload or underload is producing the failures that look like model problems.

What is context engineering?

Context engineering is the design practice of deciding exactly what information an AI agent receives per task: what enters the context window, in what form, and at what scope. It is distinct from prompt engineering, which governs how the agent is instructed, and from memory architecture, which governs what persists between runs. Context engineering determines agent output quality before the model runs.

What is the difference between prompt engineering and context engineering?

Prompt engineering defines how an agent behaves: the instructions, constraints, and reasoning guidance that govern its response. Context engineering defines what information the agent has available when it executes those instructions. Both affect output quality. Only context engineering can fix a context problem. Prompt refinement cannot compensate for missing, overloaded, or incorrectly scoped information in the context window.

What are context slices?

A context slice is a bounded, task-scoped information pack assembled at runtime for a specific agent task. Rather than passing a general-purpose context that serves all tasks, a context slice contains exactly the data, state, and retrieved content that one task requires, and excludes everything else. The design question for each slice: what does this task require to reason correctly, and what would degrade reasoning if included?

What is the difference between context engineering and context management?

Context engineering is the design discipline: the upstream architectural decisions about what information each agent task should receive, in what form, and at what scope. Context management is the operational process of handling context at runtime. Context engineering makes context management predictable; without it, context management is reactive and inconsistent.

Stefan Finch — Founder, Graph Digital

Stefan Finch is the founder of Graph Digital, advising leaders on AI strategy, commercial systems, and agentic execution. He works with digital and commercial leaders in complex B2B organisations on AI visibility, buyer journeys, growth systems, and AI-enabled execution.

Connect with Stefan: LinkedIn

Graph Digital is an AI-powered B2B marketing and growth consultancy that specialises in AI visibility and answer engine optimisation (AEO) for complex B2B companies. AI strategy and advisory →