AI Agents

How to design memory in a multi-agent AI system — a practical guide

Memory in a multi-agent AI system requires four typed layers, each with a defined lifetime and write permission. Collapsing them into a shared blob produces five failure modes that compound with agent count, making production deployments unreliable and failures untraceable.

Stefan Finch
Stefan Finch
Founder, Head of AI
Apr 18, 2026

Discuss this article with AI

When you are building a multi-agent AI system, memory feels like a detail you will sort out once the core logic works. It is not. I'm Stefan Finch, founder of Graph Digital. I designed the memory architecture for the Katelyn Skills OS, a production multi-agent platform running hundreds of pipeline cycles. The decision that changed everything was not a model choice or a framework change. It was architectural: partitioning memory into four typed layers instead of treating it as a single shared store.

Memory architecture is not a configuration detail. It is what determines whether a production multi-agent system stays coherent or degrades as agent count increases.

This guide gives you the design framework, grounded in standard AI memory concepts, that I use in production. The four layers, the write permission model, and the production proof are all here.

How to design AI agent memory in a multi-agent system

Memory in a multi-agent AI system requires four typed layers, each with a defined lifetime and write permission. In-context memory holds short-term working state, ephemeral per run. Persisted state is durable structured truth, in typed JSON. External memory is shared knowledge and doctrine, versioned and governed. Episodic memory is the append-only event history for audit and replay. Collapsing all four into a shared context produces compounding failure modes: stale state, conflicting writes, silent doctrine drift, and failures with no traceable cause that multiply as agent count increases.

The framework is straightforward. The reason it matters commercially is where most teams stop short.

Why AI agent memory architecture is a first-class decision in multi-agent systems

AI agent memory architecture is a commercially significant decision. Gartner forecasts that AI agents will intermediate more than $15 trillion in global B2B purchases by 2028. Vellum's analysis of enterprise AI adoption puts 40% of enterprise applications using task-specific agents by 2026. The architectural decisions being made in current production builds will determine whether those deployments hold or require costly remediation at scale.

The memory architecture decision sits at the centre of that reliability question. Most multi-agent systems start with a shared context window or a single JSON file. The prototype works. Two agents coordinate cleanly. The demo is stable. Then the system scales, with more agents, more runs, more accumulated state, and the behaviour becomes unpredictable. Agents contradict each other. State from previous runs bleeds into current ones. When something goes wrong, there is no traceable record of what any agent knew when it acted.

The failure does not announce itself as a memory problem. It presents as inconsistency, as debugging time, as the engineering team spending cycles on state investigation instead of capability extension.

Memory in a multi-agent AI system is not a technical detail to address after the core logic works. It is a first-class architectural decision; standard memory design concepts exist precisely because these distinctions matter. This guide is written for teams building or evaluating multi-agent systems. If your workflow runs a single agent on a stateless task, the four-layer architecture is not the subject here. If you are still evaluating what AI agents are and how they work, that question belongs before this one.

Why shared context fails in multi-agent AI systems — five compounding failure modes

The shared context blob does not degrade at a steady rate. It compounds. Every new agent and every accumulated run multiplies the probability that two or more failure modes interact simultaneously. The two-agent demo is stable. The ten-agent production system is not.

"The shared context blob doesn't degrade at a steady rate. Each agent that joins the system multiplies the noise — by the time you have four or five agents reading the same store, errors compound on top of errors."

Stefan Finch, Founder, Graph Digital

Five failure modes drive the degradation.

1. Context bloat. Each agent loads everything it might need because there is no read scope boundary. Context size grows. Retrieval becomes noisy. Token consumption rises. The system loads more than it needs on every run.

2. Stale state. Data from previous runs persists because there is no lifecycle mechanism. Agents act on stale information. Decisions made in run 47 bleed into run 52. There is no expiry, no versioning, no signal that a value is no longer current.

3. Conflicting writes. Multiple agents write to the same store without an authority model. Agent A updates a state value. Agent B overwrites it. Neither knows the other acted. The result is silent corruption: not an error, not a log entry, just wrong state.

4. Silent doctrine drift. Shared rules, schemas, and policies sit in the same blob as working context. Any agent can overwrite them. Rules change without versioning. Behaviour changes without audit trail. The system you tested last week is not the system running today.

5. Failures you can't trace back to a cause. When something goes wrong, there is no governed record of what each agent knew when it acted. You cannot reproduce the failure. You cannot isolate the cause. You spend sessions on state investigation because there is no episodic record to query.

These failures are each independently consequential. They compound: context bloat increases the probability of stale state being acted on; conflicting writes interact with silent doctrine drift; failures with no traceable cause are the aggregate result when neither can be traced.

Memory dimensionShared blobTyped memory storesConsequence
Working contextMixed with persistent state, no expiryIsolated per-run, expires at run endStale context bleeds into subsequent runs
Structured stateOverwritable by any agentTyped JSON, authority-controlledConflicting updates corrupt workflow state
System doctrineAny agent can overwriteVersioned, read-only for workersRules drift silently, behaviour changes without audit
Event historyAbsent or reconstructedAppend-only, no deletionsFailures cannot be reproduced or debugged at scale

The diagnostic question for your current system: can any agent determine what type of information it is consuming, how current it is, and who had authority to write it? In a shared blob model, that question cannot be answered. The four typed layers are the architectural answer.

In-context memory, episodic memory and the four typed layers every multi-agent AI system needs

The four-layer model uses standard AI memory concepts: in-context memory, persisted state, external memory, and episodic memory, as its primary design frame. Each layer has a defined lifetime, a defined read scope, and defined write permissions. Graph Digital's Typed Memory Stores governance model runs all four in production. The standard concepts are the architecture; Typed Memory Stores is the implementation.

Here is how each layer works and how the absence of each breaks production systems.

Layer 1 — In-context memory (working memory)

How it breaks without a bounded layer: Without a defined in-context layer, agents load everything they might need — previous run outputs, shared doctrine, accumulated state — into a single context. There is no expiry, no scope boundary. Context bloat is the result.

Why it breaks: The agent has no instruction to treat working context as ephemeral. Everything is persistent by default. What should expire at run end instead accumulates across runs.

What good looks like: In-context memory is ephemeral. It contains the agent's task inputs, retrieved context for the current run, and intermediate reasoning. It expires when the run ends. Nothing persists from in-context memory to the next run unless explicitly promoted to Layer 2. Each agent receives a bounded context pack: exactly what it needs for its task.

Checkpoint: Does each agent in your system load a defined, bounded context for its task, or does it load everything available and filter at inference time?

Layer 2 — Persisted state (structured state)

How it breaks without a typed layer: Durable facts — workflow stage, asset status, approval flags — live in the same blob as ephemeral working context. Any agent can overwrite them. There is no authority check, no version record, no signal that a value changed.

Why it breaks: A shared blob treats durable facts and ephemeral working state as the same type of thing. Same lifetime, same write access. The governance distinctions that durable state requires cannot be applied.

What good looks like: Persisted state is durable machine-readable typed JSON. It contains facts that outlast a single run: workflow stage, decision outcomes, approval flags. Write authority is assigned by agent role; only the agents designed to update a value are permitted to do so. Changes are timestamped. The state from two weeks ago is recoverable.

Checkpoint: Can you identify which agent updated which state value, when, and why, for any run in the past month?

Layer 3 — External memory (shared knowledge)

How it breaks without a governed layer: System doctrine — skills, schemas, policies, rubrics — sits in the shared blob alongside working context. Worker agents can overwrite it. Rules drift silently. The system you tested is not the system running.

Why it breaks: There is no permission asymmetry between the agents that execute tasks and the agents (or governed flows) that are authorised to update system doctrine. When any agent can write to the doctrine layer, versioning is impossible and drift is inevitable.

What good looks like: External memory is versioned, governed, and read-only for all worker agents. System doctrine — the rules, schemas, and policies that govern how agents behave — lives here. Only governance agents, through controlled update flows, can change it. Worker agents read doctrine. They do not touch it.

Checkpoint: Is there a single, versioned source of truth for system doctrine that no worker agent can overwrite?

Layer 4 — Episodic memory (event history)

How it breaks without an append-only layer: When something fails, there is no governed record of what any agent knew when it acted. Failures cannot be reproduced. The engineering team reconstructs events from logs, context window traces, and inference, which is unreliable and time-consuming.

Why it breaks: Without an append-only episodic layer, the event record is either absent or reconstructed. Neither supports production debugging at scale or compliance audit requirements.

What good looks like: Episodic memory is the append-only event log. It records which agent ran, what inputs it used, what decisions it took, and what failed, for every run. No record is modified. No record is deleted. When something goes wrong, the episodic trail gives you a complete, governed account of what every agent knew and when. Debugging a state-related issue moves from multi-session investigation to single-session resolution.

Checkpoint: Can you reproduce any agent's reasoning from a specific run two weeks ago without interrogating the model?

The write permission model that governs AI agent memory in multi-agent systems

The four layers resolve the lifetime problem. The write permission model resolves the authority problem.

A shared blob has no permission mechanism. Even a well-structured blob, carefully labelled, properly formatted, thoughtfully organised, has no way to prevent one agent from overwriting another's state. It cannot stop a worker agent from modifying system doctrine it should only be reading. Structure without permission enforcement is fragile. It degrades as agent count increases and as the people who designed the original structure move on.

The permission model is simple in principle:

Agent roleCan write to
Worker agentsIn-context memory (own run only), persisted state (observations and task outputs), episodic memory (append-only)
OrchestratorsPersisted state (workflow state and decision records), episodic memory (append-only)
Governance agentsExternal memory (via governed update flow), persisted state
Any agentEpisodic memory, append-only, no deletions permitted

Three rules govern the model.

Workers read doctrine, they do not change it. Worker agents consume system doctrine from external memory — skills, schemas, policies — but they have no write access to it. The only way doctrine changes is through a governed update flow, triggered by a governance agent.

Orchestrators manage workflow state without touching doctrine. Orchestrators read from external memory and write to persisted state. They update workflow stage, route decisions, and track completion. They do not modify system doctrine.

Episodic memory is append-only for every role. No agent — worker, orchestrator, or governance — can delete or modify an episodic record. Every agent can append. This is the mechanism that makes production debugging tractable.

"A shared blob is not an architecture that can be gradually fixed through better structuring."

Stefan Finch, Founder, Graph Digital

Checkpoint: In your current system, can you identify which agent has authority to write to which memory store, and can you verify that no other agent can bypass that constraint?

The permission model is the governance layer. What you store in each layer matters just as much as who can write to it.

Why conflating facts, judgments and recommendations corrupts multi-agent memory

The four layers define where information lives. The three information types define what kind of information belongs where.

Facts change only when underlying reality changes. A client's approval status, a workflow stage, a decision outcome: these are facts. They belong in persisted state, where they are versioned and authority-controlled.

Judgments are produced by specific evaluation runs and belong to those runs. A quality assessment, a relevance score, a routing decision: these are not facts about the world. They are outputs of a specific agent run at a specific moment. They belong in episodic memory, where they are recorded as part of the event history rather than promoted as durable state.

Recommendations are time-bound. They are relevant to a specific decision window and become obsolete as context changes. They belong in in-context memory for the duration of the run that needs them; they should not persist.

In a shared blob, these three information types are indistinguishable. A stale judgment gets treated as a current fact. A superseded recommendation gets retrieved alongside valid state. An agent cannot determine what type of information it is consuming, how current it is, or who had authority to write it.

The diagnostic question: Does your memory model distinguish between a fact (durable, authority-controlled), a judgment (run-specific, episodic), and a recommendation (time-bound, ephemeral)? If not, the system will produce decisions based on information it cannot correctly interpret.

These distinctions are not theoretical. The Katelyn Skills OS case study shows what happens in production when they are absent, and what changes when they are applied.

How Graph Digital redesigned the Katelyn Skills OS from three untyped stores to four typed layers

This is a production account, not a thought experiment. The Katelyn Skills OS, Graph Digital's multi-agent content platform, operated for several months on a naive memory model before being redesigned in March 2026. The before/after data is first-party and verifiable.

Before: three independent memory systems without governed lifecycle

The system operated three separate stores with no architectural relationship between them:

  • Claude auto-memory: session-scoped, prose blobs, machine-specific. Not a governed memory store; session context that accumulated without expiry.
  • Katelyn workspace memory: markdown files updated inconsistently across runs. Stale entries persisted. No version control. No lifecycle enforcement.
  • Pipeline state: scattered JSON across individual run folders. No unified read scope, no authority model, no episodic record.

The result: agents contradicted each other across runs. Sessions lost context from previous decisions. State from earlier runs bled into current ones. When something went wrong, reproducing the failure required multi-session investigation: querying the model, cross-referencing run outputs, reconstructing what each agent had known.

The failure rate: cross-session coherence failures required manual state correction in approximately 1 in 8 sessions across 300+ production pipeline runs between January and April 2026.

After: four typed layers with defined lifetimes and write permissions

The redesign replaced the three untyped stores with four governed layers:

  • doctrine/: External memory, versioned skills, schemas, policies. Read-only for worker agents. Updated only through governed flows.
  • project/: Persisted state, durable structured facts. Typed JSON, authority-controlled writes.
  • state/: Operational truth, Katelyn-governed, single source of truth for system state.
  • Run artifacts: In-context working memory per run. Ephemeral; expires at run end.

The outcome across 150+ subsequent runs: zero cross-session coherence failures requiring manual intervention. Debugging session time for state-related issues fell from multi-session investigation to single-session resolution. The episodic memory trail makes the failure account for every run immediately accessible.

The architectural change was the only variable. Same agents, same models, same tasks. The memory governance model was what determined production reliability.

For the Digital Director making the business case internally: the case study outcome translates directly. The debugging overhead eliminated here is the kind of cost that compounds invisibly until a production system is already under stress.

The data closes the question of whether the four-layer model holds at production scale. What remains is the implementation sequence.

How to apply the four-layer AI agent memory model in your multi-agent system today

You now have the framework: four typed layers, a write permission model, and a distinction between facts, judgments, and recommendations. The next question is how to implement it.

The four-layer model requires no dedicated infrastructure. It does not depend on a specific memory platform or agent framework. It enforces through file structure, worker contracts, and role-based write authority, the same mechanisms the Katelyn Skills OS uses in production. The model maps directly to the memory abstractions in frameworks like LangGraph (persistent state graphs, ephemeral context packs) and LangChain's agent memory modules; the design principles translate across tooling choices.

The handoff artifact pattern is the practical implementation. Each agent, when it completes a task, produces a bounded context pack for the agents that follow it:

  • Ephemeral outputs stay in in-context memory and are not persisted. They exist for the duration of the current run only.
  • Decision records are written to persisted state with a version timestamp. They include who made the decision, on what basis, and when.
  • Doctrine updates — changes to system rules, schemas, or policies — go through a governed update flow to external memory. No worker agent updates doctrine directly.
  • Event records are appended to episodic memory for every run. They include which agent ran, what inputs it used, what it decided, and what failed.

This pattern works at the file system level before any programmatic memory management is needed. The permission model can be enforced through worker contracts — defined input and output scopes for each agent role — before it is enforced at the infrastructure level.

Can you reproduce any agent's reasoning from a specific run two weeks ago without interrogating the model? If not, the memory architecture needs work before the system scales.

Frequently asked questions

What is episodic memory in an AI agent system?

Episodic memory in an AI agent system is the append-only event log that records what happened during each agent run: which agent executed, what inputs it consumed, what decisions it made, and what failed. Unlike in-context memory or persisted state, episodic records are never modified or deleted. They exist to make production debugging tractable. When agent behaviour requires investigation, the episodic trail gives you a complete, timestamped account of what every agent knew and when it acted.

What is external memory in an AI agent?

External memory in an AI agent system is the governed knowledge store that holds system doctrine: skills, schemas, rules, and policies that all agents share. It is versioned, persistent, and read-only for worker agents. Only governance agents, through controlled update flows, can change it. External memory prevents silent doctrine drift. Because no worker agent can write to it, the rules governing agent behaviour remain stable across runs until a deliberate, audited update is made.

Do I need dedicated infrastructure to implement the four-layer AI agent memory model?

No. The four-layer model requires no dedicated infrastructure, memory platform, or framework. It enforces through file structure, worker contracts, and role-based write authority. In-context memory lives in the agent's run context. Persisted state is typed JSON with defined write authority. External memory is a versioned directory that workers read but cannot write. Episodic memory is an append-only log file. The permission model can be implemented and enforced at the file system level before any programmatic memory management is introduced.

Key takeaways

  • AI agent memory architecture requires four typed layers — in-context memory, persisted state, external memory, and episodic memory — each with a defined lifetime and write permission, not a shared context blob.
  • Shared context blobs produce five compounding failure modes (context bloat, stale state, conflicting writes, silent doctrine drift, failures with no traceable cause) that compound as agent count increases.
  • The write permission model is as important as the layer structure: worker agents read doctrine but cannot write it; orchestrators manage workflow state without touching doctrine; episodic memory is append-only for every agent role.
  • Facts, judgments, and recommendations have different lifetimes and must be stored in different layers — conflating them produces decisions based on information the system cannot correctly interpret.
  • Graph Digital's Katelyn Skills OS moved from a 1-in-8 cross-session failure rate (300+ runs, three untyped stores) to zero failures in 150+ subsequent runs after redesigning to four typed layers; the memory governance model was the only variable.
  • The four-layer model requires no dedicated infrastructure — it enforces through file structure, worker contracts, and role-based write authority before programmatic memory management is needed.

If you are building or evaluating a multi-agent AI system and want to assess whether your current memory model will hold up at production scale, an AI Readiness Assessment gives you a structured gap analysis — from your current architecture to the four-layer governance model.


Stefan Finch — Founder, Graph Digital

Stefan Finch is the founder of Graph Digital, advising leaders on AI strategy, commercial systems, and agentic execution. He works with digital and commercial leaders in complex B2B organisations on AI visibility, buyer journeys, growth systems, and AI-enabled execution.

Connect with Stefan: LinkedIn

Graph Digital is an AI-powered B2B marketing and growth consultancy that specialises in AI visibility and answer engine optimisation (AEO) for complex B2B companies. AI strategy and advisory →