Every production agentic system needs to think across four distinct memory layers, each with a different scope, latency, and persistence characteristic:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AGENT MEMORY ARCHITECTURE β
βββββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββ¬ββββββββββββββ€
β IN-CONTEXT β EXTERNAL β EPISODIC β SEMANTIC β
β (Working Mem) β (Long-Term) β (Events) β (Facts) β
βββββββββββββββββββΌββββββββββββββββΌβββββββββββββΌββββββββββββββ€
β Context window β Vector DB / β Event log β Knowledge β
β Token budget β PostgreSQL β Session DB β Base / RAG β
β Current turn β Redis cache β Timelines β Embeddings β
βββββββββββββββββββΌββββββββββββββββΌβββββββββββββΌββββββββββββββ€
β Fastest β Fast β Moderate β Moderate β
β Most expensive β Cheap β Cheap β Cheap β
β Volatile β Persistent β Persistent β Persistent β
βββββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββ΄ββββββββββββββ
In-Context (Working Memory) The LLMβs context window at any given moment. Everything the model βknowsβ for this inference call. The most constrained and expensive resource in the system.
External / Long-Term Memory Persisted outside the model β vector stores, relational DBs, key-value caches. Accessed via retrieval (RAG, tool calls). Effectively unbounded in size.
Episodic Memory A log of what happened β past agent runs, tool call sequences, user interactions, outcomes. Enables the agent to reason about its own history. Typically stored as structured event records.
Semantic Memory Factual world knowledge β domain documents, ontologies, reference data. The corpus your RAG pipeline retrieves from. Static or slowly updated.
Agent Turn N
β
βββ Retrieve from Semantic Memory (RAG) β inject into context
βββ Retrieve from Episodic Memory (past runs) β inject summary
βββ Read from External KV store (user prefs) β inject as system block
β
βΌ
LLM Inference βββ Working Memory (context window)
β
βββ Write tool outputs β External store
βββ Write turn summary β Episodic store
βββ Update user state β External KV
| What you need to remember | Store |
|---|---|
| Current task state, tool results this turn | In-context (working memory) |
| User preferences, persistent profile | External KV (Redis / PostgreSQL) |
| What happened in past sessions | Episodic DB (PostgreSQL event log) |
| Domain knowledge, documents | Semantic store (vector DB / pgvector) |
| Frequently accessed reference data | Cache layer (Redis, in-memory) |
Working memory is the context window β the single most constrained resource in any agentic system. Poor management leads to context overflow, silent truncation, and degraded reasoning quality well before the hard limit is hit.
Think of the context window as a fixed budget that must be allocated across competing consumers:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONTEXT WINDOW BUDGET (e.g. 200K tokens) β
ββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββ€
β RESERVED (fixed) β DYNAMIC (per-turn allocation) β
ββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββ€
β System prompt β Retrieved RAG chunks (~10β15%) β
β (~5β10%) β Tool schemas (~5%) β
β β Conversation / scratchpad (~20β30%) β
β β Tool call results (~20β30%) β
β β Output buffer (generation) (~10β15%) β
ββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββ
Key principle: Always reserve headroom. A model operating near its context limit degrades in reasoning quality (the βlost in the middleβ problem) long before it hits a hard error.
class TokenBudget:
def __init__(self, model_context_limit: int):
self.total = model_context_limit
self.system_prompt = 0
self.tool_schemas = 0
self.output_reserve = 2000 # always reserve for generation
def remaining(self) -> int:
used = self.system_prompt + self.tool_schemas + self.output_reserve
return self.total - used
def allocate(self, priorities: dict[str, float]) -> dict[str, int]:
"""
priorities: {"rag_chunks": 0.35, "history": 0.30, "tool_results": 0.35}
Proportionally allocate remaining tokens by priority weights.
"""
budget = self.remaining()
return {k: int(budget * v) for k, v in priorities.items()}
LLMs attend most strongly to content at the beginning and end of the context window. Information buried in the middle receives less attention and is more likely to be ignored.
Attention weight across context position:
High ββββββ βββββ
β ββββ βββ
β βββ βββ
Low β βββββββββββββ
ββββββββββββββββββββββββββββββββββββββββββ
START END
Mitigation strategies:
Tool calls are the biggest source of context bloat in agentic systems. A single database query or API response can consume thousands of tokens.
def manage_tool_result(result: str, budget: int,
summarise_fn=None) -> str:
"""
Fit a tool result into a token budget.
Falls back to LLM summarisation if result exceeds budget.
"""
token_count = count_tokens(result)
if token_count <= budget:
return result # fits β use as-is
if summarise_fn and token_count <= budget * 3:
return summarise_fn(result, budget) # moderate overrun β summarise
# Large overrun β truncate with marker
truncated = truncate_to_tokens(result, budget - 50)
return truncated + "\n\n[... result truncated β full output in external store]"
Best practices:
Multi-turn conversations accumulate fast. Three strategies, in increasing aggressiveness:
Strategy 1 β Sliding Window Keep only the last N turns. Simple, predictable, loses early context.
def sliding_window(history: list[dict], max_turns: int = 10) -> list[dict]:
return history[-max_turns * 2:] # *2 for user+assistant pairs
Strategy 2 β Token-Aware Truncation Evict oldest turns until the history fits the budget.
def token_aware_truncate(history: list[dict], budget: int) -> list[dict]:
kept = []
used = 0
for message in reversed(history):
tokens = count_tokens(message["content"])
if used + tokens > budget:
break
kept.insert(0, message)
used += tokens
return kept
Strategy 3 β Progressive Summarisation Summarise old turns into a rolling summary block; keep recent turns verbatim. Covered in Β§3.
tiktoken or the modelβs tokeniserSummarisation bridges working memory and long-term memory β it compresses what happened into a form that can be cheaply re-injected later without consuming the full original token cost.
Trigger Action
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Context > 70% full Summarise oldest N turns
Task / sub-task completes Write task summary to episodic store
Session ends Write session summary to user profile
Agent hands off to another agent Write handoff brief
Periodic (every K turns) Rolling summary update
The most important pattern for long-running agents. Instead of keeping the full conversation, maintain a rolling summary that is updated incrementally:
Turn 1β5: [Full verbatim history]
β (context > threshold)
Turn 6: [Summary of turns 1β5] + [Turns 6 verbatim]
β (context > threshold again)
Turn 11: [Summary of turns 1β10] + [Turns 11 verbatim]
β
Turn 16: [Summary of turns 1β15] + [Turns 16 verbatim]
SUMMARISE_PROMPT = """
You are summarising a conversation segment for an AI agent's memory.
Preserve:
- All decisions made
- All tool calls and their outcomes
- All user-stated preferences or constraints
- Any unresolved questions or pending actions
Discard:
- Pleasantries, filler, and meta-commentary
- Redundant restatements
- Intermediate reasoning that led to a discarded path
Previous summary (if any):
{previous_summary}
New turns to incorporate:
{new_turns}
Write an updated, consolidated summary in past tense.
"""
def rolling_summarise(previous_summary: str,
new_turns: list[dict],
llm) -> str:
prompt = SUMMARISE_PROMPT.format(
previous_summary=previous_summary or "None",
new_turns=format_turns(new_turns),
)
return llm.invoke(prompt)
For very long agent runs (hundreds of turns), a single rolling summary becomes stale and lossy. Use a two-level hierarchy:
Level 1 β Turn summaries (per 5β10 turns, ~100 tokens each)
Level 2 β Session summary (per session, ~300β500 tokens)
Level 3 β User/task profile (persistent, ~200 tokens, updated on change)
At retrieval time:
β Always inject: Level 3 (user profile)
β Conditionally inject: Level 2 (recent session summary)
β Rarely inject: Level 1 (only if directly relevant to current task)
Not everything should be written back to long-term memory. Apply a write filter:
WRITE_DECISION_PROMPT = """
Evaluate this agent turn. Decide what (if anything) should be
written to long-term memory.
Categories:
USER_PREFERENCE β stated preference about behaviour, format, topics
DECISION β a confirmed decision made by user or agent
TASK_OUTCOME β result of a completed task or sub-task
CONSTRAINT β a rule or constraint to remember
DISCARD β nothing worth persisting
Turn content:
{turn}
Respond as JSON:
category
"""
def evaluate_for_write(turn: str, llm) -> dict:
response = llm.invoke(WRITE_DECISION_PROMPT.format(turn=turn))
return json.loads(response)
Write routing by category:
| Category | Write destination |
|---|---|
USER_PREFERENCE |
User profile store (PostgreSQL / Redis) |
DECISION |
Episodic event log |
TASK_OUTCOME |
Task result store; update task state |
CONSTRAINT |
System prompt augmentation store |
DISCARD |
Nothing written |
Not all memories should live forever. Apply a TTL (time-to-live) and relevance decay model:
Memory Type TTL / Retention Policy
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Turn-level summary 24β48 hours (session scope)
Session summary 30 days (unless reinforced by re-access)
User preference Indefinite (until explicitly changed)
Task outcome Duration of project / task lifecycle
Semantic (RAG docs) Versioned; expire on document update
def write_with_ttl(store, key: str, value: dict, ttl_days: int):
value["expires_at"] = (datetime.utcnow()
+ timedelta(days=ttl_days)).isoformat()
store.set(key, json.dumps(value))
# In Redis: store.expire(key, ttl_days * 86400)
A clean interface pattern for agent memory operations, usable as tool definitions in an agentic framework:
# Three memory tools exposed to the agent
REMEMBER = {
"name": "remember",
"description": "Persist a fact, decision, or preference to long-term memory.",
"parameters": {
"category": "USER_PREFERENCE | DECISION | TASK_OUTCOME | CONSTRAINT",
"content": "The information to persist, concisely stated.",
"ttl_days": "How long to retain (0 = indefinite).",
}
}
RECALL = {
"name": "recall",
"description": "Retrieve relevant memories given a query.",
"parameters": {
"query": "Natural language description of what to retrieve.",
"limit": "Max number of memories to return (default 5).",
}
}
FORGET = {
"name": "forget",
"description": "Delete or invalidate a stored memory.",
"parameters": {
"memory_id": "ID of the memory to remove.",
"reason": "Why this memory is no longer valid.",
}
}
| Principle | Guidance |
|---|---|
| Budget first | Always know your token budget before composing a context |
| Retrieve, donβt store | Keep working memory lean; pull from external stores on demand |
| Summarise at boundaries | Sub-task end, session end, context threshold β all are write triggers |
| Write with intent | Not every turn deserves persistence; classify before writing |
| Position matters | Critical information at start/end of context; verbose content in middle |
| Instrument everything | Token usage, memory hit rates, and summarisation quality are observable metrics |
| Decay is healthy | Stale memories degrade agent performance; apply TTLs aggressively |