Why AI Agents Forget (and How to Give Them Lasting Memory)
Why AI agents lose context and forget across sessions - context windows, per-agent scope, retrieval limits - and how a shared write-time memory layer fixes it.
TL;DR
- AI agents forget for four structural reasons: context windows overflow, memory dies at session end, retrieval returns what is close instead of correct, and no record tracks when a fact stops being true.
- Anthropic's research agent truncates context past 200,000 tokens, and frontier models drop from ~91% to ~7% accuracy as tool outputs grow (arxiv.org).
- Bolting a vector store onto an agent fixes retrieval latency, not memory. Storage is a bucket. Memory is shared state.
- The fix is one org-wide, write-time, bi-temporal graph every agent and human reads and writes.
- Sentra is the only system above 30% on both MEME Cascade and Absence (KAIST, 2026).
Why AI Agents Forget: Four Structural Causes
Agents forget for four distinct reasons, each rooted in how they are built rather than how they are tuned. Understanding all four matters because a fix for one does nothing for the others.
1. Context overflow. The context window is a fixed token budget, and long tasks fill it until older information gets truncated and lost.
2. Session death. Memory inside a session is scratchpad state that disappears when the session ends, so each new run and each new agent starts blind.
3. Close-not-correct retrieval. Vector search ranks chunks by similarity, returning what sits near the query rather than what answers it.
4. No record of change. Standard retrieval treats every stored fact as equally current, so a deprecated value sits beside the fact that replaced it.
The sections below take each cause in turn and explain the mechanism.
The Context Window Is a Fixed Budget That Overflows
An agent's context window works like a fixed RAM size. Everything the model can think about at once, including instructions, conversation history, tool results, and retrieved documents, has to fit inside that budget. Anthropic's multi-agent research system makes the ceiling explicit. The lead agent saves its plan to external memory because if the context window exceeds 200,000 tokens it will be truncated. Once the budget overflows, the oldest tokens fall out, and the agent loses the very details it needs to finish the task.
The amounts involved make this a hard wall, not a tuning problem you can prompt your way around. IBM Research measured a materials science workflow that would have consumed an average of 20,822,181 tokens to pass a single 3D molecular grid through context. No major model has a window anywhere near that size, so the run does not degrade gracefully. It throws a hard error. You cannot trim or summarize your way out when the input itself is two orders of magnitude larger than any window.
Performance also collapses well before the hard ceiling. Research from LongFuncEval shows that as tool outputs grow longer, frontier model accuracy falls from roughly 91% to roughly 7%. The model still receives the tokens, but it stops using them reliably. An agent buried in long traces, spans, and tool results behaves as if it forgot what it just read, because the relevant fact is now diluted across thousands of competing tokens. The window is finite, and the closer you push it, the less the agent actually remembers.
Memory Dies at Session End
Most agent memory lives in a scratchpad that disappears when the session closes. Engineers describe this as short-term memory, information held only long enough to finish the current task (jtanruan.medium.com). When the conversation ends or the context resets, the agent loses what it learned. The next session starts from zero.
To fight this, developers write key details to an external store so the information survives a reset and can be re-injected later. LlamaIndex ships named blocks for exactly this, including VectorMemory for stored chat history and FactExtractionMemory for pulling factual statements out of a conversation (jtanruan.medium.com). Without that wiring, knowledge stays trapped per agent and per session.
The deeper problem is that each store belongs to one agent. Your coding agent learns that a deployment pipeline changed last week, and your support agent never finds out. The stores sit in separate silos, so one agent's hard-won knowledge stays invisible to every other agent and to the humans on the team. Nothing accumulates across the organization, because no shared place exists for it to accumulate.
That isolation is why bolting storage onto a single agent never produces lasting memory. Memory has to be shared state every agent and human reads and writes, not a private cache that dies with each run.
Retrieval Returns What Is Close, Not What Is Correct
Retrieval-augmented generation reads more like a search trick than a memory system, and the limit is built into how it works. RAG ranks stored chunks by cosine similarity between the query embedding and each chunk embedding. That measure finds text that sits close in vector space, which is not the same as text that answers the question. Practitioners report that "retrieving code by embedding search alone became unreliable as the codebase grew," because proximity degrades as the corpus expands (jtanruan.medium.com).
Closeness also surfaces facts that have no business in the answer. ChatGPT's long-term memory once pulled a user's location from a past conversation and injected it into an unrelated image request (jtanruan.medium.com). Nothing in the math distinguished a relevant stored fact from a nearby one, so the agent restated a detail the user never asked for. Tuning chunk size or adding rerankers softens the failure, but it does not remove the root cause.
Sentra states the problem plainly: vector search returns what is close, not what is correct. The fix is not better ranking on top of embeddings. It is resolving what a fact means when you write it, against a per-organization ontology, so retrieval pulls the correct answer rather than the nearest one.
Agents Have No Record of When Facts Change
Standard RAG treats time as an optional signal you bolt on after the fact, not a property of the data itself. A vector index stores chunks by meaning and retrieves them by similarity. Whether a fact is current or two years out of date makes no difference to the math. As one practitioner guide puts it, "if freshness matters, rank newer documents higher" through custom post-retrieval filtering and a date-descending sort (jtanruan.medium.com). Recency is a manual engineering patch, not a structural guarantee.
That leaves the agent reading from a flat haystack. Old facts sit next to new ones, equally weighted, ready to restate yesterday as today. When your pricing changed in March, the deprecated number from January still scores just as relevant to a query about current pricing, because nothing in the store records that the old fact stopped being true.
Sentra's framing names the result directly. "Vector search returns what's close, not what's correct," and a system with no notion of validity windows cannot tell a superseded fact from a live one. The fix is a bi-temporal fact model, where every fact carries when it became true and when it stopped, and old facts are invalidated rather than deleted. An agent reading that graph knows which statements still hold, so it never quotes a deprecated commitment as if it were current.
Storage Is Not Memory
Adding a vector store to an agent fixes how fast you fetch data, not whether the agent holds coherent knowledge. Storage is a bucket. You can drop facts in and pull them back out, but the bucket has no idea what any fact means, how it relates to other facts, or whether it has been replaced.
IBM Research makes the limit visible. Their pointer method keeps a 3D molecular grid outside the context window and hands the agent only a reference, cutting token use roughly sevenfold and dropping execution time from 43 seconds to 11 (arxiv.org). That is excellent engineering, and it is still a per-run cache. The pointers exist for the duration of one task. Nothing carries forward to the next run, and nothing is visible to a second agent working the same problem. You have solved retrieval cost without building memory.
The deeper failure shows up when storage holds the same entity under different labels. Sentra describes the case directly. Sarah Chen in HubSpot, S. Chen in Gmail, and @schen in Slack read as three different people, so context never joins up. A bucket faithfully stores all three. It cannot tell you they are one person, because resolving that requires understanding the data, not just retaining it. The result is incoherent state. An agent answers a question about Sarah using one third of what the organization actually knows about her, and it has no way to know the other two thirds exist.
That is the line between storage and memory. Storage retains bytes. Memory is shared state that resolves who and what a fact refers to, links it to everything related, and stays consistent across every agent and person reading it. Bolting a store onto one agent gives you a faster bucket, not a brain.
The Four Forgetting Modes and Their Fixes
Each forgetting mode traces back to a specific architectural gap, and each gap has a specific structural fix. The table below pairs them.
| Forgetting mode | Structural fix |
|---|---|
| Context overflow: the window fills past its token ceiling and older content gets truncated | Memory pointers that live outside the window, so the agent references data instead of carrying it |
| Session death: scratchpad memory vanishes when a session ends and never reaches other agents | A persistent shared graph every agent and human reads and writes |
| Close-not-correct retrieval: cosine similarity surfaces what is near the query, not what is true | Write-time semantic resolution, where meaning is fixed at ingestion against an org ontology |
| Stale facts: old and current facts sit equally weighted, ready to restate yesterday as today | A bi-temporal fact model that records when each fact became true and when it stopped |
The pattern across all four rows is the same. You cannot patch forgetting at query time when the cause sits in how the agent stores and scopes knowledge. The fixes move the work earlier, to write time, and wider, to shared state.
How a Shared, Write-Time Memory Layer Fixes This
A shared, write-time, bi-temporal memory layer attacks each forgetting mode at its source rather than patching the symptom. Sentra resolves meaning at ingestion against a per-organization ontology, so when a fact enters the graph, its structure is already settled. Standard RAG systems store embeddings at write and then guess structure at query time, which forces every request to re-crawl Slack, email, and docs to rediscover what something means. Sentra makes meaning a primitive, not a side effect, so retrieval returns what is correct instead of what is merely close.
The bi-temporal fact model fixes stale knowledge directly. Every fact carries when it became true and when it stopped being true, and old facts are invalidated rather than deleted. An agent reading the graph sees that a deprecated price or a closed deal is no longer current, so it never restates yesterday as today. Provenance stays first-class, which means an agent can trace where a fact came from and why it changed.
One shared graph solves session death and per-agent silos. Every team, every tool, and every model reads and writes to the same graph through REST or MCP, so what you teach one agent, every agent remembers. A support agent learns a customer's escalation history, and the sales agent reads the same record without a fresh crawl. Sentra resolves identity continuously across names, emails, handles, and internal IDs, so Sarah Chen in HubSpot and @schen in Slack join into one person instead of three.
Sentra organizes this state into three coordinated layers. Factual memory holds what is true, where it came from, and when it changed. Action memory tracks what was promised, what is blocked, and what needs follow-up. Interaction memory records who said what and which perspective shaped a decision, so the reasoning behind an outcome survives, not just the artifact.
The proof sits in the two hardest tasks on the MEME benchmark from KAIST. Cascade and Absence break almost every system, with field averages of 3% and 1%. Sentra scores 40% and 43%, the only system above 30% on both.
FAQ
- What is the difference between agent memory and storage?
- Storage is a bucket that holds data until something asks for it. Memory is shared state that knows what a fact means, where it came from, and when it changed. Sentra resolves meaning at write time and keeps it queryable, so agents read coherent knowledge instead of raw chunks.
- Does this replace my existing agents or tools?
- No. Sentra is the memory layer underneath your stack, reachable over REST or MCP. Your agents in Cursor, Claude, or Slack keep running, and they read and write to one shared graph instead of carrying isolated, per-session context.
- What makes bi-temporal memory different from versioning?
- Versioning tracks edits to a record over time. Bi-temporal memory tracks when a fact became true and when it stopped being true, so old facts are invalidated rather than deleted. That lets an agent answer "what was true then" and "what is true now" without restating a superseded fact as current.
- How does one agent's learning reach another agent?
- Every agent writes to the same org-wide graph, so what you teach one agent, every agent remembers. There are no per-agent silos to sync. A fact captured by your sales agent is immediately available to your support agent the next time it queries.
- Is this secure enough for enterprise use?
- Sentra is SOC 2 Type II and ISO 27001 certified and does not train models on your data. You can deploy in the cloud, an isolated VPC, or fully air-gapped on-prem. Identity resolution is confidence-scored across names, emails, and handles, so access stays tied to the right person across tools.
Conclusion
Agents forget for structural reasons. The context window overflows, sessions reset, vector search returns what is close instead of what is correct, and stale facts sit unmarked next to current ones. Bolting a vector store onto an agent addresses retrieval latency, not memory. Storage is a bucket. Memory is shared state that every agent and human reads and writes, with semantics resolved at write time and a bi-temporal record of when each fact became true and when it stopped.
Give your agents a company brain. Start with Sentra and let what you teach one agent stick for every agent.