What Do LLMs Actually Forget?

LLMs don’t remember: they just read context windows. Learn why true memory requires external systems and how Flumes bridges the gap between prompts and persistence.

Ask most people how a language model like GPT "remembers" things, and you'll get a vague answer: "It sees the prompt." But if you're building real-world AI agents, that answer quickly breaks down.

Large language models (LLMs) don't remember in any conventional sense. They don't retain long-term state between calls. They don’t build memory over time. What they "know" comes down to one thing: what's in the current context window.

So, what happens when that context window runs out? The model forgets. Instantly.

Context Windows Are a Crutch

Every LLM has a token limit, a maximum number of tokens it can process at once. For GPT-4-turbo, that might be 128k tokens. Sounds huge? Sure. But:

  • A day's worth of chat history can easily surpass 10k tokens
  • Detailed project plans or logs add up fast
  • Tokenizing JSON, code, or user messages can be verbose

When the context window fills up, developers are forced to choose: what stays, what gets cut, and what gets summarized. These are memory decisions. And most LLM-based apps aren’t equipped to make them intelligently.

Short-Term vs Long-Term Memory

Language models have short-term memory, not long-term memory. That means:

  • They remember what you told them in this session
  • They forget everything between sessions
  • They can fake continuity if you cram past content into the prompt—but it’s expensive and brittle

For example:

  • GPT can remember your name if you remind it every message
  • It can track a to-do list if you keep past tasks in context
  • But if you end the session, it forgets you existed

This isn’t a flaw in GPT, it’s a design choice. But it creates limitations for any system that needs persistence, memory, or reasoning across time.

Why RAG Isn’t Real Memory

Retrieval-Augmented Generation (RAG) is a popular workaround. Store documents or facts in a vector DB, then retrieve them at inference time.

But RAG has real issues when misused as memory:

  • It retrieves semantically similar text: not the right memory
  • It lacks structure: no timeline, no source attribution, no update logic
  • It’s stateless: no awareness of prior interactions

In short: RAG helps models answer questions. But it doesn't help them remember.

External Memory = Structured Continuity

To build truly persistent agents, you need something more powerful than vector search. You need external memory:

  • Stored outside the prompt
  • Recallable based on intent, not just similarity
  • Structured to reflect tasks, timelines, and interactions
  • Updateable: able to retain the latest version of a fact or state

This is what Flumes enables. Instead of shoving more tokens into every call, Flumes lets your agent:

  • Recall only what matters from prior sessions
  • Store structured memories that persist across time
  • Summarize and compress as needed
  • Drop irrelevant or outdated data automatically

Forgetting Isn't the Problem. Not Remembering Is.

LLMs will always forget. That’s not going to change.

But agents need continuity. They need to build up knowledge, context, and goals over time. External memory systems are the only way to bridge that gap.

With Flumes, your agent doesn’t need to pretend to remember. It actually can.

Curious what a real memory system for LLMs looks like? [Join the early access] and start building with structured, persistent memory that scales.

Get early access

Effortless memory for AI teams