Writing
Agents ·Memory ·Retrieval

How AI agents remember

A field guide to agent memory: the kinds of memory an agent keeps, where each one lives in Postgres, and when the agent reads or writes it during a single turn.

10 min read
Contents

The gist

  • Agent memory splits into three classic kinds borrowed from cognitive science (semantic, episodic, procedural) and four the agent era added (working, entity/graph, reflective, shared).
  • Storage follows the access pattern: exact lookups in a SQL table, semantic recall (including which tools to use) in a pgvector index, relationships in a graph, all inside one Postgres.
  • Most reads are deterministic and run before the model thinks; expensive reads are just-in-time; durable writes happen only when the turn ends.
  • An agent that curates its own memory by forgetting the trivial and reflecting on hard runs stays sharp instead of drowning in its own history.

An agent with no memory is a brilliant stranger. It reasons well, calls tools, returns an answer, and then forgets everything the moment the turn ends. Memory is the fourth capability, alongside perception, reasoning, and action, and it is the one that turns a clever demo into a system you can rely on across days and sessions. This article is a map of three questions: what an agent stores, where each kind of memory lives in a database, and when the agent reaches for it.

The stateless default

A stateless agent reads its input, reasons over it, and produces output, with nothing carried between calls. That is fine for a one-shot tool and useless for anything real. Without memory an agent cannot run a task that spans many steps, cannot recognize a user who returns tomorrow, and cannot learn from a mistake it made an hour ago. The usual patch, pasting the entire history into every prompt, only delays the failure: the context window fills, latency climbs, cost grows with every turn, and eventually the window overflows.

Memory is the alternative to stuffing. Instead of carrying everything in the prompt, the agent persists what matters outside the context window and retrieves only the slice a given turn needs. The rest of this piece is about how to do that well.

Seven kinds of memory

"Memory" is not one thing. It helps to split it into kinds, because each kind answers a different question and, as we will see, wants a different home in the database.

Three kinds are borrowed straight from how people remember. Semantic memory holds facts: what is true about the user, the domain, and the world. Episodic memory holds experiences: a time-ordered record of what happened and what the agent did. Procedural memory holds skills: how to carry out a task without re-deriving it each time.

Four more kinds were added by the practice of building agents. Working memory is the scratchpad for the task in flight. Entity/graph memory is a map of the things an agent has seen and how they connect. Reflective memory is the lessons an agent draws from its own past runs. Shared memory is a pool that several agents read and write in common.

memory_types3 classic · 4 from the agent era
Semanticfacts
stored

Durable facts about the user, the domain, and the world.

extracted

Stable preferences and attributes pulled from the chat.

in a chatbot

"The user is vegetarian and prefers metric units."

in an agent

A knowledge base the agent retrieves over (RAG).

Episodicexperiences
stored

A time-ordered record of what happened and what was done.

extracted

A short note per turn or event, kept in order.

in a chatbot

"Yesterday you helped me draft an email to Sam."

in an agent

A run log of past executions, for replay and audit.

Proceduralskills
stored

How to carry out a task, plus the catalog of tools.

extracted

Reusable steps and workflows that worked before.

in a chatbot

The system prompt and persona rules.

in an agent

A learned tool sequence (workflow memory) and the toolbox.

Workingtask statenew
stored

The scratchpad for the task currently in flight.

extracted

The current goal, plan, and intermediate results.

in a chatbot

The active conversation in the context window.

in an agent

Plan and partial results held across loop steps.

Entity / graphrelationshipsnew
stored

Entities and the typed edges between them.

extracted

People, systems, and orgs, and how they connect.

in a chatbot

"Sam is your manager; the thread was about Q3."

in an agent

A knowledge graph the agent traverses (Graphiti, Zep).

Reflectivelessonsnew
stored

Lessons the agent draws from its own past runs.

extracted

What went wrong, and the rule that avoids it next time.

in a chatbot

"Ask for the deadline before drafting."

in an agent

A lessons file loaded before similar tasks (Reflexion).

Sharedcommon poolnew
stored

A pool several agents read and write in common.

extracted

Facts and state shared across the team.

in a chatbot

A team-wide FAQ every bot reads from.

in an agent

Multi-agent crew state, a shared blackboard.

Seven kinds, three borrowed from how people remember and four added by building agents. For each: what gets stored, what the agent extracts, and how it shows up in a chatbot versus a broader agentic system.

These seven kinds are not seven databases. They collapse into a handful of storage shapes, and a single turn touches several of them at once, which is what the rest of this piece shows.

Where memory lives

Here is the engineering core, and it is simpler than the catalog of kinds suggests. The access pattern picks the backend. Ask one question of every read: do I know exactly which rows I want, or only that they should resemble something? The answer routes the data to one of three homes, all of which are just Postgres.

Exact lookups go in a SQL table. Conversational memory is the clearest case. You want this thread's most recent messages, in order. That is an index range scan, not a similarity search.

messages.sql
create table messages (
  id          bigint generated always as identity primary key,
  thread_id   text        not null,
  role        text        not null,          -- 'user' | 'assistant'
  content     text        not null,
  created_at  timestamptz not null default now()
);
 
create index on messages (thread_id, created_at desc);

Reading the recent turns is a single indexed query, no embeddings involved:

recent.sql
select role, content
from messages
where thread_id = $1
order by created_at desc
limit 20;

A tool log or event log has the same shape: rows you fetch by id and time, for replay and audit.

Semantic recall goes in a vector index. When you only know the meaning you want, store an embedding of the content and search by distance. The pgvector extension adds a vector column type and distance operators to plain Postgres.

knowledge.sql
create extension if not exists vector;
 
create table knowledge (
  id        bigint generated always as identity primary key,
  content   text        not null,
  embedding vector(768) not null,
  metadata  jsonb       not null default '{}'
);
 
-- An approximate-nearest-neighbor index (HNSW) over cosine distance.
create index on knowledge using hnsw (embedding vector_cosine_ops);

Retrieval pulls the top-K rows closest to the query embedding. The <=> operator is cosine distance:

recall.sql
select content, metadata
from knowledge
order by embedding <=> $1   -- $1 is the query embedding
limit 5;

Because it is still SQL, you can scope the search with an ordinary filter on the JSONB metadata in the same query:

filtered.sql
select content
from knowledge
where metadata @> '{"source": "papers"}'
order by embedding <=> $1
limit 5;

Knowledge, learned workflows, the tool catalog, entity descriptions, and reflective lessons all share this layout: a content column, an embedding, and some metadata.

Relationships go in a graph. Entity memory is not really about similarity, it is about connection. Model entities and typed edges as two tables, then walk the graph with a recursive query.

graph.sql
create table entities (
  id        bigint generated always as identity primary key,
  name      text not null,
  kind      text,
  embedding vector(768)          -- so an entity can also be found by meaning
);
 
create table edges (
  src bigint references entities(id),
  rel text   not null,
  dst bigint references entities(id)
);

Pulling the two-hop neighborhood around an entity is a recursive CTE:

neighborhood.sql
with recursive nbrhood as (
  select id, name, 0 as depth
  from entities
  where name = $1
  union all
  select e.dst, t.name, n.depth + 1
  from nbrhood n
  join edges e    on e.src = n.id
  join entities t on t.id  = e.dst
  where n.depth < 2
)
select distinct name, depth from nbrhood order by depth;

Tools are memory, too. As an agent gains tools, sending every tool's schema to the model on every call is the same mistake as stuffing the prompt: it is noise the model has to read past. Treat the tool catalog as procedural memory. Store each tool's description as an embedding, exactly like the knowledge table, and retrieve only the few tools whose description is closest to the task at hand.

semantic_tool_retrievalintent top 3 tools
task · Book me a flight and a hotel for Berlin next month.
book_flightkept
find_hotelkept
convert_currencykept
send_email
run_sql
search_papers
summarize_doc
create_ticket
tools available8
schemas sent to model3 / 8
Tools are procedural memory: store each tool's description as an embedding, then retrieve only the few that match the task instead of sending every schema to the model. The catalog can grow without bloating the prompt.

The retrieval is the same <=> nearest-neighbor search you already run for knowledge; only the rows differ. A catalog of two hundred tools costs the model nothing if the agent loads the three it needs, and the model reasons better when it is not handed a wall of schemas it will never call.

One database, not three

Notice that nothing above leaves Postgres. The vector index is an extension, the graph is two ordinary tables, and the conversation log is a plain table with a B-tree. You do not need a dedicated vector store or a separate graph engine to ship a memory-aware agent; you need one Postgres and a clear idea of which question each table answers.

It is tempting to embed everything and search by similarity for all of it. Resist that. Asking a vector index for "this thread's last ten messages, in order" is both slow and wrong: it returns rows that are merely similar, not the ones you asked for. Exact, ordered reads are a B-tree's job. Similarity is the vector index's job. Connection is the graph's job. Match the backend to the question and most of the hard decisions disappear.

When the agent remembers

Knowing where memory lives is half the design. The other half is timing, and it is where most agents go wrong. A single turn touches memory at three distinct moments, and keeping them separate is what makes an agent both cheap and debuggable.

First, deterministic reads, before the model thinks. An agent cannot ask for what it does not know exists, so the core stores load on every turn by a fixed rule, not by the model's choice: recent conversation by thread, relevant knowledge by similarity, similar past workflows, and the entities named in the query. This is the context the model starts from.

Second, just-in-time reads, inside the loop. Expensive or conditional fetches wait until the model actually asks. A web search fires only when local memory comes up thin, and a full document loads only when the model decides it needs the detail. A lean prompt reasons better than a stuffed one, so the agent pulls detail just in time rather than just in case.

Third, durable writes, at the stop condition. When the turn produces a final answer, the agent persists what it learned: the exchange, the workflow it followed, any new entities, and a reflective lesson. Writing only at the end keeps half-finished reasoning out of long-term memory.

The system below traces one full turn through all three phases. Scroll it: the deterministic reads load from each store, the loop reaches outside only when it must, and the durable writes flow back when the turn ends.

the system

One agent, one Postgres, three stores

Everything the agent remembers lives in three shapes inside a single database: a SQL table for exact, ordered reads, a vector index for meaning, and a graph for connection. Here is the whole system at rest. Scroll to watch one turn move through it.

step 1 · before

Deterministic reads load the context

Before the model thinks, core memory loads by a fixed rule, not by the model's choice: recent conversation from the SQL table, relevant knowledge from the vector index, named entities from the graph. This is the context the turn starts from.

step 2 · the loop

The model reasons, and reaches out just in time

Inside the loop the model reasons over that context. Only when local memory comes up thin does it reach outside, a web search or a document fetch, just in time rather than just in case. A lean prompt reasons better than a stuffed one.

step 3 · after

Durable writes, only when the turn ends

When the turn produces a final answer, the agent writes what it learned back to the stores: the exchange, the workflow it followed, any new entities, and a reflective lesson. Writing only at the stop condition keeps half-finished reasoning out of long-term memory.

tools are memory

The tools live in the vector index too

The catalog of tools is procedural memory. Each tool's description is an embedding in the same vector index, so the agent retrieves the few tools that match the task instead of sending every schema to the model. The catalog can grow without bloating the prompt.

agent_memory
readloopwrite
webagentassemble · reason · actone PostgresSQL tableconversation · tool logvector indexknowledge · workflowgraphentities · relations
the system

In code, the three phases are the top, middle, and bottom of one function. The model chooses only what happens inside the loop; everything around it is fixed.

run_turn.py
def run_turn(query, thread_id):
    # 1. Deterministic reads: assemble context before the model thinks.
    context = assemble(
        recent    = sql_recent_messages(thread_id),       # exact, by thread
        knowledge = knn("knowledge", embed(query), k=5),  # semantic top-K
        workflow  = knn("workflow",  embed(query), k=3),
        entities  = graph_neighborhood(query),            # graph traversal
    )
    write_message(thread_id, "user", query)               # durable, every turn
 
    # 2. The loop: reason, act, fetch just in time.
    messages = [system_prompt(), user(context, query)]
    for _ in range(MAX_STEPS):
        step = model(messages)
        if step.is_final:
            break
        result = call_tool(step)        # e.g. expand_summary(id), web_search(q)
        messages += [step, tool_result(result)]
 
    # 3. Stop condition: persist what the turn learned.
    write_message(thread_id, "assistant", step.text)
    upsert_workflow(query, steps_taken=messages)
    upsert_entities(extract_entities(step.text))
    write_lesson(reflect(query, step.text))               # reflective memory
    return step.text

Read the structure, not the syntax: reads at the top, a tight loop in the middle, writes at the bottom. Because the reads and writes are deterministic, every run starts and ends the same way, which is exactly what makes a misbehaving agent possible to debug.

Keeping memory from rotting

A memory that only grows eventually buries the agent in its own history. So writing needs discipline, the same way human memory keeps what matters and lets the trivial fade.

Forgetting. Not every fact earns a permanent place. A durable preference is worth a write; a one-off aside is not. Deciding what to keep takes judgment, so it is one of the few memory operations that belongs to the model rather than to a rule.

Forgetting is a feature

The instinct is to save everything, just in case. But noise retrieved is noise reasoned over. An agent that writes selectively keeps its future retrievals sharp; an agent that hoards slowly poisons its own context.

Reflection. After a hard task, the agent writes down what it learned. The next time a similar task begins, that lesson loads with the rest of memory, and the agent avoids repeating the mistake. This is the loop that lets an agent get better at a job over time instead of starting fresh each morning.

The takeaway

Three questions carry the whole design. The taxonomy answers what to store: seven kinds, three of them borrowed from how people remember and four added by building agents. Postgres answers where it goes: exact reads in a SQL table, semantic reads (knowledge, and the tools to act on it) in a pgvector index, relationships in a graph, all in one database. The loop answers when to touch it: deterministic reads before the model thinks, just-in-time reads while it works, durable writes when it stops. Get those three right and you have crossed most of the distance between a demo that forgets you and a system that does not.