How AI agents remember

An agent with no memory is a brilliant stranger. It reasons well, calls tools, returns an answer, and then forgets everything the moment the turn ends. Memory is the fourth capability, alongside perception, reasoning, and action, and it is the one that turns a clever demo into a system you can rely on across days and sessions. This article is a map of three questions: what an agent stores, where each kind of memory lives in a database, and when the agent reaches for it. It closes on a fourth: when one Postgres stops being enough.

Here is one turn of a memory-aware agent at work. The bright lines are the conversation and its tools; the dim lines beside them are the memory it reads before acting and writes when it stops. The rest of this piece explains those tags.

agent_sessionone turn · reads then writes memory

>triage the spike in checkout 500s and propose a fix

01 / 11

A single turn in an agent terminal. The bright lines are the conversation and its tools; the dim lines with colored tags are the memory it reads before acting and writes when it stops. The rest of this piece is about those tags.

The stateless default

A stateless agent reads its input, reasons over it, and produces output, with nothing carried between calls. That is fine for a one-shot tool and useless for anything real. Without memory an agent cannot run a task that spans many steps, cannot recognize a user who returns tomorrow, and cannot learn from a mistake it made an hour ago. The usual patch, pasting the entire history into every prompt, only delays the failure: the context window fills, latency climbs, cost grows with every turn, and eventually the window overflows.

Memory is the alternative to stuffing. Instead of carrying everything in the prompt, the agent persists what matters outside the context window and retrieves only the slice a given turn needs. The rest of this piece is about how to do that well.

Seven kinds of memory

"Memory" is not one thing. It helps to split it into kinds, because each kind answers a different question and, as we will see, wants a different home in the database.

Three kinds are borrowed straight from how people remember. Semantic memory holds facts: what is true about the user, the domain, and the world. Episodic memory holds experiences: a time-ordered record of what happened and what the agent did. Procedural memory holds skills: how to carry out a task without re-deriving it each time.

Four more kinds were added by the practice of building agents. Working memory is the scratchpad for the task in flight. Entity/graph memory is a map of the things an agent has seen and how they connect. Reflective memory is the lessons an agent draws from its own past runs. Shared memory is a pool that several agents read and write in common.

memory_types3 classic · 4 from the agent era

Semanticfacts

stored

Durable facts about the user, the domain, and the world.

extracted

Stable preferences and attributes pulled from the chat.

in a chatbot

"The user is vegetarian and prefers metric units."

in an agent

A knowledge base the agent retrieves over (RAG).

Episodicexperiences

stored

A time-ordered record of what happened and what was done.

extracted

A short note per turn or event, kept in order.

in a chatbot

"Yesterday you helped me draft an email to Sam."

in an agent

A run log of past executions, for replay and audit.

Proceduralskills

stored

How to carry out a task, plus the catalog of tools.

extracted

Reusable steps and workflows that worked before.

in a chatbot

The system prompt and persona rules.

in an agent

A learned tool sequence (workflow memory) and the toolbox.

Workingtask statenew

stored

The scratchpad for the task currently in flight.

extracted

The current goal, plan, and intermediate results.

in a chatbot

The active conversation in the context window.

in an agent

Plan and partial results held across loop steps.

Entity / graphrelationshipsnew

stored

Entities and the typed edges between them.

extracted

People, systems, and orgs, and how they connect.

in a chatbot

"Sam is your manager; the thread was about Q3."

in an agent

A knowledge graph the agent traverses (Graphiti, Zep).

Reflectivelessonsnew

stored

Lessons the agent draws from its own past runs.

extracted

What went wrong, and the rule that avoids it next time.

in a chatbot

"Ask for the deadline before drafting."

in an agent

A lessons file loaded before similar tasks (Reflexion).

Sharedcommon poolnew

stored

A pool several agents read and write in common.

extracted

Facts and state shared across the team.

in a chatbot

A team-wide FAQ every bot reads from.

in an agent

Multi-agent crew state, a shared blackboard.

Seven kinds, three borrowed from how people remember and four added by building agents. For each: what gets stored, what the agent extracts, and how it shows up in a chatbot versus a broader agentic system.

Shared memory is the multi-agent case

The first six kinds are written by a single agent. Shared memory is the exception: a pool several agents read and write at once. Keep it as an ordinary namespaced table (a scope column every agent filters on), but because there are many writers it is the one place you need concurrency control: row-level locks (select ... for update) or an optimistic version column so two agents do not overwrite each other. The rest of this article assumes a single writer.

These seven kinds are not seven databases. They collapse into a handful of storage shapes, and a single turn touches several of them at once, which is what the rest of this piece shows.

Where memory lives

Here is the engineering core, and it is simpler than the catalog of kinds suggests. The access pattern picks the backend. Ask one question of every read: do I know exactly which rows I want, or only that they should resemble something? The answer routes the data to one of three homes, all of which are just Postgres.

Exact lookups go in a SQL table. Conversational memory is the clearest case. You want this thread's most recent messages, in order. That is an index range scan, not a similarity search.

messages.sql

create table messages (
  id          bigint generated always as identity primary key,
  thread_id   text        not null,
  role        text        not null,          -- 'user' | 'assistant'
  content     text        not null,
  created_at  timestamptz not null default now()
);
 
create index on messages (thread_id, created_at desc);

Reading the recent turns is a single indexed query, no embeddings involved:

recent.sql

select role, content
from messages
where thread_id = $1
order by created_at desc
limit 20;

A tool log or event log has the same shape: rows you fetch by id and time, for replay and audit.

Semantic recall goes in a vector index. When you only know the meaning you want, store an embedding of the content and search by distance. The pgvector extension adds a vector column type and distance operators to plain Postgres.

knowledge.sql

create extension if not exists vector;
 
create table knowledge (
  id        bigint generated always as identity primary key,
  content   text        not null,
  embedding vector(768) not null,
  metadata  jsonb       not null default '{}'
);
 
-- An approximate-nearest-neighbor index (HNSW) over cosine distance.
create index on knowledge using hnsw (embedding vector_cosine_ops);

Retrieval pulls the top-K rows closest to the query embedding. The <=> operator is cosine distance:

recall.sql

select content, metadata
from knowledge
order by embedding <=> $1   -- $1 is the query embedding
limit 5;

Because it is still SQL, you can scope the search with an ordinary filter on the JSONB metadata in the same query:

filtered.sql

select content
from knowledge
where metadata @> '{"source": "papers"}'
order by embedding <=> $1
limit 5;

Knowledge, learned workflows, the tool catalog, entity descriptions, and reflective lessons all share this layout: a content column, an embedding, and some metadata.

One caveat on that layout: vector(768) pins a specific embedding model. Change the model and the dimension and the stored vectors change with it, so version the embedding (a model column, or a parallel index) and plan a backfill rather than migrating in place.

Relationships go in a graph. Entity memory is not really about similarity, it is about connection. Model entities and typed edges as two tables, then walk the graph with a recursive query.

graph.sql

create table entities (
  id        bigint generated always as identity primary key,
  name      text not null,
  kind      text,
  embedding vector(768)          -- so an entity can also be found by meaning
);
 
create table edges (
  src bigint references entities(id),
  rel text   not null,
  dst bigint references entities(id)
);

Pulling the two-hop neighborhood around an entity is a recursive CTE:

neighborhood.sql

with recursive nbrhood as (
  select id, name, 0 as depth
  from entities
  where name = $1
  union all
  select e.dst, t.name, n.depth + 1
  from nbrhood n
  join edges e    on e.src = n.id
  join entities t on t.id  = e.dst
  where n.depth < 2
)
select distinct name, depth from nbrhood order by depth;

Working memory is the exception. The three homes above hold memory that outlives a turn. Working memory, the scratchpad for the task in flight, does not: it is the run's own state, the plan and partial results the loop carries between steps. It lives in the orchestrator, not the long-term stores. You persist it as a checkpoint (a LangGraph PostgresSaver, for instance) so a paused or crashed run resumes where it stopped, not so you can recall it next week.

Tools are memory, too. As an agent gains tools, sending every tool's schema to the model on every call is the same mistake as stuffing the prompt: it is noise the model has to read past. Treat the tool catalog as procedural memory. Store each tool's description as an embedding, exactly like the knowledge table, and retrieve only the few tools whose description is closest to the task at hand.

semantic_tool_retrievalintent → top 3 tools

task · Book me a flight and a hotel for Berlin next month.

book_flightreserve a seat on a flightkept

find_hotelsearch hotels by city and datekept

convert_currencyconvert between currencieskept

send_emailsend a message to a recipient

run_sqlquery the production database

search_papersfind academic papers by topic

summarize_doccondense a long document

create_ticketopen a tracking ticket

tools available8

schemas sent to model3 / 8

Tools are procedural memory: store each tool's description as an embedding, then retrieve only the few that match the task instead of sending every schema to the model. The catalog can grow without bloating the prompt.

The retrieval is the same <=> nearest-neighbor search you already run for knowledge; only the rows differ. A catalog of two hundred tools costs the model nothing if the agent loads the three it needs, and the model reasons better when it is not handed a wall of schemas it will never call. This only pays off past a few dozen tools, though. Under roughly thirty, send them all: retrieval adds latency and a failure mode for no real benefit.

One database, not three

Notice that nothing above leaves Postgres. The vector index is an extension, the graph is two ordinary tables, and the conversation log is a plain table with a B-tree. You do not need a dedicated vector store or a separate graph engine to ship a memory-aware agent; you need one Postgres and a clear idea of which question each table answers.

It is tempting to embed everything and search by similarity for all of it. Resist that. Asking a vector index for "this thread's last ten messages, in order" is both slow and wrong: it returns rows that are merely similar, not the ones you asked for. Exact, ordered reads are a B-tree's job. Similarity is the vector index's job. Connection is the graph's job. Match the backend to the question and most of the hard decisions disappear.

When the agent remembers

Knowing where memory lives is half the design. The other half is timing, and it is where most agents go wrong. A single turn touches memory at three distinct moments, and keeping them separate is what makes an agent both cheap and debuggable.

First, deterministic reads, before the model thinks. An agent cannot ask for what it does not know exists, so the core stores load on every turn by a fixed rule, not by the model's choice: recent conversation by thread, relevant knowledge by similarity, similar past workflows, and the entities named in the query. This is the context the model starts from.

Deterministic by default, adaptive when recall matters

Loading by a fixed rule is a deliberate trade. It is predictable and easy to debug, but it leaves recall on the table: the model can often phrase a query better than the user did. The alternatives let the model shape retrieval, query rewriting, HyDE, multi-query, or plan-then-retrieve, at the cost of determinism and extra calls. Start deterministic, and reach for adaptive retrieval on the queries where recall actually matters.

Second, just-in-time reads, inside the loop. Expensive or conditional fetches wait until the model actually asks. A web search fires only when local memory comes up thin, and a full document loads only when the model decides it needs the detail. A lean prompt reasons better than a stuffed one, so the agent pulls detail just in time rather than just in case.

Third, durable writes, at the stop condition. When the turn produces a final answer, the agent persists what it learned: the exchange, the workflow it followed, any new entities, and a reflective lesson. Writing only at the end keeps half-finished reasoning out of long-term memory.

The system below traces one full turn through all three phases. Scroll it: the deterministic reads load from each store, the loop reaches outside only when it must, and the durable writes flow back when the turn ends.

the system

One agent, one Postgres, three stores

Everything the agent remembers lives in three shapes inside a single database: a SQL table for exact, ordered reads, a vector index for meaning, and a graph for connection. Here is the whole system at rest. Scroll to watch one turn move through it.

step 1 · before

Deterministic reads load the context

Before the model thinks, core memory loads by a fixed rule, not by the model's choice: recent conversation from the SQL table, relevant knowledge from the vector index, named entities from the graph. This is the context the turn starts from.

step 2 · the loop

The model reasons, and reaches out just in time

Inside the loop the model reasons over that context. Only when local memory comes up thin does it reach outside, a web search or a document fetch, just in time rather than just in case. A lean prompt reasons better than a stuffed one.

step 3 · after

Durable writes, only when the turn ends

When the turn produces a final answer, the agent writes what it learned back to the stores: the exchange, the workflow it followed, any new entities, and a reflective lesson. Writing only at the stop condition keeps half-finished reasoning out of long-term memory.

tools are memory

The tools live in the vector index too

The catalog of tools is procedural memory. Each tool's description is an embedding in the same vector index, so the agent retrieves the few tools that match the task instead of sending every schema to the model. The catalog can grow without bloating the prompt.

agent_memory

read→loop→write

the system

In code, the three phases are the top, middle, and bottom of one function. The model chooses only what happens inside the loop; everything around it is fixed.

run_turn.py

def run_turn(query, thread_id):
    # 1. Deterministic reads: assemble context before the model thinks.
    context = assemble(
        recent    = sql_recent_messages(thread_id),       # exact, by thread
        knowledge = knn("knowledge", embed(query), k=5),  # semantic top-K
        workflow  = knn("workflow",  embed(query), k=3),
        entities  = graph_neighborhood(query),            # graph traversal
    )
    write_message(thread_id, "user", query)               # durable, every turn
 
    # 2. The loop: reason, act, fetch just in time.
    messages = [system_prompt(), user(context, query)]
    for _ in range(MAX_STEPS):
        step = model(messages)
        if step.is_final:
            break
        result = call_tool(step)        # e.g. expand_summary(id), web_search(q)
        messages += [step, tool_result(result)]
 
    # 3. Stop condition: persist what the turn learned.
    write_message(thread_id, "assistant", step.text)
    upsert_workflow(query, steps_taken=messages)
    upsert_entities(extract_entities(step.text))
    if turn_was_hard(step):                               # not every turn: reflection costs a model call
        write_lesson(reflect(query, step.text))           # reflective memory, can run async
    return step.text

Read the structure, not the syntax: reads at the top, a tight loop in the middle, writes at the bottom. Because the reads and writes are deterministic, every run starts and ends the same way, which is exactly what makes a misbehaving agent possible to debug.

One line in there hides real work: assemble. Four stores can return more than fits the prompt, so assembling context is a budget problem, not a concatenation. Give each store a token quota, dedupe across them (the same fact often surfaces from both knowledge and the graph), and set a priority for conflicts: an explicit user statement outranks an inferred one, and fresh outranks stale. The model only ever sees what survives that pass.

Keeping memory from rotting

A memory that only grows eventually buries the agent in its own history. So writing needs discipline, the same way human memory keeps what matters and lets the trivial fade.

Forgetting. Not every fact earns a permanent place. A durable preference is worth a write; a one-off aside is not. Deciding what to keep takes judgment, so it is one of the few memory operations that belongs to the model rather than to a rule.

Forgetting is a feature

The instinct is to save everything, just in case. But noise retrieved is noise reasoned over. An agent that writes selectively keeps its future retrievals sharp; an agent that hoards slowly poisons its own context.

Consolidation. Forgetting prunes; consolidation compresses. An episodic log that only grows will eventually dwarf its own useful signal, and you cannot delete your way out of that. So fold old episodes upward: summarize a run into a paragraph, a week of runs into a page, the way human memory keeps the gist and lets the transcript fade. Tiered schemes like this, recent detail in full and older memory as recursive summaries, are what let a long-lived agent stay both informed and small.

Reflection. After a hard task, the agent writes down what it learned. The next time a similar task begins, that lesson loads with the rest of memory, and the agent avoids repeating the mistake. This is the loop that lets an agent get better at a job over time instead of starting fresh each morning. Reflection is a model call, though, so it does not belong on every turn. Run it only when a turn was hard, failed, or surfaced something new, and run it out of band when you can, so it never sits on the response path the user is waiting for.

When one Postgres is not enough

The design so far is deliberately monolithic: one Postgres, three access patterns. It will carry you much further than the urge to add infrastructure suggests. But it is not the end state for every system. Three axes each have a line where a specialized backend stops being premature and starts being correct. Be honest about where yours sit.

when_to_graduatepostgres → first lever → graduate

1stay on postgresPostgres + pgvector (HNSW)

Up to ~1M vectors, modest QPS

Transactional rows and vector search live in one database.

2first leverTune HNSW: m, ef_construction, ef_search

Recall or latency slips first

Defaults are conservative; tuning buys a lot before any migration.

3graduateDedicated vector DB: Qdrant, Weaviate, Milvus

Past ~10M vectors or high QPS

Quantization, sharding, and filtered search that stay fast at scale.

pgvector comfortup to ~1M

graduate near10M+ or high QPS

One Postgres is the right default, and it goes further than most expect. Each axis has a line where a specialized backend earns its keep. Switch axes to see where that line is and what to reach for. Migrate an axis only when you measure its line, not before.

On vector scale, pgvector with a tuned HNSW index is comfortable into the low millions of rows. Past that, or under high query load, a dedicated vector database (Qdrant, Weaviate, Milvus) pays for itself: quantization, sharding, and filtered search that stay fast where Postgres starts to strain.

On graph depth, two tables and a recursive CTE are fine for shallow, occasional traversals. They break when edges need weights or timestamps, when traversals run deep and hot, or when you want graph algorithms like community detection or centrality that SQL will not express well. That is where a property-graph database (Neo4j, Memgraph) and Cypher earn their place, and where a temporal knowledge-graph layer (Graphiti, Zep) keeps facts that were only true for a while. In relationship-heavy, time-aware work, this is usually the first axis to graduate.

On retrieval quality, cosine similarity alone is a simplification. Production retrieval is typically hybrid: lexical search (BM25) fused with dense vectors to catch the exact terms embeddings miss, then a cross-encoder reranker over the top candidates. Pair that with adaptive retrieval, query rewriting, HyDE, or multi-query, when a single pass leaves recall on the table.

Start simple, graduate on evidence

None of this argues against starting on one Postgres. It argues against guessing. Build the simple design, measure where it strains, and graduate the one axis that crosses its line, not all three on a hunch. Premature specialization costs more than it saves.

The takeaway

Three questions carry the whole design. The taxonomy answers what to store: seven kinds, three of them borrowed from how people remember and four added by building agents. Postgres answers where it goes: exact reads in a SQL table, semantic reads (knowledge, and the tools to act on it) in a pgvector index, relationships in a graph, all in one database. The loop answers when to touch it: deterministic reads before the model thinks, just-in-time reads while it works, durable writes when it stops. A fourth question waits at scale: one Postgres is the right default, and you graduate a single axis, vector scale, graph depth, or retrieval quality, only once you have measured it past the line. Get those right and you have crossed most of the distance between a demo that forgets you and a system that does not.

Memory'si olmayan bir agent, zeki ama unutkan bir yabancı gibidir: güzel reasoning yapar, tool çağırır, cevabını döndürür, ama turn biter bitmez her şeyi unutur. Memory; perception, reasoning ve action'ın yanındaki dördüncü temel yetenektir ve bir agent'ı gösterişli bir demo olmaktan çıkarıp günler, hatta session'lar boyunca güvenebileceğin bir sisteme dönüştüren şey de tam olarak budur. Bu yazı üç sorunun haritasını çıkarıyor: agent neyi saklar, her memory türü database'in neresinde durur ve agent ona ne zaman başvurur. Dördüncü bir soruyla da kapanıyor: tek bir Postgres ne zaman yetmemeye başlar.

Aşağıda, memory'si olan bir agent'ın tek bir turn'ü iş başında. Parlak satırlar konuşma ve tool çağrıları; yanlarındaki soluk satırlar ise agent'ın harekete geçmeden önce okuduğu, işi bitince yazdığı memory. Yazının geri kalanı da bu etiketleri anlatıyor.

agent_sessionone turn · reads then writes memory

>triage the spike in checkout 500s and propose a fix

01 / 11

The stateless default

Stateless bir agent input'unu alır, üzerinde reasoning yapar ve bir output üretir; çağrılar arasında hiçbir şey taşımaz. Tek seferlik bir tool için bu yeterli, ama gerçek hiçbir iş için yetmez. Memory olmadan bir agent ne çok adımlı bir task'ı yürütebilir, ne yarın geri dönen kullanıcıyı tanıyabilir, ne de bir saat önce yaptığı hatadan ders çıkarabilir. Akla ilk gelen çözüm, yani tüm geçmişi her prompt'a yapıştırmak, sorunu yalnızca erteler: context window dolar, latency artar, maliyet her turn'le birlikte büyür ve er ya da geç window taşar.

Memory işte bu stuffing'in alternatifi. Agent her şeyi prompt'ta taşımak yerine önemli olanı context window'un dışında persist eder, o turn'ün ihtiyaç duyduğu kısmı da retrieve eder. Yazının geri kalanı bunu nasıl düzgün yapacağını anlatıyor.

Seven kinds of memory

"Memory" tek parça bir şey değil. Onu türlere ayırmak işe yarıyor, çünkü her tür farklı bir soruya cevap veriyor ve birazdan göreceğimiz gibi database'de farklı bir yere oturuyor.

Bunların üçü doğrudan insan hafızasından geliyor. Semantic memory, facts'i tutar: kullanıcı, domain ve dünya hakkında neyin doğru olduğunu. Episodic memory, yaşananları tutar: ne olduğunun ve agent'ın ne yaptığının zaman sıralı kaydını. Procedural memory ise beceriyi tutar: bir task'ı her seferinde sıfırdan türetmeden nasıl yapacağını.

Kalan dört türü ise agent geliştirme pratiği sonradan ekledi. Working memory, o an üzerinde çalışılan task'ın müsvedde defteri. Entity/graph memory, agent'ın gördüğü şeylerin ve aralarındaki bağların haritası. Reflective memory, agent'ın kendi geçmiş run'larından çıkardığı dersler. Shared memory ise birden fazla agent'ın ortaklaşa okuyup yazdığı bir havuz.

memory_types3 classic · 4 from the agent era

Semanticfacts

stored

Durable facts about the user, the domain, and the world.

extracted

Stable preferences and attributes pulled from the chat.

in a chatbot

"The user is vegetarian and prefers metric units."

in an agent

A knowledge base the agent retrieves over (RAG).

Episodicexperiences

stored

A time-ordered record of what happened and what was done.

extracted

A short note per turn or event, kept in order.

in a chatbot

"Yesterday you helped me draft an email to Sam."

in an agent

A run log of past executions, for replay and audit.

Proceduralskills

stored

How to carry out a task, plus the catalog of tools.

extracted

Reusable steps and workflows that worked before.

in a chatbot

The system prompt and persona rules.

in an agent

A learned tool sequence (workflow memory) and the toolbox.

Workingtask statenew

stored

The scratchpad for the task currently in flight.

extracted

The current goal, plan, and intermediate results.

in a chatbot

The active conversation in the context window.

in an agent

Plan and partial results held across loop steps.

Entity / graphrelationshipsnew

stored

Entities and the typed edges between them.

extracted

People, systems, and orgs, and how they connect.

in a chatbot

"Sam is your manager; the thread was about Q3."

in an agent

A knowledge graph the agent traverses (Graphiti, Zep).

Reflectivelessonsnew

stored

Lessons the agent draws from its own past runs.

extracted

What went wrong, and the rule that avoids it next time.

in a chatbot

"Ask for the deadline before drafting."

in an agent

A lessons file loaded before similar tasks (Reflexion).

Sharedcommon poolnew

stored

A pool several agents read and write in common.

extracted

Facts and state shared across the team.

in a chatbot

A team-wide FAQ every bot reads from.

in an agent

Multi-agent crew state, a shared blackboard.

Shared memory: birden fazla agent durumu

İlk altı tür tek bir agent tarafından yazılır. Shared memory istisna: birden fazla agent'ın aynı anda okuyup yazdığı bir havuz. Onu sıradan, namespace'li bir table olarak tut (her agent'ın filtrelediği bir scope column'u), ama çok sayıda writer olduğu için concurrency control'e ihtiyaç duyduğun tek yer burası: row-level lock'lar (select ... for update) ya da iki agent birbirini ezmesin diye optimistic bir version column. Yazının geri kalanı tek bir writer varsayıyor.

Bu yedi tür, yedi ayrı database demek değil. Hepsi birkaç temel storage biçimine iner; üstelik tek bir turn aynı anda birkaçına birden dokunur. Yazının geri kalanı da zaten bunu gösteriyor.

Where memory lives

İşin mühendislik özü burada ve türlerin listesine bakınca sanacağından çok daha basit. Backend'i seçen şey access pattern. Her read'de kendine tek bir soru sor: hangi row'ları istediğimi tam olarak biliyor muyum, yoksa sadece bir şeye benzemelerini mi istiyorum? Bu sorunun cevabı veriyi üç evden birine yönlendirir ve üçü de aslında sadece Postgres.

Tam bilinen lookup'lar SQL table'a gider. En net örnek conversational memory. Bu thread'in en son mesajlarını sırasıyla istiyorsun; bu bir index range scan işi, similarity search değil.

messages.sql

create table messages (
  id          bigint generated always as identity primary key,
  thread_id   text        not null,
  role        text        not null,          -- 'user' | 'assistant'
  content     text        not null,
  created_at  timestamptz not null default now()
);
 
create index on messages (thread_id, created_at desc);

Son turn'leri okumak için tek bir indexed query yeter, hiçbir embedding'e gerek yok:

recent.sql

select role, content
from messages
where thread_id = $1
order by created_at desc
limit 20;

Bir tool log ya da event log da aynı yapıda: id ve zamana göre çektiğin, replay ve audit için tuttuğun row'lar.

Anlamca yakın recall vector index'e gider. Elinde sadece aradığın anlam varsa, içeriğin embedding'ini saklar ve distance'a göre ararsın. pgvector extension'ı, düz Postgres'e bir vector column tipi ve distance operatörleri kazandırır.

knowledge.sql

create extension if not exists vector;
 
create table knowledge (
  id        bigint generated always as identity primary key,
  content   text        not null,
  embedding vector(768) not null,
  metadata  jsonb       not null default '{}'
);
 
-- An approximate-nearest-neighbor index (HNSW) over cosine distance.
create index on knowledge using hnsw (embedding vector_cosine_ops);

Retrieval, query embedding'ine en yakın top-K row'u getirir. <=> operatörü de cosine distance demek:

recall.sql

select content, metadata
from knowledge
order by embedding <=> $1   -- $1 is the query embedding
limit 5;

Sonuçta hâlâ SQL yazdığın için, aramayı aynı query içinde JSONB metadata üzerinde sıradan bir filter ile daraltabilirsin:

filtered.sql

select content
from knowledge
where metadata @> '{"source": "papers"}'
order by embedding <=> $1
limit 5;

Knowledge, öğrenilmiş workflow'lar, tool kataloğu, entity tanımları ve reflective ders'lerin hepsi aynı düzeni paylaşır: bir content column, bir embedding ve biraz metadata.

Bu düzene dair bir uyarı: vector(768) belirli bir embedding modelini sabitler. Modeli değiştirdiğinde boyut ve saklı vektörler de onunla değişir; o yüzden embedding'i versiyonla (bir model column'u ya da paralel bir index) ve yerinde migration yerine bir backfill planla.

İlişkiler graph'a gider. Entity memory'nin derdi aslında benzerlik değil, bağlantı. Entity'leri ve aralarındaki tipli edge'leri iki table olarak modelle, sonra graph'ı recursive bir query ile dolaş.

graph.sql

create table entities (
  id        bigint generated always as identity primary key,
  name      text not null,
  kind      text,
  embedding vector(768)          -- so an entity can also be found by meaning
);
 
create table edges (
  src bigint references entities(id),
  rel text   not null,
  dst bigint references entities(id)
);

Bir entity'nin çevresindeki two-hop komşuluğu çekmek için recursive bir CTE kullanırsın:

neighborhood.sql

with recursive nbrhood as (
  select id, name, 0 as depth
  from entities
  where name = $1
  union all
  select e.dst, t.name, n.depth + 1
  from nbrhood n
  join edges e    on e.src = n.id
  join entities t on t.id  = e.dst
  where n.depth < 2
)
select distinct name, depth from nbrhood order by depth;

Working memory bir istisna. Yukarıdaki üç ev, bir turn'den uzun yaşayan memory'i tutar. Working memory, yani o an üzerinde çalışılan task'ın müsvedde defteri, bunu yapmaz: run'ın kendi state'idir, loop'un adımlar arası taşıdığı plan ve ara sonuçlar. Long-term store'larda değil, orchestrator'da yaşar. Onu bir checkpoint olarak persist edersin (örneğin bir LangGraph PostgresSaver'ı), ki duraklayan ya da çöken bir run kaldığı yerden devam edebilsin; gelecek hafta hatırlayasın diye değil.

Tool'lar da birer memory. Agent'ın tool sayısı arttıkça her çağrıda her tool'un schema'sını modele yollamak, prompt'u stuffing yapmakla aynı hata: modelin okuyup geçmek zorunda kaldığı bir gürültüye dönüşür. Tool kataloğunu procedural memory gibi düşün. Her tool'un açıklamasını tıpkı knowledge table'ındaki gibi bir embedding olarak sakla, sonra sadece açıklaması o anki task'a en yakın olan birkaç tool'u retrieve et.

semantic_tool_retrievalintent → top 3 tools

task · Book me a flight and a hotel for Berlin next month.

book_flightreserve a seat on a flightkept

find_hotelsearch hotels by city and datekept

convert_currencyconvert between currencieskept

send_emailsend a message to a recipient

run_sqlquery the production database

search_papersfind academic papers by topic

summarize_doccondense a long document

create_ticketopen a tracking ticket

tools available8

schemas sent to model3 / 8

Bu retrieval, knowledge için zaten kullandığın <=> nearest-neighbor search'ün aynısı; tek fark, dönen row'lar. İki yüz tool'luk bir katalog bile, agent ihtiyacı olan üçünü yüklediği sürece modele hiçbir yük bindirmez; üstelik model, hiç çağırmayacağı bir schema yığınıyla boğuşmadığında daha iyi reasoning yapar. Ama bu ancak birkaç düzineyi aşan kataloglarda işe yarar. Otuzun altında hepsini birden gönder: retrieval yalnızca gereksiz bir latency ve bir hata kaynağı ekler.

Üç değil, tek bir database

Dikkat et, yukarıda hiçbir şey Postgres'in dışına çıkmadı. Vector index bir extension, graph iki sıradan table, conversation log ise B-tree'li düz bir table. Memory-aware bir agent'ı canlıya almak için ayrı bir vector store ya da ayrı bir graph engine'e ihtiyacın yok; tek bir Postgres ve her table'ın hangi soruya cevap verdiğini net bilmen yeterli.

İnsanın içinden her şeyi embed edip her şeyi benzerlikle aramak geçer, ama buna direnmek lazım. Bir vector index'e "şu thread'in son on mesajını sırayla ver" demek hem yavaş hem yanlış: istediğin row'ları değil, sadece onlara benzeyenleri döner. Tam ve sıralı read'ler B-tree'nin işi, benzerlik vector index'in işi, bağlantı ise graph'ın işi. Backend'i soruyla eşleştirdiğin anda zor kararların çoğu kendiliğinden ortadan kalkar.

When the agent remembers

Memory'nin nerede durduğunu bilmek tasarımın yarısı. Diğer yarısı timing ve çoğu agent'ın çuvalladığı yer de tam burası. Tek bir turn, memory'e üç ayrı anda dokunur; bu üçünü birbirinden ayrı tutmak, bir agent'ı hem ucuz hem de debug edilebilir kılan şey.

Birincisi, deterministic read'ler, model daha düşünmeye başlamadan. Bir agent var olduğunu bilmediği şeyi isteyemez; o yüzden çekirdek store'lar her turn'de modelin keyfine değil, sabit bir kurala göre yüklenir: thread'e göre son conversation, benzerliğe göre ilgili knowledge, geçmişteki benzer workflow'lar ve query'de adı geçen entity'ler. Modelin yola çıktığı context işte bu.

Varsayılan deterministic, recall önemliyse adaptive

Sabit bir kurala göre yüklemek bilinçli bir takas. Öngörülebilir ve debug'ı kolay, ama recall'u açıkta bırakır: model çoğu zaman query'yi kullanıcının kurduğundan daha iyi kurabilir. Alternatifler retrieval'ı modele şekillendirtir, yani query rewriting, HyDE, multi-query ya da plan-then-retrieve, ama determinism ve fazladan çağrı pahasına. Deterministic başla ve recall'un gerçekten önem taşıdığı query'lerde adaptive retrieval'a uzan.

İkincisi, just-in-time read'ler, loop'un içinde. Pahalı ya da koşullu fetch'ler, model gerçekten isteyene kadar bekler. Web search ancak local memory zayıf kaldığında devreye girer; bir doküman da ancak model detayına ihtiyaç duyduğunda tam olarak yüklenir. Yalın bir prompt, tıka basa dolu olandan daha iyi reasoning yapar; bu yüzden agent detayı "ne olur ne olmaz" diye değil, tam ihtiyaç anında çeker.

Üçüncüsü, durable write'lar, stop condition'da. Turn bir final answer ürettiğinde agent öğrendiklerini persist eder: exchange'i, izlediği workflow'u, yeni entity'leri ve bir reflective ders'i. Sadece sonda yazmak, yarım kalmış reasoning'in long-term memory'ye sızmasını engeller.

Aşağıdaki sistem, tek bir turn'ü bu üç fazın hepsinden geçiriyor. Kaydırarak izle: deterministic read'ler her store'dan yüklenir, loop yalnızca mecbur kaldığında dışarıya uzanır, turn biterken de durable write'lar geri akar.

the system

One agent, one Postgres, three stores

step 1 · before

Deterministic reads load the context

step 2 · the loop

The model reasons, and reaches out just in time

step 3 · after

Durable writes, only when the turn ends

tools are memory

The tools live in the vector index too

agent_memory

read→loop→write

the system

Kod tarafında bu üç faz aslında tek bir fonksiyonun üstü, ortası ve altı. Model yalnızca loop'un içinde ne olacağına karar verir; çevresindeki her şey sabittir.

run_turn.py

def run_turn(query, thread_id):
    # 1. Deterministic reads: assemble context before the model thinks.
    context = assemble(
        recent    = sql_recent_messages(thread_id),       # exact, by thread
        knowledge = knn("knowledge", embed(query), k=5),  # semantic top-K
        workflow  = knn("workflow",  embed(query), k=3),
        entities  = graph_neighborhood(query),            # graph traversal
    )
    write_message(thread_id, "user", query)               # durable, every turn
 
    # 2. The loop: reason, act, fetch just in time.
    messages = [system_prompt(), user(context, query)]
    for _ in range(MAX_STEPS):
        step = model(messages)
        if step.is_final:
            break
        result = call_tool(step)        # e.g. expand_summary(id), web_search(q)
        messages += [step, tool_result(result)]
 
    # 3. Stop condition: persist what the turn learned.
    write_message(thread_id, "assistant", step.text)
    upsert_workflow(query, steps_taken=messages)
    upsert_entities(extract_entities(step.text))
    if turn_was_hard(step):                               # not every turn: reflection costs a model call
        write_lesson(reflect(query, step.text))           # reflective memory, can run async
    return step.text

Syntax'a değil yapıya bak: üstte read'ler, ortada sıkı bir loop, altta write'lar. Read'ler ve write'lar deterministic olduğu için her run aynı şekilde başlar, aynı şekilde biter; hatalı davranan bir agent'ı debug edilebilir kılan da işte bu.

Oradaki tek bir satır gerçek işi gizliyor: assemble. Dört store, prompt'a sığandan fazlasını döndürebilir; o yüzden context'i toplamak bir concatenation değil, bir bütçe problemi. Her store'a bir token kotası ver, store'lar arası dedupe et (aynı fact çoğu zaman hem knowledge'tan hem graph'tan gelir) ve çakışmalar için bir öncelik belirle: açık bir kullanıcı ifadesi, çıkarımla elde edilenden; taze olan, bayat olandan önce gelir. Model yalnızca bu elemeden sağ çıkanı görür.

Keeping memory from rotting

Sürekli büyüyen bir memory, eninde sonunda agent'ı kendi geçmişinin altında boğar. O yüzden yazma işinin bir disiplini olmalı; tıpkı insan hafızasının önemli olanı tutup önemsizi zamanla unutması gibi.

Forgetting. Her fact kalıcı bir yeri hak etmez. Kalıcı bir tercihi yazmaya değer, ama geçerken söylenmiş tek seferlik bir laf değmez. Neyin tutulacağına karar vermek muhakeme ister; zaten bu yüzden bir kurala değil de doğrudan modele bırakılan birkaç memory işleminden biri.

Forgetting bir feature'dır

İlk içgüdü, "ne olur ne olmaz" diye her şeyi saklamak. Oysa retrieve ettiğin gürültü, sonunda üzerinde reasoning yaptığın gürültü olur. Seçici yazan bir agent, ileride yapacağı retrieval'ları keskin tutar; her şeyi istifleyen bir agent ise kendi context'ini yavaş yavaş zehirler.

Consolidation. Forgetting budar, consolidation sıkıştırır. Yalnızca büyüyen bir episodic log, eninde sonunda kendi işe yarar sinyalini gölgede bırakır ve bundan silerek kurtulamazsın. O yüzden eski episode'ları yukarı katla: bir run'ı bir paragrafa, bir haftalık run'ı bir sayfaya özetle; tıpkı insan hafızasının özü tutup transkripti soldurması gibi. Bu tür katmanlı düzenler, yani yakın detay tam hâliyle ve eski memory recursive özetler hâlinde, uzun ömürlü bir agent'ı hem bilgili hem de küçük tutan şeydir.

Reflection. Zorlu bir task'ın ardından agent öğrendiğini yazıya döker. Benzer bir task bir dahaki sefere geldiğinde o ders, memory'nin geri kalanıyla birlikte yüklenir ve agent aynı hataya düşmez. Bir agent'ın her sabah sıfırdan başlamak yerine bir işte zamanla ustalaşmasını sağlayan döngü işte bu. Ama reflection bir model çağrısı, o yüzden her turn'e ait değil. Onu yalnızca bir turn zorlu geçtiğinde, başarısız olduğunda ya da yeni bir şey ortaya çıkardığında çalıştır; mümkünse de kullanıcının beklediği yanıt yolundan çıkarıp arka planda çalıştır.

When one Postgres is not enough

Buraya kadarki tasarım bilinçli olarak monolitik: tek bir Postgres, üç access pattern. Seni, altyapı ekleme dürtüsünün ima ettiğinden çok daha ileriye taşır. Ama her sistem için son durak değil. Üç eksenin her birinin, özelleşmiş bir backend'in erken olmaktan çıkıp doğru karara dönüştüğü bir sınırı var. Seninkilerin nerede durduğu konusunda dürüst ol.

when_to_graduatepostgres → first lever → graduate

1stay on postgresPostgres + pgvector (HNSW)

Up to ~1M vectors, modest QPS

Transactional rows and vector search live in one database.

2first leverTune HNSW: m, ef_construction, ef_search

Recall or latency slips first

Defaults are conservative; tuning buys a lot before any migration.

3graduateDedicated vector DB: Qdrant, Weaviate, Milvus

Past ~10M vectors or high QPS

Quantization, sharding, and filtered search that stay fast at scale.

pgvector comfortup to ~1M

graduate near10M+ or high QPS

Vector scale'de, ayarlı bir HNSW index'iyle pgvector birkaç milyon row'a kadar rahat. Bunun ötesinde ya da yüksek query yükünde, özel bir vector database (Qdrant, Weaviate, Milvus) kendini amorti eder: quantization, sharding ve Postgres'in zorlanmaya başladığı yerde hızlı kalan filtered search.

Graph depth'te, iki table ve recursive bir CTE sığ ve seyrek traversal'lar için yeterli. Edge'lerin ağırlık ya da timestamp'e ihtiyaç duyduğunda, traversal'lar derin ve sık olduğunda ya da SQL'in iyi ifade edemeyeceği community detection veya centrality gibi graph algoritmaları istediğinde bu yapı çöker. İşte o nokta, bir property-graph database'in (Neo4j, Memgraph) ve Cypher'ın hak ettiği yer; bir de yalnızca bir süre doğru olan fact'leri tutan temporal bir knowledge-graph katmanının (Graphiti, Zep) yeri. İlişki ağırlıklı, zaman duyarlı işlerde genelde ilk terfi eden eksen budur.

Retrieval quality'de, tek başına cosine similarity bir basitleştirme. Production'da retrieval çoğu zaman hybrid'dir: embedding'lerin kaçırdığı tam terimleri yakalamak için lexical search (BM25) ile dense vektörler birleştirilir, ardından en iyi adaylar bir cross-encoder reranker'dan geçer. Tek bir geçiş recall'u açıkta bıraktığında bunu adaptive retrieval ile, yani query rewriting, HyDE ya da multi-query ile, tamamla.

Basit başla, kanıta göre terfi et

Bunların hiçbiri tek bir Postgres ile başlamaya karşı değil. Tahmin yürütmeye karşı. Basit tasarımı kur, nerede zorlandığını ölç ve sınırını aşan tek ekseni terfi ettir; üç ekseni birden, içine doğduğu için değil. Erken özelleşme, kazandırdığından fazlasına mal olur.

The takeaway

Tüm tasarımı üç soru taşıyor. Taxonomy, neyi saklayacağını söyler: üçü insan hafızasından gelen, dördü agent geliştirmenin eklediği yedi tür. Postgres, nereye koyacağını söyler: tam read'ler SQL table'da, anlamca read'ler (knowledge ve onunla iş görecek tool'lar) pgvector index'te, ilişkiler graph'ta, hepsi de tek bir database içinde. Loop ise ne zaman dokunacağını söyler: model düşünmeden önce deterministic read'ler, çalışırken just-in-time read'ler, işi bitince durable write'lar. Ölçekte dördüncü bir soru bekler: tek bir Postgres doğru varsayılandır ve tek bir ekseni, yani vector scale, graph depth ya da retrieval quality'yi, ancak onu ölçüp sınırın ötesine geçtiğini gördüğünde terfi ettirirsin. Bunları doğru kurarsan, seni unutan bir demo ile seni unutmayan bir sistem arasındaki yolun büyük kısmını çoktan katetmiş olursun.