The Agent Memory System

· updated
joelclawmemoryaispecarchitecturetutorial

AI agents are goldfish. Every session starts from zero. The system prompt gives them identity and rules, but nothing about what happened yesterday — what broke, what Joel prefers, what patterns emerged, what decisions were made. Without memory, agents repeat mistakes, re-discover constraints, and can’t compound knowledge across sessions.

The naive fix is a big text file. Append everything, load it all. This collapses within weeks: the file grows unbounded, signal drowns in noise, stale facts contradict current ones, and the context window fills with garbage. You need a pipeline, not a file.

A system where AI agents observe their own sessions, extract durable knowledge, triage it through quality gates, store it in searchable vector collections, and retrieve it with time-decay ranking — so every future session starts smarter than the last.

sessions → observe → write-gate → store → decay/rank → retrieve → inject
    ↑                                                              ↓
    └──── nightly maintenance (merge dupes, prune stale) ←────────┘

         reflect → propose → triage → promote to MEMORY.md

Components

0. Decide what memory is for

Memory is for patterns, not noise. A good memory system captures:

  • User preferences and communication style
  • Hard rules the user has established (“never do X”)
  • System architecture facts that don’t change daily
  • Operational conventions (naming, paths, tools)
  • Debugging insights that save future sessions hours

It does NOT capture:

  • Step-by-step narration of what happened (git log does this)
  • Transient task state (Redis does this)
  • Raw session transcripts (session storage does this)
  • Every file that was read or edited

The test: will this fact still be true and useful next month? If yes, it’s memory. If not, it’s a log entry.

1. The structured log (slog)

Before memory, you need a system log. Not application logs — a structured operational log that records what agents did, not what they thought.

A JSONL file in a stable location. Each entry:

{
  "timestamp": "2026-03-02T10:14:00.000Z",
  "action": "deploy",
  "tool": "system-bus-worker",
  "detail": "Published ghcr.io/joelhooks/system-bus-worker:20260302-064120",
  "reason": "subscription URL fix + model ref update"
}

The log is append-only, never edited. A CLI wraps it with HATEOAS JSON responses so agents can write entries and query history without knowing the file path.

Actions are verbs: install, configure, deploy, fix, create, ingest, restart. Tools are nouns: caddy, inngest, redis, gateway. This vocabulary stabilizes quickly and becomes searchable.

The rule: slog liberally. You can’t filter nothing into something later. Every deploy, config change, service restart, debug finding, secrets rotation — log it. Skip only pure file-edit bookkeeping that git log already captures.

2. Architecture Decision Records (ADRs)

Long-lived architectural decisions get their own documents. Not in the log — in a decisions directory, numbered sequentially, with structured frontmatter:

---
status: shipped  # proposed | accepted | shipped | superseded | deprecated | rejected
date: 2026-02-19
deciders: joel
tags: [memory, pipeline, inngest]
supersedes: [0021]
---

ADRs record the context (why this decision was needed), the decision (what was chosen), and the consequences (what changed as a result). They are the institutional memory of why the system is shaped the way it is.

Status must be a single canonical word. Cross-references go in superseded-by: frontmatter, not inline. ADR numbers never get reused.

Agents should read relevant ADRs before modifying any subsystem. “Why is it this way?” is almost always answered by an ADR.

3. The observation pipeline

This is where sessions become memory. The pipeline fires on two triggers:

  1. Session compaction — when a long session hits a context limit and compacts, the compacted content is observed before it’s discarded
  2. Session end — when a session closes, its full transcript is observed

The observer is an LLM call with a specific system prompt. It does NOT summarize — it distills. The prompt instructs it to:

  • Identify coherent conversation segments (topic boundaries, not arbitrary chunks)
  • For each segment, produce a narrative (1-3 sentences of context) and a fact list
  • Tag each fact with priority: 🔴 High (corrections, constraints, hard requirements), 🟡 Medium (recurring patterns), 🟢 Low (minor notes)
  • Annotate each fact with a write gate: allow (durable signal), hold (ambiguous, preserve but don’t auto-inject), or discard (noise)
  • Classify each fact into a taxonomy category

The output is structured XML — not freeform text. This matters because downstream steps parse it programmatically.

<observations>
  <segment>
    <narrative>Joel debugged a stale NFS mount on the Mac Mini...</narrative>
    <facts>
      - 🔴 [gate=allow confidence=0.92 category=jc:operations] Both NFS mounts require sudo to unmount when stale
      - 🟡 [gate=hold confidence=0.61 category=jc:system-architecture] K8s NFS PV needs explicit server/path
      - 🟢 [gate=discard confidence=0.84 category=jc:projects] Checked three-body export list in ADM
    </facts>
  </segment>
</observations>

4. The write gate

Not everything observed deserves storage. The write gate is a filter between observation and persistence:

  • allow — durable factual signal. Gets stored and retrieved by default.
  • hold — ambiguous or contextual. Gets stored but excluded from default retrieval. Available when explicitly requested.
  • discard — noise. Dropped entirely. Never stored.

The gate is determined by the observer LLM (via the annotation), but the system also applies rules-based overrides:

  • Observations shorter than 12 characters → discard
  • Instruction-text artifacts (“Add after…”, “Replace…”) → discard
  • Facts containing ADR references, file paths, or explicit commands → boost toward allow

The write gate prevents the “append everything” failure mode. Most observations are hold or discard. Only genuine signal becomes retrievable memory.

5. The vector store

Observations that pass the write gate get stored in a collection with auto-embedding. Each document:

{
  id: "uuid",
  session_id: "source-session",
  observation: "Both NFS mounts require sudo to unmount when stale",
  observation_type: "fact",
  source: "session-abc-segment-2",
  timestamp: 1709395200,  // unix seconds
  write_verdict: "allow",
  write_confidence: 0.92,
  category_id: "jc:operations",
  category_confidence: 0.88,
  taxonomy_version: "v1",
  merged_count: 1,
  recall_count: 0,
  superseded_by: null
}

The collection handles both keyword search and semantic (vector) search. Hybrid search combines them. The observation field is auto-embedded — no external embedding pipeline needed.

Before storing, dedup. Search for semantically similar existing observations (cosine similarity > 0.85). If a near-duplicate exists, merge: increment merged_count on the existing doc, update its timestamp, and skip the new insert. This prevents the same insight from being stored 15 times across sessions.

6. Category taxonomy

Every observation gets a category. Start simple — seven categories cover most agent memory:

CategorySignal
preferencesUser communication style, tool preferences, aesthetic taste
rules-conventionsHard rules, naming conventions, process requirements
system-architectureArchitecture decisions, topology, design patterns
operationsDeploy procedures, incident responses, debugging insights
memory-systemMeta: observations about the memory system itself
projectsActive projects, milestones, feature context
people-relationshipsPeople mentioned, meeting context, relationship notes

Classification uses keyword matching first (fast, free), falling back to LLM classification when keywords are ambiguous. The taxonomy version is stored per-observation so you can evolve it without invalidating historical data.

7. Retrieval with time decay

When an agent session starts, it retrieves relevant memory. The retrieval pipeline:

  1. Query rewrite — The raw query gets rewritten by a fast LLM (Haiku-class) to improve search recall. “that thing with the NFS mounts” becomes “NFS mount stale sudo unmount three-body Mac Mini”. This step has a hard timeout (2 seconds) — if it’s slow, use the original query.

  2. Hybrid search — Both keyword and vector search against the observations collection. Combine results with rank fusion.

  3. Time decay — Apply exponential decay to scores:

    final_score = raw_score × exp(-0.01 × days_since_created)

    A fact from 70 days ago gets ~50% weight. A fact from 180 days ago is nearly invisible. Facts marked stale get an additional 0.5 multiplier.

  4. Cap — Take the top 10 results after decay ranking. This is the inject cap — no matter how many observations match, the session gets at most 10. This prevents memory from consuming the context window.

  5. Budget profiles — Different contexts need different depth:

    • lean (2-3 results, no query rewrite) — for automated checks, heartbeats
    • balanced (5-7 results, rewrite enabled) — default for interactive sessions
    • deep (10-15 results, aggressive rewrite) — for research, debugging complex issues
    • auto — system picks based on context

8. The reflection cycle

Observations are raw signal. Reflection compresses them into durable updates for a curated memory file (MEMORY.md). This is the step that turns hundreds of observations into a compact, high-signal reference document.

The reflector is a separate LLM call that reads:

  • Recent observations (since last reflection)
  • Current MEMORY.md content
  • Compression guidance (target 60-80% of input length)

It outputs proposals — structured changes to specific sections of MEMORY.md:

<proposals>
  <proposal>
    <section>Hard Rules</section>
    <change>NEVER set retries: 0 on Inngest functions. Let Inngest defaults handle retries.</change>
  </proposal>
</proposals>

Proposals don’t go directly into MEMORY.md. They enter a triage pipeline.

9. The proposal triage pipeline

Three tiers, from automatic to human:

Tier 1: Auto-action (no human needed)

  • Auto-promote: proposal contains an ADR reference, file path, or explicit command; confidence > 0.8; no duplicate in existing memory
  • Auto-reject: instruction-text artifact, too short, exact duplicate, or confidence < 0.3
  • Auto-merge: semantically similar to existing memory entry (merge the text, increment count)

Tier 2: LLM batch review (~$0.01 per batch of 20)

  • Proposals that aren’t auto-actioned get batched and sent to an LLM for review
  • The LLM reads current MEMORY.md + the batch and decides: promote (with clean formatted text), reject (with reason), or needs-review
  • Runs every 30 minutes via cron to catch stragglers

Tier 3: Human review (rare)

  • Only proposals the LLM flags as ambiguous or risky
  • Creates a task in the task manager: “Review memory proposal: [summary]”
  • Human approves, edits, or rejects

This pipeline keeps MEMORY.md curated without requiring constant human attention. Most proposals resolve in Tier 1 or 2.

10. Nightly maintenance

A cron job that runs daily to keep the observation store healthy:

  1. Dedup sweep — Find semantically similar observations (cosine > threshold), merge the best pair, mark the weaker one as superseded_by the stronger one
  2. Stale pruning — Observations older than 90 days with recall_count: 0 (never retrieved) get marked stale. Already-stale observations older than 180 days get soft-deleted.
  3. Stats emission — Write maintenance stats to telemetry: total observations, merged count, stale count, category distribution

The maintenance job is idempotent — running it twice produces the same result. It’s also conservative: it marks stale and supersedes, it doesn’t hard-delete. Recovery is always possible.

11. Observability

Every step in the memory pipeline emits structured telemetry:

{
  level: "info",
  source: "worker",
  component: "observe",
  action: "memory.observe.completed",
  success: true,
  duration_ms: 2340,
  metadata: {
    session_id: "abc",
    observations_extracted: 7,
    allowed: 3,
    held: 2,
    discarded: 2,
    deduped: 1
  }
}

Telemetry goes to a searchable collection (same search engine as memory, different collection). This means you can query “how many observations were discarded this week?” or “what’s the average reflect duration?” without log parsing.

Silent failures are bugs. If the observe step fails, it must emit an error event. If the triage step silently drops a proposal, that’s a bug. Every transition in the pipeline is observable.

12. The curated memory file

MEMORY.md is the crown jewel. It’s a small, version-controlled markdown file organized by section:

 
 
## Joel
- (2026-02-14) Dry, direct communication. No filler.
- (2026-02-14) "Let the loop earn its pass" — won't hand-wave failures.
 
## Hard Rules
- (2026-02-27) NEVER set retries: 0 on Inngest functions.
- (2026-02-14) Never fabricate experiences in Joel's voice.
 
## Conventions
- (2026-02-16) Repos clone to ~/Code/{org}/{repo}.
- (2026-02-17) ADR status must be a single canonical word.

Every entry has a date. Sections map to the taxonomy categories. The file is loaded into every agent session as part of the system context — it IS the distilled institutional memory.

MEMORY.md is deliberately small. If it grows past ~200 entries, that’s a signal to prune. The vector store holds everything; MEMORY.md holds only what’s worth injecting into every session.

13. Git as memory backbone

Git is the most underrated memory system. Every commit is a timestamped, attributed, diffable record of what changed. The memory system leans on this:

  • MEMORY.md is version-controlled. Every promotion is a commit. You can git log MEMORY.md to see the full history of what the system learned and when.
  • ADRs are version-controlled. Decision history is preserved.
  • Skills are version-controlled. Operational knowledge evolves in git, not in databases.
  • The slog is in the vault, which is version-controlled. Operational history survives database failures.

If the vector store dies, you lose search ranking and retrieval speed — but the actual knowledge is in git. This is a design choice: the durable layer is files, the fast layer is the search engine.

14. The discovery pipeline

Not all memory comes from sessions. The discovery pipeline captures interesting finds — URLs, repos, ideas — as structured vault notes:

  1. Agent or human shares a URL with context (“this is interesting”)
  2. Pipeline investigates: clone repo, extract article, analyze content
  3. LLM generates a title, tags, relevance assessment, and a vault note in the user’s voice
  4. Note is written to the vault and logged
  5. If the resource is worth monitoring, a feed subscription is created

Discovery captures are first-class knowledge — they’re indexed alongside session observations and searchable through the same recall interface.

15. The docs/PDF corpus

Books, papers, and documentation get chunked and indexed in a separate collection. Each chunk:

{
  id: "uuid",
  doc_id: "tempo-by-rao",
  chunk_index: 42,
  content: "The actual text of the chunk...",
  source_path: "/volume1/home/joel/books/2026/tempo.epub",
  tags: ["strategy", "tempo", "decision-making"]
}

This is not session memory — it’s reference memory. Agents can search it when they need deep context on a topic. It’s queried explicitly (“search the library for X”), not auto-injected.

Properties

The system has these properties when built correctly:

  • Compound learning. Each session makes future sessions smarter. Knowledge accumulates, noise doesn’t.
  • Durable pipeline. Every step is memoized in a durable function framework. A crash mid-observe doesn’t lose observations.
  • Quality gates. Write gates, triage tiers, and dedup prevent garbage accumulation.
  • Time-aware retrieval. Old facts fade. Recent facts dominate. Stale facts get pruned.
  • Observable. Every pipeline step emits telemetry. You can diagnose “why didn’t the system remember X?” by tracing the pipeline.
  • Git-backed durability. The search engine is the fast path. Git is the durable path. Lose one, recover from the other.
  • Human sovereignty. MEMORY.md is curated through a review pipeline. The human decides what patterns are worth encoding as hard rules.
  • Category-aware. Facts are classified, enabling filtered retrieval (“show me only architecture facts”) and maintenance (“what categories are over-represented?”).

What this doesn’t cover

  • Authentication for the recall API (you probably want it)
  • Multi-user memory (extend the category scheme and add user scoping)
  • Memory sharing between agents (add an agent-mail protocol)
  • Embedding model selection (any auto-embedding search engine works)
  • The specific LLM or inference provider (any instruction-following model works)
  • Real-time memory injection mid-session (build it after the core pipeline works)
  • Forgetting on request (“forget everything about X” — add a purge command)

Implementation notes

Use whatever tools match your stack. The architecture is:

ConcernOptions
Structured logJSONL file + CLI wrapper, or any append-only log
ADRsMarkdown files in a decisions directory
Durable pipelineInngest, Vercel Workflow, Temporal, Trigger.dev
Vector + keyword searchTypesense (auto-embedding), Qdrant + embedding API, Pinecone, Weaviate
Curated memoryMarkdown file, version-controlled
Observation LLMAny instruction-following model
Triage LLMFast/cheap model (Haiku-class) for batch review
Telemetry storeSame search engine (different collection), or dedicated OTEL backend
Document corpusSame search engine (different collection), chunked and tagged

The observation prompt is ~60 lines. The write gate is ~50 lines. The triage pipeline is three tiers. The nightly maintenance is four steps. The curated file is one markdown document.

The complexity is in the pipeline — making sure observations flow reliably from session to storage to retrieval, with quality gates at every transition. Get the pipeline right and the rest is details.

Provenance

This system wasn’t designed in a whiteboard session. It grew through 18 days of building, breaking, and rebuilding — tracked in structured logs, architecture decisions, git commits, and session transcripts. Here’s the full trail.

Architecture Decision Records

The decision chain that shaped the system, in chronological order:

ADRDateStatusDecision
ADR-00142026-02-14SupersededAgent memory workspace — the first attempt, a flat file in ~/.joelclaw/workspace/
ADR-00202026-02-15SupersededObservational memory pipeline — session transcript → LLM extraction → storage
ADR-00212026-02-15ShippedComprehensive agent memory system — 3-tier architecture, dual-write Redis+Qdrant, local embedding. Superseded 0014 and 0020. The one that stuck.
ADR-00682026-02-19ShippedMemory proposal auto-triage pipeline — the 3-tier promote system
ADR-00772026-02-20ShippedMemory system next phase — reflection, nightly maintenance, weekly summaries
ADR-00822026-02-20ShippedTypesense as unified search layer — replaced Qdrant with Typesense for everything
ADR-00942026-02-22ProposedMemory write gate v1 — soft, LLM-first, three-state (allow/hold/discard)
ADR-00952026-02-22ProposedTypesense-native memory categories — SKOS-lite v1 taxonomy
ADR-00962026-02-22ProposedBudget-aware memory retrieval — lean/balanced/deep/auto profiles
ADR-00972026-02-22ProposedForward triggers for time-based memory preload
ADR-00982026-02-22ProposedWrite gate v2 calibration and governance
ADR-00992026-02-22ProposedMemory knowledge-graph substrate — deferred, activation-gated
ADR-01002026-02-22ProposedDual search (vector + graph) activation plan
ADR-01052026-02-22SupersededPDF brain — document library as first-class network utility

System Log Entries

Key operational moments from the structured log (slog), showing the system being built in production:

  • 2026-02-14 20:20 — Installed Qdrant 1.16.3 in Docker. The first vector store.
  • 2026-02-14 20:20 — Created Project 08: memory-system. Credits: Alex Hillman, John Lindquist.
  • 2026-02-15 05:15 — Set up agent memory workspace at ~/.joelclaw/workspace/ with MEMORY.md and daily logs.
  • 2026-02-16 15:02 — Debated Qdrant location: NAS (storage) vs Mac Mini (GPU for embedding). Chose hybrid.
  • 2026-02-16 15:04 — ADR-0021 created (986 lines). Two codex reviews, all blockers fixed. Superseded ADR-0014 and 0020.
  • 2026-02-16 15:04 — Session-lifecycle extension: auto-briefing from MEMORY.md + daily log + slog + projects.
  • 2026-02-16 15:04 — 7 memory pipeline spikes passed: pi structured output, XML parser, Qdrant client, local embedding (nomic-embed-text-v1.5, 768 dims), semantic search, Redis state, background processes.
  • 2026-02-16 15:04 — observe.ts Phase 1 complete: validate-input → call-observer-llm → parse-observations → store-to-qdrant → update-redis-state → emit-accumulated.
  • 2026-02-17 01:15 — Observe pipeline working end-to-end. Smoke test 3/3 PASS.
  • 2026-02-17 03:35 — Phase 3 complete: promote.ts. 16 Inngest functions registered.
  • 2026-02-17 03:48 — Fixed reflect.ts: MEMORY.md read path, XML tag format, prompt structure. Full pipeline verified: observe → reflect (4 proposals) → promote (1 approved, 1 rejected).
  • 2026-02-17 05:44 — Generic embedding function using local all-mpnet-base-v2 via sentence-transformers. 768-dim vectors.
  • 2026-02-17 06:04 — Backfilled 491 zero-vector observations in 8 batches. Semantic search verified across full corpus.
  • 2026-02-17 06:04 — Backfill-observe function: 35 pi session transcripts, 900 Qdrant points (was 511).
  • 2026-02-18 04:03 — Review pipeline end-to-end fix. Curated 112 duplicate proposals into MEMORY.md.
  • 2026-02-18 17:32joelclaw recall: semantic search over 520+ observations via CLI.
  • 2026-02-28 16:23 — Paradigm shift: stop using markdown as primary retrieval. Skills, memory, knowledge → all in Typesense with SKOS taxonomy.
  • 2026-02-28 16:25 — Qdrant is dead. Typesense handles ALL retrieval. Single backend.

Git Commits

The implementation trail in code:

CommitWhat shipped
4e65b51Sync memory system (reflect + promote) from worker clone
494683dSync worker → monorepo observe.ts
eb0049bGeneric embedding function (all-mpnet-base-v2, 768-dim, local)
369356aBackfill-observe — resumable, idempotent transcript backfill via Inngest
cc9978aPhase 3: event-driven promote, review CLI, remove REVIEW.md
7f8914djoelclaw recall — semantic search over 520+ memory observations
a0929c2Vault content sync (memory-related)
a825f89CLI capability adapters — mail, otel, recall, subscriptions
a6de1e0Fix embedding query for vector search in nightly maintenance

Influences

The architecture vision originates from @jumperz’s 31-piece agent memory system — a comprehensive framework splitting memory into Storage (short-term, files, graph, episodic) and Intelligence (retrieval, decay, advanced, ops) across three implementation phases. Joel shared the thread via Telegram on Feb 20; it became the source blueprint for ADR-0077 and every memory ADR that followed. Concepts like score decay (exp(-0.01 × days)), write gates, dedup at cosine > 0.85, query rewriting, tiered search, nightly/weekly maintenance, budget-aware retrieval, and the trust pass all trace directly to jumperz’s diagrams.

Other influences that shaped the implementation:

  • TrustGraph — Context Cores — Versioned, portable agent memory that ships like code. The “Context Core” pattern: version-pin your agent’s domain knowledge and promote it through environments like a Docker image.
  • Andy (JFDI bot) — Narrative Continuity vs Fact Retrieval — An agent studying another agent’s architecture. The key frame: narrative continuity (maintaining the thread) vs fact retrieval (indexed lookup). Same phylum, different niche.
  • Kuato — Alex Hillman’s session recall tool for Claude Code. Key insight: user messages are the signal — requests, confirmations, corrections, completions reconstruct a session without loading the full transcript.
  • Nat Eliason / Felix — Three-Layer Memory — Autonomous agent with layered memory (working/episodic/long-term), heartbeat crons, and Codex delegation. The three-layer separation validated the tiered approach.
  • Mem0 — Memory layer for AI agents. Observation → extraction → storage → retrieval patterns.
  • A-MEM — Agentic memory with self-reflection. The reflect → propose → promote cycle.
  • Alex Hillman — Credited in the original project creation. Community-building patterns applied to agent memory.
  • John Lindquist — Credited in the original project creation. egghead co-founder, shared context on session-based learning.
  • The Agent Writing Loop — Companion tutorial covering the content pipeline that sits alongside this memory system