Agent Memory in 2026: An Honest Comparison of Mem0, Zep, Letta, and the Rest - AutoMem Blog

I have a Claude routine that runs every day and keeps on top of what’s going on in the agentic memory blogosphere. It seems like the hot topic this month is comparing memory systems to Mem0 — I’ve seen six posts just this week.

So I’ll pile in. Is AutoMem better?

Ofc I build it, so take this with the appropriate grain of salt — but the honest answer is the one nobody wants: it depends, and the benchmark numbers you'd use to settle it are mostly self-reported and quietly impossible to compare head-to-head.

That's not a dodge. It's the actual state of the field in 2026. Every memory system publishes a number. Almost none of them publish the same number — same dataset slice, same answer model, same judge, same ingestion rules.

Yes, I put AutoMem in a comparison table like everyone else, but I also want to be realistic about what those scores mean:

This is a field guide, not a victory lap. AutoMem is not the highest-scoring system on this page, and I'm going to show you exactly who scores higher. The point is to help you pick the right tool for your failure mode — not to crown one.

The short version

There is no clean leaderboard. The widely-cited numbers come from vendor blogs, use different models and judges, and sometimes run different slices of the same benchmark.
Retrieval is close to solved; synthesis isn't. On LongMemEval, AutoMem retrieves the right evidence 97% of the time — but only answers correctly 87% of the time. The gap is the interesting part.
Pick on architecture and failure mode, not the headline percentage. Vector-only vs graph+vector, append-everything vs distilled, managed vs self-host, framework-locked vs MCP-shared. Those differences will matter to you long after a two-point benchmark gap stops mattering.

At a glance

Seven systems, by architecture and how you run them:

Mem0 — vector + entity linking · Apache-2.0 · OSS, plus a free managed tier
Zep / Graphiti — temporal knowledge graph · Graphiti is Apache-2.0 · free tier, Zep Cloud managed
Letta — OS-style tiered memory · Apache-2.0 · free 3-agent tier, self-hostable
Supermemory — memory API + RAG · closed-source · managed only, 1M-token free tier
Cognee — graph + vector · open core · self-hostable
LangMem — KV + vector, LangGraph-native · MIT · free, no new infrastructure
AutoMem — graph + vector · MIT · free self-host, or ~$5/month on Railway

The benchmark numbers come later — along with why most of them are hard to trust.

External scores are vendor-reported unless noted. AutoMem's are full-run reruns with a pinned judge — more on why that distinction matters below.

What these benchmarks actually measure

Two datasets dominate the conversation, and they test different things.

LongMemEval (ICLR 2025) stresses long-range episodic memory: bury facts and updates across a large personal history, then ask questions that require finding the right one. It splits into categories — knowledge updates, multi-session, temporal reasoning, preferences — which is what makes it useful for diagnosis instead of a single vanity number.

LoCoMo tests conversational memory: multi-session QA over long dialogues, including single-hop, multi-hop, and temporal questions.

The metric worth understanding is recall@5 — did the top five retrieved memories contain the evidence needed to answer? It cleanly separates finding the right memory from using it. Hold onto that one; it's the whole story later.

Read every number on this page skeptically

Here's the part most comparison posts skip, including the ones written by people who don't sell a memory system.

In December 2025, Penfield Labs audited LoCoMo and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of deliberately incorrect answers. A benchmark whose own answer key is 6% wrong cannot distinguish an 88% system from a 94% one. The error bar swallows the gap.

Then there's the methodology drift:

Different slices. Many published LoCoMo numbers run a single conversation or the 81-question slice. AutoMem's 84.74% is the full 10 conversations and 1,986 judged questions. A smaller, easier slice produces a bigger, prettier number.
Different judges and answer models. Swap GPT-4o for a frontier model and scores move several points without the memory system changing at all. Mastra, for example, reports 84.23% with gpt-4o and 94.87% with gpt-5-mini — same system, ten-point swing.
Vendor math fights. When Zep and Mem0 disagreed publicly on LoCoMo, the same configuration got reported as 84%, then 58%, then 75%, depending on who was counting.

A number with no judge, no slice, and no answer model attached isn't a result. It's marketing.

That's the lens for everything below. I'll label what's vendor-reported, because most of it is.

The systems

Mem0 — the baseline everyone compares against

Mem0 is the gravitational center of this space: ~51k GitHub stars, Apache-2.0, an AWS integration, and the broadest framework support. Architecturally it's vector-first — semantic similarity plus keyword matching — with built-in entity linking for lightweight relationships.

(The standalone graph-store integrations like Neo4j were removed from the open-source SDK in favor of that built-in linking.) It reports 92.5% on LoCoMo and 94.4% on LongMemEval (vendor-reported), the highest pair in this lineup.

The honest tradeoff: it's fundamentally a vector store with entity linking, not a full knowledge graph — so deep multi-hop traversal isn't its native model, and Mem0's own category breakdown shows multi-session as the soft spot. On the managed platform, higher request volumes climb to a $249/month Pro tier.

Best for: teams that want the most-adopted, best-documented option and mostly need fast semantic recall.

Zep / Graphiti — temporal reasoning

Zep is built on Graphiti , an open-source (Apache-2.0) bi-temporal knowledge graph that tracks both when a fact was true and when it was recorded. That's a genuinely good idea — "what did I believe last March" is a different question from "what's true now," and most systems can't tell them apart. Zep reports 71.2% on LongMemEval with gpt-4o, and strong results on its own deep-memory-retrieval benchmark.

The honest tradeoff: its headline LongMemEval number trails the vector-heavy systems, and its LoCoMo figures are the ones that got into a public dispute. The temporal modeling is real; the marketing math got messy.

Best for: agents where time and fact-validity actually matter — support histories, anything where stale facts are dangerous.

Letta — agent-managed memory

Letta (formerly MemGPT) treats memory like an operating system: a core context the agent edits directly, archival storage for the long tail, and recall search over history. The agent decides what to keep in focus. It's Apache-2.0 with a free tier (3 agents).

The honest tradeoff: there's no standardized LoCoMo or LongMemEval score to point at — there's an open issue requesting exactly that. That's not a knock on the architecture; it just means you can't benchmark-compare it, and self-hosting wants PostgreSQL and some ops appetite.

Best for: builders who want agents that actively curate their own memory rather than a passive store.

Supermemory — built for coding workflows

Supermemory is a managed memory API aimed at coding agents: automatic fact extraction, contradiction resolution, selective forgetting, tight MCP and editor integration, and a generous 1M-token free tier. It reports 81.6% on LongMemEval-S with gpt-4o.

The honest tradeoff: it's closed-source and managed-only, so you're trusting the numbers rather than rerunning them, and reported scores vary a bit across its own materials.

Best for: coding agents that want drop-in memory with minimal plumbing and don't need to self-host.

Cognee — graph+vector with broad ingestion

Cognee is an open-core graph+vector hybrid with 30+ data connectors (Slack, Notion, Drive, databases) and multimodal support. The architecture is sound and the connector breadth is the real draw.

The honest tradeoff: no published LongMemEval or LoCoMo scores — the case rests on architecture rather than measured outcomes — and managed pricing runs $35/month (Developer) to $200/month (Team), with on-prem as custom enterprise.

Best for: pulling memory from many existing data sources, where ingestion breadth beats a benchmark number.

LangMem — native to LangGraph

LangMem is MIT-licensed and built into LangGraph's persistent store: episodic, semantic, and procedural memory with zero new infrastructure to deploy. If you already live in LangChain, it's the lowest-friction option on this list.

The honest tradeoff: it's vector-only with no published standardized benchmarks, and it's bound to the LangGraph ecosystem. Outside that stack, it's not really a candidate.

Best for: teams already on LangGraph who want memory without standing up another service.

AutoMem — graph+vector, built in the open

Since I build it, here's the same honest treatment. AutoMem is MIT-licensed, runs a graph database ( FalkorDB ) alongside a vector store ( Qdrant ), and ranks recall with a nine-component score that blends vector similarity, keyword overlap, graph-edge strength, recency, importance, and confidence. It carries 11 typed relationship types and first-class temporal validity, runs a consolidation/decay cycle so wrong rabbit holes fade, and speaks MCP so the same memory works across Claude Code, Cursor, ChatGPT Desktop, and the rest. Free to self-host with Docker, or about $5/month on Railway.

On the numbers: 87.00% on LongMemEval full (435/500) and 84.74% on LoCoMo full (1,986 questions) — answerer gpt-5-mini, judge gpt-5.4-mini, with committed artifacts and sha256 hashes. Mid-pack on raw accuracy, and that's partly by design: the architecture optimizes for relationships and temporal validity over verbatim recall.

The honest tradeoff: more moving parts than a vector store, and it's not the top of the leaderboard. What you get for the extra parts is structured reasoning over your memories, not just similarity.

Best for: agents that need to reason over connected memories and survive across tools, where you'd rather self-host than rent.

The numbers

All external numbers are vendor-reported unless flagged otherwise; AutoMem's are full-run reruns with a pinned judge and committed artifacts:

AutoMem — LoCoMo 84.74%, LongMemEval 87.00%, recall@5 97.00%. Official rerun, full slices. Experiment log
Mem0 — LoCoMo 92.5%, LongMemEval 94.4%; BEAM 64.1 / 48.6. Mem0 research
Zep / Graphiti — LongMemEval 71.2% (gpt-4o); LoCoMo not reported; strong on its own DMR benchmark. Zep paper
Supermemory — LongMemEval-S 81.6% (gpt-4o); LoCoMo not reported. Supermemory research
Letta — no standardized LoCoMo or LongMemEval score published. Open benchmark issue
Cognee — no standardized scores published.
LangMem — no standardized scores published.

For completeness: a cluster of newer entrants report north of 90% on LongMemEval — Hindsight (91.4%), Honcho (90.4%), MemMachine (93%), HydraDB (90.2%) — most on the easier LongMemEval-S slice and most with frontier answer models. The full external table, with sources and status flags, lives on the AutoMem benchmarks page .

Recall is a lookup, not an LLM call

Here's an architectural difference the percentages hide. A lot of memory systems run an LLM on the read path — to extract, re-rank, or summarize what they hand back. AutoMem doesn't. Recall is a vector lookup plus deterministic scoring over the graph, and it returns the stored memories themselves. The only LLM in the loop is the agent's own — the one that was going to run anyway.

That changes the cost shape. Recall latency is database latency, not database-plus-an-inference-round-trip, and you're not spending summarization tokens on every retrieval.

Writes follow the same rule: storing a memory is a graph operation, and the enrichment that follows — entities, relationships, embeddings — runs in the background on lightweight, non-LLM extraction rather than a model call per write. (Oversized entries can optionally get LLM-compressed on the way in, but a typical short memory never touches one.)

Not an optimized prompt. No prompt in that path at all.

The thing no table shows

Run the recall@5 number back. On LongMemEval, AutoMem retrieves the right session 97% of the time but only answers correctly 87% of the time. I dug into the 65 wrong answers: 54 of them had the correct session sitting right there in the top-5 retrieval. Only 11 were genuine retrieval misses.

That's not "memory is hard to find." That's memory was found and the answer still came out wrong. First-stage retrieval is close to saturated. The remaining work is synthesis — reading the retrieved evidence and reasoning over it without drifting — and it's most visible on preference questions, where AutoMem hits 90% retrieval but only 57% end-to-end accuracy.

This is the reframe I'd push on the whole field: we are collectively polishing the retrieval number while the answer-synthesis gap goes unmeasured. The leaderboard rewards the part that's nearly done.

How to choose

Match the failure mode, not the feature list:

Vector-only vs graph+vector. If your queries are "find things like this," a vector store (or Mem0's free tier, or LangMem) is plenty. If they're "what led to this decision" or "what did I used to believe," you want a graph layer — AutoMem, Cognee, Zep.
Append-everything vs distilled. Some systems store raw history and search it. AutoMem stores distilled memories — decisions, patterns, preferences — and decays the noise. Different bet on what "remembering" means.
Managed vs self-host. Supermemory and Zep Cloud are the least ops. AutoMem, Letta, Cognee, and LangMem can run on your own metal.
Framework-locked vs portable. LangMem is great inside LangGraph and irrelevant outside it. MCP-native systems like AutoMem move with you across tools.

What none of these benchmarks measure

LongMemEval and LoCoMo test recall and answer synthesis. They say nothing about write precision (did you store the right thing), forgetting (does stale data actually leave), privacy boundaries (per-agent isolation, ACLs), or long-running coding-agent workflows where memory accumulates over weeks. Those are where real deployments live, and the field doesn't have good public benchmarks for any of them yet.

So treat every percentage on this page — mine included — as one signal, not a verdict. AutoMem's claim is reproducibility and systems design, not a top-of-the-leaderboard trophy. The numbers are version-stamped and rerunnable on the benchmarks page ; the harness, judge policy, and artifact hashes are all public.

If you want to argue about any of this, I'm around — the repo is on GitHub and there's a Discord .

— Jack