:: BENCHMARK_SIGNAL

Benchmarks for agent memory
under pressure.

AutoMem measures whether a memory system can retrieve the right evidence and support correct answers across long histories. The page separates official reruns from external context, canary checks, and exploratory evals so the numbers stay useful without over-claiming.

Canonical current May 18, 2026
LongMemEval
87.00%
87.00% (435/500)
recall@5 97.00% (485/500)
LoCoMo
84.74%
84.74% (1683/1986)
pinned judge, full run

Source: verygoodplugins/automem@a742602. This is a reproducibility and systems claim, not a state-of-the-art claim.

:: WHAT_THE_SCORES_MEAN

What these benchmarks mean

LongMemEval and LoCoMo are not generic product leaderboards. They stress the two parts of agent memory that matter most in practice: recovering the right evidence from long histories, then using that evidence to answer questions without drifting.

LongMemEval

Tests long-range episodic memory: can the system recover facts and updates from a large personal-memory history and answer with the right evidence?

Current full run: 500 questions with a separate answer model and judge.

LoCoMo

Tests conversational memory: multi-session QA over long dialogues, including temporal, single-hop, and multi-hop questions.

Current full run: 10 conversations and 1,986 judged questions.

Recall@5

Separates retrieval from answer synthesis: did the five returned memories include the evidence needed to answer correctly?

LongMemEval full reached recall@5 97.00%, so many remaining misses are synthesis or representation work.

How to compare AutoMem

Treat this as market context, not a leaderboard. The external rows below are reported by their own projects or papers, usually with different models, judges, token budgets, and ingestion rules.

AutoMem canonical runs

official rerun

Generated from the main repository benchmark artifacts with publishable flags, pinned judges, and reproduction links.

Mem0, Zep/Graphiti, Letta, A-MEM

external reported

Useful research and market context when cited from their own artifacts, but not AutoMem-controlled reruns.

External memory systems

context only

Benchmarks are not apples-to-apples unless dataset version, extraction policy, answer model, judge, context budget, and scale are aligned.

BEAM and LongMemEval-V2

not yet canonical

Tracked as future comparison surfaces until the main repo promotes AutoMem-controlled runs.

Published external context

These are public claims and papers found in May 2026. They help readers orient AutoMem, but the rows are not apples-to-apples unless rerun under one harness.

System LoCoMo LongMemEval Other Status Source
AutoMem

Same canonical results shown above; included here so readers can scan against external reported rows.

84.74% 87.00% recall@5 97.00% official AutoMem rerun AutoMem experiment log
Mem0 Cloud

Mem0 reports managed-platform results for LoCoMo, LongMemEval, and BEAM; their docs caution the cloud stack includes proprietary optimizations.

92.5% 94.4% BEAM 64.1 / 48.6 external reported Mem0 research
Zep / Graphiti

Zep's paper reports LongMemEval accuracy and latency versus full-context baselines with GPT-4o-family models.

not reported 71.2% (gpt-4o) 63.8% (gpt-4o-mini) external reported Zep paper
Supermemory

Supermemory reports LongMemEval-S category scores and compares against Zep and full-context baselines.

not reported 81.6% (gpt-4o) LongMemEval-S external reported Supermemory research
Hindsight

The Hindsight preprint reports LongMemEval and LoCoMo improvements for a structured memory architecture.

up to 89.61% 91.4% scaled backbone research preprint Hindsight arXiv
Mastra Observational Memory

Mastra reports an open-source LongMemEval-S runner and shows model-dependent scores for Observational Memory.

not reported 84.23% (gpt-4o); 94.87% (gpt-5-mini) LongMemEval-S external reported Mastra research
Honcho

Honcho reports LongMemEval-S, LoCoMo, and BEAM results while cautioning that some LongMemEval-S setups now fit in frontier context windows.

89.9% 90.4%; 92.6% (Gemini 3 Pro) LongMemEval-S / BEAM external reported Honcho benchmark
Letta / MemGPT

Letta has an open request for LOCOMO / MemBench / LongMemEval benchmark coverage; no official standardized score was found.

not published not published benchmark issue open no official standardized score found Letta benchmark issue
MemMachine

A 2026 arXiv preprint reports LoCoMo and LongMemEvalS accuracy-efficiency results under its own optimized setup.

91.69% 93.0% gpt4.1-mini / LongMemEvalS research preprint MemMachine arXiv
HydraDB

HydraDB reports LongMemEval-S category results against Supermemory, Zep, full-context, and memory-layer baselines.

not reported 90.23% LongMemEval-S / Gemini 3 Pro external reported HydraDB benchmark
Exabase M-1

Exabase reports a May 2026 LongMemEval run focused on retrieval quality with a smaller answer model and no question-specific prompt tuning.

not reported 96.4% top-50 Gemini 3 Flash external reported Exabase research

Why the current numbers are credible now

Recent changelog entries show the benchmark harness, evaluator, and recall system maturing before the May full-run verification.

Changelog
  1. Feb 2026

    LongMemEval harness

    Added the ICLR 2025 LongMemEval benchmark harness and Recall Quality Lab for data-driven recall work.

  2. Mar 2026

    LoCoMo and relationship engine

    Added LoCoMo judge coverage, optimized the relationship taxonomy, and fixed benchmark evaluator bugs.

  3. Apr 2026

    Recall quality and judge reliability

    Improved keyword-heavy recall and hardened LoCoMo judge runs against flaky rate limits.

  4. May 2026

    Full-run verification

    Verified full LongMemEval and LoCoMo runs and promoted only publishable claims.

:: CANONICAL_RESULTS
Canonical full

LongMemEval

87.00% 500 questions

87.00% (435/500)

recall@5 97.00% (485/500)

Answer model
gpt-5-mini
Judge
gpt-5.4-mini-2026-03-17
Judge errors
0
Category Score
knowledge update 88.46% (69/78)
multi session 84.21% (112/133)
single session assistant 98.21% (55/56)
single session preference 56.67% (17/30)
single session user 92.86% (65/70)
temporal reasoning 87.97% (117/133)

65 wrong total; 54 wrong had answer session retrieved at recall@5; 11 wrong were retrieval misses; 4 correct answers were retrieval misses.

Source: Experiment log

Artifact: benchmarks/results/longmemeval-full-publication-20260518.json (generated, gitignored)

SHA256: ed6f7cf69b7be6fa0050536ec2b0f947f5510afd8c2a374b3fafb9cde009da75

Canonical full

LoCoMo

84.74% 1986 questions

84.74% (1683/1986)

Judge
gpt-5.4-mini-2026-03-17
Judge calls
444
Judge errors
0
Category Score
single hop 52.13% (147/282)
temporal 86.60% (278/321)
multi hop 46.88% (45/96)
open domain 93.58% (787/841)
complex 95.52% (426/446)

Source: Experiment log

Artifact: benchmarks/results/locomo_baseline_20260517_193934.json (generated, gitignored)

SHA256: a75816e9a6d3302c22b34852b75ac19a9d9f5cb27d1a109e0af7e49359330716

:: CANARY_AND_EXPLORATORY

Canonical vs exploratory

Canary runs catch drift quickly. Exploratory runs expose useful signals, but they are not public headline claims until the main repository promotes them through the official benchmark flow.

AutoMem should not be described as "best memory system" or a state-of-the-art winner from these rows. External comparisons need apples-to-apples reruns with current systems, judge policies, dataset versions, and scale settings.
Verification canaries
Benchmark Scope Score Status
LongMemEval mini stratified 70.00% (21/30) Representative canary
LoCoMo mini 85.20% (259/304) Fresh verification
Exploratory signals
Benchmark Scope Result
BEAM 100K V1 raw-dialogue shim 76.25% (305/400), avg 0.677
BEAM 100K V2 fact-extraction shim 73.75% (295/400), avg 0.653
Writ drift category, 5 scenarios 100.0% recall_accuracy; 20.0% update_fidelity; 0.0% drift_rate
Claude Code hook replay fixture and metrics harness harness tests only; no publication score
:: REPRODUCE

Reproduction trail

The website consumes artifact-manifest.json from the main repository benchmark artifact directory. Detailed result JSON files may remain generated and gitignored; the durable claim source is the committed experiment log plus the synced artifact manifest.

Not yet canonical

  • [ ] BEAM official 1M/10M
  • [ ] LongMemEval-V2
  • [ ] Mem0 managed-platform apples-to-apples comparison
Need help remembering something?