:: BENCHMARK_SIGNAL

Benchmarks for agent memory
under pressure.

AutoMem's primary numbers now come from the neutral Agent Memory Benchmark — the standardized answerer, judge, and grader the harness ran for AutoMem. BEAM is the one same-harness, apples-to-apples axis against published competitors, and there AutoMem is a clear #2: ahead of Honcho at every scale, with the lead widening to +16.8pp at 10M. Competitor figures elsewhere are their own published / self-reported numbers, and the conversational-recall gap behind Hindsight is named here too, not hidden.

Read the AMB results → Experiment Log

Submitted - PR under review June 26, 2026

BEAM @ 10M

57.4%

vs Honcho 40.6%

+16.8pp at the hardest tier

Context / answer

~2.6-4.8k

board leader 17-27k

flat across every scale

Neutral harness — every provider feeds the same answerer and judge, so the score measures retrieval, not grader mood. BEAM is the only apples-to-apples axis; AutoMem is #2 behind Hindsight. A reproducibility and scaling claim, not a state-of-the-art one.

:: AMB_NEUTRAL_HARNESS

Submitted - PR under review AMB v1

Primary claim set: the neutral Agent Memory Benchmark

AutoMem's current primary numbers come from the neutral Agent Memory Benchmark (AMB / vectorize-io), neutral harness, run in single-query / RAG mode with a gemini-3.1-pro-preview answerer and gemini-2.5-flash-lite judge. The provider self-spins its full stack (Self-spinning Docker (FalkorDB + Qdrant), FastEmbed-local bge-base-en-v1.5 (768d), no embedding API keys, lean enrichment (ENRICHMENT_ENABLED=false). Run name automem-sub.). The single biggest reason to trust these: you can run them yourself.

Status: Submitted to the neutral board; vectorize-io provider PR is under review. Not live/official on the leaderboard yet.

Core-3 (single-query / RAG)

On conversational Core-3, AutoMem trails the AMB leader Hindsight on all three (locomo 85.1 vs 92, longmemeval 74.4 vs 94.6, personamem 76.1 vs 86.6). The strength is large-context BEAM scaling and context-token efficiency, not verbatim conversational recall.

LoCoMo

85.1% +/-1.8

locomo10 - n=1540

Hindsight (AMB): 92%
Honcho (self-rep.): 89.9%
Mean ctx tokens: 4,768

Hindsight (also AMB, clean comparison) is the proper yardstick; AutoMem trails it 85.1% vs 92%. Honcho's 89.9% is self-reported on its own harness (directional only).

LongMemEval

74.4% +/-3.8

longmemeval/s - n=500

Hindsight (AMB): 94.6%
Honcho (self-rep.): 90.4%
Mean ctx tokens: 3,756

AutoMem trails Hindsight 74.4% vs 94.6% on the clean AMB comparison. Honcho's 90.4% is self-reported on its own harness (directional only).

PersonaMem

76.1% +/-3.4

personamem/32k - n=589

Hindsight (AMB): 86.6%
Honcho (self-rep.): n/a
Mean ctx tokens: 2,588

AutoMem trails Hindsight 76.1% vs 86.6% on the clean AMB comparison. Honcho does not report PersonaMem.

BEAM scaling (apples-to-apples)

BEAM is the only same-benchmark, same-harness, apples-to-apples axis vs published competitors. AutoMem beats Honcho at every tier; the margin grows with scale. AutoMem is a clear #2 behind Hindsight (vectorize's own reference system).

Scores are rubric-mean (0 / 0.5 / 1 per item, averaged) - a different scale than pass/fail benchmarks.

BEAM scaling: AutoMem holds 67.5% at 100k and degrades gracefully to 57.4% at 10M, while Honcho stays flat near 63% through 1M then collapses to 40.6% at 10M; Hindsight holds a ~73-64% band across the curve. — BEAM rubric-mean accuracy as the haystack grows 100x. Neutral AMB, single-query mode. Hindsight band is approximate.

Scale	AutoMem	Honcho	Hindsight	vs Honcho (pp)	Mean ctx tokens
100K	67.5% x3 repro, spread 1.8pp	63.0%	~73%	+4.5	3,842
500K	65.6% +/-2.8 (n=700)	64.9%	band	+0.7	3,929
1M	63.8% +/-2.7 (n=700)	63.1%	band	+0.7	3,900
10M	57.4% +/-5.5 (n=200)	40.6%	~64%	+16.8	3,932

Graceful vs collapse

AutoMem degrades gracefully (67.5% -> 57.4%, -10pp) across a 100x haystack increase. Honcho collapses over the same range (63.0% -> 40.6%, -22pp): flat through 1M then a cliff at 10M. Hindsight holds ~73-64% across the curve.

The 10M centerpiece

At 10M tokens, context-stuffing is physically impossible, so the score reflects retrieval architecture, not context window. AutoMem holds ~57% where the well-marketed #2 (Honcho) breaks to ~41%.

// what does 10M tokens actually mean?

plain-English scale check

The haystack

7.5M

words held in memory

Ten million tokens is about 7.5 million words — the complete works of Shakespeare roughly eight times over, or the entire seven-book Harry Potter series read seven times through. One unbroken haystack.

Ask a single question against a memory store that size, and AutoMem returns just the part you needed — about 3,900 tokens of context, call it 3,000 words, in roughly 1.7 seconds (median, on our own hardware). Not the whole library. The page you were looking for.

Context-token efficiency

AutoMem

~2.6-4.8k

Board leader

17-27k

AutoMem feeds ~2.6-4.8k context tokens at every scale; the board's leader feeds 17-27k on BEAM. This is architectural, not hardware-dependent. Reported as the mean context tokens fed to the answerer (matches the board's avg_context_tokens metric).

Competitor numbers come from AMB external_results.json (published / self-reported with source attribution), not re-run through the AMB Gemini harness. BEAM is the only same-benchmark, same-harness, apples-to-apples axis vs published competitors. Reproduce with the public image ghcr.io/verygoodplugins/automem:amb-v1 and AUTOMEM_REPRODUCE.md (one command per split, no embedding API keys).

:: WHAT_THE_SCORES_MEAN

What these benchmarks mean

LongMemEval and LoCoMo are not generic product leaderboards. They stress the two parts of agent memory that matter most in practice: recovering the right evidence from long histories, then using that evidence to answer questions without drifting.

LongMemEval

Tests long-range episodic memory: can the system recover facts and updates from a large personal-memory history and answer with the right evidence?

Current full run: 500 questions with a separate answer model and judge.

LoCoMo

Tests conversational memory: multi-session QA over long dialogues, including temporal, single-hop, and multi-hop questions.

Current full run: 10 conversations and 1,986 judged questions.

Recall@5

Separates retrieval from answer synthesis: did the five returned memories include the evidence needed to answer correctly?

LongMemEval full reached recall@5 97.00%, so many remaining misses are synthesis or representation work.

How to compare AutoMem

Treat this as market context, not a leaderboard. The external rows below are reported by their own projects or papers, usually with different models, judges, token budgets, and ingestion rules.

AutoMem canonical runs

official rerun

Generated from the main repository benchmark artifacts with publishable flags, pinned judges, and reproduction links.

Mem0, Zep/Graphiti, Letta, A-MEM

external reported

Useful research and market context when cited from their own artifacts, but not AutoMem-controlled reruns.

External memory systems

context only

Benchmarks are not apples-to-apples unless dataset version, extraction policy, answer model, judge, context budget, and scale are aligned.

BEAM (neutral harness)

apples-to-apples

The one axis where AutoMem and published competitors ran the same benchmark under the same neutral harness. AutoMem is #2 — above Honcho at every tier, below Hindsight. LongMemEval-V2 is still tracked as a future surface.

Published external context

These are public claims and papers found in May 2026. They help readers orient AutoMem, but the rows are not apples-to-apples unless rerun under one harness.

System	LoCoMo	LongMemEval	Other	Status	Source
AutoMem Same canonical results shown above; included here so readers can scan against external reported rows.	84.74%	87.00%	recall@5 97.00%	official AutoMem rerun	AutoMem experiment log
Mem0 Cloud Mem0 reports managed-platform results for LoCoMo, LongMemEval, and BEAM; their docs caution the cloud stack includes proprietary optimizations.	92.5%	94.4%	BEAM 64.1 / 48.6	external reported	Mem0 research
Zep / Graphiti Zep's paper reports LongMemEval accuracy and latency versus full-context baselines with GPT-4o-family models.	not reported	71.2% (gpt-4o)	63.8% (gpt-4o-mini)	external reported	Zep paper
Supermemory Supermemory reports LongMemEval-S category scores and compares against Zep and full-context baselines.	not reported	81.6% (gpt-4o)	LongMemEval-S	external reported	Supermemory research
Hindsight Hindsight benchmark materials report LongMemEval and LoCoMo improvements for a structured memory architecture.	up to 89.61%	91.4%	scaled backbone	external reported	Hindsight benchmarks
Mastra Observational Memory Mastra reports an open-source LongMemEval-S runner and shows model-dependent scores for Observational Memory.	not reported	84.23% (gpt-4o); 94.87% (gpt-5-mini)	LongMemEval-S	external reported	Mastra research
Honcho Honcho reports LongMemEval-S, LoCoMo, and BEAM results while cautioning that some LongMemEval-S setups now fit in frontier context windows.	89.9%	90.4%; 92.6% (Gemini 3 Pro)	LongMemEval-S / BEAM	external reported	Honcho benchmark
Letta / MemGPT Letta has an open request for LOCOMO / MemBench / LongMemEval benchmark coverage; no official standardized score was found.	not published	not published	benchmark issue open	no official standardized score found	Letta benchmark issue
MemMachine MemMachine reports LoCoMo accuracy-efficiency results under its own optimized setup. LongMemEvalS is omitted until a non-removed public source supports the value.	91.69%	not sourced here	LoCoMo benchmark	external reported	MemMachine LoCoMo
HydraDB HydraDB reports LongMemEval-S category results against Supermemory, Zep, full-context, and memory-layer baselines.	not reported	90.23%	LongMemEval-S / Gemini 3 Pro	external reported	HydraDB benchmark
Exabase M-1 Exabase reports a May 2026 LongMemEval run focused on retrieval quality with a smaller answer model and no question-specific prompt tuning.	not reported	96.4% top-50	Gemini 3 Flash	external reported	Exabase research

Why the current numbers are credible now

Recent changelog entries show the benchmark harness, evaluator, and recall system maturing before the May full-run verification.

Changelog

Feb 2026

LongMemEval harness

Added the ICLR 2025 LongMemEval benchmark harness and Recall Quality Lab for data-driven recall work.
Mar 2026

LoCoMo and relationship engine

Added LoCoMo judge coverage, optimized the relationship taxonomy, and fixed benchmark evaluator bugs.
Apr 2026

Recall quality and judge reliability

Improved keyword-heavy recall and hardened LoCoMo judge runs against flaky rate limits.
May 2026

Full-run verification

Verified full LongMemEval and LoCoMo runs and promoted only publishable claims.
Jun 2026

Neutral AMB submission

Submitted to the neutral Agent Memory Benchmark — same answerer and judge as everyone else. A clear #2 on BEAM scaling, with the conversational-recall gap named openly.

:: INTERNAL_RUNS

Internal runs (own judge — directional)

These full LongMemEval and LoCoMo runs use AutoMem's own answerer and judge, so they are not comparable to other systems' published numbers. The same answers scored 82.0% under a gpt-5-mini judge and 70.25% under a gpt-5 judge — a ~12-point swing from grader strictness alone. Read these as directional; the neutral AMB numbers above are the comparable ones.

Why these aren't a leaderboard claim →

Canonical full

LongMemEval

87.00% 500 questions

87.00% (435/500)

recall@5 97.00% (485/500)

Answer model: gpt-5-mini
Judge: gpt-5.4-mini-2026-03-17
Judge errors: 0

Category	Score
knowledge update	88.46% (69/78)
multi session	84.21% (112/133)
single session assistant	98.21% (55/56)
single session preference	56.67% (17/30)
single session user	92.86% (65/70)
temporal reasoning	87.97% (117/133)

65 wrong total; 54 wrong had answer session retrieved at recall@5; 11 wrong were retrieval misses; 4 correct answers were retrieval misses.

Source: Experiment log

Artifact: Generated result file; see the experiment log for path and run context. (generated, gitignored)

SHA256: ed6f7cf69b7be6fa0050536ec2b0f947f5510afd8c2a374b3fafb9cde009da75

Canonical full

LoCoMo

84.74% 1986 questions

84.74% (1683/1986)

Judge: gpt-5.4-mini-2026-03-17
Judge calls: 444
Judge errors: 0

Category	Score
single hop	52.13% (147/282)
temporal	86.60% (278/321)
multi hop	46.88% (45/96)
open domain	93.58% (787/841)
complex	95.52% (426/446)

Source: Experiment log

Artifact: Generated result file; see the experiment log for path and run context. (generated, gitignored)

SHA256: a75816e9a6d3302c22b34852b75ac19a9d9f5cb27d1a109e0af7e49359330716

:: CANARY_AND_EXPLORATORY

Canonical vs exploratory

Canary runs catch drift quickly. Exploratory runs expose useful signals, but they are not public headline claims until the main repository promotes them through the official benchmark flow.

AutoMem should not be described as "best memory system" or a state-of-the-art winner from these rows. External comparisons need apples-to-apples reruns with current systems, judge policies, dataset versions, and scale settings.

Verification canaries
Benchmark	Scope	Score	Status
LongMemEval	mini stratified	70.00% (21/30)	Representative canary
LoCoMo	mini	85.20% (259/304)	Fresh verification

Exploratory signals
Benchmark	Scope	Result
BEAM	100K V1 raw-dialogue shim	76.25% (305/400), avg 0.677
BEAM	100K V2 fact-extraction shim	73.75% (295/400), avg 0.653
Writ	drift category, 5 scenarios	100.0% recall_accuracy; 20.0% update_fidelity; 0.0% drift_rate
Claude Code hook replay	fixture and metrics harness	harness tests only; no headline score

:: REPRODUCE

Reproduction trail

The website uses checked-in benchmark data maintained from the main repository experiment log, judge policy, and neutral AMB submission details. Detailed result JSON files may remain generated and gitignored; the durable claim source is the committed experiment log plus the reproducible AMB run notes.

Experiment Log Judge Policy

Not yet canonical

[ ] LongMemEval-V2
[ ] Mem0 managed-platform apples-to-apples comparison

Benchmarks for agent memory under pressure.

Primary claim set: the neutral Agent Memory Benchmark

Core-3 (single-query / RAG)

BEAM scaling (apples-to-apples)

Context-token efficiency

What these benchmarks mean

LongMemEval

LoCoMo

Recall@5

How to compare AutoMem

AutoMem canonical runs

Mem0, Zep/Graphiti, Letta, A-MEM

External memory systems

BEAM (neutral harness)

Published external context

Why the current numbers are credible now

LongMemEval harness

LoCoMo and relationship engine

Recall quality and judge reliability

Full-run verification

Neutral AMB submission

Internal runs (own judge — directional)

LongMemEval

LoCoMo

Canonical vs exploratory

Reproduction trail

Not yet canonical

Benchmarks for agent memory
under pressure.