LongMemEval
Tests long-range episodic memory: can the system recover facts and updates from a large personal-memory history and answer with the right evidence?
Current full run: 500 questions with a separate answer model and judge.
AutoMem measures whether a memory system can retrieve the right evidence and support correct answers across long histories. The page separates official reruns from external context, canary checks, and exploratory evals so the numbers stay useful without over-claiming.
Source: verygoodplugins/automem@a742602. This is a reproducibility and systems claim, not a state-of-the-art claim.
LongMemEval and LoCoMo are not generic product leaderboards. They stress the two parts of agent memory that matter most in practice: recovering the right evidence from long histories, then using that evidence to answer questions without drifting.
Tests long-range episodic memory: can the system recover facts and updates from a large personal-memory history and answer with the right evidence?
Current full run: 500 questions with a separate answer model and judge.
Tests conversational memory: multi-session QA over long dialogues, including temporal, single-hop, and multi-hop questions.
Current full run: 10 conversations and 1,986 judged questions.
Separates retrieval from answer synthesis: did the five returned memories include the evidence needed to answer correctly?
LongMemEval full reached recall@5 97.00%, so many remaining misses are synthesis or representation work.
Treat this as market context, not a leaderboard. The external rows below are reported by their own projects or papers, usually with different models, judges, token budgets, and ingestion rules.
Generated from the main repository benchmark artifacts with publishable flags, pinned judges, and reproduction links.
Useful research and market context when cited from their own artifacts, but not AutoMem-controlled reruns.
Benchmarks are not apples-to-apples unless dataset version, extraction policy, answer model, judge, context budget, and scale are aligned.
Tracked as future comparison surfaces until the main repo promotes AutoMem-controlled runs.
These are public claims and papers found in May 2026. They help readers orient AutoMem, but the rows are not apples-to-apples unless rerun under one harness.
| System | LoCoMo | LongMemEval | Other | Status | Source |
|---|---|---|---|---|---|
| AutoMem Same canonical results shown above; included here so readers can scan against external reported rows. | 84.74% | 87.00% | recall@5 97.00% | official AutoMem rerun | AutoMem experiment log |
| Mem0 Cloud Mem0 reports managed-platform results for LoCoMo, LongMemEval, and BEAM; their docs caution the cloud stack includes proprietary optimizations. | 92.5% | 94.4% | BEAM 64.1 / 48.6 | external reported | Mem0 research |
| Zep / Graphiti Zep's paper reports LongMemEval accuracy and latency versus full-context baselines with GPT-4o-family models. | not reported | 71.2% (gpt-4o) | 63.8% (gpt-4o-mini) | external reported | Zep paper |
| Supermemory Supermemory reports LongMemEval-S category scores and compares against Zep and full-context baselines. | not reported | 81.6% (gpt-4o) | LongMemEval-S | external reported | Supermemory research |
| Hindsight The Hindsight preprint reports LongMemEval and LoCoMo improvements for a structured memory architecture. | up to 89.61% | 91.4% | scaled backbone | research preprint | Hindsight arXiv |
| Mastra Observational Memory Mastra reports an open-source LongMemEval-S runner and shows model-dependent scores for Observational Memory. | not reported | 84.23% (gpt-4o); 94.87% (gpt-5-mini) | LongMemEval-S | external reported | Mastra research |
| Honcho Honcho reports LongMemEval-S, LoCoMo, and BEAM results while cautioning that some LongMemEval-S setups now fit in frontier context windows. | 89.9% | 90.4%; 92.6% (Gemini 3 Pro) | LongMemEval-S / BEAM | external reported | Honcho benchmark |
| Letta / MemGPT Letta has an open request for LOCOMO / MemBench / LongMemEval benchmark coverage; no official standardized score was found. | not published | not published | benchmark issue open | no official standardized score found | Letta benchmark issue |
| MemMachine A 2026 arXiv preprint reports LoCoMo and LongMemEvalS accuracy-efficiency results under its own optimized setup. | 91.69% | 93.0% | gpt4.1-mini / LongMemEvalS | research preprint | MemMachine arXiv |
| HydraDB HydraDB reports LongMemEval-S category results against Supermemory, Zep, full-context, and memory-layer baselines. | not reported | 90.23% | LongMemEval-S / Gemini 3 Pro | external reported | HydraDB benchmark |
| Exabase M-1 Exabase reports a May 2026 LongMemEval run focused on retrieval quality with a smaller answer model and no question-specific prompt tuning. | not reported | 96.4% top-50 | Gemini 3 Flash | external reported | Exabase research |
Recent changelog entries show the benchmark harness, evaluator, and recall system maturing before the May full-run verification.
Added the ICLR 2025 LongMemEval benchmark harness and Recall Quality Lab for data-driven recall work.
Added LoCoMo judge coverage, optimized the relationship taxonomy, and fixed benchmark evaluator bugs.
Improved keyword-heavy recall and hardened LoCoMo judge runs against flaky rate limits.
Verified full LongMemEval and LoCoMo runs and promoted only publishable claims.
87.00% (435/500)
recall@5 97.00% (485/500)
| Category | Score |
|---|---|
| knowledge update | 88.46% (69/78) |
| multi session | 84.21% (112/133) |
| single session assistant | 98.21% (55/56) |
| single session preference | 56.67% (17/30) |
| single session user | 92.86% (65/70) |
| temporal reasoning | 87.97% (117/133) |
65 wrong total; 54 wrong had answer session retrieved at recall@5; 11 wrong were retrieval misses; 4 correct answers were retrieval misses.
Source: Experiment log
Artifact: benchmarks/results/longmemeval-full-publication-20260518.json (generated, gitignored)
SHA256: ed6f7cf69b7be6fa0050536ec2b0f947f5510afd8c2a374b3fafb9cde009da75
84.74% (1683/1986)
| Category | Score |
|---|---|
| single hop | 52.13% (147/282) |
| temporal | 86.60% (278/321) |
| multi hop | 46.88% (45/96) |
| open domain | 93.58% (787/841) |
| complex | 95.52% (426/446) |
Source: Experiment log
Artifact: benchmarks/results/locomo_baseline_20260517_193934.json (generated, gitignored)
SHA256: a75816e9a6d3302c22b34852b75ac19a9d9f5cb27d1a109e0af7e49359330716
Canary runs catch drift quickly. Exploratory runs expose useful signals, but they are not public headline claims until the main repository promotes them through the official benchmark flow.
| Benchmark | Scope | Score | Status |
|---|---|---|---|
| LongMemEval | mini stratified | 70.00% (21/30) | Representative canary |
| LoCoMo | mini | 85.20% (259/304) | Fresh verification |
| Benchmark | Scope | Result |
|---|---|---|
| BEAM | 100K V1 raw-dialogue shim | 76.25% (305/400), avg 0.677 |
| BEAM | 100K V2 fact-extraction shim | 73.75% (295/400), avg 0.653 |
| Writ | drift category, 5 scenarios | 100.0% recall_accuracy; 20.0% update_fidelity; 0.0% drift_rate |
| Claude Code hook replay | fixture and metrics harness | harness tests only; no publication score |
The website consumes artifact-manifest.json from the main repository benchmark artifact directory.
Detailed result JSON files may remain generated and gitignored; the durable claim source is the committed experiment log plus the synced artifact manifest.