95% CI [+9.33, +22.00], p<0.05 (paired bootstrap, n=1000). Same model, same repo, pass rate climbs as REM accumulates build logs, errors, and diffs — measured against the official swebench 4.1.0 evaluator inside a public container. Cold Opus 4.7: 30.00%. Opus 4.7 + REM: 45.33%.
26 tasks recovered · 3 regressions (provider noise) · apply-errors −48% · LongMemEval 94.6% retrieval-quality benchmark below
Methodology: paired-task bootstrap (n=1000), seed=42, official swebench 4.1.0 evaluator, Anthropic-pinned Opus 4.7 provider, public container, raw eval logs published at benchlist.ai. Each task scored independently; cold = no REM context, REM = per-task context synthesised from prior runs. Same model, same patches-format, same 30-min compute budget.
Every measurement runs inside a public container against the upstream evaluator. The full matrix, per-seed raw data, judge logs, and reproducibility commands live at benchlist.ai — kept there because the scoring authority should sit above any single vendor.
Reproduction artifacts available on request — email dev@remlabs.ai with your use case. Full predictions JSONL + Dockerfile + per-task eval logs published with first cohort.
Reviewers and partners get:
swebench/eval:4.1.0 evaluatorExpected: cold 30.00%, REM 45.33%, lift +15.33pp (CI [+9.33, +22.00], p<0.05)
SWE-bench measures how much REM helps a coding agent solve real GitHub issues; LongMemEval measures the underlying retrieval substrate (500 multi-session memory questions, ICLR 2025). We list it here as a supporting signal — co-leader on the public leaderboard, not the headline.
Five LongMemEval questions running against our production API right now — not a cached number, not a screenshot. Real API calls, real scores. Every run creates a fresh namespace, stores memories, then recalls them.
Methodology: store a fact via /v1/memory/set, then query via /v1/memory/search. Question passes if the recalled answer contains the expected key info. Same protocol as the published LongMemEval (ICLR 2025).
Two modes. A: point the tool at any memory API and grade it on LongMemEval retrieval questions. B: plot your coding agent's week-over-week lift after 30 days of REM-accumulated build logs — same model, same repo, measurable delta.
500 multi-session memory questions across five categories: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and preference tracking. Published at ICLR 2025. We score against the byte-exact upstream GPT-4o judge.
The LongMemEval benchmark proves that AI agent memory requires more than vector search. See how our AI memory API achieves these results.