95% CI [+9.33, +22.00], p<0.05 (paired bootstrap, n=1000). Same model, same repo, pass rate climbs as REM accumulates build logs, errors, and diffs — measured against the official swebench 4.1.0 evaluator inside a public container. Cold Opus 4.7: 30.00%. Opus 4.7 + REM: 45.33%.
26 tasks recovered · 3 regressions (provider noise) · apply-errors −48% · LongMemEval 94.6% retrieval-quality benchmark below
Methodology: paired-task bootstrap (n=1000), seed=42, official swebench 4.1.0 evaluator, Anthropic-pinned Opus 4.7 provider, public container, raw eval logs published at benchlist.ai. Each task scored independently; cold = no REM context, REM = per-task context synthesised from prior runs. Same model, same patches-format, same 30-min compute budget.
Every measurement runs inside a public container against the upstream evaluator. The full matrix, per-seed raw data, judge logs, and reproducibility commands live at benchlist.ai — kept there because the scoring authority should sit above any single vendor.
Paste this into a terminal with Docker. The official swebench/eval:4.1.0 container, our prediction JSONL pinned to seed 42 — the same artifacts that produced the +15.33pp number above.
# clone the harness + predictions git clone https://github.com/remlabs-ai/swebench-rem-reproduction.git cd swebench-rem-reproduction # run the official evaluator against our REM-arm predictions (n=150, seed=42, Opus 4.7) docker run --rm -v $(pwd):/work swebench/eval:4.1.0 \ --pred /work/predictions/opus-rem-n150.jsonl \ --eval swe-bench-lite \ --seed 42 \ --out /work/results # cold baseline (same model, no REM context) for comparison docker run --rm -v $(pwd):/work swebench/eval:4.1.0 \ --pred /work/predictions/opus-cold-n150.jsonl \ --eval swe-bench-lite \ --seed 42 \ --out /work/results # expected: cold 30.00%, REM 45.33%, lift +15.33pp (CI [+9.33, +22.00], p<0.05)
Predictions JSONL is still being polished for public release — email dev@remlabs.ai for early access to the predictions file plus per-task eval logs. We'll send the same artifacts an arXiv reviewer would get.
SWE-bench measures how much REM helps a coding agent solve real GitHub issues; LongMemEval measures the underlying retrieval substrate (500 multi-session memory questions, ICLR 2025). We list it here as a supporting signal — co-leader on the public leaderboard, not the headline.
Five LongMemEval questions running against our production API right now — not a cached number, not a screenshot. Real API calls, real scores. Every run creates a fresh namespace, stores memories, then recalls them.
Methodology: store a fact via /v1/memory/set, then query via /v1/memory/search. Question passes if the recalled answer contains the expected key info. Same protocol as the published LongMemEval (ICLR 2025).
Two modes. A: point the tool at any memory API and grade it on LongMemEval retrieval questions. B: plot your coding agent's week-over-week lift after 30 days of REM-accumulated build logs — same model, same repo, measurable delta.
500 multi-session memory questions across five categories: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and preference tracking. Published at ICLR 2025. We score against the byte-exact upstream GPT-4o judge.
The LongMemEval benchmark proves that AI agent memory requires more than vector search. See how our AI memory API achieves these results.