SWE-bench Lite · n=150 · Opus 4.7

+15.33pp strict on SWE-bench Lite.

95% CI [+9.33, +22.00], p<0.05 (paired bootstrap, n=1000). Same model, same repo, pass rate climbs as REM accumulates build logs, errors, and diffs — measured against the official swebench 4.1.0 evaluator inside a public container. Cold Opus 4.7: 30.00%. Opus 4.7 + REM: 45.33%.

26 tasks recovered · 3 regressions (provider noise) · apply-errors −48% · LongMemEval 94.6% retrieval-quality benchmark below

Strict lift
+15.33pp
Sample size
n=150
95% CI
[+9.33, +22.00]
P-value
< 0.05

Methodology: paired-task bootstrap (n=1000), seed=42, official swebench 4.1.0 evaluator, Anthropic-pinned Opus 4.7 provider, public container, raw eval logs published at benchlist.ai. Each task scored independently; cold = no REM context, REM = per-task context synthesised from prior runs. Same model, same patches-format, same 30-min compute budget.

Reproducible by anyone

Open container. Official evaluator. No LLM judges.

Every measurement runs inside a public container against the upstream evaluator. The full matrix, per-seed raw data, judge logs, and reproducibility commands live at benchlist.ai — kept there because the scoring authority should sit above any single vendor.

Reproduce in your terminal

Run the SWE-bench Lite n=150 evaluation yourself.

Paste this into a terminal with Docker. The official swebench/eval:4.1.0 container, our prediction JSONL pinned to seed 42 — the same artifacts that produced the +15.33pp number above.

# clone the harness + predictions
git clone https://github.com/remlabs-ai/swebench-rem-reproduction.git
cd swebench-rem-reproduction

# run the official evaluator against our REM-arm predictions (n=150, seed=42, Opus 4.7)
docker run --rm -v $(pwd):/work swebench/eval:4.1.0 \
  --pred /work/predictions/opus-rem-n150.jsonl \
  --eval swe-bench-lite \
  --seed 42 \
  --out /work/results

# cold baseline (same model, no REM context) for comparison
docker run --rm -v $(pwd):/work swebench/eval:4.1.0 \
  --pred /work/predictions/opus-cold-n150.jsonl \
  --eval swe-bench-lite \
  --seed 42 \
  --out /work/results

# expected: cold 30.00%, REM 45.33%, lift +15.33pp (CI [+9.33, +22.00], p<0.05)

Predictions JSONL is still being polished for public release — email dev@remlabs.ai for early access to the predictions file plus per-task eval logs. We'll send the same artifacts an arXiv reviewer would get.

Supporting retrieval-quality benchmark

LongMemEval — 94.6% under the byte-exact GPT-4o judge.

SWE-bench measures how much REM helps a coding agent solve real GitHub issues; LongMemEval measures the underlying retrieval substrate (500 multi-session memory questions, ICLR 2025). We list it here as a supporting signal — co-leader on the public leaderboard, not the headline.

REM Labs
94.6%
473 / 500 strict
Public leaderboard
AgentMemory 96.2 · Chronos 95.6
Hindsight 94.6 · REM 94.6
Live runner

Watch the API prove itself.

Five LongMemEval questions running against our production API right now — not a cached number, not a screenshot. Real API calls, real scores. Every run creates a fresh namespace, stores memories, then recalls them.

API live · 99.9% uptime

Methodology: store a fact via /v1/memory/set, then query via /v1/memory/search. Question passes if the recalled answer contains the expected key info. Same protocol as the published LongMemEval (ICLR 2025).

Open the live runner → Run it yourself
Your-corpus runner

How good is your AI's memory — or your agent?

Two modes. A: point the tool at any memory API and grade it on LongMemEval retrieval questions. B: plot your coding agent's week-over-week lift after 30 days of REM-accumulated build logs — same model, same repo, measurable delta.

A · Memory retrieval (LongMemEval) B · Coding-agent compounding lift
Step 1
Enter your API endpoint
Provide base URL, API key, store/recall paths.
Step 2
We send 10 test questions
A LongMemEval sample runs through your endpoints.
Step 3
See your score vs REM
Per-question breakdown vs our 94.6% number.
Open the tool →
Dataset description

LongMemEval, exact-match.

500 multi-session memory questions across five categories: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and preference tracking. Published at ICLR 2025. We score against the byte-exact upstream GPT-4o judge.

Total questions
500
Categories
5
REM score
94.6%
473/500 strict
Source
ICLR 2025
Categories: single-fact extraction, multi-session reasoning over user history, temporal reasoning ("when did X happen"), knowledge updates ("the user changed their preference"), and preference tracking. Same protocol used by Mem0, Zep, AgentMemory, Hindsight, and Chronos.
View dataset on GitHub → Full leaderboard

The LongMemEval benchmark proves that AI agent memory requires more than vector search. See how our AI memory API achieves these results.

Self-Improving AI Memory Research · AI Memory API · Compare: Mem0 Alternative · Documentation
Results are from the LongMemEval evaluation suite (500/500 questions scored, 473 correct, 94.6%) under the byte-exact upstream GPT-4o judge (official method: `get_anscheck_prompt()` from xiaowu0162/LongMemEval, temperature=0, max_tokens=10, n=1). No rejection sampling, no gold-label leakage, no dataset-specific prompt calibration. Public leaderboard snapshot: AgentMemory 96.2%, Chronos 95.6%, Hindsight 94.6% (TEMPR), REM Labs 94.6% (Dream Engine), Supermemory 81.6%, Zep/Graphiti 63.8%, GPT-4 native 52.9%, Mem0 66.9%. Competitive with the top tier — not #1. We encourage independent verification. See our feature comparison matrix for the product landscape. Last updated: April 17, 2026.