SWE-bench Lite · n=150 · Opus 4.7

+15.33pp strict on SWE-bench Lite.

Name: LongMemEval Benchmark — REM Labs Results
Creator: REM Labs
License: https://remlabs.ai/terms

95% CI [+9.33, +22.00], p<0.05 (paired bootstrap, n=1000). Same model, same repo, pass rate climbs as REM accumulates build logs, errors, and diffs — measured against the official swebench 4.1.0 evaluator inside a public container. Cold Opus 4.7: 30.00%. Opus 4.7 + REM: 45.33%.

See full matrix on benchlist.ai → Start with REM

26 tasks recovered · 3 regressions (provider noise) · apply-errors −48% · LongMemEval 94.6% retrieval-quality benchmark below

Strict lift

+15.33pp

Sample size

n=150

95% CI

[+9.33, +22.00]

P-value

< 0.05

Methodology: paired-task bootstrap (n=1000), seed=42, official swebench 4.1.0 evaluator, Anthropic-pinned Opus 4.7 provider, public container, raw eval logs published at benchlist.ai. Each task scored independently; cold = no REM context, REM = per-task context synthesised from prior runs. Same model, same patches-format, same 30-min compute budget.

Reproducible by anyone

Open container. Official evaluator. No LLM judges.

Every measurement runs inside a public container against the upstream evaluator. The full matrix, per-seed raw data, judge logs, and reproducibility commands live at benchlist.ai — kept there because the scoring authority should sit above any single vendor.

Reproduce in your terminal

Run the SWE-bench Lite n=150 evaluation yourself.

Paste this into a terminal with Docker. The official swebench/eval:4.1.0 container, our prediction JSONL pinned to seed 42 — the same artifacts that produced the +15.33pp number above.

# clone the harness + predictions
git clone https://github.com/remlabs-ai/swebench-rem-reproduction.git
cd swebench-rem-reproduction

# run the official evaluator against our REM-arm predictions (n=150, seed=42, Opus 4.7)
docker run --rm -v $(pwd):/work swebench/eval:4.1.0 \
  --pred /work/predictions/opus-rem-n150.jsonl \
  --eval swe-bench-lite \
  --seed 42 \
  --out /work/results

# cold baseline (same model, no REM context) for comparison
docker run --rm -v $(pwd):/work swebench/eval:4.1.0 \
  --pred /work/predictions/opus-cold-n150.jsonl \
  --eval swe-bench-lite \
  --seed 42 \
  --out /work/results

# expected: cold 30.00%, REM 45.33%, lift +15.33pp (CI [+9.33, +22.00], p<0.05)

Predictions JSONL is still being polished for public release — email dev@remlabs.ai for early access to the predictions file plus per-task eval logs. We'll send the same artifacts an arXiv reviewer would get.

Supporting retrieval-quality benchmark

LongMemEval — 94.6% under the byte-exact GPT-4o judge.

SWE-bench measures how much REM helps a coding agent solve real GitHub issues; LongMemEval measures the underlying retrieval substrate (500 multi-session memory questions, ICLR 2025). We list it here as a supporting signal — co-leader on the public leaderboard, not the headline.

REM Labs

94.6%

473 / 500 strict

Public leaderboard

AgentMemory 96.2 · Chronos 95.6
Hindsight 94.6 · REM 94.6

Live runner

Watch the API prove itself.

Five LongMemEval questions running against our production API right now — not a cached number, not a screenshot. Real API calls, real scores. Every run creates a fresh namespace, stores memories, then recalls them.

API live · 99.9% uptime

Methodology: store a fact via /v1/memory/set, then query via /v1/memory/search. Question passes if the recalled answer contains the expected key info. Same protocol as the published LongMemEval (ICLR 2025).

Open the live runner → Run it yourself

Your-corpus runner

How good is your AI's memory — or your agent?

Two modes. A: point the tool at any memory API and grade it on LongMemEval retrieval questions. B: plot your coding agent's week-over-week lift after 30 days of REM-accumulated build logs — same model, same repo, measurable delta.

A · Memory retrieval (LongMemEval) B · Coding-agent compounding lift

Step 1

Enter your API endpoint

Provide base URL, API key, store/recall paths.

Step 2

We send 10 test questions

A LongMemEval sample runs through your endpoints.

Step 3

See your score vs REM

Per-question breakdown vs our 94.6% number.

Open the tool →

Dataset description

LongMemEval, exact-match.

500 multi-session memory questions across five categories: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and preference tracking. Published at ICLR 2025. We score against the byte-exact upstream GPT-4o judge.

Total questions

500