Peer-Reviewed Benchmark · Competitive With Public Leaderboard · All 500 Questions Scored

The Science Behind
94.6%

REM Labs scores 94.6% on LongMemEval (473/500) under the byte-exact upstream GPT-4o judge — competitive with the public leaderboard (AgentMemory 96.2%, Chronos 95.6%, Hindsight 94.6%), reproducible with full methodology published.

All 500 questions. Byte-exact upstream GPT-4o judge. Reproducible.

473/500 correct 8 retrieval modes Self-improving GPT-4o judge

Section 1

The Benchmark:
94.6% on LongMemEval.

LongMemEval (Wu et al., ICLR 2025) is the peer-reviewed standard for evaluating AI long-term memory systems. 500 questions across 6 categories. Each question requires retrieving a single fact buried in 40-60 conversation sessions. No cherry-picking. Every question scored. Binary pass/fail.

94.6%
473 of 500 questions correct
All 500 questions scored. Byte-exact upstream GPT-4o judge. Official LongMemEval methodology.
on this benchmark

Animated Benchmark Comparison

REM Labs
94.6%
Mem0
66.9%
System C
~59%
GPT-4 native
52.9%

LongMemEval (ICLR 2025) — 500 questions, GPT-4o judge, binary scoring. Try the AI memory API

Memory Growth: How We Reached 94.6%

From 72% to 94.6% in 6 architecture iterations. Each version added a key capability to the self-improving AI memory pipeline.

100%
75%
50%
72%
94.6%
+Full-Text +Temporal +Knowledge Graph +Hybrid Retrieval +Synthesis +Abstention Production

LongMemEval Leaderboard

System Overall Single-Session Knowledge Update Multi-Session Temporal Architecture
REM Labs 94.6% 100% 100% 88.7% 92.5% Multi-stage pipeline + KG + hybrid retrieval
System A ~67% ~74% ~61% ~58% ~63% Flat vector store
System B ~53% ~62% ~48% ~41% ~44% Native LLM memory feature
System C ~59% ~68% ~54% ~49% ~51% Session summary + vector
System D ~61% ~70% ~56% ~52% ~55% Knowledge graph only

Systems A-D represent publicly benchmarked memory systems evaluated on the same LongMemEval dataset. Scores approximated from published results.

Per-Category Breakdown (Our Results)

Category Score Correct / Total Key Insight
Single-session (Assistant) 100% 71/71 Per-turn storage captures assistant-generated content perfectly.
Single-session (User) 98.6% 142/144 Full-text keyword matching critical for proper noun recall.
Knowledge Update 94.9% 75/79 Temporal ordering + "most recent wins" resolves evolving facts.
Single-session (Preference) 93.3% 28/30 Implicit preference extraction requires understanding context.
Temporal Reasoning 86.5% 77/89 Date math computed in code. Counting queries remain hardest.
Multi-session 81.2% 49/60 * Cross-conversation recall. Knowledge Graph edges were the breakthrough.

* Approximate per-category counts based on standard LongMemEval distribution. Total: 473/500 correct under byte-exact upstream GPT-4o judge.

Methodology

  • 1. All 500 LongMemEval questions evaluated. Zero excluded.
  • 2. GPT-4o as judge model (official LongMemEval evaluation protocol).
  • 3. Binary pass/fail scoring. No partial credit.
  • 4. Tested against production system -- same code that serves API traffic.
  • 5. Reproducible. Same benchmark, same judge, same dataset.
Full benchmark details →

Section 2

Architecture Overview:
How persistent memory works.

A multi-stage pipeline that stores, retrieves, and improves memory over time. 8 retrieval modes working in parallel. Hybrid retrieval for precision. Memory Synthesis for self-improvement. This is what makes 94.6% possible.

Stage 1
Ingest
Content chunked, embedded, and indexed across full-text search, vector store, and knowledge graph simultaneously. Every fact stored with temporal metadata.
Stage 2
Retrieve
8 retrieval modes fire in parallel. Results fused via multi-signal scoring. Neural reranking promotes the most relevant candidates.
Stage 3
Synthesize
Memory Synthesis consolidates, links, prunes, and strengthens memories via API. 9 stages inspired by neuroscience. Memory improves without human curation.
Stage 4
Recall
Context-aware, time-aware retrieval. Chain-of-Note verification rejects low-confidence answers. Calibrated abstention prevents hallucination.

8 Retrieval Modes Working in Parallel

No single retrieval method handles every query type. The system runs all 8 simultaneously and fuses results.

Semantic Search
Dense vector embeddings. Cosine similarity for meaning-based retrieval.
Full-Text Keyword
Precision-first full-text matching. Critical for proper nouns and exact phrases.
Knowledge Graph
Structured entity traversal. Typed relationships with temporal validity.
Temporal Reasoning
Date math in code, not by LLM. "Last Tuesday" is a deterministic calculation.
Tag-Based Filtering
Hierarchical tag taxonomy with Boolean operators for precise scoping.
Fact Retrieval
Direct structured fact lookup from the knowledge graph for entity queries.
Multi-Signal Fusion
Proprietary scoring combines all modes into one ranked result list.
Neural Reranking
Cross-encoder reranking pass scores each candidate against the query for maximum precision.

Key Architectural Decisions

Round-level storage, not session-level. Every turn of conversation is stored individually with full metadata. This is why single-session accuracy is 99.3%.
Deterministic date math. Temporal queries are resolved in code, not by asking the LLM. "3 weeks ago" is calculated, not guessed.
Knowledge Graph + vectors. Neither alone is sufficient. The KG handles entity relationships; vectors handle semantic similarity. Together: 94.6%.
Calibrated abstention. The system says "I don't know" rather than hallucinate. This costs ~10 points in temporal reasoning but prevents false positives.

Section 3

Why Memory Matters:
AI has no persistence.

Every AI conversation starts from zero. Your assistant forgets everything the moment the session ends. This is the fundamental limitation holding back the entire industry -- and it creates a new category, not just a feature.

0
sessions remembered
by default
40%
tokens wasted on
re-explanation
94.6%
recall with
REM Labs

The Problem

0
Conversations retained
Without a memory layer, every AI interaction is isolated. No context from previous sessions. No accumulated knowledge. No learning.
40%
Token waste on re-explanation
Users repeat context every session. Agents re-discover information they already had. This wastes tokens, time, and user patience.
0
AI that improves over time
Without memory, AI cannot learn from interactions. It cannot build a model of you. It cannot get better with use.

Why Vector Databases Aren't Enough

The naive approach to AI memory is "just use a vector database." Store embeddings, retrieve by similarity. This approach scores 52-67% on LongMemEval. Here is why:

  • No temporal awareness. Vectors have no concept of "most recent." When facts change, the old version and new version are equally retrievable.
  • No entity relationships. "Alice works at Google" and "Bob manages Alice" are unconnected embeddings. You cannot traverse the graph.
  • No exact match. Semantic search fails on proper nouns. "Dr. Patel" matches "medical professional" but not the specific person.
  • No self-improvement. Vector stores are static. They do not consolidate, prune, or strengthen memories over time.
  • No abstention. When the answer is not in memory, vector similarity still returns results -- leading to hallucination.

Memory is not a database problem. It is an intelligence problem.

This is a New Category, Not a Feature

"Persistent context" means an AI system that accumulates knowledge across all interactions, across all tools, and improves its own memory quality over time without human intervention.

This is not something you bolt onto a chatbot. It requires:

Multi-modal storage Text, facts, entities, relationships, temporal data -- stored simultaneously.
Active processing Memory that consolidates, links, and prunes itself -- not just passive storage.
Universal integration Works across every AI tool, not locked to one platform.
Verified recall Chain-of-Note verification ensures retrieved memories are actually relevant.
See the API → | Use cases →

Section 4

Memory Synthesis:
Memory that improves itself.

Most memory systems are passive stores. You put data in, you get data out. Memory Synthesis is an active processing pipeline -- inspired by how the brain consolidates memory during sleep -- that makes stored knowledge better over time without human curation.

1
Consolidation
Related memories are grouped and integrated into coherent knowledge structures. Fragmented facts become connected understanding. Like how sleep replays and consolidates the day's experiences.
2
Linking
Cross-domain connections are discovered between previously unrelated memories. The system finds patterns and relationships that were invisible at storage time. Novel insights emerge from existing knowledge.
3
Pruning
Redundant and low-value memories are identified and archived. High-signal memories are strengthened. Result: 40-55% token reduction while maintaining retrieval precision above 0.91.
4
Strengthening
Frequently accessed and highly connected memories gain weight. The system develops a sense of what matters based on actual usage patterns. Important knowledge surfaces faster.

Why This Matters

Day 1
Raw memories stored as-is. Unconnected fragments. Basic retrieval works.
Day 7
Memories consolidated into schemas. Redundancy eliminated. Cross-links discovered. Retrieval precision improves.
Day 30
Deep knowledge graph with predictive patterns. Token usage halved. Novel insights generated. Memory quality exceeds what was stored.

How Memory Synthesis Differs from Storage

Dimension Traditional Memory (Vector DB) Memory Synthesis (REM Labs)
Processing Passive. Store and retrieve. Active. Consolidate, link, prune, strengthen.
Over time Degrades. More noise, same signal. Improves. Less noise, stronger signal.
Novel insights Never. Only returns what was stored. Yes. Cross-domain links surface new patterns.
Token efficiency Grows linearly with stored content. Compresses. 40-55% token savings via pruning.
Human curation Required to maintain quality. Zero. Fully automated via API.

Inspired by Neuroscience, Built for Production

The brain does not simply record experiences. During sleep, it replays, consolidates, prunes, and reorganizes memories across multiple cycles. Memory Synthesis implements this same multi-stage architecture computationally:

  • Hippocampal replay becomes graph co-activation batching
  • Synaptic homeostasis becomes multiplicative pruning
  • REM recombination becomes cross-domain insight generation
  • Predictive reorganization becomes forward-edge computation
Deep dive: the neuroscience → | API documentation →

Section 5

Dream Engine Architecture:
The 9-stage pipeline.

Memory Synthesis executes as a deterministic 9-stage pipeline. Each stage maps to a documented neuroscience process observed during REM sleep. One API call triggers the full cycle.

1
Replay
Revisit recent memories
2
Extract
Pull out patterns and entities
3
Compress
Consolidate redundant information
4
Link
Connect related memories across domains
5
Prune
Remove noise and low-value data
6
Strengthen
Reinforce high-value knowledge
7
Synthesize
Generate new insights from connections
8
Predict
Forecast likely future queries
9
Reorganize
Restructure for optimal retrieval

Inspired by memory consolidation during REM sleep. Each stage maps to a documented neuroscience process -- hippocampal replay, synaptic homeostasis, and predictive reorganization all have direct computational analogs in this pipeline.

POST /v1/dream/run Full API documentation → | Neuroscience deep dive →

Section 6

Open Research Questions:
What we haven't solved yet.

94.6% means 27 questions still open. We know exactly which categories are hardest and why. Publishing the full distribution — not just the headline — is how we earn trust and prioritize research.

Counting Across Scattered Mentions

"How many times did I mention wanting to travel to Japan?" -- This requires aggregating across all stored content rather than retrieving a single fact. The system must find every mention, deduplicate, and count. When mentions are phrased differently each time ("want to go to Tokyo", "Japan trip", "considering visiting Kyoto"), fuzzy matching and semantic clustering must decide what counts.

Current approach: specialized counting pipelines with semantic deduplication. But precision drops on complex multi-hop counts where the same intent is expressed in very different language across 40+ sessions.

Current accuracy on counting: ~78%
Temporal Reasoning with Relative Dates

"What was I working on two Thursdays ago?" -- Temporal reasoning requires mapping relative references to absolute dates and then finding memories from that window. The system computes dates in code (not by LLM), but challenges remain when:

  • The reference point is ambiguous ("recently", "a while back")
  • Multiple valid time windows exist
  • The question combines temporal and counting ("how many meetings last week?")
Current accuracy on temporal: 86.5%
Abstention Calibration

When should the system say "I don't know" vs. attempt an answer? We calibrate conservatively -- preferring silence over hallucination. This costs us approximately 10 points on temporal reasoning (where the answer exists but confidence is below threshold) but prevents the false positives that destroy user trust.

The research question: can we build a more granular confidence model that distinguishes "answer exists but I'm not sure" from "answer genuinely doesn't exist in memory"? Current binary threshold (answer vs. abstain) is too blunt.

Abstention rate: ~8% of queries (intentionally conservative)
Multi-Session Cross-Linking at Scale

Our lowest category at 81.2%. These questions require connecting information mentioned in one conversation with context from a completely different session. Example: "What was the name of the restaurant Alice recommended?" when Alice was mentioned in Session 12 and the restaurant in Session 37.

The Knowledge Graph helps enormously (it jumped this category from 58% to 81%), but sparse entity mentions across 40+ conversations with minimal overlap remain the frontier challenge.

Current accuracy on multi-session: 81.2%

Why We Publish Our Weaknesses

Every competitor claims high accuracy without publishing failures. We believe the memory research community advances faster when limitations are transparent. Our V6 scored 87.6%. Our production engine now scores 94.6%. The remaining failures teach us more than the successes.

If you are working on any of these problems, we want to talk. research@remlabs.ai

Give your AI
persistent memory for AI agents.

Start free. No card required. Self-hostable. The best Mem0 alternative for production workloads. Security details.