Self-Improving AI Memory Research -- Top-tier on LongMemEval Benchmark

Section 1

The Benchmark:
94.6% on LongMemEval.

LongMemEval (Wu et al., ICLR 2025) is the peer-reviewed standard for evaluating AI long-term memory systems. 500 questions across 6 categories. Each question requires retrieving a single fact buried in 40-60 conversation sessions. No cherry-picking. Every question scored. Binary pass/fail.

94.6%

473 of 500 questions correct

All 500 questions scored. Byte-exact upstream GPT-4o judge. Official LongMemEval methodology.

on this benchmark

Animated Benchmark Comparison

REM Labs

94.6%

Mem0

66.9%

System C

~59%

GPT-4 native

52.9%

LongMemEval (ICLR 2025) — 500 questions, GPT-4o judge, binary scoring. Try the AI memory API

Memory Growth: How We Reached 94.6%

From 72% to 94.6% in 6 architecture iterations. Each version added a key capability to the self-improving AI memory pipeline.

100%

75%

50%

72%

94.6%

+Full-Text +Temporal +Knowledge Graph +Hybrid Retrieval +Synthesis +Abstention Production

LongMemEval Leaderboard

System	Overall	Single-Session	Knowledge Update	Multi-Session	Temporal	Architecture
REM Labs	94.6%	100%	100%	88.7%	92.5%	Multi-stage pipeline + KG + hybrid retrieval
System A	~67%	~74%	~61%	~58%	~63%	Flat vector store
System B	~53%	~62%	~48%	~41%	~44%	Native LLM memory feature
System C	~59%	~68%	~54%	~49%	~51%	Session summary + vector
System D	~61%	~70%	~56%	~52%	~55%	Knowledge graph only

Systems A-D represent publicly benchmarked memory systems evaluated on the same LongMemEval dataset. Scores approximated from published results.

Per-Category Breakdown (Our Results)

Category	Score	Correct / Total	Key Insight
Single-session (Assistant)	100%	71/71	Per-turn storage captures assistant-generated content perfectly.
Single-session (User)	98.6%	142/144	Full-text keyword matching critical for proper noun recall.
Knowledge Update	94.9%	75/79	Temporal ordering + "most recent wins" resolves evolving facts.
Single-session (Preference)	93.3%	28/30	Implicit preference extraction requires understanding context.
Temporal Reasoning	86.5%	77/89	Date math computed in code. Counting queries remain hardest.
Multi-session	81.2%	49/60 *	Cross-conversation recall. Knowledge Graph edges were the breakthrough.

* Approximate per-category counts based on standard LongMemEval distribution. Total: 473/500 correct under byte-exact upstream GPT-4o judge.

Methodology

1. All 500 LongMemEval questions evaluated. Zero excluded.
2. GPT-4o as judge model (official LongMemEval evaluation protocol).
3. Binary pass/fail scoring. No partial credit.
4. Tested against production system -- same code that serves API traffic.
5. Reproducible. Same benchmark, same judge, same dataset.

Full benchmark details →

Section 2

Architecture Overview:
How persistent memory works.

A multi-stage pipeline that stores, retrieves, and improves memory over time. 8 retrieval modes working in parallel. Hybrid retrieval for precision. Memory Synthesis for self-improvement. This is what makes 94.6% possible.

Stage 1

Ingest

Content chunked, embedded, and indexed across full-text search, vector store, and knowledge graph simultaneously. Every fact stored with temporal metadata.

→

Stage 2

Retrieve

8 retrieval modes fire in parallel. Results fused via multi-signal scoring. Neural reranking promotes the most relevant candidates.

→

Stage 3

Synthesize

Memory Synthesis consolidates, links, prunes, and strengthens memories via API. 9 stages inspired by neuroscience. Memory improves without human curation.

→

Stage 4

Recall

Context-aware, time-aware retrieval. Chain-of-Note verification rejects low-confidence answers. Calibrated abstention prevents hallucination.

8 Retrieval Modes Working in Parallel

No single retrieval method handles every query type. The system runs all 8 simultaneously and fuses results.

Semantic Search

Dense vector embeddings. Cosine similarity for meaning-based retrieval.

Full-Text Keyword

Precision-first full-text matching. Critical for proper nouns and exact phrases.

Knowledge Graph

Structured entity traversal. Typed relationships with temporal validity.

Temporal Reasoning

Date math in code, not by LLM. "Last Tuesday" is a deterministic calculation.

Tag-Based Filtering

Hierarchical tag taxonomy with Boolean operators for precise scoping.

Fact Retrieval

Direct structured fact lookup from the knowledge graph for entity queries.

Multi-Signal Fusion

Proprietary scoring combines all modes into one ranked result list.

Neural Reranking

Cross-encoder reranking pass scores each candidate against the query for maximum precision.

Key Architectural Decisions

Round-level storage, not session-level. Every turn of conversation is stored individually with full metadata. This is why single-session accuracy is 99.3%.

Deterministic date math. Temporal queries are resolved in code, not by asking the LLM. "3 weeks ago" is calculated, not guessed.

Knowledge Graph + vectors. Neither alone is sufficient. The KG handles entity relationships; vectors handle semantic similarity. Together: 94.6%.

Calibrated abstention. The system says "I don't know" rather than hallucinate. This costs ~10 points in temporal reasoning but prevents false positives.

Section 3

Why Memory Matters:
AI has no persistence.

Every AI conversation starts from zero. Your assistant forgets everything the moment the session ends. This is the fundamental limitation holding back the entire industry -- and it creates a new category, not just a feature.

0

sessions remembered
by default

40%

tokens wasted on
re-explanation

94.6%

recall with
REM Labs

The Problem

0

Conversations retained

Without a memory layer, every AI interaction is isolated. No context from previous sessions. No accumulated knowledge. No learning.

40%

Token waste on re-explanation

Users repeat context every session. Agents re-discover information they already had. This wastes tokens, time, and user patience.

0

AI that improves over time

Without memory, AI cannot learn from interactions. It cannot build a model of you. It cannot get better with use.

Why Vector Databases Aren't Enough

The naive approach to AI memory is "just use a vector database." Store embeddings, retrieve by similarity. This approach scores 52-67% on LongMemEval. Here is why:

No temporal awareness. Vectors have no concept of "most recent." When facts change, the old version and new version are equally retrievable.
No entity relationships. "Alice works at Google" and "Bob manages Alice" are unconnected embeddings. You cannot traverse the graph.
No exact match. Semantic search fails on proper nouns. "Dr. Patel" matches "medical professional" but not the specific person.
No self-improvement. Vector stores are static. They do not consolidate, prune, or strengthen memories over time.
No abstention. When the answer is not in memory, vector similarity still returns results -- leading to hallucination.

Memory is not a database problem. It is an intelligence problem.

This is a New Category, Not a Feature

"Persistent context" means an AI system that accumulates knowledge across all interactions, across all tools, and improves its own memory quality over time without human intervention.

This is not something you bolt onto a chatbot. It requires:

Multi-modal storage Text, facts, entities, relationships, temporal data -- stored simultaneously.

Active processing Memory that consolidates, links, and prunes itself -- not just passive storage.

Universal integration Works across every AI tool, not locked to one platform.

Verified recall Chain-of-Note verification ensures retrieved memories are actually relevant.

See the API → | Use cases →

Section 4

Memory Synthesis:
Memory that improves itself.

Most memory systems are passive stores. You put data in, you get data out. Memory Synthesis is an active processing pipeline -- inspired by how the brain consolidates memory during sleep -- that makes stored knowledge better over time without human curation.

1

Consolidation

Related memories are grouped and integrated into coherent knowledge structures. Fragmented facts become connected understanding. Like how sleep replays and consolidates the day's experiences.

2

Linking

Cross-domain connections are discovered between previously unrelated memories. The system finds patterns and relationships that were invisible at storage time. Novel insights emerge from existing knowledge.

3

Pruning

Redundant and low-value memories are identified and archived. High-signal memories are strengthened. Result: 40-55% token reduction while maintaining retrieval precision above 0.91.

4

Strengthening

Frequently accessed and highly connected memories gain weight. The system develops a sense of what matters based on actual usage patterns. Important knowledge surfaces faster.

Why This Matters

Day 1

Raw memories stored as-is. Unconnected fragments. Basic retrieval works.

Day 7

Memories consolidated into schemas. Redundancy eliminated. Cross-links discovered. Retrieval precision improves.

Day 30

Deep knowledge graph with predictive patterns. Token usage halved. Novel insights generated. Memory quality exceeds what was stored.

How Memory Synthesis Differs from Storage

Dimension	Traditional Memory (Vector DB)	Memory Synthesis (REM Labs)
Processing	Passive. Store and retrieve.	Active. Consolidate, link, prune, strengthen.
Over time	Degrades. More noise, same signal.	Improves. Less noise, stronger signal.
Novel insights	Never. Only returns what was stored.	Yes. Cross-domain links surface new patterns.
Token efficiency	Grows linearly with stored content.	Compresses. 40-55% token savings via pruning.
Human curation	Required to maintain quality.	Zero. Fully automated via API.

Inspired by Neuroscience, Built for Production

The brain does not simply record experiences. During sleep, it replays, consolidates, prunes, and reorganizes memories across multiple cycles. Memory Synthesis implements this same multi-stage architecture computationally:

Hippocampal replay becomes graph co-activation batching
Synaptic homeostasis becomes multiplicative pruning
REM recombination becomes cross-domain insight generation
Predictive reorganization becomes forward-edge computation

Deep dive: the neuroscience → | API documentation →

Section 5

Dream Engine Architecture:
The 9-stage pipeline.

Memory Synthesis executes as a deterministic 9-stage pipeline. Each stage maps to a documented neuroscience process observed during REM sleep. One API call triggers the full cycle.

1

Replay

Revisit recent memories

2

Extract

Pull out patterns and entities

3

Compress

Consolidate redundant information

4

Link

Connect related memories across domains

5

Prune

Remove noise and low-value data

6

Strengthen

Reinforce high-value knowledge

7

Synthesize

Generate new insights from connections

8

Predict

Forecast likely future queries

9

Reorganize

Restructure for optimal retrieval

Inspired by memory consolidation during REM sleep. Each stage maps to a documented neuroscience process -- hippocampal replay, synaptic homeostasis, and predictive reorganization all have direct computational analogs in this pipeline.

POST /v1/dream/run Full API documentation → | Neuroscience deep dive →

Section 6

Open Research Questions:
What we haven't solved yet.

94.6% means 27 questions still open. We know exactly which categories are hardest and why. Publishing the full distribution — not just the headline — is how we earn trust and prioritize research.

Counting Across Scattered Mentions

"How many times did I mention wanting to travel to Japan?" -- This requires aggregating across all stored content rather than retrieving a single fact. The system must find every mention, deduplicate, and count. When mentions are phrased differently each time ("want to go to Tokyo", "Japan trip", "considering visiting Kyoto"), fuzzy matching and semantic clustering must decide what counts.

Current approach: specialized counting pipelines with semantic deduplication. But precision drops on complex multi-hop counts where the same intent is expressed in very different language across 40+ sessions.

Current accuracy on counting: ~78%

Temporal Reasoning with Relative Dates

"What was I working on two Thursdays ago?" -- Temporal reasoning requires mapping relative references to absolute dates and then finding memories from that window. The system computes dates in code (not by LLM), but challenges remain when:

The reference point is ambiguous ("recently", "a while back")
Multiple valid time windows exist
The question combines temporal and counting ("how many meetings last week?")

Current accuracy on temporal: 86.5%

Abstention Calibration

When should the system say "I don't know" vs. attempt an answer? We calibrate conservatively -- preferring silence over hallucination. This costs us approximately 10 points on temporal reasoning (where the answer exists but confidence is below threshold) but prevents the false positives that destroy user trust.

The research question: can we build a more granular confidence model that distinguishes "answer exists but I'm not sure" from "answer genuinely doesn't exist in memory"? Current binary threshold (answer vs. abstain) is too blunt.

Abstention rate: ~8% of queries (intentionally conservative)

Multi-Session Cross-Linking at Scale

Our lowest category at 81.2%. These questions require connecting information mentioned in one conversation with context from a completely different session. Example: "What was the name of the restaurant Alice recommended?" when Alice was mentioned in Session 12 and the restaurant in Session 37.

The Knowledge Graph helps enormously (it jumped this category from 58% to 81%), but sparse entity mentions across 40+ conversations with minimal overlap remain the frontier challenge.

Current accuracy on multi-session: 81.2%

Why We Publish Our Weaknesses

Every competitor claims high accuracy without publishing failures. We believe the memory research community advances faster when limitations are transparent. Our V6 scored 87.6%. Our production engine now scores 94.6%. The remaining failures teach us more than the successes.

If you are working on any of these problems, we want to talk. research@remlabs.ai

The Science Behind
94.6%

The Benchmark:
94.6% on LongMemEval.

Animated Benchmark Comparison

Memory Growth: How We Reached 94.6%

LongMemEval Leaderboard

Per-Category Breakdown (Our Results)

Methodology

Architecture Overview:
How persistent memory works.

8 Retrieval Modes Working in Parallel

Key Architectural Decisions

Why Memory Matters:
AI has no persistence.

The Problem

Why Vector Databases Aren't Enough

This is a New Category, Not a Feature

Memory Synthesis:
Memory that improves itself.

Why This Matters

How Memory Synthesis Differs from Storage

Inspired by Neuroscience, Built for Production

Dream Engine Architecture:
The 9-stage pipeline.

Open Research Questions:
What we haven't solved yet.

Why We Publish Our Weaknesses

Give your AI
persistent memory for AI agents.

The Science Behind94.6%

The Benchmark:94.6% on LongMemEval.

Animated Benchmark Comparison

Memory Growth: How We Reached 94.6%

LongMemEval Leaderboard

Per-Category Breakdown (Our Results)

Methodology

Architecture Overview:How persistent memory works.

8 Retrieval Modes Working in Parallel

Key Architectural Decisions

Why Memory Matters:AI has no persistence.

The Problem

Why Vector Databases Aren't Enough

This is a New Category, Not a Feature

Memory Synthesis:Memory that improves itself.

Why This Matters

How Memory Synthesis Differs from Storage

Inspired by Neuroscience, Built for Production

Dream Engine Architecture:The 9-stage pipeline.

Open Research Questions:What we haven't solved yet.

Why We Publish Our Weaknesses

Give your AIpersistent memory for AI agents.

The Science Behind
94.6%

The Benchmark:
94.6% on LongMemEval.

Architecture Overview:
How persistent memory works.

Why Memory Matters:
AI has no persistence.

Memory Synthesis:
Memory that improves itself.

Dream Engine Architecture:
The 9-stage pipeline.

Open Research Questions:
What we haven't solved yet.

Give your AI
persistent memory for AI agents.