$δ$ -mem: Efficient Online Memory for Large Language Models

Jingdi Lei, Di Zhang, Junxian Li, Weida Wang, Kaixuan Fan, Xiang Liu, Qihan Liu, Xiaoteng Ma

May 12, 2026

arXiv:2605.12357v1 PDF

cs.AI(primary)

#194of 2292·Artificial Intelligence

#194 of 2292 · Artificial Intelligence

Tournament Score

1520±47

10501800

80%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1520±47

10501800

80%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose $δ$ -mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. $δ$ -mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone's attention computation during generation. With only an $8\times8$ online memory state, $δ$ -mem improves the average score to $1.10\times$ that of the frozen backbone and $1.15\times$ that of the strongest non- $δ$ -mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching $1.31\times$ on MemoryAgentBench and $1.20\times$ on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: δ-mem: Efficient Online Memory for Large Language Models

1. Core Contribution

δ-mem proposes a memory mechanism that augments frozen full-attention LLM backbones with a compact online state of associative memory (OSAM). The key idea is threefold: (1) compress historical information into a fixed-size r×r state matrix (where r=8) updated via a gated delta-rule learning procedure; (2) read from this state using current-input queries to produce associative memory signals; and (3) transform these signals into low-rank corrections applied to the backbone's query and output attention components. This creates a dynamic, history-dependent steering mechanism that differs fundamentally from static adapters like LoRA—though the projection matrices are fixed after training, the corrections vary because they depend on the evolving state.

The paper situates its contribution well within a taxonomy of memory mechanisms: textual (RAG, compression), outside-channel (external modules), and parametric (LoRA, prefix-tuning). δ-mem occupies a unique position by maintaining an online, continuously evolving state that directly participates in forward computation without requiring context window expansion, backbone fine-tuning, or external retrieval infrastructure.

2. Methodological Rigor

The approach is grounded in well-understood principles. The delta-rule update is a classical associative memory formulation recast as online SGD on a regression loss, enhanced with dimension-wise gating inspired by recent linear attention/retention designs. The mathematical formulation is clean and the decomposition of the update into retention, erasure, and write terms (Eq. 11-12) provides clear interpretability.

The experimental evaluation covers both memory-heavy benchmarks (HotpotQA, LoCoMo, MemoryAgentBench) and general capability benchmarks (IFEval, GPQA-Diamond), providing a balanced assessment. The comparison against six baselines spanning all three memory paradigms is reasonably comprehensive. The ablation studies are well-designed: the heads ablation (Table 3) reveals that query+output correction is the sweet spot; the insertion depth ablation (Table 4) shows middle layers are most effective for partial injection; and the context recovery experiment (Figure 2) demonstrates the state retains useful information even without explicit context.

However, several methodological concerns deserve mention. The training data is quite small (2,219 samples from QASPER), and the paper does not thoroughly investigate how training data diversity or scale affects generalization. The choice of r=8 and α=16 appears somewhat arbitrary—while the paper demonstrates these work, a more systematic hyperparameter sensitivity analysis would strengthen claims about the mechanism's robustness. Additionally, the evaluation uses only one primary backbone size family (Qwen3-4B as the main comparison point), with Qwen3-8B and SmolLM3-3B serving more as supplementary experiments.

3. Potential Impact

The practical implications are significant. Long-running LLM agents and assistants are a rapidly growing deployment paradigm, and the inability to efficiently accumulate and leverage historical context is a genuine bottleneck. δ-mem's approach—extremely compact state (8×8 = 64 parameters per layer), negligible memory overhead (Figure 3b), and compatibility with frozen backbones—makes it immediately deployable in production systems.

The 1.31× improvement on MemoryAgentBench and the near-doubling of TTL (Test-Time Learning) scores from 26.14 to 50.50 are particularly noteworthy, suggesting the mechanism is especially effective for tasks requiring online adaptation. The preservation of general capabilities (IFEval scores remain stable or slightly improve) addresses a common concern with memory augmentation approaches.

The conceptual contribution—that a tiny associative memory state coupled with attention correction can substitute for context window expansion—could influence how the community thinks about memory in LLMs, potentially reducing the emphasis on ever-longer context windows.

4. Timeliness & Relevance

This work is highly timely. The deployment of LLM-based agents (OpenAI Codex, Claude Code, etc.) has made persistent memory a first-class engineering problem. The paper explicitly references these systems and targets the gap between single-turn reasoning capability and multi-turn memory requirements. The concurrent development of linear attention and state-space models with recurrent states (Titans, Mamba, etc.) provides a theoretical backdrop, but δ-mem's approach of retrofitting existing full-attention models is more practically relevant given the massive investment in existing Transformer infrastructure.

5. Strengths & Limitations

Key Strengths:

Extreme compactness: an 8×8 state matrix achieving meaningful memory improvements is a striking result that challenges assumptions about required memory capacity

Clean separation of concerns: the frozen backbone handles reasoning while δ-mem handles memory evolution

Three writing granularities (TSW, SSW, MSW) provide practical flexibility for different deployment scenarios

The context recovery experiment (Section 5.1) is a compelling demonstration that the state genuinely encodes historical information

Very low parameter overhead (0.12% of backbone) compared to alternatives like MLP Memory (76.40%)

Notable Weaknesses:

Training on only 2,219 QASPER samples raises questions about generalization breadth; the mechanism is tested on specific benchmark types but real-world memory demands are far more diverse

The paper does not address how δ-mem handles contradictory or evolving information over very long horizons (hundreds or thousands of interactions)

No analysis of what the 8×8 state actually encodes—interpretability of the learned associative memory is absent

The decoding speed penalty (visible in Figure 3a) is acknowledged but not quantified precisely in terms of wall-clock impact at scale

Comparison with recent strong memory systems like Mem0 (which is cited but not benchmarked against directly) and Titans would strengthen the positioning

The paper lacks analysis of failure modes—when does the compact state become a bottleneck?

6. Additional Observations

The connection between δ-mem and the broader literature on fast weight programmers and modern linear attention mechanisms is underexplored. The delta rule update is essentially a fast weight programmer, and acknowledging this lineage more explicitly would provide deeper theoretical grounding. The scalability to larger models (70B+) and longer interaction histories (thousands of turns) remains untested but is critical for the claimed use case.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated May 13, 2026

Comparison History (20)

vs. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact due to broader cross-domain relevance and timeliness: efficient long-term memory for LLMs/agents is a central, widely applicable problem across NLP, agentic systems, and deployment. The proposed online delta-rule associative state coupled to attention is a novel, lightweight mechanism that can be adopted in many LLM settings without retraining. Paper 2 appears strong and impactful within materials generation, but its scope is more domain-specific. Given current momentum in LLM efficiency and memory, Paper 1 has higher expected breadth and uptake.

vs. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination

gemini-3.15/16/2026

Paper 1 provides fundamental mechanistic insights into LLM hallucinations and memory conflicts using attractor geometry. It addresses a critical, widely studied problem in AI reliability and safety, offering a novel detection metric and revealing scaling laws. Paper 2, while presenting an efficient architectural improvement for memory, focuses more on engineering optimization and has narrower theoretical implications compared to the deep interpretability contributions of Paper 1.

vs. MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

claude-opus-4.65/16/2026

MemQ introduces a more theoretically novel framework by formalizing memory management as an Exogenous-Context MDP and applying TD(λ) eligibility traces over provenance DAGs—a principled integration of reinforcement learning with memory systems that opens new research directions. It addresses a fundamental limitation (ignoring dependency chains in memory) with a well-grounded formalism, demonstrates broad impact across six diverse benchmarks, and provides principled parameter selection guidance. While δ-mem offers practical engineering value with its compact memory state, MemQ's theoretical contributions and broader applicability across agent paradigms suggest higher long-term scientific impact.

vs. Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

gpt-5.25/16/2026

Paper 2 ($δ$-mem) likely has higher impact due to broader applicability and timeliness: efficient long-term memory for LLM assistants/agents is a core bottleneck across many domains, and its lightweight, frozen-backbone online update mechanism is broadly deployable without retraining or context extension. The method is simple, scalable, and could influence architectures and systems work widely. Paper 1 addresses an important reliability issue in RL-for-reasoning, but is more specialized to planner–executor setups and trace-supervision paradigms, with narrower cross-field reach.

vs. Revealing Interpretable Failure Modes of VLMs

gemini-3.15/16/2026

Paper 1 addresses a fundamental and pervasive bottleneck in LLMs—long-term memory and context limits—by introducing a highly efficient, lightweight architectural augmentation. Its approach of using delta-rule learning for low-rank attention corrections without requiring full fine-tuning has the potential for widespread adoption across numerous AI applications. While Paper 2 presents a valuable safety evaluation framework for VLMs, Paper 1 introduces a core capability enhancement that will likely drive broader methodological advancements and impact a wider range of downstream AI systems.

vs. TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

gpt-5.25/16/2026

Paper 2 likely has higher impact because it introduces a broadly applicable evaluation framework and metric for a previously under-measured capability (prospective metacognitive control) directly tied to real-world agent deployment under budgets. Its scope spans multiple domains and model families, creating a common benchmark that can shape future research and system design. Paper 1 is a clever, practical memory mechanism with clear utility, but its impact is narrower (architecture/efficiency for LLM memory) and depends on adoption within specific model pipelines, whereas TRIAGE can influence evaluation standards across the field.

vs. Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis

claude-opus-4.65/16/2026

Paper 1 introduces a fundamentally novel paradigm for self-improving language models—shifting from synthetic data generation to environment construction with verifiable reward signals. The concept of solve-verify asymmetry as the key property for sustained self-improvement is a deep theoretical insight with broad implications for RL-based LLM training. While Paper 2 presents a useful engineering contribution (compact memory augmentation), it is more incremental, addressing a well-studied problem with a specific mechanism. Paper 1's framework has greater potential to reshape how the field thinks about scalable self-improvement, giving it higher long-term scientific impact.

vs. SkillEvolver: Skill Learning as a Meta-Skill

claude-opus-4.65/16/2026

δ-mem addresses a fundamental and broadly applicable challenge in LLM systems—efficient long-term memory without context expansion or fine-tuning. Its lightweight, architecture-level contribution (compact associative memory with delta-rule learning coupled to attention) has broad applicability across all LLM-based systems, including agents, assistants, and dialogue. SkillEvolver is innovative but operates in a narrower niche (skill learning for agents) and relies on prompt/code refinement rather than introducing a new architectural primitive. δ-mem's theoretical grounding in associative memory and attention modification gives it wider methodological influence across the deep learning community.

vs. BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

claude-opus-4.65/16/2026

δ-mem addresses a fundamental and broadly applicable challenge in LLM research—efficient long-term memory—with a novel, lightweight mechanism that works with frozen backbones. Its potential impact spans all LLM applications (agents, assistants, long-context tasks) across many fields. BenchCAD, while rigorous and valuable, serves a narrower community (CAD/engineering AI) as a benchmark contribution. δ-mem's architectural innovation (compact associative memory with delta-rule learning coupled to attention) offers a more generalizable contribution with wider adoption potential.

vs. The Evaluation Differential: When Frontier AI Models Recognise They Are Being Tested

gpt-5.25/16/2026

Paper 2 likely has higher scientific impact: it proposes a concrete, lightweight mechanism for online long-term memory in LLMs with clear empirical gains on multiple benchmarks, strong real-world applicability (assistants/agents), and straightforward integration with existing frozen backbones. Its methodological contribution is implementable and measurable, enabling follow-up work across model efficiency, continual learning, and agent systems. Paper 1 is timely and important for AI safety/governance, but is more conceptual/audit-focused with less direct technical validation; its impact may be significant but narrower and slower to translate into widely adopted methods.

vs. Rethinking AI Hardware: A Three-Layer Cognitive Architecture for Autonomous Agents

gpt-5.25/16/2026

Paper 1 offers a concrete, technically novel mechanism (delta-rule online associative memory producing low-rank attention corrections) that plugs into existing frozen LLMs with minimal state, addressing a timely bottleneck (long-term memory without context expansion). It is likely to be broadly reusable across models, tasks, and deployments, and is experimentally grounded on recognized memory-heavy benchmarks with clear ablations implied by baseline comparisons. Paper 2 is potentially impactful but reads more like a systems/architecture proposal with synthetic-task evaluation and less demonstrated methodological rigor or generalizability to real deployments.

vs. AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

gemini-3.15/13/2026

Paper 1 proposes a novel architectural solution to a critical bottleneck in LLMs (long-term memory and context scaling costs) via a lightweight, delta-rule based associative memory. This technical innovation has direct, high-impact applications in developing efficient AI agents. While Paper 2 provides a valuable and timely meta-analysis of AI safety benchmarks, Paper 1's algorithmic contribution offers foundational performance improvements that are likely to broadly influence core LLM development and deployment across multiple domains.

vs. Revisiting Privacy Preservation in Brain-Computer Interfaces: Conceptual Boundaries, Risk Pathways, and a Protection-Strength Grading Framework

gemini-3.15/13/2026

Paper 2 addresses a critical bottleneck in the highly active field of Large Language Models: efficient long-term memory. Its technical contribution—a lightweight, fixed-size online memory state that avoids expensive context expansion or full fine-tuning—offers immediate, measurable performance gains and cost reductions for AI agents. While Paper 1 provides a timely and important conceptual framework for BCI privacy, Paper 2's concrete algorithmic innovation has broader, more immediate real-world applicability and scalability across the rapidly expanding AI landscape, leading to higher short- to medium-term scientific and practical impact.

vs. STAR: Failure-Aware Markovian Routing for Multi-Agent Spatiotemporal Reasoning

claude-opus-4.65/13/2026

δ-mem addresses a fundamental and broadly applicable challenge—efficient long-term memory for LLMs—with a simple, elegant mechanism (compact associative memory state with delta-rule learning). Its lightweight design (8×8 state matrix), compatibility with frozen backbones, and strong empirical gains make it highly practical and widely adoptable across LLM applications. Paper 2 tackles a narrower problem (spatiotemporal multi-agent routing) with a more specialized framework. While methodologically sound, its impact is limited to compositional reasoning pipelines. δ-mem's broader applicability to assistants, agents, and general LLM deployment gives it significantly higher potential impact.

vs. BadSKP: Backdoor Attacks on Knowledge Graph-Enhanced LLMs with Soft Prompts

gpt-5.25/13/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: efficient long-term memory for LLM assistants/agents is a central bottleneck, and δ-mem proposes a lightweight, general mechanism that works with a frozen backbone and tiny state, making deployment plausible across many systems. Methodologically it offers a clear, scalable architectural contribution (online delta-rule memory + low-rank attention corrections) with benchmarked gains while preserving generality. Paper 1 is novel and important for security of KG-enhanced LLMs, but its scope is narrower (specific KG soft-prompt architecture, adversarial setting).

vs. On the Limitations of Large Language Models for Conceptual Database Modeling

gemini-3.15/13/2026

Paper 2 proposes a novel, lightweight memory mechanism for LLMs, addressing a critical bottleneck (context length limitations) in AI research. Its architectural innovation has broad applications across various LLM-based agents and systems, promising significant performance gains without extensive fine-tuning. In contrast, Paper 1 is a domain-specific evaluation of existing LLM capabilities for database modeling. Therefore, Paper 2 offers higher novelty, broader applicability across the AI field, and greater potential to influence future foundational model designs.

vs. Agentic AI platforms for autonomous training and rule induction of human-human and virus-human protein-protein interactions

gemini-3.15/13/2026

While Paper 1 presents an innovative application of AI agents to bioinformatics, Paper 2 addresses a fundamental bottleneck in foundation models: long-term memory and context window efficiency. By introducing a lightweight, plug-and-play memory mechanism that works with frozen backbones, Paper 2 offers broad utility across the entire AI ecosystem. Its impact spans multiple domains by enabling more efficient LLMs, making its potential scientific and practical impact significantly higher and more far-reaching.

vs. LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters

claude-opus-4.65/13/2026

Paper 1 addresses a fundamental challenge in LLM architecture—efficient long-term memory—with a novel, generalizable mechanism (delta-rule associative memory as low-rank attention corrections). Its compact design (8×8 state matrix) and strong empirical gains across multiple benchmarks suggest broad applicability across many LLM applications. Paper 2, while valuable as a practical deployment case study in legal AI, is more domain-specific and applies existing techniques (RAG/CAG) rather than introducing new methods. Paper 1's architectural innovation has greater potential for widespread adoption and cross-domain impact.

vs. Random-Set Graph Neural Networks

gpt-5.25/13/2026

Paper 2 likely has higher impact: it addresses a central, timely bottleneck for LLM agents (efficient long-term memory) with a simple, broadly applicable mechanism that plugs into frozen full-attention models and shows strong gains on multiple memory-heavy benchmarks while preserving general capabilities. Its real-world applicability to assistants/agents is immediate and cross-cuts NLP, systems, and continual learning. Paper 1 is novel and relevant for GNN uncertainty, but its impact is narrower (graph tasks) and the belief-function/random-set formalism may see slower adoption despite solid empirical evaluation.

vs. E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

gpt-5.25/13/2026

Paper 1 is more novel and broadly impactful: it proposes a lightweight, principled online associative memory (delta-rule) that directly modulates attention via low-rank corrections, improving long-context behavior without finetuning or context expansion. This architectural mechanism is likely reusable across many LLM backbones and settings (agents, assistants, continual interaction), making applications wide and timely. Paper 2 improves a training paradigm for tool-integrated reasoning with better data/ROI, but is more domain-specific and may be more sensitive to task/setup choices, limiting breadth and long-term generality compared to a general memory mechanism.

δδδ-mem: Efficient Online Memory for Large Language Models

Abstract

AI Impact Assessments

Scientific Impact Assessment: δ-mem: Efficient Online Memory for Large Language Models

1. Core Contribution

2. Methodological Rigor

3. Potential Impact

4. Timeliness & Relevance

5. Strengths & Limitations

Key Strengths:

Notable Weaknesses:

6. Additional Observations

Comparison History (20)

$δ$ -mem: Efficient Online Memory for Large Language Models