Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Ali Hatamizadeh, Yejin Choi, Jan Kautz

May 21, 2026

arXiv:2605.22791v1 PDF

cs.AI(primary)

#202of 2292·Artificial Intelligence

#202 of 2292 · Artificial Intelligence

Tournament Score

1518±50

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor8

Novelty6

Clarity8.5

Tournament Score

1518±50

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Gated DeltaNet-2

1. Core Contribution

Gated DeltaNet-2 addresses a specific structural limitation in delta-rule linear attention models: the use of a single scalar gate β_t to simultaneously control both how much old content is erased (key-side) and how much new content is written (value-side). The paper introduces two independent channel-wise gates — an erase gate b_t ∈ [0,1]^{d_k} operating on the key axis and a write gate w_t ∈ [0,1]^{d_v} operating on the value axis. This is a clean generalization that subsumes both Gated DeltaNet (scalar decay, scalar delta gate) and KDA (channel-wise decay, scalar delta gate) as special cases through gate tying.

The conceptual insight is well-motivated: erasing and writing act on different axes of the state matrix (key vs. value dimensions), and tying them to one scalar unnecessarily constrains the model's ability to manage its compressed memory. The paper also provides a fast-weight interpretation (Eq. 13) showing the update as the solution to a local online optimization problem, which gives theoretical grounding to the design choice.

2. Methodological Rigor

The paper demonstrates strong mathematical rigor. The chunkwise parallel training algorithm is carefully derived, showing how channel-wise decay can be absorbed into asymmetric erase factors to preserve the WY representation structure. The decay-normalized recurrence (Eq. 19) is the key technical insight enabling this — by absorbing cumulative decay into the rank-one factors, the algorithm retains the same triangular-solve-plus-dense-matmul structure as KDA.

The full backward derivation is provided in Appendix B, with explicit identification of why scalar post-scaling (valid for KDA) breaks for the decoupled gates (Appendix B.5). Numerical verification details (Appendix D) demonstrate fp64 gradient agreement to machine precision.

The experimental setup is reasonably controlled: all models share 1.3B parameters, identical training recipes (100B FineWeb-Edu tokens, same optimizer, same batch size), and matched recurrent state sizes. The comparison set is comprehensive and current, including Mamba-2, Gated DeltaNet, KDA, Mamba-3 (SISO and MIMO), and Transformers.

However, some limitations in rigor should be noted: (1) only one model scale (1.3B) is evaluated, leaving scaling behavior unclear; (2) training on only 100B tokens is modest by current standards; (3) the improvements on some benchmarks are within noise margins (e.g., some commonsense tasks show mixed results across methods).

3. Potential Impact

Direct impact on efficient sequence models: The paper makes an incremental but practically useful contribution to the delta-rule family of linear attention models. The decoupled gate mechanism is conceptually simple, adds minimal overhead (~5% throughput decrease vs. KDA), and provides consistent improvements across multiple evaluation axes.

Long-context retrieval: The most compelling results are on RULER needle-in-a-haystack tasks (Table 3), where Gated DeltaNet-2 shows clear advantages in multi-key retrieval — the hardest setting for fixed-state recurrence. At 4K context length with recurrent-only models, it achieves 37.8% on MK-NIAH-1 vs. 28.0% for KDA and 27.8% for Gated DeltaNet. The S-NIAH-2 at 4K shows 93.0% vs. 89.0% (KDA) and 87.2% (GDN).

Code availability: Open-source Triton kernels are provided, which lowers the adoption barrier significantly.

Broader influence: The paper's contribution is primarily within the linear attention/efficient transformer subfield. It does not introduce a fundamentally new paradigm but rather refines the gating mechanism. The impact is likely to be absorbed into future iterations of delta-rule models rather than spawning independent research directions.

4. Timeliness & Relevance

The paper is highly timely. It enters a very active research area where Mamba-2, Gated DeltaNet, KDA, and Mamba-3 have been published in rapid succession (2024-2026). The competition among these approaches is fierce, and the improvements shown here, while modest in absolute terms, address a well-identified bottleneck. The need for efficient long-context models with strong retrieval capabilities is a pressing concern for LLM deployment.

5. Strengths & Limitations

Strengths:

Clean mathematical formulation with elegant special-case reductions to KDA and Gated DeltaNet

Thorough derivation of both forward and backward chunkwise algorithms

Informative ablations (Table 5) showing that the erase gate contributes more than the write gate, providing mechanistic insight

Practical efficiency: near-flat throughput scaling with sequence length, minimal overhead vs. KDA

The fast-weight interpretation (Table 1) provides a unified theoretical lens across all compared methods

Limitations:

Single-scale evaluation (1.3B only) — unclear if benefits persist or amplify at larger scales

Improvements on language modeling and commonsense are modest (e.g., ~0.5-1 point average improvement in Table 2)

The expanded erase range [0,2] ablation shows no benefit, suggesting the theoretical motivation from negative eigenvalues doesn't translate to practical gains at this scale

No evaluation on generation quality tasks, code generation, or instruction following

The hybrid results sometimes show smaller gaps than recurrent-only, suggesting SWA partially compensates for the scalar gate limitation, reducing the practical urgency of the fix

Training data (FineWeb-Edu) is a filtered academic subset — generalization to broader pretraining mixes is untested

Overall Assessment

Gated DeltaNet-2 is a well-executed incremental contribution to delta-rule linear attention. The insight of decoupling erase and write is clean and mathematically grounded, the engineering is thorough, and the results are consistently positive across evaluations. The paper's strongest case is on long-context retrieval, where the architectural change most directly addresses interference in compressed memory. However, the improvements are modest in many settings, and single-scale evaluation limits confidence in the generality of the findings.

Rating:6.8/ 10

Significance 6.5Rigor 8Novelty 6Clarity 8.5

Generated May 22, 2026

Comparison History (18)

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

claude-opus-4.65/22/2026

Gated DeltaNet-2 introduces a fundamental architectural improvement to linear attention mechanisms by decoupling erase and write gates, with strong empirical results across multiple benchmarks. This addresses a core limitation in efficient sequence modeling, has broad applicability beyond any single domain, and advances the frontier of efficient transformer alternatives—a highly active research area. Paper 1, while useful, is primarily a benchmark contribution for a specific application domain (finance spreadsheets) with findings that current agents fall short, offering less methodological novelty and narrower impact.

vs. Latent-space Attacks for Refusal Evasion in Language Models

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it proposes a broadly useful architectural improvement for efficient long-context modeling (decoupled erase/write gating in linear attention), provides algorithmic/optimization contributions (chunkwise WY, gate-aware backward pass), and reports large-scale training with consistent gains across multiple benchmarks and settings, enabling real-world deployment benefits (constant-memory decoding, long-context retrieval). Paper 1 is insightful and timely for safety research, but its primary contribution is an attack/analysis on refusal mechanisms with narrower positive application scope and potentially shorter-lived impact as defenses evolve.

vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

claude-opus-4.65/22/2026

Gated DeltaNet-2 introduces a fundamental architectural improvement to linear attention mechanisms by decoupling erase and write operations, with strong empirical results across multiple benchmarks at 1.3B scale. This addresses a core limitation in efficient sequence modeling—a rapidly growing field with broad impact on LLM efficiency, long-context modeling, and inference cost. Paper 2 contributes a valuable evaluation metric for VLM explainability, but its scope is narrower (benchmarking/evaluation rather than model architecture), and its impact is primarily within the XAI subcommunity rather than the broader deep learning ecosystem.

vs. LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it advances core sequence modeling architecture by decoupling erase/write in linear attention, provides new algorithms (chunkwise WY, gate-aware backward) enabling efficient training, and demonstrates strong scaling and broad benchmark gains, especially for long-context retrieval—highly timely and broadly applicable across NLP and systems. Paper 1 addresses an important safety issue in multi-agent KV-sharing, but its impact is narrower (specific to latent KV communication setups) and depends on adoption of KV-sharing agents; it also hinges on a particular leakage definition (reconstruction) that may not cover all threat models.

vs. AMEL: Accumulated Message Effects on LLM Judgments

claude-opus-4.65/22/2026

Gated DeltaNet-2 introduces a fundamental architectural innovation in linear attention mechanisms—decoupling erase and write gates—with strong empirical results across language modeling, reasoning, and retrieval benchmarks. It generalizes multiple existing architectures (Gated DeltaNet, KDA), provides theoretical grounding (fast-weight update view, chunkwise algorithm), and addresses a core challenge in efficient sequence modeling. Its impact spans architecture design, efficient inference, and long-context modeling. Paper 1 identifies an interesting LLM evaluation bias (AMEL) but is more narrowly scoped to evaluation practices with a straightforward mitigation (fresh context per item), limiting its broader scientific impact.

vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact: it introduces a novel architectural refinement (decoupled channel-wise erase/write gates) in linear attention/fast-weight memory with clear algorithmic and training-system contributions (chunkwise WY, gate-aware backward), and demonstrates strong results at large scale (1.3B, 100B tokens) on broad benchmarks including long-context retrieval. This advances core sequence modeling infrastructure with wide applicability across LLMs and efficient attention alternatives. Paper 2 is timely and useful for agentic test-time scaling, but is more workflow/protocol-level and may be more sensitive to tooling/model specifics, limiting foundational breadth.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact: it proposes a concrete architectural innovation in linear attention (decoupled channel-wise erase/write gates), provides theoretical/algorithmic contributions enabling efficient training, and demonstrates strong large-scale empirical gains (1.3B params, 100B tokens) across diverse tasks, especially long-context retrieval—an area of high current relevance. Its advances can generalize broadly to sequence modeling systems and hardware-efficient inference. Paper 1 is timely and valuable for evaluation of emotional intelligence, but its impact may be narrower (benchmark/dataset-focused) and more sensitive to annotation scale and domain specificity.

vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

gpt-5.25/22/2026

Paper 2 has higher potential impact: it proposes a broadly applicable architectural advance in linear attention by decoupling erase/write with channel-wise gates, includes algorithmic/optimization contributions (chunkwise WY, efficient backward), and demonstrates competitive results at scale (1.3B/100B tokens) on diverse tasks, especially long-context retrieval—highly timely and relevant. This can influence future model designs beyond a specific inference regime. Paper 1 is impactful for Tree-of-Thought inference efficiency, but is more specialized (KV-cache management for tree search) and likely narrower in cross-field adoption.

vs. Correct Is Not Enough: Training Reasoning Planners with Executor-Grounded Rewards

gemini-3.15/22/2026

Paper 2 addresses a critical and highly timely challenge in LLM reasoning: reward hacking and unfaithful reasoning traces in RL training. By introducing executor-grounded rewards, it ensures reasoning is both high-quality and practically useful for the consuming model. This has broad, immediate applications across agentic AI, complex problem solving, and verifiable reasoning frameworks, potentially offering a wider real-world impact than the architectural improvements to linear attention presented in Paper 1.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

claude-opus-4.65/22/2026

Paper 1 presents a concrete architectural innovation in linear attention mechanisms with rigorous methodology, open-source code, and strong empirical results across multiple benchmarks. It advances a highly active research area (efficient sequence modeling) with broad implications for scaling language models. Paper 2 is a qualitative interview study (n=24) at a single company, offering useful but incremental observations about AI's workplace impact. Paper 1's technical contribution is more likely to be widely cited, built upon, and adopted by the research community.

vs. Optimal Experiments for Partial Causal Effect Identification

gemini-3.15/22/2026

Paper 1 addresses a critical bottleneck in large language models by improving linear attention mechanisms. Its novel decoupling of erase and write operations yields state-of-the-art results among sub-quadratic models. Given the ubiquitous use of LLMs, improvements in sequence modeling efficiency and long-context reasoning offer immediate, high-impact real-world applications and broad adoption potential across the AI community, edging out Paper 2's theoretical contributions to causal inference.

vs. Optimal Experiments for Partial Causal Effect Identification

claude-opus-4.65/22/2026

Gated DeltaNet-2 addresses a fundamental limitation in linear attention mechanisms—the coupled erase/write gates—with a clean architectural solution that generalizes prior work (Gated DeltaNet, KDA). It demonstrates strong empirical results at scale (1.3B params, 100B tokens) across multiple benchmarks, particularly on long-context retrieval. The efficient attention/sequence modeling space is extremely active and high-impact, with broad applications across NLP and beyond. Paper 1, while rigorous and novel in causal inference experiment design, addresses a more specialized problem with a smaller immediate audience and fewer downstream applications.

vs. Mind the Sim-to-Real Gap & Think Like a Scientist

claude-opus-4.65/22/2026

Paper 1 introduces a concrete architectural improvement (Gated DeltaNet-2) to linear attention mechanisms with strong empirical results across multiple benchmarks, including language modeling, reasoning, and retrieval at 1.3B scale. The efficient attention/linear RNN space is extremely active and high-impact, and improvements here have immediate broad applicability across NLP and beyond. Paper 2 addresses an important but more niche theoretical problem at the intersection of simulation and experimentation in sequential decision-making. While rigorous, its narrower scope and more specialized audience limit its breadth of impact compared to advances in foundational transformer-alternative architectures.

vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

gpt-5.25/22/2026

Paper 1 has higher likely scientific impact: it introduces a concrete, technically novel architectural improvement to linear attention (decoupled channel-wise erase/write gates), provides efficient training/inference algorithms, and demonstrates strong empirical gains at scale (1.3B, 100B tokens) on multiple established benchmarks, especially long-context retrieval—an area of high current relevance. Its method is broadly applicable to foundation-model design and can influence subsequent model architectures. Paper 2 is valuable for AI education and accountability, but its impact is more domain-specific, with smaller-scale evidence and less direct methodological generalization to core ML systems.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

gpt-5.25/22/2026

Paper 2 likely has higher impact: it proposes a principled architectural improvement to linear attention by decoupling erase/write with channel-wise gates, unifying prior DeltaNet/KDA variants with clear theoretical derivations and efficient training algorithms. It is broadly applicable as an alternative to softmax attention for long-context and constant-memory decoding, with strong evidence at scale (1.3B, 100B tokens) across diverse benchmarks and open code—suggesting robustness and adoption potential. Paper 1 targets KV-cache eviction for existing transformers, valuable but narrower and more heuristic, with less demonstrated breadth and rigor.

vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

gpt-5.25/22/2026

Paper 2 likely has higher impact: it introduces a broadly applicable architectural innovation for linear attention by decoupling erase/write with channel-wise gates, provides derivations and efficient training algorithms, and demonstrates strong empirical gains at scale (1.3B, 100B tokens) across multiple benchmarks, especially long-context retrieval—highly timely for foundation models. Paper 1 is interesting for multimodal political-speech analysis but is based on a small single-speech case study, relies on proprietary LLM judgments, and has narrower domain impact and weaker methodological generality.

vs. Towards a General Intelligence and Interface for Wearable Health Data

claude-opus-4.65/22/2026

Paper 2 presents a foundation model for wearable health pretrained on unprecedented scale (1 trillion minutes, 5 million participants), addressing a fundamental challenge in digital health. Its breadth of impact spans 35 health prediction tasks across multiple domains, introduces novel LLM agent-based architecture search, and demonstrates clinical validation. While Paper 1 is a solid incremental improvement in linear attention mechanisms, Paper 2 opens new paradigms in personalized health AI with far broader real-world applications across healthcare, potentially impacting millions of lives.

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gpt-5.25/22/2026

Paper 1 offers a concrete algorithmic advance in sequence modeling (decoupled erase/write gating for delta-rule linear attention) with derived efficient training/inference machinery and strong large-scale empirical validation (1.3B params, 100B tokens) across diverse benchmarks, especially long-context retrieval—an area of broad, timely interest. Its methodological rigor and potential to influence architectures across NLP and beyond are high. Paper 2 is innovative for autonomous agents and has clear practical relevance, but impact may be narrower (systems/engineering-specific), with less evidence of generalizable scientific principles or extensive evaluation.