Anamaria-Roberta Hartl, Levente Zólyomi, David Stap, Pieter-Jan Hoedt, Niklas Schmidinger, Lukas Hauzenberger, Sebastian Böck, Günter Klambauer
Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.
This paper provides the first systematic head-to-head comparison of three leading subquadratic sequence architectures—xLSTM, Mamba-2, and Gated DeltaNet—across domains with complex structured dependencies (code and time series), complemented by a unified mathematical framework that explains observed performance differences. The paper's central claim is that xLSTM's advantage stems from two primitive capabilities: accumulation (counting-like operations over unbounded lengths) and finite-state tracking, enabled by its architectural separation of matrix-state linear attention (mLSTM/xLSTM[1:0]) from nonlinear recurrence (sLSTM/xLSTM[0:1]).
The contribution is structured as an "applications-to-principles" pipeline: first establish empirical differences on practical tasks, then derive a unified formulation that predicts which synthetic primitives should differentiate the architectures, and finally validate on controlled tasks. This structure is methodologically appealing and provides more explanatory depth than a pure benchmark paper.
Strengths in experimental design: The paper evaluates across three distinct experimental paradigms (from-scratch pre-training, distillation, TSFM pre-training), which strengthens the generalizability claim. The time-series experiments include a five-point parameter scaling sweep (1M–80M), and the distillation experiments carefully control for teacher, data, initialization, and optimization recipe. The synthetic experiments use length generalization (4× and 16× extrapolation) as a discriminative test.
1. Scale limitations: Code language modeling is conducted only at 400M parameters. The paper acknowledges this, but given that architectural differences often diminish with scale (as hinted by the 80M TSFM convergence), the practical relevance of the findings at larger scales remains uncertain.
2. Single teacher in distillation: Only Qwen3-4B-Instruct is used as teacher, limiting generalizability claims about distillation.
3. Mamba-2 exclusion from distillation: Mamba-2 is excluded from the distillation comparison due to architectural incompatibility with the plug-in protocol. While justified, this weakens the three-way comparison in one of three experimental settings.
4. Reporting maximum over 5 seeds on synthetic tasks (rather than mean ± std) inflates apparent performance and obscures reliability.
5. Small margins on reasoning benchmarks: The paper honestly reports that reasoning/commonsense margins are often <0.5 points, raising questions about whether the differences are statistically meaningful.
Practical impact: The comparison is timely and directly relevant to practitioners designing hybrid foundation models. The finding that xLSTM consistently outperforms on code and time-series tasks—where structured dependencies are critical—provides actionable guidance for architecture selection. The distillation results are particularly useful given the growing interest in linearizing existing Transformers.
Theoretical impact: The unified formulation expressing all three architectures in terms of input/forget gates, overwriting mechanisms, and state dynamics is a useful conceptual contribution. The key insight—that Mamba-2's tied gates limit accumulation (analogous to GRU limitations), while Gated DeltaNet's explicit overwriting interferes with counting—provides testable hypotheses. However, the framework is more of a notational unification than a deep theoretical result; the formal analysis doesn't prove impossibility results or provide new complexity-theoretic bounds beyond citing existing work (Merrill et al., 2024; Grazzi et al., 2025).
Broader influence: This work could influence the design of hybrid models (e.g., the ratio of mLSTM to sLSTM layers) and inform which subquadratic operator to use in emerging hybrid architectures like Nemotron, Kimi Linear, and OLMo Hybrid.
The paper addresses a critical current question: as hybrid architectures become mainstream (with major labs adopting them), which subquadratic operator should be preferred? The timing is excellent—the referenced hybrid models (Samba, Nemotron Nano, Kimi Linear, OLMo Hybrid) are all from 2025-2026, and the paper directly informs ongoing design decisions. The focus on code and time series rather than yet another English language modeling comparison is a welcome departure.
1. Multi-domain evaluation: Testing across code pre-training, distillation, and time-series provides robust evidence that the findings aren't domain-specific.
2. Principled explanation: The unified formulation connects empirical results to architectural mechanisms, moving beyond pure benchmarking.
3. Synthetic validation: The counting and state-tracking experiments cleanly isolate the predicted capabilities and confirm the framework's predictions.
4. Practical relevance: The xLSTM[m:s] ratio analysis (e.g., xLSTM[7:1] for code, xLSTM[3:1] for time series) provides concrete architectural guidance.
5. Honest reporting: The paper acknowledges when margins are small and when competing methods have advantages (e.g., Mamba-2 at 80M CRPS).
1. Conflict of interest: Several authors are affiliated with the group that developed xLSTM (Hochreiter's lab at JKU/NXAI). While the experiments appear fairly designed, external replication would strengthen credibility.
2. Limited scale: 400M parameters for code LMs is far below the frontier; performance orderings may not hold at 7B+.
3. Narrow operator set: Only three architectures are compared. RWKV, RetNet, Griffin/Hawk, and other competitive subquadratic designs are excluded.
4. No wall-clock or memory comparisons: The paper compares model quality but doesn't report training/inference efficiency metrics, which are central to the subquadratic motivation.
5. The unified framework, while useful for intuition, doesn't yield new formal results—the theoretical contribution is incremental over existing analyses.
6. The synthetic tasks, while clean, use very small models (2 layers, 128 hidden), and it's unclear how these findings translate to large-scale practical models.
This is a well-structured empirical comparison paper with a useful theoretical framing. Its primary value is in providing the first matched comparison of three leading subquadratic architectures on challenging structured tasks, with a principled explanation for the observed differences. The evidence consistently favors xLSTM, though the margins are sometimes small and the scale is limited. The unified formulation is a helpful organizational contribution rather than a deep theoretical advance. The paper would benefit from larger-scale experiments and efficiency comparisons, but as published, it provides timely, actionable insights for the hybrid architecture design community.
Generated Jun 11, 2026
Paper 1 addresses a fundamental bottleneck in AI: the quadratic computational cost of Transformers. By systematically evaluating and theorizing why specific subquadratic architectures like xLSTM succeed in complex tasks, it directly informs the design of next-generation foundation models. This architectural advancement has massive breadth of impact across sequence modeling domains. While Paper 2 offers a highly rigorous and valuable benchmark for evaluating LLM hallucinations in citation contexts, Paper 1 provides foundational algorithmic principles that are more likely to shape the future trajectory of core deep learning architectures.
Paper 2 addresses a critical inefficiency in the dominant paradigm of Chain-of-Thought reasoning. By identifying the 'commitment boundary' and proving that subsequent reasoning steps are epiphenomenal, it offers massive real-world computational savings (up to 55%) for deploying large reasoning models. Paper 1 provides valuable comparisons of alternative subquadratic architectures, but Paper 2's deep interpretability insights and immediate practical utility for scaling inference compute give it broader and more timely scientific impact.
Paper 2 addresses the fundamental and highly active question of efficient alternatives to quadratic attention in Transformers, comparing leading subquadratic architectures (xLSTM, Mamba-2, Gated DeltaNet) across diverse practical tasks and providing principled theoretical analysis. This has broader impact across NLP, time-series, and efficient ML. Paper 1, while technically sound, addresses a narrower optimization of MoE router design. Paper 2's breadth of applications, timeliness given the efficiency scaling crisis, and potential to guide future architecture design give it higher impact potential.
Paper 2 addresses the fundamental and highly active problem of efficient sequence modeling alternatives to Transformers, comparing leading subquadratic architectures (xLSTM, Mamba-2, Gated DeltaNet) with both empirical evaluation and principled analysis. This topic has enormous breadth of impact across NLP, time-series, and code modeling. The unified formulation and mechanistic analysis of why xLSTM succeeds provides actionable insights for architecture design. Paper 1, while creative in applying INRs to behavioral data, addresses a more niche problem (unsupervised policy representation) with narrower applicability and incremental methodological contribution.
Paper 1 addresses a fundamental question in deep learning architecture design—understanding why certain subquadratic architectures outperform others—with broad implications across sequence modeling (code, time-series, language). Its unified theoretical framework explaining state tracking and memory dynamics provides generalizable principles that can guide future architecture development. Paper 2, while practically useful, is a narrowly scoped engineering contribution focused on quantizing a specific model for specific hardware, with limited generalizability beyond its immediate context.
Paper 2 introduces a clear, novel diagnosis (head–backbone competition) and a simple, scalable design principle (Backbone-as-Architect) plus an extremely lightweight adaptive mechanism (CLP) that achieves measurable inference speedups with zero quality loss across multiple model sizes. This targets a major, timely bottleneck—LLM inference efficiency—with immediate real-world applicability and broad impact across deployment, systems, and model design. Paper 1 is valuable and rigorous but is primarily a comparative/analytical study of existing subquadratic architectures with more incremental innovation and less direct, near-term deployment leverage.
Paper 1 addresses a fundamental bottleneck in modern AI—the quadratic scaling of Transformers—by evaluating and theoretically unifying subquadratic alternatives like xLSTM and Mamba-2. Its findings on state tracking and memory dynamics have broad implications across multiple domains, including NLP, code generation, and time-series analysis. Paper 2 offers a valuable but more niche application of explainability for efficient ECG training. Because Paper 1 tackles a core architectural challenge with field-wide relevance and high timeliness, it possesses significantly higher potential scientific impact.
Paper 1 addresses a fundamental and timely problem in sequence modeling—understanding why certain subquadratic architectures outperform others—with broad implications for LLMs, code models, and time-series foundation models. Its unified theoretical framework and principled analysis of memory dynamics provide actionable insights for architecture design across multiple domains. Paper 2 presents an interesting observation about RL-induced gradient disruption for adversarial robustness, but its practical impact is more limited: RL training for classifiers is computationally expensive, and the adversarial robustness community has well-established defenses. Paper 1's breadth of impact and architectural insights give it higher potential.
Paper 2 introduces a more novel algorithmic change (CPPO) that addresses a widely used RLHF/RLVR bottleneck (token-level trust regions) with a principled link to policy-improvement bounds and clear empirical gains in stability and reasoning across scales. Its real-world applicability is immediate for LLM post-training pipelines and broadly relevant across NLP and RL. Paper 1 is timely and useful, but is primarily a comparative/diagnostic study among existing subquadratic architectures with narrower impact and less standalone methodological novelty.
Paper 2 addresses a fundamental and timely question in deep learning—understanding which subquadratic architectures best replace Transformers—with broad implications for LLMs, time-series modeling, and efficient AI. Its unified theoretical framework comparing xLSTM, Mamba-2, and Gated DeltaNet, combined with principled analysis of memory dynamics and state tracking, provides foundational insights for architecture design. Paper 1 offers a solid incremental contribution to dataset pruning with class-aware sampling, but its scope is narrower (classification pruning) and impact more specialized compared to Paper 2's influence on the widely studied efficient sequence modeling paradigm.