Forget Attention: Importance-Aware Attention Is All You Need
Suhyeong Shin, Yeongwook Yang
Abstract
Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Forget Attention: Importance-Aware Attention Is All You Need"
1. Core Contribution
SISA proposes "score-level fusion" — injecting an SSM-derived importance bias directly into the attention score computation, as opposed to the block-level (Jamba, Samba) or head-level (Hymba, Falcon-H1) fusion strategies used by existing hybrids. The key insight is that by deriving cumulative decay and data-dependent rotation terms from Mamba-3's mathematical framework, and then concatenating these channels onto Q and K vectors, the entire operation reduces to a single standard SDPA call on augmented Q/K. This means no recurrent state, no custom CUDA kernel, and full compatibility with FlashAttention.
The mathematical realization is elegant: the additive score bias λ·C̄ᵢᵀB̄ⱼ decomposes into an inner product that can be absorbed into augmented Q/K vectors via a scaling constant s = d_h^{1/4}√λ. This is a clean algebraic insight that makes the method practically deployable.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses:
3. Potential Impact
The conceptual contribution — identifying score-level fusion as a third design axis — is the paper's most valuable offering. Even if the specific instantiation (SISA) doesn't scale perfectly, the idea that SSM signals can directly modulate attention scores opens a design space that others can explore. The FlashAttention compatibility is a genuine practical advantage over Mamba variants that require custom kernels.
However, the impact may be limited by:
4. Timeliness & Relevance
The paper addresses a genuinely active research question: how to combine attention and SSMs effectively. The hybrid architecture space is competitive (Jamba, Hymba, Falcon-H1, Nemotron-H all appeared in 2024-2025), and offering a new integration paradigm is timely. The connection to FoX (forgetting transformer) is well-drawn — SISA generalizes scalar decay biases to vector-valued, data-dependent biases.
The emphasis on stock SDPA compatibility is practically relevant given the fragmentation of custom kernel requirements across SSM variants.
5. Strengths & Limitations
Key Strengths:
1. Clean mathematical formulation: The augmented Q/K construction is simple, principled, and implementable in a few lines of code
2. NIAH convergence speed: 100% at step 1K (7× faster than Transformer) is a striking result suggesting genuine architectural benefit for retrieval
3. Honest reporting: The authors transparently disclose where Mamba-3 wins (369M LAMBADA), protocol differences, and statistical overlap at 369M
4. Comprehensive ablation: The d_s study across scales provides useful guidance
Notable Limitations:
1. Scale ceiling: 369M is too small to draw architectural conclusions for modern LLMs. The observation that Mamba-3 overtakes SISA at 369M raises the question of whether SISA's advantages diminish further at scale.
2. Narrow evaluation: Five benchmarks, several near random baseline (HellaSwag at 25-27%, WinoGrande at 50-53%), making differences hard to interpret
3. Training efficiency claim is misleading: "5× fewer tokens" refers to matching Transformer's final accuracy early in training, not a genuine training compute reduction
4. RoPE length limitation: Context limited to 2,048 tokens with no extrapolation capability
5. Softmax dilution: The authors identify this as a fundamental limitation but offer only a speculative SISA-2 solution
6. No perplexity reported: The most standard LM metric is absent
Overall Assessment
SISA introduces a conceptually clean and practically implementable idea — fusing SSM importance signals at the attention score level via augmented Q/K vectors. The NIAH convergence results are compelling and suggest a real architectural advantage for retrieval. However, the limited scale (≤369M), narrow benchmarks, non-monotonic hyperparameter sensitivity, and the fact that Mamba-3 overtakes SISA at the largest tested scale collectively weaken the case for broad impact. This is a promising preliminary study that identifies a useful design principle, but requires significantly more validation before its practical importance can be assessed.
Generated Jun 3, 2026
Comparison History (23)
Paper 1 introduces a novel, generally applicable architectural primitive (score-level SSM–attention fusion) that can be deployed with standard SDPA, avoiding custom kernels and recurrent state. This makes adoption and scaling in mainstream LLM stacks easy, increasing real-world impact. It targets a timely, central bottleneck—combining long-range retrieval with importance/prioritization—and reports clear gains on both language modeling and retrieval-style benchmarks. Its contribution is broad across sequence modeling and efficient LLM design. Paper 2 is valuable but more domain/tooling- and dataset-specific, with narrower architectural novelty.
Paper 1 targets the highly active field of Large Language Model architectures by fusing Transformers and State Space Models. Its approach to improving attention mechanisms has broad real-world applicability and high relevance to current AI trends. While Paper 2 presents a significant theoretical milestone in classical planning by learning admissible heuristics, its overall breadth of impact and application scope are narrower compared to fundamental advancements in foundational language models.
Paper 2 introduces a concrete architectural innovation (SISA) that addresses a fundamental challenge in language modeling—fusing attention and SSM mechanisms at the score level rather than block or head level. This defines a new design axis with empirical results showing improvements on standard benchmarks, and it integrates seamlessly with existing infrastructure (stock SDPA). Paper 1 addresses an important safety concern (compliance bias in agents) with a useful taxonomy and evaluation protocols, but its contributions are more framework/position-oriented with preliminary results on a small-scale evaluation. Paper 2's architectural contribution is more likely to drive widespread follow-up research and adoption in the rapidly evolving LLM architecture space.
LAP addresses a critical infrastructure gap in autonomous science by standardizing the agent-to-instrument interface, complementing existing protocols (MCP, A2A). Its potential impact spans all experimental sciences adopting self-driving labs, offering safety-critical primitives, measurement standards, and federation capabilities. While Paper 2 proposes an interesting score-level SSM-attention fusion, it shows mixed results at scale (Mamba-3 leads at 369M) and represents an incremental architectural contribution in a crowded hybrid modeling space. LAP's broader cross-disciplinary applicability and foundational infrastructure nature give it higher long-term impact potential.
Paper 2 introduces a novel architectural design principle ('score-level fusion') for hybrid language models that addresses a fundamental tension between attention and SSMs. This defines a new design axis with broad implications for the entire language modeling field. The approach is elegant (single SDPA call, no custom kernels), shows strong empirical results on standard benchmarks, and could influence how future foundation models are built. Paper 1, while useful, addresses a narrower problem (clarification in LLM agents) with more incremental improvements (3.7% success rate gain) and limited architectural novelty.
Paper 2 likely has higher impact: it introduces a rigorous evaluation framework (AgentCL) plus diagnostic tooling (MemProbe) for continual learning in language agents, addressing a broadly recognized measurement gap. Benchmarks can influence many future papers across agent memory, adaptation, and evaluation, with immediate real-world relevance as agents are deployed in long-running settings. Paper 1 is a promising modeling contribution with good results and practical implementation (stock SDPA), but its impact is narrower to hybrid attention/SSM architecture design and may compete with many fast-moving alternatives.
Paper 2 likely has higher impact: it introduces a simple, broadly applicable architectural primitive (score-level SSM-attention fusion) that plugs into standard SDPA without custom kernels or recurrent state, making adoption easy across model builders and hardware stacks. The method targets a timely, central bottleneck (long-context retrieval + prioritization) and reports strong gains on established benchmarks and retrieval metrics, suggesting cross-field relevance (NLP, efficient transformers, systems). Paper 1 is novel for autonomous RL training workflows, but its impact may be narrower and more dependent on complex evaluation setups and toolchains.
Paper 1 introduces a fundamentally new design paradigm (score-level fusion) for hybrid language models, addressing a core architectural challenge in the field. It proposes a clean, elegant solution (SISA) that integrates SSM importance signals directly into attention scores without custom kernels, offering broad applicability across language modeling. Paper 2, while solving practical problems in news-augmented forecasting, addresses a narrower application domain with incremental improvements combining existing techniques (compression, reward models). Paper 1's architectural contribution has broader potential impact across the rapidly evolving foundation model landscape.
Paper 2 likely has higher scientific impact: it proposes a concrete, easily adoptable architectural modification (score-level SSM-attention fusion) with strong empirical gains and no custom kernels, making integration into existing Transformer stacks straightforward. Its timeliness is high given intense interest in efficient long-context and hybrid SSM/attention models, and improvements on widely used benchmarks suggest broad applicability across NLP and sequence modeling. Paper 1 addresses an important emerging governance need and claims formal rigor, but its impact may be narrower and more dependent on adoption by specific IAM/agent platforms.
Paper 2 offers a concrete, novel algorithmic contribution (score-level fusion of SSM importance into attention) with demonstrated empirical gains and practical implementation advantages (single stock-SDPA call, no custom kernel), making it likely to be adopted and extended broadly in ML/NLP and systems. Its timeliness is high given active work on Transformer–SSM hybrids, and results on standard benchmarks strengthen rigor. Paper 1 is valuable as a systems architecture perspective for edge agents, but is more conceptual and less empirically grounded, limiting near-term scientific traction.
Paper 1 addresses a fundamental problem in LLM architecture by proposing a novel fusion of Transformers and State Space Models. Improvements to foundational AI architectures have a profound, cross-disciplinary impact on efficiency and performance. In contrast, Paper 2 introduces a benchmark for a more specific niche (predicting user decisions from prediction markets/on-chain data), which, while valuable, has a narrower scope and less potential for widespread foundational impact across fields.
Paper 1 addresses a critical and timely issue—gender bias in LLM-based medical triage—with clear real-world implications for patient safety as AI systems are increasingly deployed in healthcare. It demonstrates a systematic, reproducible bias across multiple leading model families with a well-designed methodology, and its findings have immediate policy relevance for AI regulation and clinical deployment. Paper 2 proposes a technically interesting architectural innovation (score-level SSM-attention fusion), but the results are mixed (Mamba-3 leads on LAMBADA at 369M), tested only at small scale, and represents an incremental contribution in a crowded hybrid architecture space with uncertain long-term adoption.
Paper 2 addresses a fundamental challenge in foundation models by fusing State Space Models with Transformer attention, directly impacting the highly active field of Large Language Models. Its advancements in efficiency and sequence retrieval have massive, broad implications across AI and its downstream applications. In contrast, Paper 1 focuses on a relatively niche application within video game procedural content generation, resulting in a much narrower scope of potential scientific and real-world impact.
Paper 2 proposes a fundamental architectural innovation for hybrid language models by integrating SSM-derived importance directly into the attention mechanism. Improvements in foundation model architectures have massive breadth of impact across all of AI. Paper 1 addresses a valuable but much narrower niche (refactoring formal proofs in Lean), limiting its overall scientific impact compared to core LLM advancements.
Paper 1 proposes a fundamental architectural advancement by seamlessly integrating state space models (SSMs) into the attention mechanism. Core improvements to foundational language model architectures typically yield massive, cross-disciplinary impact, influencing how future models are trained and deployed. While Paper 2 offers a valuable evaluation framework for multi-agent systems, Paper 1 addresses a more central bottleneck in AI with a highly timely and widely applicable solution.
While Paper 1 offers valuable insights into data curation and agent training, Paper 2 proposes a fundamental architectural innovation for foundation models. By fusing State Space Models with Attention at the score level (SISA), Paper 2 addresses a core bottleneck in hybrid language modeling. Foundational architectural improvements generally yield broader downstream impact across all AI applications, making Paper 2's contribution more universally significant.
Paper 1 introduces a novel architectural design principle (score-level fusion) for hybrid language models that defines a new design axis beyond existing block-level and head-level paradigms. It addresses a fundamental challenge in language modeling with an elegant solution requiring no custom kernels, making it broadly applicable. Its impact spans the entire LLM community. Paper 2, while rigorous and clinically relevant, addresses a narrower domain (lung cancer early detection) with incremental advances combining existing techniques (multi-agent systems, RAG, MARL). Paper 1's architectural innovation has broader potential to influence future model designs across NLP.
Paper 1 introduces a novel architectural paradigm (score-level fusion) for hybrid language models that addresses a fundamental tension between attention and SSMs. It defines a new design axis beyond existing block-level and head-level approaches, with strong empirical results and practical appeal (no custom kernels needed). While Paper 2 makes solid contributions to safe RL with formal guarantees, it extends existing shielding frameworks to RMDPs—a more incremental advance. Paper 1's impact potential is broader given the massive scale of LLM research and the practical simplicity of the proposed method.
LEAP demonstrates broader and more transformative impact. It achieves state-of-the-art results on formal theorem proving (solving all 12 Putnam 2025 problems), introduces a new benchmark (Lean-IMO-Bench), and shows research-level utility by formalizing proofs for open mathematical problems. This bridges a critical gap between informal and formal mathematical reasoning with immediate practical applications. SISA proposes an interesting architectural fusion technique but is evaluated only at small scale (152M-369M parameters) with incremental improvements, limiting its demonstrated impact compared to LEAP's breakthrough-level results.
Paper 2 likely has higher scientific impact due to stronger novelty (score-level fusion of SSM importance directly inside attention), broad applicability to foundational LLM architecture design, and high timeliness in hybrid attention/SSM research. It reports clear empirical gains on widely recognized benchmarks and offers a practical implementation path (single stock SDPA call, no custom kernels), increasing adoption potential across ML systems. Paper 1 is valuable and applicable to urban planning, but the approach (GA-based calibration) is more incremental and domain-specific, with narrower cross-field reach.