Forget Attention: Importance-Aware Attention Is All You Need
Soohyeong Shin, Yeongwook Yang
Abstract
Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Forget Attention: Importance-Aware Attention Is All You Need"
1. Core Contribution
SISA proposes "score-level fusion" — injecting an SSM-derived importance bias directly into the attention score computation, realized through augmented Q/K vectors fed into a single standard SDPA call. The key equation adds a term λ·C̄ᵢᵀB̄ⱼ (encoding cumulative decay and data-dependent rotation from Mamba-3's framework) to the standard qᵢᵀkⱼ/√dₕ content similarity score. The algebraic trick of concatenating SSM channels onto Q and K vectors allows this to be computed without custom CUDA kernels, maintaining FlashAttention compatibility.
This positions SISA as a "third axis" of SSM-attention hybridization beyond block-level (Jamba) and head-level (Hymba) designs. The conceptual contribution — that the attention score itself is the natural interface point for fusion — is clean and well-articulated.
2. Methodological Rigor
Strengths in experimental design:
Weaknesses:
3. Potential Impact
Positive aspects:
Limitations on impact:
4. Timeliness & Relevance
The paper addresses a genuinely active research area — SSM-attention hybridization is one of the most explored topics in language model architecture design (2024-2025). The timing relative to Mamba-3 (cited as 2026 ICLR), FoX, Hymba, and Falcon-H1 is appropriate. The distinction from FoX (scalar decay bias vs. vector-valued data-dependent bias) is well-drawn and technically meaningful.
However, the field is moving rapidly toward very large scale, and a paper maxing out at 369M may struggle for attention among practitioners focused on 7B+ models.
5. Strengths & Limitations
Key strengths:
1. Clean mathematical formulation with the augmented Q/K reduction (Proposition 1)
2. No custom kernels — a significant practical advantage
3. Excellent NIAH convergence (100% from step 1K, 7× faster than Transformer)
4. Thorough ablation and transparency about protocol differences
5. Honest reporting of limitations (369M results, softmax dilution, scale gaps)
Notable weaknesses:
1. Scale is too small for confident architectural conclusions — 369M/5B is at least an order of magnitude below where architectural choices have been validated in recent literature
2. Non-monotonic dₛ optimum across scales makes practical deployment guidance unclear
3. The 39% throughput overhead vs. Transformer is non-trivial and undermines the "free lunch" narrative
4. At the largest tested scale, SISA doesn't clearly beat the Transformer baseline (bootstrap CIs overlap on all benchmarks)
5. The paper's strongest metric (LAMBADA-greedy) shows dramatic improvement, but this narrows to marginal differences on most other benchmarks
Overall Assessment
SISA presents a well-motivated and cleanly executed architectural idea — score-level fusion of SSM and attention signals. The augmented Q/K trick is the paper's most impactful technical contribution, enabling the fusion without infrastructure changes. However, the experimental validation is limited to small scale, the improvements are inconsistent across scales and benchmarks, and the fundamental FFN trade-off raises questions about scalability. This is a solid workshop or incremental conference contribution that introduces a promising direction, but falls short of the evidence needed to establish score-level fusion as a validated architectural paradigm.
Generated Jun 2, 2026
Comparison History (23)
Paper 2 introduces a fundamental architectural improvement to language models by fusing SSMs and attention at the score level. Innovations in foundational AI architectures typically yield massive cross-disciplinary impact, high citation rates, and rapid adoption. While Paper 1 presents a highly valuable protocol for the emerging field of autonomous labs, its immediate impact is constrained to a specific intersection of robotics and science, whereas Paper 2's methodology broadly advances the core AI ecosystem.
Paper 2 likely has higher impact: it delivers a substantial capability jump (general-purpose LLMs reaching strong formal-proof performance) via an agentic framework applicable across models, introduces a timely new benchmark (Lean-IMO-Bench), and demonstrates broader real-world/research utility (autonomous formalization and a verified result on an open combinatorics challenge). Its applications span automated reasoning, software verification, and mathematics. Paper 1 is technically novel and efficient, but its impact is more specialized to LM architecture design and depends on broader adoption and scaling validation.
Paper 2 proposes a fundamental architectural innovation by fusing State Space Models directly into the attention mechanism (score-level fusion). This foundational improvement to language modeling architecture has the potential to influence a vast array of downstream applications and models across AI. While Paper 1 presents a strong, highly applicable system for automated data science, Paper 2's contribution tackles a core mechanism in foundation models, offering a broader and deeper potential impact across the entire field of deep learning.
Paper 2 introduces a novel architectural concept (score-level fusion of SSMs and attention) that defines a new design axis for hybrid language models, addressing a fundamental challenge in the dominant field of language modeling. Its innovation is more foundational and broadly applicable across NLP/AI. Paper 1, while valuable as a benchmark for single-cell multi-omics translation, is more incremental—systematizing existing methods rather than proposing a new paradigm. The breadth of impact for advances in language model architecture far exceeds that of a domain-specific benchmark study.
Paper 1 proposes a fundamental architectural innovation for language models by integrating SSM importance signals directly into attention scores. Given the rapid scaling and widespread application of hybrid foundational models, this score-level fusion approach has the potential for broad, transformative impact across AI. In contrast, Paper 2 presents an interesting but niche empirical study with a very small sample size (20 papers) limited to computer architecture. While useful for researchers, its scope and fundamental technical contribution are significantly narrower than the architectural advancements proposed in Paper 1.
Paper 2 introduces a fundamentally new design axis (score-level fusion) for hybrid language models that elegantly combines attention and SSM mechanisms within a single operation. This architectural innovation is broadly applicable to the entire LM community, requires no custom kernels, and addresses a core limitation of both Transformers and SSMs. Its simplicity and generality give it high adoption potential. While Paper 1 presents a solid agentic orchestration framework, it is more niche and builds incrementally on existing agent/skill paradigms. Paper 2's contribution to foundational architecture design gives it wider and longer-lasting impact.
Paper 1 introduces a novel architectural contribution (score-level fusion of SSMs and attention) that addresses a fundamental challenge in hybrid language modeling with a clean, practical implementation requiring no custom kernels. It defines a new design axis beyond existing block-level and head-level paradigms, with strong empirical results. Paper 2 introduces a valuable benchmark for agent values, but benchmarks tend to have more transient impact. Paper 1's methodological innovation has broader potential to influence future architecture design across the rapidly evolving foundation model landscape.
DAG-MoE introduces a fundamentally new axis for scaling MoE models—structural aggregation via learned DAG structures—with theoretical grounding showing expanded combination spaces and implicit multi-step reasoning within a single layer. This has broader impact across the large and active MoE research community, applicable to any MoE architecture. SISA, while clever in fusing SSM importance into attention scores, addresses a more niche hybrid modeling problem at relatively small scales (152M-369M), and its advantages over existing hybrids are incremental. DAG-MoE's contribution is more generalizable and addresses a core scalability challenge in modern LLMs.
Paper 1 introduces a novel architectural contribution (score-level fusion) that defines a new design axis for hybrid language models, with clear empirical improvements in both language modeling and retrieval tasks. It addresses a fundamental challenge in the dominant paradigm of LLM architecture design, has immediate practical applicability (no custom kernels needed), and could influence the rapidly growing hybrid SSM-attention field. Paper 2 offers a useful conceptual reframing of heuristic design but operates in a narrower application domain (combinatorial optimization) with less transformative potential for the broader ML/AI community.
Paper 2 addresses a highly active and competitive area (hybrid language modeling combining attention and SSMs) with a novel 'score-level fusion' paradigm that is practical, implementable with existing infrastructure (single SDPA call), and demonstrates clear empirical improvements. The timeliness is exceptional given the current intense focus on efficient LLM architectures. Paper 1, while rigorous, systematizes existing belief change theory in a mature, niche area of knowledge representation, offering theoretical unification rather than opening new research directions. Paper 2's broader applicability to NLP/AI and its introduction of a new design axis give it substantially higher impact potential.
Paper 2 has higher potential scientific impact because it proposes a unifying research agenda (“Model Science”) that can reshape how the field evaluates, understands, and governs deployed AI systems. Its scope (verify/explore/steer/refine), infrastructure emphasis (catalogues, shared principles), and relevance to current shortcomings (hallucinations, shortcut learning, interpretability) make it broadly applicable across subfields and timely for safety and deployment at scale. Paper 1 is a strong technical contribution with clear empirical gains, but its impact is narrower and contingent on adoption within specific LM architectures.
Paper 2 likely has higher impact: it introduces an agentic framework for protein design that tightly couples PLMs with biophysical tools and a novel training objective (CAPO) to learn when/why to query feedback. This is highly timely for AI-driven biology and can translate directly to real-world protein engineering (enzymes, antibodies, PPIs), influencing multiple communities (ML, structural biology, drug discovery). Paper 1 is a clever, efficient architectural fusion for LMs, but its primary impact is within model design/efficiency and may face fast-moving competition from adjacent hybrid attention/SSM methods.
Paper 1 introduces a novel architectural paradigm (score-level fusion) for hybrid language models, addressing a fundamental challenge in the rapidly evolving field of efficient sequence modeling. It offers a clean, implementable solution (single SDPA call, no custom kernels) with strong empirical results across multiple benchmarks. The contribution defines a new design axis beyond existing block-level and head-level hybrids, with broad implications for LLM architecture. Paper 2 applies existing techniques (knowledge graphs, graph attention, temporal modeling) to a narrower educational domain with limited generalizability beyond advanced mathematics courses.
Paper 1 proposes a fundamental architectural innovation by fusing attention and state space models at the score level, addressing core limitations in current foundation model designs. This could broadly influence the architecture of future language models. Paper 2 offers a valuable inference optimization (speculative decoding speedup), but its impact is mostly limited to deployment efficiency rather than foundational model capabilities.
Paper 1 (SISA) addresses a highly active and competitive area—hybrid attention/SSM architectures for language modeling—with a clean, practical innovation (score-level fusion) that requires no custom kernels and integrates into standard SDPA. It defines a new design axis beyond existing block-level and head-level paradigms, with strong empirical results on established benchmarks. Paper 2 (SHARP) proposes an interesting neuroscience-inspired framework for streaming temporal learning, but its contributions are more incremental, validated primarily on smaller-scale benchmarks, and the practical adoption pathway is less clear given the dominance of transformer-based approaches.
Paper 2 offers a foundational architectural innovation for Large Language Models by introducing score-level fusion of State Space Models and Attention. This addresses a critical bottleneck in AI research: combining the global retrieval of Transformers with the sequential efficiency of SSMs without requiring custom kernels. While Paper 1 provides a valuable, domain-specific application of Mamba for energy grids, Paper 2's breakthrough has a vastly wider breadth of impact, directly influencing the design of future foundation models, improving long-context retrieval, and advancing the highly active field of efficient AI architectures.
Mind-Omni introduces a fundamentally new paradigm for brain-computer interfaces by unifying seven encoding/decoding tasks through discrete diffusion, creating the first versatile multi-task brain-vision-language framework. Its breadth of impact spans neuroscience, BCI, and AI, with strong real-world applications in neural decoding. While SISA offers a clever architectural contribution (score-level fusion for SSM-attention hybrids), it represents an incremental improvement in language model architecture. Mind-Omni's novelty as a foundation model paradigm for neural activity, combined with its cross-disciplinary impact and practical BCI applications, gives it higher potential scientific impact.
Paper 1 introduces a fundamental architectural innovation by fusing SSMs and Attention at the score level, addressing a major bottleneck in hybrid language models. Given the current intense focus on efficient LLM architectures, this approach has high potential for widespread adoption in next-generation foundation models, offering broader practical applications and immediate performance gains compared to the theoretical alignment insights of Paper 2.
Paper 1 likely has higher scientific impact: it proposes a novel, low-friction architectural fusion (score-level SSM-informed attention) that can be deployed with standard SDPA and no custom kernels, making adoption easy and broadly relevant to foundation-model training. The method targets a central, timely limitation (prioritization + long-context retrieval) and reports strong gains on established benchmarks (LAMBADA, NIAH) including faster retrieval convergence. Its potential impact spans LLM architecture research, efficiency, and long-context applications. Paper 2 is application-driven and useful, but more domain-specific and less foundational.
Paper 2 proposes a fundamental architectural innovation by fusing SSMs and Attention at the score level, addressing core limitations in current foundation models. This has broader applicability across all domains relying on language modeling, potentially influencing the design of next-generation LLMs. While Paper 1 presents a valuable post-training method for reasoning, its primary domain (theorem proving) is more specialized, giving Paper 2 a wider breadth of potential scientific impact.