Forget Attention: Importance-Aware Attention Is All You Need

Soohyeong Shin, Yeongwook Yang

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →
#285 of 3404 · Artificial Intelligence
Share
Tournament Score
1510±44
10501800
78%
Win Rate
18
Wins
5
Losses
23
Matches
Rating
4.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Combining attention's global retrieval with the sequential importance signal of state space models (SSMs) is the open challenge of hybrid language modeling. Transformers see everywhere but cannot prioritize; SSMs know what matters but cannot revisit. Existing hybrids -- Jamba (block level) and Hymba (head level) -- place the two in separate compartments, so neither informs the other during the attention computation itself. We propose SISA (SSM-Informed Softmax Attention), which adds an SSM-derived importance term directly inside the attention score and realizes the full operation as a single SDPA call on augmented query/key vectors -- no recurrent state, no custom kernel. At 152M / 5B tokens, SISA reaches LAMBADA-greedy 17.3% (vs. Transformer 13.9 and Mamba-3 15.5) and attains NIAH 100% from step 1K, 7x faster than Transformer's retrieval convergence; at 369M, Mamba-3 leads LAMBADA while SISA preserves perfect NIAH and stock-SDPA execution. SISA thus defines a third design axis for SSM-attention hybrids -- score-level fusion -- beyond the block-level and head-level paradigms that have dominated the field.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Forget Attention: Importance-Aware Attention Is All You Need"

1. Core Contribution

SISA proposes "score-level fusion" — injecting an SSM-derived importance bias directly into the attention score computation, realized through augmented Q/K vectors fed into a single standard SDPA call. The key equation adds a term λ·C̄ᵢᵀB̄ⱼ (encoding cumulative decay and data-dependent rotation from Mamba-3's framework) to the standard qᵢᵀkⱼ/√dₕ content similarity score. The algebraic trick of concatenating SSM channels onto Q and K vectors allows this to be computed without custom CUDA kernels, maintaining FlashAttention compatibility.

This positions SISA as a "third axis" of SSM-attention hybridization beyond block-level (Jamba) and head-level (Hymba) designs. The conceptual contribution — that the attention score itself is the natural interface point for fusion — is clean and well-articulated.

2. Methodological Rigor

Strengths in experimental design:

  • Parameter-matched comparisons across four architectures (Transformer, SISA, Mamba-2, Mamba-3) at three scales (50M, 152M, 369M)
  • Careful documentation of micro-batch protocol differences (mb=2 vs. mb=4), with controlled retrains for fair comparison
  • Multi-seed NIAH verification (5 seeds)
  • Systematic dₛ ablation study across scales
  • Bootstrap confidence intervals at 369M showing statistical overlap
  • FLOPs analysis (only 7% overhead)
  • Weaknesses:

  • The maximum scale is 369M parameters on 5B tokens — far below what would be needed to establish architectural claims with confidence. The 369M models are Chinchilla-undertrained (13.5× tokens/param vs. ~20× optimal), which the authors acknowledge but which limits generalizability.
  • Only five benchmarks are used, with several showing minimal differentiation (HellaSwag clusters at 25-27%). The LAMBADA-greedy metric, where SISA shines most, is a relatively narrow evaluation.
  • The NIAH test is synthetic and somewhat simplistic (200 trials, single sentence retrieval). While informative, it's not a comprehensive retrieval benchmark.
  • At 369M, the results are mixed: Mamba-3 leads LAMBADA by +2.6 pp, and bootstrap CIs overlap with Transformer on all benchmarks. The paper's strongest claims rest on 152M results.
  • The micro-batch sensitivity is concerning — the mb=2 vs. mb=4 differences suggest some fragility in the experimental setup that isn't fully explained.
  • 3. Potential Impact

    Positive aspects:

  • The augmented Q/K trick is elegant and practically useful — it requires no custom kernels and works with stock FlashAttention, lowering the adoption barrier significantly compared to Mamba variants.
  • The conceptual framework of "score-level fusion" could inspire follow-up work exploring richer bias forms or different SSM signals injected at the score level.
  • The 25% throughput advantage over Mamba-3 with competitive performance is practically meaningful.
  • Limitations on impact:

  • The improvements are demonstrated only at very small scale. The non-monotonic dₛ scaling behavior (optimal dₛ = 64 at 50M, 16 at 152M, 128 at 369M) suggests the method may be difficult to tune at larger scales without extensive ablation.
  • The FFN reduction trade-off is fundamental: SISA adds SSM parameters by shrinking the FFN, and this becomes increasingly costly at scale (as acknowledged in the parameter budget analysis). This could limit the approach's viability at the 1B+ scales where practical impact matters most.
  • The softmax dilution issue (Section 6.2) is a fundamental limitation that the authors themselves identify — the additive bias is most useful precisely when attention is already decisive, but gets washed out when many tokens are relevant.
  • 4. Timeliness & Relevance

    The paper addresses a genuinely active research area — SSM-attention hybridization is one of the most explored topics in language model architecture design (2024-2025). The timing relative to Mamba-3 (cited as 2026 ICLR), FoX, Hymba, and Falcon-H1 is appropriate. The distinction from FoX (scalar decay bias vs. vector-valued data-dependent bias) is well-drawn and technically meaningful.

    However, the field is moving rapidly toward very large scale, and a paper maxing out at 369M may struggle for attention among practitioners focused on 7B+ models.

    5. Strengths & Limitations

    Key strengths:

    1. Clean mathematical formulation with the augmented Q/K reduction (Proposition 1)

    2. No custom kernels — a significant practical advantage

    3. Excellent NIAH convergence (100% from step 1K, 7× faster than Transformer)

    4. Thorough ablation and transparency about protocol differences

    5. Honest reporting of limitations (369M results, softmax dilution, scale gaps)

    Notable weaknesses:

    1. Scale is too small for confident architectural conclusions — 369M/5B is at least an order of magnitude below where architectural choices have been validated in recent literature

    2. Non-monotonic dₛ optimum across scales makes practical deployment guidance unclear

    3. The 39% throughput overhead vs. Transformer is non-trivial and undermines the "free lunch" narrative

    4. At the largest tested scale, SISA doesn't clearly beat the Transformer baseline (bootstrap CIs overlap on all benchmarks)

    5. The paper's strongest metric (LAMBADA-greedy) shows dramatic improvement, but this narrows to marginal differences on most other benchmarks

    Overall Assessment

    SISA presents a well-motivated and cleanly executed architectural idea — score-level fusion of SSM and attention signals. The augmented Q/K trick is the paper's most impactful technical contribution, enabling the fusion without infrastructure changes. However, the experimental validation is limited to small scale, the improvements are inconsistent across scales and benchmarks, and the fundamental FFN trade-off raises questions about scalability. This is a solid workshop or incremental conference contribution that introduces a promising direction, but falls short of the evidence needed to establish score-level fusion as a validated architectural paradigm.

    Rating:4.5/ 10
    Significance 5Rigor 5.5Novelty 6Clarity 7

    Generated Jun 2, 2026

    Comparison History (23)

    vs. LAP: An Agent-to-Instrument Protocol for Autonomous Science
    gemini-3.16/3/2026

    Paper 2 introduces a fundamental architectural improvement to language models by fusing SSMs and attention at the score level. Innovations in foundational AI architectures typically yield massive cross-disciplinary impact, high citation rates, and rapid adoption. While Paper 1 presents a highly valuable protocol for the emerging field of autonomous labs, its immediate impact is constrained to a specific intersection of robotics and science, whereas Paper 2's methodology broadly advances the core AI ecosystem.

    vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
    gpt-5.26/3/2026

    Paper 2 likely has higher impact: it delivers a substantial capability jump (general-purpose LLMs reaching strong formal-proof performance) via an agentic framework applicable across models, introduces a timely new benchmark (Lean-IMO-Bench), and demonstrates broader real-world/research utility (autonomous formalization and a verified result on an open combinatorics challenge). Its applications span automated reasoning, software verification, and mathematics. Paper 1 is technically novel and efficient, but its impact is more specialized to LM architecture design and depends on broader adoption and scaling validation.

    vs. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management
    gemini-3.16/3/2026

    Paper 2 proposes a fundamental architectural innovation by fusing State Space Models directly into the attention mechanism (score-level fusion). This foundational improvement to language modeling architecture has the potential to influence a vast array of downstream applications and models across AI. While Paper 1 presents a strong, highly applicable system for automated data science, Paper 2's contribution tackles a core mechanism in foundation models, offering a broader and deeper potential impact across the entire field of deep learning.

    vs. scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation
    claude-opus-4.66/3/2026

    Paper 2 introduces a novel architectural concept (score-level fusion of SSMs and attention) that defines a new design axis for hybrid language models, addressing a fundamental challenge in the dominant field of language modeling. Its innovation is more foundational and broadly applicable across NLP/AI. Paper 1, while valuable as a benchmark for single-cell multi-omics translation, is more incremental—systematizing existing methods rather than proposing a new paradigm. The breadth of impact for advances in language model architecture far exceeds that of a domain-specific benchmark study.

    vs. Can AI Review Improve Paper Drafting? An Empirical Study on 20 Computer Architecture Submissions
    gemini-3.16/2/2026

    Paper 1 proposes a fundamental architectural innovation for language models by integrating SSM importance signals directly into attention scores. Given the rapid scaling and widespread application of hybrid foundational models, this score-level fusion approach has the potential for broad, transformative impact across AI. In contrast, Paper 2 presents an interesting but niche empirical study with a very small sample size (20 papers) limited to computer architecture. While useful for researchers, its scope and fundamental technical contribution are significantly narrower than the architectural advancements proposed in Paper 1.

    vs. SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration
    claude-opus-4.66/2/2026

    Paper 2 introduces a fundamentally new design axis (score-level fusion) for hybrid language models that elegantly combines attention and SSM mechanisms within a single operation. This architectural innovation is broadly applicable to the entire LM community, requires no custom kernels, and addresses a core limitation of both Transformers and SSMs. Its simplicity and generality give it high adoption potential. While Paper 1 presents a solid agentic orchestration framework, it is more niche and builds incrementally on existing agent/skill paradigms. Paper 2's contribution to foundational architecture design gives it wider and longer-lasting impact.

    vs. Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
    claude-opus-4.66/2/2026

    Paper 1 introduces a novel architectural contribution (score-level fusion of SSMs and attention) that addresses a fundamental challenge in hybrid language modeling with a clean, practical implementation requiring no custom kernels. It defines a new design axis beyond existing block-level and head-level paradigms, with strong empirical results. Paper 2 introduces a valuable benchmark for agent values, but benchmarks tend to have more transient impact. Paper 1's methodological innovation has broader potential to influence future architecture design across the rapidly evolving foundation model landscape.

    vs. DAG-MoE: From Simple Mixture to Structural Aggregation in Mixture-of-Experts
    claude-opus-4.66/2/2026

    DAG-MoE introduces a fundamentally new axis for scaling MoE models—structural aggregation via learned DAG structures—with theoretical grounding showing expanded combination spaces and implicit multi-step reasoning within a single layer. This has broader impact across the large and active MoE research community, applicable to any MoE architecture. SISA, while clever in fusing SSM importance into attention scores, addresses a more niche hybrid modeling problem at relatively small scales (152M-369M), and its advantages over existing hybrids are incremental. DAG-MoE's contribution is more generalizable and addresses a core scalability challenge in modern LLMs.

    vs. Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs
    claude-opus-4.66/2/2026

    Paper 1 introduces a novel architectural contribution (score-level fusion) that defines a new design axis for hybrid language models, with clear empirical improvements in both language modeling and retrieval tasks. It addresses a fundamental challenge in the dominant paradigm of LLM architecture design, has immediate practical applicability (no custom kernels needed), and could influence the rapidly growing hybrid SSM-attention field. Paper 2 offers a useful conceptual reframing of heuristic design but operates in a narrower application domain (combinatorial optimization) with less transformative potential for the broader ML/AI community.

    vs. An Abstract Worlds Semantic Framework for Belief Change Operators
    claude-opus-4.66/2/2026

    Paper 2 addresses a highly active and competitive area (hybrid language modeling combining attention and SSMs) with a novel 'score-level fusion' paradigm that is practical, implementable with existing infrastructure (single SDPA call), and demonstrates clear empirical improvements. The timeliness is exceptional given the current intense focus on efficient LLM architectures. Paper 1, while rigorous, systematizes existing belief change theory in a mature, niche area of knowledge representation, offering theoretical unification rather than opening new research directions. Paper 2's broader applicability to NLP/AI and its introduction of a new design axis give it substantially higher impact potential.

    vs. The Case for Model Science: Verify, Explore, Steer, Refine
    gpt-5.26/2/2026

    Paper 2 has higher potential scientific impact because it proposes a unifying research agenda (“Model Science”) that can reshape how the field evaluates, understands, and governs deployed AI systems. Its scope (verify/explore/steer/refine), infrastructure emphasis (catalogues, shared principles), and relevance to current shortcomings (hallucinations, shortcut learning, interpretability) make it broadly applicable across subfields and timely for safety and deployment at scale. Paper 1 is a strong technical contribution with clear empirical gains, but its impact is narrower and contingent on adoption within specific LM architectures.

    vs. AgentPLM: Agentic Protein Language Models with Reasoning-Augmented Decoding for Protein Sequence Design
    gpt-5.26/2/2026

    Paper 2 likely has higher impact: it introduces an agentic framework for protein design that tightly couples PLMs with biophysical tools and a novel training objective (CAPO) to learn when/why to query feedback. This is highly timely for AI-driven biology and can translate directly to real-world protein engineering (enzymes, antibodies, PPIs), influencing multiple communities (ML, structural biology, drug discovery). Paper 1 is a clever, efficient architectural fusion for LMs, but its primary impact is within model design/efficiency and may face fast-moving competition from adjacent hybrid attention/SSM methods.

    vs. Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis
    claude-opus-4.66/2/2026

    Paper 1 introduces a novel architectural paradigm (score-level fusion) for hybrid language models, addressing a fundamental challenge in the rapidly evolving field of efficient sequence modeling. It offers a clean, implementable solution (single SDPA call, no custom kernels) with strong empirical results across multiple benchmarks. The contribution defines a new design axis beyond existing block-level and head-level hybrids, with broad implications for LLM architecture. Paper 2 applies existing techniques (knowledge graphs, graph attention, temporal modeling) to a narrower educational domain with limited generalizability beyond advanced mathematics courses.

    vs. TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding
    gemini-3.16/2/2026

    Paper 1 proposes a fundamental architectural innovation by fusing attention and state space models at the score level, addressing core limitations in current foundation model designs. This could broadly influence the architecture of future language models. Paper 2 offers a valuable inference optimization (speculative decoding speedup), but its impact is mostly limited to deployment efficiency rather than foundational model capabilities.

    vs. SHARP: Sleep-based Hierarchical Accelerated Replay for Long Range Non-Stationary Temporal Pattern Recognition
    claude-opus-4.66/2/2026

    Paper 1 (SISA) addresses a highly active and competitive area—hybrid attention/SSM architectures for language modeling—with a clean, practical innovation (score-level fusion) that requires no custom kernels and integrates into standard SDPA. It defines a new design axis beyond existing block-level and head-level paradigms, with strong empirical results on established benchmarks. Paper 2 (SHARP) proposes an interesting neuroscience-inspired framework for streaming temporal learning, but its contributions are more incremental, validated primarily on smaller-scale benchmarks, and the practical adoption pathway is less clear given the dominance of transformer-based approaches.

    vs. EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction
    gemini-3.16/2/2026

    Paper 2 offers a foundational architectural innovation for Large Language Models by introducing score-level fusion of State Space Models and Attention. This addresses a critical bottleneck in AI research: combining the global retrieval of Transformers with the sequential efficiency of SSMs without requiring custom kernels. While Paper 1 provides a valuable, domain-specific application of Mamba for energy grids, Paper 2's breakthrough has a vastly wider breadth of impact, directly influencing the design of future foundation models, improving long-context retrieval, and advancing the highly active field of efficient AI architectures.

    vs. Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion
    claude-opus-4.66/2/2026

    Mind-Omni introduces a fundamentally new paradigm for brain-computer interfaces by unifying seven encoding/decoding tasks through discrete diffusion, creating the first versatile multi-task brain-vision-language framework. Its breadth of impact spans neuroscience, BCI, and AI, with strong real-world applications in neural decoding. While SISA offers a clever architectural contribution (score-level fusion for SSM-attention hybrids), it represents an incremental improvement in language model architecture. Mind-Omni's novelty as a foundation model paradigm for neural activity, combined with its cross-disciplinary impact and practical BCI applications, gives it higher potential scientific impact.

    vs. Subliminal Learning Is Steering Vector Distillation
    gemini-3.16/2/2026

    Paper 1 introduces a fundamental architectural innovation by fusing SSMs and Attention at the score level, addressing a major bottleneck in hybrid language models. Given the current intense focus on efficient LLM architectures, this approach has high potential for widespread adoption in next-generation foundation models, offering broader practical applications and immediate performance gains compared to the theoretical alignment insights of Paper 2.

    vs. NBQ: Next-Best-Question for Dynamic Profiling
    gpt-5.26/2/2026

    Paper 1 likely has higher scientific impact: it proposes a novel, low-friction architectural fusion (score-level SSM-informed attention) that can be deployed with standard SDPA and no custom kernels, making adoption easy and broadly relevant to foundation-model training. The method targets a central, timely limitation (prioritization + long-context retrieval) and reports strong gains on established benchmarks (LAMBADA, NIAH) including faster retrieval convergence. Its potential impact spans LLM architecture research, efficiency, and long-context applications. Paper 2 is application-driven and useful, but more domain-specific and less foundational.

    vs. Distilling LLM Feedback for Lean Theorem Proving
    gemini-3.16/2/2026

    Paper 2 proposes a fundamental architectural innovation by fusing SSMs and Attention at the score level, addressing core limitations in current foundation models. This has broader applicability across all domains relying on language modeling, potentially influencing the design of next-generation LLMs. While Paper 1 presents a valuable post-training method for reasoning, its primary domain (theorem proving) is more specialized, giving Paper 2 a wider breadth of potential scientific impact.