Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen

Jun 10, 2026arXiv:2606.11634v1

cs.AI

#1162of 3489·Artificial Intelligence

#1162 of 3489 · Artificial Intelligence

Tournament Score

1437±49

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6.5

Novelty5.5

Clarity7.5

Abstract

The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SWARR - Architecture-Aware RL for Sliding-Window Attention in Math Reasoning

1. Core Contribution

The paper presents SWARR, a two-stage pipeline that (1) converts a pretrained self-attention (SA) transformer to sliding-window attention (SWA) via supervised fine-tuning, then (2) applies reinforcement learning to adapt the model's generation behavior to the SWA constraint. The central empirical finding is that RL substantially closes the performance gap between SWA and SA that persists after SFT alone, changing the practical viability assessment of SWA for math reasoning.

The key insight — framed as "data-architecture mismatch" — is that SFT data are generated assuming full attention and may contain long-range dependencies incompatible with SWA, whereas on-policy RL generates trajectories under the SWA constraint, naturally favoring patterns that work within the limited attention window. This is a clean, intuitive idea that reframes RL not merely as a reward optimizer but as an implicit architecture adapter.

2. Methodological Rigor

The experimental design is reasonably thorough for an empirical paper. The authors conduct:

Fair comparisons across multiple SWA window sizes (2k, 4k, 8k) with both equal-step and equal-time training budgets

Bootstrap confidence intervals over multiple evaluation runs (8-32 repeats per benchmark)

Controlled cross-SFT experiments (Table 4) that isolate the data-architecture mismatch by matching length distributions and keeping only correct trajectories

Locality metrics (probability-based information gap) that provide quantitative evidence for the hypothesis

Ablation studies on conversion strategies (Table 5)

Scale validation at 4B parameters (Table 7)

However, several methodological concerns arise:

The "data-architecture mismatch" explanation, while plausible, remains a hypothesis rather than a proven mechanism. RL improves SA models substantially too (SA goes from 48.6 to 65.9), so the improvement isn't solely about architecture adaptation — much of the gain is the well-known benefit of RLVR for reasoning.

The cross-SFT experiment (Table 4) is informative but uses a relatively small 3.3B-token dataset with only correct trajectories, which is a controlled but somewhat artificial setting.

The paper uses a private 42B-token SFT dataset, which limits reproducibility of Stage 1.

The locality metric (Equation 4) uses SA-SFT as a reference model, which introduces a potential confound — the metric measures alignment with SA-SFT's predictions rather than an architecture-independent notion of locality.

3. Potential Impact

Practical significance: If SWA can match SA performance in reasoning tasks, the efficiency gains are substantial — the paper demonstrates ~6.2× throughput improvement at 32k context length. This directly addresses the inference cost bottleneck in reasoning models that generate long chain-of-thought traces.

Broader implications:

The finding that effective reasoning doesn't require full global attention challenges assumptions in the field and could redirect architecture design efforts

The "architecture-aware RL" principle could generalize to other efficient architectures (RNNs, hybrid models, sparse attention), though this remains untested

The conversion-from-SA approach avoids expensive pretraining of new architectures, lowering barriers to experimentation

Limitations on impact:

Results are demonstrated only at 1.5B and 4B scales; commercial-scale validation (>100B) is absent

SWA2k still lags significantly even after RL, suggesting the approach has limits for very aggressive efficiency targets

The approach is validated only on math reasoning; the appendix acknowledges challenges for long-context understanding tasks

4. Timeliness & Relevance

This paper is highly timely. The reasoning LLM paradigm (DeepSeek-R1, OpenAI o1/o3) has made long-generation inference a practical bottleneck. Simultaneously, RLVR has become the dominant post-training paradigm. Studying the intersection of efficient architectures and RL-based reasoning training addresses an immediate need. The observation that RL's on-policy nature provides implicit architecture adaptation is a fresh angle on an active area.

5. Strengths & Limitations

Key Strengths:

Clean experimental narrative: SFT leaves a gap → RL closes it → analyses explain why

The cross-SFT experiment (Table 4) provides compelling evidence for the data-architecture mismatch hypothesis

Practical recipe that builds on existing SA checkpoints rather than requiring new pretraining

Efficiency analysis is concrete with real throughput measurements and memory profiling

The paper is honest about limitations (SWA2k still struggles, results may not generalize to other tasks)

Notable Weaknesses:

The contribution is primarily empirical observation rather than architectural or algorithmic innovation — the SWA mechanism and RL algorithms are both existing

The "architecture-aware" framing of RL is somewhat tautological: on-policy RL always generates from the current model, so it's always "architecture-aware" in this sense

SA-RL-900 also improves dramatically over SA-SFT (48.6→65.9), so RL's benefit isn't unique to SWA. The paper could better decompose how much of SWA's improvement is generic RL benefit vs. architecture-specific adaptation

The gap narrowing is partly an artifact of ceiling effects: as all models approach high accuracy, absolute gaps naturally shrink

Limited to math reasoning; the appendix shows SWA struggles significantly on long-context tasks without additional inference-time methods

Missing Comparisons:

No comparison with other efficient attention methods (sparse attention, hybrid architectures) that could serve as alternative baselines

No comparison with distillation approaches that could also produce architecture-matched training data

Overall Assessment

This is a solid empirical study with a clear and timely message: don't dismiss SWA for reasoning based on SFT results alone, because RL adaptation substantially changes the picture. The findings are well-supported by controlled experiments and provide actionable guidance for practitioners. However, the contribution is primarily observational, the scale is modest, and the underlying mechanism explanation remains at the hypothesis level. The paper advances practical understanding of efficient architectures for reasoning but does not introduce fundamentally new methods.

Rating:6.2/ 10

Significance 6.5Rigor 6.5Novelty 5.5Clarity 7.5

Generated Jun 11, 2026

Comparison History (16)

Wonvs. Mind the Perspective: Let's Reason Recursively for Theory of Mind

Paper 1 addresses a critical bottleneck in LLM deployment (the quadratic scaling of self-attention) by offering a novel RL-based training paradigm to make efficient sliding-window attention viable for rigorous tasks like math reasoning. This structural improvement has broad implications for foundational model training and efficient inference. In contrast, Paper 2 proposes an inference-time prompting strategy tailored to a specific cognitive domain (Theory of Mind), which, while valuable, has a narrower potential scientific and practical impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

Paper 1 addresses a fundamental scalability bottleneck (quadratic attention complexity) in LLM reasoning with a practical, broadly applicable recipe. The finding that RL can bridge the SWA-SA gap has significant implications for efficient long-context inference across many applications. Paper 2, while technically interesting, targets a narrow application domain (planar mechanism synthesis) with a complex multi-component framework. Paper 1's contribution is more likely to influence widespread LLM deployment and efficiency research, affecting a larger community of researchers and practitioners.

claude-opus-4-6·Jun 11, 2026

Wonvs. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

Paper 1 addresses a highly practical and timely problem—making efficient attention mechanisms viable for reasoning LLMs—with clear empirical results showing RL can recover SWA performance. This has immediate implications for deploying long-context reasoning models at scale. Paper 2 provides a theoretically interesting decomposition of RLVR reward signals, but its impact is narrower: it's primarily a methodological audit tool for the alignment community. Paper 1's broader applicability to efficient inference, combined with the growing demand for reasoning LLMs, gives it higher potential real-world impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. AdMem: Advanced Memory for Task-solving Agents

Paper 2 likely has higher impact: it targets a core scaling bottleneck (quadratic attention) with an actionable, efficient recipe to convert existing SA models to linear-complexity SWA and recover performance via RL. This is timely given long-context demand and offers clear real-world deployment benefits (cheaper inference) and broad relevance across LLM architectures beyond math. Methodologically, it provides a concrete hypothesis (data-architecture mismatch) and an empirical demonstration that RL alters conclusions about SWA viability. Paper 1 is useful for agents, but memory frameworks are crowded and impact may be more incremental/less general.

gpt-5.2·Jun 11, 2026

Wonvs. Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Paper 1 addresses a critical bottleneck in large language models—the quadratic computational cost of long-context self-attention. By demonstrating a novel RL-based method to make efficient sliding-window attention competitive in complex mathematical reasoning, it offers immediate, high-impact applications in foundational AI development. Paper 2 presents an interesting HCI study on creativity, but its small sample size and niche gamified setup limit its broader scientific impact compared to the core architectural advancements in Paper 1.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Forecasting Future Behavior as a Learning Task

Paper 2 has higher likely scientific impact: it targets a central, timely bottleneck (long-context efficiency) with broad applicability beyond math reasoning (any long-context LLM deployment). The two-stage SA→SWA conversion plus on-policy RL adaptation offers a practical pathway to retrofit existing models, with clear real-world systems benefits (linear attention at higher accuracy). The data–architecture mismatch hypothesis and empirical demonstration that RL shifts conclusions about SWA viability is broadly relevant to architecture-aware training. Paper 1 is novel for interpretability/behavior prediction but has narrower immediate deployment impact.

gpt-5.2·Jun 11, 2026

Wonvs. AutoMine Solution for AV2 2026 Scenario Mining Challenge

Paper 2 is likely higher impact: it addresses a broadly relevant, timely bottleneck (long-context efficiency) with a general recipe (SA→SWA conversion + RL adaptation) that can transfer across LLM reasoning tasks and model families, potentially influencing both research and deployment. Its key insight (data-architecture mismatch and RL as an adaptation mechanism) is conceptually novel and could affect how the community evaluates efficient attention. Paper 1 is strong for autonomous driving scenario mining and competition results, but is more domain-specific and appears more system/engineering-oriented, limiting cross-field breadth.

gpt-5.2·Jun 11, 2026

Wonvs. SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

Paper 2 has higher potential impact: it tackles a major scalability bottleneck (quadratic attention) with a broadly applicable, timely recipe (SA→SWA conversion + RL adaptation) that could enable efficient long-context reasoning without retraining from scratch. The architecture-aware RL insight generalizes beyond math to other tasks where data/architecture mismatch matters, potentially influencing model design and training pipelines across NLP and systems. Paper 1 is novel for agent-skill organization evaluation, but its scope is narrower (benchmarking/authoring paradigm effects) and outcome gains are modest and task-dependent.

gpt-5.2·Jun 11, 2026

Wonvs. Towards Responsibly Non-Compliant Machines

Paper 1 addresses a critical technical bottleneck in LLMs (long-context efficiency) with a concrete, empirically validated methodology. Its practical applications and methodological rigor offer immediate utility. In contrast, Paper 2 is a conceptual position paper; while discussing an important safety topic, it lacks the concrete technical contributions likely to drive immediate and measurable scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. ReflectiChain: Epistemic Grounding in LLM-Driven World Models for Supply Chain Resilience

Paper 1 addresses a fundamental scalability bottleneck in LLM reasoning (quadratic attention complexity) with a practical, broadly applicable recipe (SWA + RL). Its finding that RL can recover accuracy lost from architectural efficiency changes has wide implications for the entire LLM community working on long-context reasoning. Paper 2, while technically interesting, targets a narrower domain (supply chain resilience) with a complex, specialized framework evaluated on a single synthetic benchmark (10-node network), limiting its generalizability and broader impact.

claude-opus-4-6·Jun 11, 2026

#1162of 3489·Artificial Intelligence

#1162 of 3489 · Artificial Intelligence

Tournament Score

1437±49

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor6.5

Novelty5.5

Clarity7.5