Kai Liu, Peijie Dong, Xinchen Xie, Jianfei Gao, Qipeng Guo, Xiaowen Chu, Shaoting Zhang, Kai Chen
The rapid progress of reasoning and agentic large language models (LLMs) has increased the demand for long-context inference, but self-attention (SA) scales quadratically with context length. To address this, we study SWARR (Sliding-Window Attention with Reinforced Adaptation for Math Reasoning), a practical recipe for adapting SWA models to mathematical reasoning. SWARR has two stages: (1) efficient conversion from a pretrained SA model to SWA with supervised fine-tuning (SFT), which avoids pretraining a new base model, and (2) policy adaptation with reinforcement learning (RL). We find that SWA still underperforms SA after SFT, and we hypothesize that this gap is caused in part by a data-architecture mismatch: most SFT data are prepared for SA models and may contain long-range dependencies that are difficult for SWA to model. Because on-policy RL optimizes self-generated trajectories under the SWA constraint, it can adapt trajectories to better match SWA. Experiments on mathematical reasoning benchmarks show that this recipe substantially narrows the gap between SWA and SA, recovering much of the accuracy lost during SWA conversion while preserving the efficiency benefits of linear-complexity attention. Our central contribution is the empirical finding that RL changes the conclusion one would draw from conversion and SFT alone about SWA's viability for math reasoning.
The paper presents SWARR, a two-stage pipeline that (1) converts a pretrained self-attention (SA) transformer to sliding-window attention (SWA) via supervised fine-tuning, then (2) applies reinforcement learning to adapt the model's generation behavior to the SWA constraint. The central empirical finding is that RL substantially closes the performance gap between SWA and SA that persists after SFT alone, changing the practical viability assessment of SWA for math reasoning.
The key insight — framed as "data-architecture mismatch" — is that SFT data are generated assuming full attention and may contain long-range dependencies incompatible with SWA, whereas on-policy RL generates trajectories under the SWA constraint, naturally favoring patterns that work within the limited attention window. This is a clean, intuitive idea that reframes RL not merely as a reward optimizer but as an implicit architecture adapter.
The experimental design is reasonably thorough for an empirical paper. The authors conduct:
However, several methodological concerns arise:
Practical significance: If SWA can match SA performance in reasoning tasks, the efficiency gains are substantial — the paper demonstrates ~6.2× throughput improvement at 32k context length. This directly addresses the inference cost bottleneck in reasoning models that generate long chain-of-thought traces.
Broader implications:
Limitations on impact:
This paper is highly timely. The reasoning LLM paradigm (DeepSeek-R1, OpenAI o1/o3) has made long-generation inference a practical bottleneck. Simultaneously, RLVR has become the dominant post-training paradigm. Studying the intersection of efficient architectures and RL-based reasoning training addresses an immediate need. The observation that RL's on-policy nature provides implicit architecture adaptation is a fresh angle on an active area.
This is a solid empirical study with a clear and timely message: don't dismiss SWA for reasoning based on SFT results alone, because RL adaptation substantially changes the picture. The findings are well-supported by controlled experiments and provide actionable guidance for practitioners. However, the contribution is primarily observational, the scale is modest, and the underlying mechanism explanation remains at the hypothesis level. The paper advances practical understanding of efficient architectures for reasoning but does not introduce fundamentally new methods.
Generated Jun 11, 2026
Paper 1 addresses a critical bottleneck in LLM deployment (the quadratic scaling of self-attention) by offering a novel RL-based training paradigm to make efficient sliding-window attention viable for rigorous tasks like math reasoning. This structural improvement has broad implications for foundational model training and efficient inference. In contrast, Paper 2 proposes an inference-time prompting strategy tailored to a specific cognitive domain (Theory of Mind), which, while valuable, has a narrower potential scientific and practical impact.
Paper 1 addresses a fundamental scalability bottleneck (quadratic attention complexity) in LLM reasoning with a practical, broadly applicable recipe. The finding that RL can bridge the SWA-SA gap has significant implications for efficient long-context inference across many applications. Paper 2, while technically interesting, targets a narrow application domain (planar mechanism synthesis) with a complex multi-component framework. Paper 1's contribution is more likely to influence widespread LLM deployment and efficiency research, affecting a larger community of researchers and practitioners.
Paper 1 addresses a highly practical and timely problem—making efficient attention mechanisms viable for reasoning LLMs—with clear empirical results showing RL can recover SWA performance. This has immediate implications for deploying long-context reasoning models at scale. Paper 2 provides a theoretically interesting decomposition of RLVR reward signals, but its impact is narrower: it's primarily a methodological audit tool for the alignment community. Paper 1's broader applicability to efficient inference, combined with the growing demand for reasoning LLMs, gives it higher potential real-world impact.
Paper 2 likely has higher impact: it targets a core scaling bottleneck (quadratic attention) with an actionable, efficient recipe to convert existing SA models to linear-complexity SWA and recover performance via RL. This is timely given long-context demand and offers clear real-world deployment benefits (cheaper inference) and broad relevance across LLM architectures beyond math. Methodologically, it provides a concrete hypothesis (data-architecture mismatch) and an empirical demonstration that RL alters conclusions about SWA viability. Paper 1 is useful for agents, but memory frameworks are crowded and impact may be more incremental/less general.
Paper 1 addresses a critical bottleneck in large language models—the quadratic computational cost of long-context self-attention. By demonstrating a novel RL-based method to make efficient sliding-window attention competitive in complex mathematical reasoning, it offers immediate, high-impact applications in foundational AI development. Paper 2 presents an interesting HCI study on creativity, but its small sample size and niche gamified setup limit its broader scientific impact compared to the core architectural advancements in Paper 1.
Paper 2 has higher likely scientific impact: it targets a central, timely bottleneck (long-context efficiency) with broad applicability beyond math reasoning (any long-context LLM deployment). The two-stage SA→SWA conversion plus on-policy RL adaptation offers a practical pathway to retrofit existing models, with clear real-world systems benefits (linear attention at higher accuracy). The data–architecture mismatch hypothesis and empirical demonstration that RL shifts conclusions about SWA viability is broadly relevant to architecture-aware training. Paper 1 is novel for interpretability/behavior prediction but has narrower immediate deployment impact.
Paper 2 is likely higher impact: it addresses a broadly relevant, timely bottleneck (long-context efficiency) with a general recipe (SA→SWA conversion + RL adaptation) that can transfer across LLM reasoning tasks and model families, potentially influencing both research and deployment. Its key insight (data-architecture mismatch and RL as an adaptation mechanism) is conceptually novel and could affect how the community evaluates efficient attention. Paper 1 is strong for autonomous driving scenario mining and competition results, but is more domain-specific and appears more system/engineering-oriented, limiting cross-field breadth.
Paper 2 has higher potential impact: it tackles a major scalability bottleneck (quadratic attention) with a broadly applicable, timely recipe (SA→SWA conversion + RL adaptation) that could enable efficient long-context reasoning without retraining from scratch. The architecture-aware RL insight generalizes beyond math to other tasks where data/architecture mismatch matters, potentially influencing model design and training pipelines across NLP and systems. Paper 1 is novel for agent-skill organization evaluation, but its scope is narrower (benchmarking/authoring paradigm effects) and outcome gains are modest and task-dependent.
Paper 1 addresses a critical technical bottleneck in LLMs (long-context efficiency) with a concrete, empirically validated methodology. Its practical applications and methodological rigor offer immediate utility. In contrast, Paper 2 is a conceptual position paper; while discussing an important safety topic, it lacks the concrete technical contributions likely to drive immediate and measurable scientific impact.
Paper 1 addresses a fundamental scalability bottleneck in LLM reasoning (quadratic attention complexity) with a practical, broadly applicable recipe (SWA + RL). Its finding that RL can recover accuracy lost from architectural efficiency changes has wide implications for the entire LLM community working on long-context reasoning. Paper 2, while technically interesting, targets a narrower domain (supply chain resilience) with a complex, specialized framework evaluated on a single synthetic benchmark (10-node network), limiting its generalizability and broader impact.