Yang Zhou, Ranajoy Sadhukhan, Zhaofeng Sun, Zhuoming Chen, Souvik Kundu, Saket Dingliwal, Sai Muralidhar Jayanthi, Aram Galstyan
Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.
Sparrow addresses a critical computational bottleneck in RLVR training: the cost of generating long chain-of-thought (CoT) rollouts, which can consume >70% of per-step training time. The paper's central insight is that sparse attention can accelerate rollout generation, but naively applying it creates an actor-policy mismatch that destabilizes training. The key novel contributions are:
The methodology is well-structured around controlled studies. The authors systematically sweep sparsity configurations across four model sizes and multiple sequence-length bins, then identify stability thresholds through careful ablation. Several aspects stand out:
Practical impact is substantial. RLVR training cost is a major barrier for both academic and industrial labs. Achieving 2.0–2.4× rollout speedup (translating to ~1.8–2.1× end-to-end speedup) without performance degradation is highly valuable. The approach is:
Conceptual impact is also noteworthy. The tail-distribution perspective on actor-policy mismatch is a genuinely useful lens that could inform other settings where approximate policies are used (e.g., distilled models as actors, quantized rollout, speculative decoding in RL). The finding that average mismatch is a poor stability predictor while tail statistics are highly informative is an insight that extends beyond sparse attention.
The DistillSparse concept—using already-computed dense logprobs to distill back into the sparse actor via LoRA—is elegant in its zero-marginal-cost design and could inspire similar approaches.
This work is exceptionally timely. The field is rapidly scaling CoT length (from 8K to 100K+ tokens) for reasoning and agentic tasks. The rollout bottleneck will only worsen. The paper explicitly targets thinking models (Qwen3 thinking family) with 37K generation cutoffs, reflecting cutting-edge training practices. The emergence of sparse attention in production models (DeepSeek-V3.2, NSA) further validates the relevance of understanding sparse-dense interactions in training loops.
1. The paper identifies a genuinely important and under-studied problem (sparse rollout stability in RL) and provides both theoretical framing and practical solutions.
2. The tail-distribution insight is well-motivated, empirically validated, and practically actionable.
3. The cost model analysis is thorough and provides a principled optimization framework rather than ad-hoc tuning.
4. The decreasing speedup with model size (2.4× for 4B → 1.48× for 14B) is honestly reported and well-explained (attention's decreasing fraction of total cost + higher KV budgets needed).
5. Open-sourced code and project website enhance reproducibility.
1. Single model family (Qwen3) limits generalizability claims.
2. The 0.86 threshold is a point estimate—no confidence intervals or sensitivity analysis around this value.
3. DistillSparse introduces system complexity (LoRA management during generation/training transitions) that may complicate adoption.
4. The speedup diminishes for larger models, precisely where cost savings matter most.
5. No comparison against concurrent work like SparseRL (Luo et al., 2026) under identical conditions.
6. The paper does not explore how the threshold might shift with different RL algorithms (e.g., GRPO variants, ReMax).
Sparrow makes a meaningful contribution by providing a principled framework for using sparse attention in RL rollouts—an increasingly important practical problem. The tail-mismatch perspective is insightful and the dynamic scheduling + DistillSparse solutions are practical. The main limitations are the restricted experimental scope (one model family) and diminishing returns at larger scales. Nevertheless, the work provides actionable guidance for practitioners and opens useful research directions.
Generated Jun 9, 2026
Paper 2 addresses a critical bottleneck in RLVR training for LLMs—the computational cost of long-context rollout generation—which is highly timely given the rapid scaling of reasoning models. It offers concrete, practical speedups (2.0-2.4x) with principled theoretical grounding via sparse-to-dense mismatch analysis. The breadth of impact is significant as it applies across model scales and RL domains. Paper 1 addresses a meaningful but more niche evaluation problem in conditional generation under compositional shift. While rigorous, its scope and immediate applicability are narrower compared to the broad LLM training community that Paper 2 serves.
Paper 1 addresses a fundamental limitation in LLM distillation by removing the requirement for shared tokenizers. This unlocks the ability to distill knowledge across arbitrary model families (e.g., Llama to Qwen), significantly broadening the applicability of on-policy distillation. While Paper 2 offers valuable efficiency gains for RLVR, Paper 1's contribution has more profound and immediate implications for the broader AI community's ability to train and adapt diverse models.
Paper 1 provides tight theoretical characterizations (VC dimension bounds) for Transformers and chain-of-thought learning, establishing fundamental sample complexity results that are architecture-agnostic and broadly applicable. These foundational results will likely influence theoretical understanding of Transformers for years. Paper 2, while practically useful with its sparse rollout acceleration for RLVR, addresses a more specific engineering optimization problem. Its contributions are tied to current model families (Qwen3) and may become less relevant as architectures evolve. The theoretical rigor and generality of Paper 1 give it broader and more lasting scientific impact.
Paper 1 likely has higher scientific impact due to stronger conceptual novelty and cross-domain breadth: it embeds full GENERIC nonequilibrium thermodynamics (energy conservation + entropy production with exact degeneracy conditions) directly into neural operators, extending structure-preserving learning beyond Hamiltonian/single-law constraints and beyond finite-dimensional systems. The exact-by-construction guarantees and gauge-invariant diagnostics suggest high methodological rigor and potential influence across scientific ML, PDE modeling, and thermodynamics-driven simulation. Paper 2 is timely and practically valuable for efficient RL of LLMs, but is more incremental/engineering-focused and narrower in foundational scientific reach.
Paper 2 likely has higher impact: it tackles a major practical bottleneck in RLVR—long-context rollout cost—via a principled stability-efficiency framework (tail actor-policy mismatch), dynamic sparsity scheduling, and a cost model, delivering consistent ~2x+ speedups across multiple model sizes and domains with some generalization evidence. This improves feasibility and scalability of long-context RL training, benefiting many labs and applications. Paper 1 is novel in implicitly weighting reasoning quality via in-context utility, but its gains appear more incremental and narrower to reasoning-quality supervision rather than broad training efficiency.
Paper 1 addresses a critical computational bottleneck in RLVR training for LLMs—long-context rollout generation—with a principled framework (sparse-to-dense mismatch analysis) that achieves significant speedups (2-2.4x) across multiple model scales. Its practical impact on making RL-based LLM training more efficient is substantial given the field's trajectory. Paper 2 introduces an interesting causal attribution framework for LLM agent failures, but its validation is limited to synthetic settings, and the niche scope (debugging agent failures) limits breadth. Paper 1's methodological rigor, scalability evidence, and relevance to the booming RLVR paradigm give it higher impact potential.
Paper 1 addresses a critical bottleneck in modern LLM development (long-context RL and chain-of-thought training) by introducing a dynamic sparse attention schedule. Given the massive computational costs of LLM training, this approach offers highly relevant and immediately applicable real-world benefits. While Paper 2 provides rigorous theoretical guarantees for data pruning, it primarily evaluates on traditional vision benchmarks, whereas Paper 1's focus on scalable LLM RL has greater timeliness and broader impact across the current AI landscape.
Paper 1 likely has higher scientific impact: it proposes a novel, mechanistic stability criterion (tail per-token sparse-to-dense mismatch) and a dynamic sparsity schedule with demonstrated multi-model and cross-domain generalization, addressing a timely bottleneck in long-context RL for LLMs. The method offers broadly applicable efficiency gains (rollout acceleration) and introduces an additional technique (DistillSparse) to extend sparsity limits. Paper 2 is highly practical for enterprise benchmarking but is more domain/tooling-specific, with narrower cross-field methodological novelty.
Paper 1 addresses a critical computational bottleneck in modern LLM training—efficient reinforcement learning for long-context reasoning. By achieving over 2x speedups for state-of-the-art models, its methods have immediate potential for widespread adoption in both industry and academia. While Paper 2 offers strong theoretical contributions to Deep Gaussian Processes, Paper 1's timely relevance to the rapidly expanding field of LLM reasoning gives it significantly higher potential for broad scientific and real-world impact.
Paper 2 addresses a critical bottleneck in RLVR training—the computational cost of long-context rollouts—with a principled, theoretically grounded approach (sparse-to-dense mismatch analysis). It provides substantial speedups (2.0-2.4x) validated across multiple model scales and domains, with a novel dynamic sparsity schedule and DistillSparse technique. The work is highly timely given the rapid adoption of RLVR and reasoning LLMs. Paper 1 presents a useful but relatively incremental engineering contribution combining LLMs with graph-based query planning, with narrower impact scope and less methodological novelty.