Reasoning Can Be Restored by Correcting a Few Decision Tokens

Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, Xiang Wang

May 16, 2026

arXiv:2605.16874v1 PDF

cs.AI(primary)

#125of 2292·Artificial Intelligence

#125 of 2292 · Artificial Intelligence

Tournament Score

1535±47

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8.5

Tournament Score

1535±47

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token-by-token generation and how to narrow this gap efficiently. We study the base-reasoning gap through quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17x), and coincide with high base-model uncertainty -- suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at https://github.com/AlphaLab-USTC/RRTokenIntervention.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Reasoning Can Be Restored by Correcting a Few Decision Tokens"

1. Core Contribution

This paper makes two intertwined contributions: an analytical finding and a practical method. The analytical finding is that the performance gap between base LLMs and reasoning-trained models (LRMs) is driven by a surprisingly sparse set of tokens (~8% in Qwen3-0.6B) that are (a) early in generation, (b) enriched 7-17× for planning-related content, and (c) aligned with base-model uncertainty peaks. The practical contribution is a disagreement-guided token intervention scheme: at positions where cross-entropy between the base and reasoning model exceeds a calibrated threshold, the reasoning model generates a single token before control returns to the base model. With ~4-13% token replacement, this recovers 91-157% of the same-size reasoning model's Pass@8 gap.

The key insight — that reasoning capability is not a diffuse property but concentrates at discrete planning decision points — is conceptually clean and carries implications beyond the specific intervention mechanism proposed.

2. Methodological Rigor

The analytical methodology is sound. The use of Lorenz curves and Gini coefficients (average G=0.936) to quantify disagreement sparsity is well-chosen and interpretable. The positional analysis (Figure 2b), IoU overlap between disagreement and entropy spikes (Figure 2c), and planning enrichment analysis (Table 1) triangulate the core claim from multiple angles.

The intervention mechanism uses a principled two-part gate: a global threshold τ calibrated as the (1-r)-quantile of disagreement scores on a held-out set, combined with a local sliding-window ratio test to suppress spurious triggers. The calibration procedure (Algorithm 1) is clearly specified and reproducible.

Strengths in experimental design: The paper includes important controls — random replacement and early-only baselines (Table 4) decisively show that position selection, not mere strong-model injection, drives gains. The flip analysis (Table 14: 152 error→correct vs. 3 correct→error) provides compelling evidence that intervention is overwhelmingly constructive. Cross-family generalization (LLaMA pair, Appendix C.5) and cross-domain testing (GPQA-Diamond, Appendix C.6) strengthen generalizability claims.

Weaknesses: The planning/execution token classification relies on a heuristic keyword-matching approach (Appendix B.2), which is acknowledged as coarse. This is a soft limitation — the enrichment ratios are large enough (7-17×) that classification noise is unlikely to invalidate the finding, but a more sophisticated classifier would strengthen the claim. The base model (Qwen3-0.6B) is small; whether the same sparsity structure holds at larger scales (e.g., 70B base models) remains unaddressed. The intervention requires running both models at every step to compute disagreement, making it diagnostically valuable but computationally impractical — though the entropy-only variant (Appendix C.3) partially addresses this.

3. Potential Impact

Theoretical impact: The "sparse control" view of reasoning — that a few early planning commitments steer entire trajectories — provides a concrete, testable framework for understanding what reasoning post-training actually changes. This connects to and extends concurrent work on latent reasoning capabilities in base models, activation steering, and the "echo chamber" hypothesis (Zhao et al., 2025). It offers a token-level mechanistic explanation complementary to parameter-level analyses (e.g., reasoning subspaces).

Practical impact: The intervention scheme demonstrates that small, targeted corrections can substitute for expensive full reasoning model deployment. This has immediate implications for: (1) efficient inference routing — using a small model for most tokens and a large one sparingly; (2) distillation — identifying which tokens to emphasize during knowledge transfer; (3) RLVR training — focusing policy gradient updates on the tokens that matter most, extending Wang et al. (2025a)'s high-entropy token insight.

Adjacent fields: The finding that autoregressive generation has sparse "steering points" resonates with control theory and planning in sequential decision-making, potentially influencing how we think about agentic LLM systems, code generation planning, and multi-step reasoning architectures more broadly.

4. Timeliness & Relevance

This paper arrives at an opportune moment. The community is grappling with the cost of reasoning models (o1, R1, etc.) and the question of what post-training actually teaches. The paper directly addresses a current bottleneck: understanding *where* reasoning capability manifests in token generation. The finding that base models already "know how to execute" but fail at "planning under uncertainty" provides a practical and theoretically grounded perspective on the latent capability hypothesis that several groups are pursuing simultaneously.

5. Key Strengths & Limitations

Strengths:

Clean, well-operationalized research question with quantitative answers

Multiple converging lines of evidence (sparsity, position, uncertainty, enrichment, predictiveness)

Strong controls ruling out trivial explanations (random/early-only baselines, flip analysis)

The recovery percentage exceeding 100% (surpassing same-size thinking model) is a striking result

Reproducible: code released, calibration procedure fully specified

Limitations:

Scale limited to 0.6B-8B range; unclear if sparsity holds for frontier-scale models

Math-centric evaluation; GPQA results (Appendix C.6) are on only 50 questions

The diagnostic intervention requires dual-model forward passes at every step

Heuristic planning/execution classification

No analysis of what happens in the reasoning model's hidden states at these positions — the paper characterizes *where* but not *how* the reasoning model generates better planning tokens

Missing comparisons: The paper does not compare against speculative decoding or other collaborative inference methods (RelayLLM, RouteLLM) under matched compute budgets, which would contextualize the practical utility more precisely.

Overall Assessment

This is a well-executed analysis paper with a clean central finding that advances our understanding of the base-reasoning gap. The sparsity of the reasoning advantage is convincingly demonstrated and the intervention experiments provide a compelling "proof by construction" that correcting planning tokens recovers reasoning capability. The main limitations are scope (model scale and task diversity) rather than methodological flaws.

Rating:7.4/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8.5

Generated May 19, 2026

Comparison History (20)

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

claude-opus-4.65/20/2026

Paper 1 offers a highly novel and actionable insight: the reasoning gap between base and reasoning LLMs is concentrated in a sparse set of early 'decision tokens.' This finding has immediate practical implications for efficient inference (delegating only ~8% of tokens to a stronger model), directly addresses the timely and high-impact area of LLM reasoning, and provides a simple yet effective method with broad applicability. Paper 2 makes a solid theoretical contribution distinguishing volatility from stochasticity in exploration, but its scope is narrower (bandits/computational psychiatry) and less immediately transformative for the broader AI community.

vs. NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

claude-opus-4.65/19/2026

Paper 2 offers a more fundamental scientific insight: reasoning capability differences between base and reasoning LLMs are concentrated in a sparse set of early 'decision tokens.' This finding is both novel and actionable, leading to an elegant inference-time intervention that achieves reasoning-model performance with minimal overhead. The mechanistic understanding of where reasoning fails has broad implications for model training, interpretability, and efficiency. While Paper 1 presents an interesting architectural framework for multi-agent systems, Paper 2's discovery is more surprising, empirically clean, and immediately applicable across the LLM community.

vs. MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

gpt-5.25/19/2026

Paper 1 is more novel and broadly impactful: it offers a general, token-level diagnosis of the base–reasoning gap and introduces a lightweight, inference-time intervention that can recover/surpass reasoning performance with minimal compute—relevant to many LLM/LRM settings and future model efficiency work. Its methodology (distributional divergences, sparsity analysis, targeted delegation) is conceptually reusable across tasks and models. Paper 2 is strong in real-world applicability and deployment evidence, but is more systems/engineering-oriented and domain-specific (enterprise invoices), with limited methodological novelty and narrower cross-field impact.

vs. Self-supervised Hierarchical Visual Reasoning with World Model

gpt-5.25/19/2026

Paper 1 offers a novel, crisp mechanism-level finding (reasoning gains are sparse and early/planning-token concentrated) plus a simple, general inference-time intervention that can recover/surpass reasoning-model performance with minimal compute. This is timely given widespread deployment cost constraints and could impact LLM inference, routing, distillation, and interpretability across many tasks. Paper 2 is promising for model-based RL, but hierarchical residual world models are closer to an incremental architectural advance in a narrower domain, with impact depending on robustness across diverse 3D environments and baselines.

vs. PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

gpt-5.25/19/2026

Paper 2 is likely higher impact due to a clearer, broadly applicable insight (reasoning gains are sparse and early/planning-heavy) plus a simple, efficient inference-time method (token-level delegation) that can recover or exceed reasoning-model performance with minimal extra compute. This is timely and practical for deployment, cost reduction, and interpretability, and can generalize across models and tasks without complex training loops. Paper 1 is innovative and rigorous but more specialized (RLVR + population LoRA self-play) and heavier to reproduce/deploy, potentially narrowing immediate adoption.

vs. Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

claude-opus-4.65/19/2026

Paper 1 presents a novel and rigorous analysis of where base language models fail compared to reasoning models, discovering that the reasoning gap is highly sparse and concentrated in early planning-related tokens. The proposed disagreement-guided token intervention is both theoretically insightful and practically impactful, offering an efficient inference-time method to recover reasoning performance. This has broad implications for understanding and improving LLM reasoning. Paper 2 describes an engineering framework (PuppyChatter) for LLM API abstraction, which is primarily a software tooling contribution with limited scientific novelty or broad research impact.

vs. XDecomposer: Learning Prior-Free Set Decomposition for Multiphase X-ray Diffraction

gpt-5.25/19/2026

Paper 1 targets a long-standing, high-friction scientific bottleneck (multiphase PXRD) with a prior-free set-prediction formulation that removes key practical constraints (no candidate lists, templates, or known phase count) and enforces physics-consistent reconstruction. This combination of novelty and direct applicability could materially accelerate materials discovery and characterization workflows across chemistry, physics, and materials science. Paper 2 is timely and clever for efficient inference, but its impact may be narrower and more incremental within LLM engineering, and depends on access to a stronger reasoning model at inference time.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

claude-opus-4.65/19/2026

Paper 2 presents a novel empirical finding—that reasoning advantages concentrate in sparse 'decision tokens'—with a concrete, validated intervention method that recovers reasoning performance efficiently. This offers immediate practical applications (cheaper inference via selective delegation), strong methodological rigor with reproducible code, and broad relevance to the LLM reasoning community. Paper 1, while intellectually interesting, is a position paper proposing a theoretical architecture without implementation or empirical validation, and its impact depends on future work to address its own stated open problems.

vs. State-Centric Decision Process

gpt-5.25/19/2026

Paper 2 has higher estimated impact: it introduces a general runtime formalism (SDP) that supplies missing MDP structure in text-only environments via certified predicates, enabling broader methodological and analytical advances (state construction, transition certification, termination criteria, credit assignment, failure localization). Its applicability spans many agent settings (web, tools, scientific exploration), making it more cross-field and timely for LLM agent evaluation/training. Paper 1 is novel and practical for efficient reasoning gains, but is narrower (token-level delegation between two models) and depends on access to a stronger “reasoning model.”

vs. Voices in the Loop: Mapping Participatory AI

claude-opus-4.65/19/2026

Paper 2 presents a novel and rigorous mechanistic insight into how reasoning models differ from base LLMs, identifying that the gap concentrates on sparse early 'decision tokens.' The practical intervention method—delegating only ~8% of tokens to a reasoning model—offers an efficient, actionable technique with broad implications for LLM deployment, efficiency, and interpretability. This combines fundamental understanding with immediate practical utility. Paper 1, while valuable as a resource for participatory AI mapping, is primarily a cataloging/framework contribution with narrower methodological novelty and more limited cross-field impact.

vs. Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

gemini-3.15/19/2026

Paper 2 provides fundamental insights into the mechanics of LLM reasoning, revealing that reasoning advantages stem from a sparse set of early planning tokens. Its proposed inference-time intervention offers a highly efficient, generalizable method to boost base model performance across various tasks. While Paper 1 presents a valuable memory architecture for specific long-horizon scientific agents, Paper 2's findings on the base-reasoning gap have broader applicability and address core theoretical and practical challenges in LLM reasoning and efficiency.

vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

claude-opus-4.65/19/2026

Paper 2 offers a more fundamental scientific insight—that reasoning failures in LLMs concentrate at sparse, early 'decision tokens'—which is both novel and actionable. The finding that ~8% of tokens account for the base-reasoning gap, and that targeted single-token interventions can recover reasoning performance, is a clean, surprising result with broad implications for efficient inference, model understanding, and distillation. Paper 1 addresses an important governance problem by applying formal methods (LTL) to LLM monitoring, but is more incremental in combining known techniques. Paper 2's mechanistic insight is likely to inspire more follow-up research across the field.

vs. Unlocking LLM Creativity in Science through Analogical Reasoning

claude-opus-4.65/19/2026

Paper 1 introduces a fundamentally new paradigm (analogical reasoning) for LLM-driven scientific discovery with validated real-world biomedical applications achieving state-of-the-art results across multiple domains. Its breadth of impact spans AI methodology and multiple scientific fields. Paper 2, while offering elegant mechanistic insights about reasoning token sparsity and a practical inference-time intervention, addresses a more narrowly scoped technical problem within LLM reasoning. Paper 1's novelty in connecting cognitive science concepts to AI-driven science and its demonstrated cross-domain applicability give it higher potential for broad scientific impact.

vs. CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings

gemini-3.15/19/2026

Paper 2 addresses a critical bottleneck in large language models by revealing that reasoning failures stem from a few early planning tokens. Its proposed intervention method offers highly efficient inference for reasoning tasks, presenting immediate, broad applicability and high citation potential across the rapidly expanding AI field. While Paper 1 presents an innovative approach to brain-computer interfaces, its immediate impact is restricted to the narrower domain of neurotechnology.

vs. From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

gemini-3.15/19/2026

Paper 1 addresses a critical challenge in modern AI—understanding and improving the reasoning capabilities of LLMs. By introducing a highly efficient inference-time token intervention strategy, it offers broad applicability and significant computational benefits for deploying large models. In contrast, Paper 2 focuses on a specific card game using shallow reinforcement learning, which has narrower applicability and lower overall relevance to current major AI trends. Thus, Paper 1 promises broader real-world applications and higher timeliness.

vs. AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

gpt-5.25/19/2026

Paper 2 likely has higher impact: it identifies a broadly applicable, mechanistically interpretable phenomenon (reasoning gaps concentrate in a few early planning tokens) and introduces a simple, practical inference-time intervention that can yield large gains with minimal compute. This is timely for efficient deployment and can influence model editing, routing/mixture-of-experts, interpretability, and system design across many LLMs. Paper 1 is innovative but more complex and RLVR-specific, with higher implementation/training overhead and narrower applicability, making downstream adoption and cross-field impact less certain.

vs. How Much is Brain Data Worth for Machine Learning?

gemini-3.15/19/2026

Paper 1 addresses a highly timely and critical problem in AI—improving LLM reasoning efficiency. Its discovery that reasoning gaps stem from a few early planning tokens and its proposed low-budget intervention method offer immediate, practical applications that could significantly reduce compute costs. While Paper 2 provides an interesting theoretical foundation for the niche field of NeuroAI, Paper 1 has broader applicability, greater immediate relevance to the booming field of large language models, and a more straightforward path to real-world deployment.

vs. Learning to Solve Compositional Geometry Routing Problems

gemini-3.15/19/2026

Paper 1 addresses a highly timely and impactful challenge in AI: improving the reasoning capabilities and inference efficiency of Large Language Models. Its finding that reasoning gaps concentrate on a few early planning tokens, along with the proposed sparse intervention method, offers significant theoretical insights and broad, real-world applications in deploying efficient AI systems. Paper 2 is methodologically strong but targets a more specialized niche in routing and combinatorial optimization, resulting in a narrower potential scientific impact.

vs. A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy

claude-opus-4.65/19/2026

Paper 1 presents a novel mechanistic insight into why base LLMs fail at reasoning compared to reasoning models, showing the gap is concentrated in sparse 'decision tokens.' The proposed intervention method is elegant and efficient, with broad implications for LLM research, interpretability, and efficient deployment. It addresses a timely, high-impact topic (LLM reasoning) relevant across AI/ML. Paper 2, while practically useful, applies an existing technique (Noise2Noise) to Raman spectroscopy denoising—a more incremental, domain-specific contribution with narrower impact scope.

vs. ScreenSearch: Uncertainty-Aware OS Exploration

claude-opus-4.65/19/2026

Paper 1 offers a novel, rigorous analysis revealing that reasoning failures in LLMs concentrate on sparse early 'decision tokens,' providing both mechanistic insight and a practical inference-time intervention that recovers reasoning performance with minimal compute. This finding has broad implications for understanding and improving LLMs across the field. Paper 2 addresses an interesting but narrower problem (desktop GUI exploration), with contributions more incremental in nature—combining existing retrieval and bandit techniques. Paper 1's insights are more fundamental, widely applicable, and timely given the centrality of LLM reasoning research.