Tengyao Tu, Yulin Li, Hui-Ling Zhen, Libo Qin, Zhoujun Wei, Jinghua Piao, Zhuotao Tian, Yong Li
Recent advances in Large Reasoning Models (LRMs) demonstrate remarkable performance improvements by iteratively reflecting, exploring, and executing complex tasks, yet suffer from inefficiencies due to redundant reasoning, known as "overthinking". Existing methods to mitigate this issue either rely on static difficulty estimates or require task-specific training, and thus fail to adapt to the dynamic complexity during reasoning. In this work, we empirically show that the problem difficulty evolves dynamically throughout the reasoning process and is linearly encoded in the LRM's step-level embeddings. Building on this insight, we propose DyCon, a training-free framework that leverages latent step-level representations to explicitly model the evolving task difficulty, enabling the dynamic control of reasoning depth to mitigate the overthinking issue. Extensive experiments conducted on four models ranging from 4B to 32B, and across twelve benchmarks in math reasoning, general question answering, and coding tasks demonstrate that DyCon significantly enhances reasoning efficiency by reducing redundant steps without sacrificing accuracy or generalization. Project page and code are available at https://github.com/yu-lin-li/DyCon.
DyCon addresses the "overthinking" problem in Large Reasoning Models (LRMs), where models generate redundant reasoning steps even for simple problems. The key insight is twofold: (1) problem difficulty evolves dynamically during reasoning rather than being static, and (2) this evolving difficulty is linearly encoded in step-level hidden representations. Building on these observations, the authors propose a training-free framework that fits a lightweight ridge regressor on step embeddings to estimate difficulty at each reasoning step, then modulates logits of reflection-related tokens (e.g., "wait," "alternatively," "hmm") to suppress unnecessary deliberation when difficulty is low while preserving deep reasoning when difficulty is high.
The novelty lies in the combination of dynamic (rather than static) difficulty estimation, the use of latent representations (rather than handcrafted heuristics), and soft logit modulation (rather than hard early-exit). The approach is notably simple—a linear regressor fitted on 600 samples with log-transformed remaining length as the supervision signal—yet effective across diverse settings.
Practical applications: DyCon is immediately deployable—it requires no model retraining, no architectural changes, and only 600 fitting samples. The framework is compatible with existing inference engines (HuggingFace, vLLM). Token reductions of 10–40% translate directly to reduced inference costs and latency, which is highly relevant for production LRM deployment.
Broader influence: The finding that difficulty is linearly encoded in step embeddings is potentially impactful beyond efficient reasoning. It could inform curriculum learning, adaptive compute allocation, model routing (assigning easy queries to smaller models), and difficulty-aware data curation. The cross-domain transferability (math-fitted regressor working on GPQA, coding) suggests these difficulty representations may be universal features of reasoning models.
Limitations for impact: The method specifically targets overthinking in reasoning-style models (with `` tokens). Its applicability to standard instruction-tuned models is limited (Appendix B.13). The reliance on predefined reflection-token vocabularies, while shown to be robust, adds a manual component.
The paper is highly timely. The explosion of reasoning models (DeepSeek-R1, QwQ, Qwen3-Thinking, o1) has created an acute need for inference efficiency. Overthinking is widely recognized as a key bottleneck—models routinely generate 10K–20K tokens for problems solvable in 2K. DyCon addresses this with a practical, deployable solution rather than requiring expensive retraining, which positions it well for near-term adoption.
The paper's appendix (35+ pages) is remarkably thorough, covering System 1/2 framing, cross-lingual analysis, regressor refinement, GRU-based alternatives, and more. The iterative refinement analysis (Appendix B.10) revealing that excessive refinement hurts performance is a valuable practical insight. The code availability enhances reproducibility.
Generated Jun 8, 2026
DyCon addresses a fundamental and broadly applicable problem in LLM reasoning efficiency (overthinking) with a training-free, model-agnostic framework validated across 4 models and 12 benchmarks spanning multiple domains. Its discovery that difficulty is linearly encoded in step-level embeddings is a novel mechanistic insight with broad implications. Lung-R1, while valuable, is domain-specific (pulmonary diagnosis) with narrower applicability. DyCon's generalizability, theoretical insight about dynamic difficulty evolution, and practical utility across the rapidly growing LRM ecosystem give it higher potential for widespread adoption and cross-field impact.
HERO addresses a fundamental challenge in multi-turn agent training—credit assignment and feedback alignment—with a novel hindsight-enhanced self-distillation framework. Its contribution is more foundational, tackling core issues in agentic RL that affect a broad range of applications. DyCon addresses reasoning efficiency (overthinking), which is practically useful but more incremental; it builds on existing observations about redundant reasoning and proposes a training-free intervention. HERO's methodological innovation (converting observations into turn-level diagnoses for self-distillation) opens new research directions for agent training, giving it higher potential impact.
Paper 2 addresses a highly timely and critical bottleneck in Large Reasoning Models—inference inefficiency or 'overthinking'. By proposing a training-free method to dynamically control reasoning depth using latent embeddings, it offers immediate, practical improvements to inference-time compute scaling across multiple tasks. While Paper 1 presents a strong theoretical reframing of RLHF, Paper 2's broad applicability, computational efficiency gains, and direct relevance to current trends in inference-time reasoning give it a higher potential for widespread scientific and practical impact.
DyCon addresses a fundamental and widely-recognized problem (overthinking in LRMs) with a novel, training-free approach grounded in an empirical insight about difficulty being linearly encoded in step-level embeddings. Its broad evaluation across 4 models, 12 benchmarks, and multiple task domains demonstrates strong generalizability. The finding that problem difficulty evolves dynamically and is linearly decodable is a significant scientific contribution with broad implications for efficient inference. While Claw-Eval is a solid benchmark contribution addressing real evaluation gaps for autonomous agents, benchmarks tend to have more transient impact compared to fundamental methodological insights about reasoning efficiency that can be widely adopted.
Paper 1 has higher likely impact: it addresses a widely observed, timely problem (LLM/LRM overthinking) with a training-free, model-agnostic method validated across multiple model scales and 12 benchmarks, suggesting broad applicability and easier adoption. Its core claim—difficulty evolution encoded in step embeddings—could influence future inference-time control and efficiency work across tasks (math, QA, coding). Paper 2 is innovative but narrower (legal evidence selection), relies on complex parsing/optimization pipelines and specialized hardware evaluation, and its benefits may depend strongly on domain “contamination” assumptions, limiting generalizability.
Paper 1 addresses a critical and highly timely challenge in foundational Large Language Models (inference-time reasoning efficiency and 'overthinking') with a training-free method. Its impact spans multiple broad domains like math, coding, and general QA. In contrast, Paper 2, while methodologically rigorous, focuses on a niche application (Traditional Chinese Medicine), significantly limiting its broader scientific and interdisciplinary impact compared to core LLM advancements.
Paper 1 addresses a fundamental AI safety concern—the ability of frontier models to reason without chain-of-thought, which could undermine oversight mechanisms. It introduces novel metrics (time horizons, reasoning token horizons), provides large-scale empirical measurements across 43 benchmarks, and offers actionable forecasts for the AI safety community. Its implications span AI governance, policy, and alignment research. Paper 2, while practically useful for efficiency optimization, addresses a narrower technical problem (overthinking in LRMs) with incremental methodology. Paper 1's broader safety implications and policy relevance give it higher potential impact.
DyCon addresses a broadly relevant problem (LLM reasoning efficiency) affecting the rapidly growing field of large reasoning models. It offers a training-free, generalizable framework validated across 4 models and 12 benchmarks, with immediate practical applications for reducing computational costs. The insight that difficulty is linearly encoded in step-level embeddings is novel and could inspire further research. Paper 2, while solid, addresses a narrower task (audio-visual event localization) with more incremental contributions combining known techniques (hyperbolic embeddings, graph networks), limiting its broader impact.
Paper 2 addresses a highly timely and critical issue in modern Large Reasoning Models—inference inefficiency or 'overthinking'. Its training-free approach using dynamic difficulty modeling from latent representations offers an elegant, computationally efficient solution. This broadens its applicability across various domains like math, QA, and coding. While Paper 1 presents a solid methodological improvement for tool-use agents, Paper 2's potential to significantly reduce inference costs for state-of-the-art reasoning models without sacrificing accuracy gives it a wider and more immediate scientific and practical impact.
Paper 2 addresses a highly timely and critical bottleneck in modern Large Reasoning Models—inference inefficiency or 'overthinking'. Its insight that problem difficulty evolves and is encoded in step-level embeddings offers a novel, training-free solution to dynamically control reasoning depth. Given the massive compute costs associated with inference-time scaling (e.g., o1-like models), an efficient, generalizable approach tested across multiple models and benchmarks provides higher immediate real-world utility and broader impact than the relatively more saturated field of LLM agent memory frameworks proposed in Paper 1.