ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning
Ziyan Liu, Xueda Shen, Yuzhe Gu, Songyang Gao, Kuikun Liu, Guangran Cheng, Chengqi Lyu, Dahua Lin
Abstract
Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.
AI Impact Assessments
(1 models)Scientific Impact Assessment: ThoughtFold
1. Core Contribution
ThoughtFold addresses the "overthinking" problem in Large Reasoning Models (LRMs) — the tendency to produce verbose chain-of-thought (CoT) reasoning containing redundant explorations, self-repetitions, and off-target attempts. The key insight is that standard RLVR methods uniformly reinforce all tokens in outcome-correct trajectories, indiscriminately memorizing both essential deductions and noise.
The paper proposes a two-pronged solution: (1) an introspective redundancy identification strategy that uses prune-and-verify with binary search to identify which reasoning steps are dispensable, and (2) a masked preference optimization (MDPO) objective that provides step-level supervision — penalizing redundant steps while encouraging "fold anchors" that bridge essential reasoning segments. The method jointly optimizes this fine-grained preference loss with standard GRPO for trajectory-level accuracy.
The novelty lies in moving beyond outcome-level length penalties to step-level credit assignment for reasoning efficiency. The "folding" metaphor is apt: rather than truncating reasoning, the method identifies and collapses internal redundancy while preserving logical structure.
2. Methodological Rigor
Strengths in methodology:
Concerns:
3. Experimental Evaluation
The experimental coverage is comprehensive: four models (7B-14B scale), five benchmarks spanning difficulty levels (GSM8K to AIME), and multiple baselines including the strong S-GRPO. Key results:
The ablation study is informative, isolating contributions of attention-based pruning, internal folding, and dynamic masking. The hyperparameter analysis (Table 3) shows a smooth accuracy-efficiency tradeoff controlled by λ.
The ML@k metric (Section 4.3) is a thoughtful contribution — it reveals that ThoughtFold's gains come from structural reasoning improvements rather than mere length distribution reshaping, as evidenced by the steeper decay compared to Short-RL.
Limitations in evaluation:
4. Timeliness & Relevance
This paper is highly timely. The overthinking problem in reasoning models (DeepSeek-R1, Qwen3, etc.) is a widely recognized bottleneck for deployment, as excessive token generation increases latency and cost. The efficient reasoning subfield is rapidly growing (2025-2026), and ThoughtFold offers a principled alternative to the dominant length-penalty approaches by providing finer-grained supervision.
The work connects to broader themes in RL for LLMs: credit assignment, reward shaping, and preference optimization — making it relevant beyond just efficient reasoning.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Overall Assessment
ThoughtFold makes a meaningful contribution to the efficient reasoning literature by introducing step-level preference learning as an alternative to coarse length penalties. The introspective redundancy identification and masked preference optimization are novel and well-executed. The empirical results are strong, though incomplete training cost analysis is a notable gap. The work should influence future approaches to reasoning efficiency and credit assignment in RLVR.
Generated Jun 3, 2026
Comparison History (25)
Paper 1 pioneers an evaluation framework for autonomous agent development, addressing a critical milestone towards AGI: recursive self-improvement. While Paper 2 offers significant practical efficiency gains for current reasoning models, Paper 1 explores a fundamentally novel capability, highlighting critical safety and alignment issues like reward hacking. Its focus on meta-agents provides broader long-term implications across AI safety, alignment, and systems design, giving it a higher potential for foundational scientific impact.
Paper 1 addresses a highly critical and timely bottleneck in modern AI: the massive inference costs and 'over-thinking' of Large Reasoning Models (like DeepSeek-R1). By achieving a 56% reduction in token usage without sacrificing accuracy, ThoughtFold offers immediate, widespread practical applications and computational savings. While Paper 2 presents a rigorous Shapley-based solution to multi-agent credit assignment, the broader and more immediate industry and academic focus on efficient single-agent reasoning gives Paper 1 a higher potential impact.
Paper 2 likely has higher impact due to a broadly applicable training framework targeting a widely observed LRM issue (overthinking/inefficient long CoTs) with clear, large efficiency gains (~56% token reduction) while preserving accuracy—high practical value for deployment cost/latency across many tasks and models. Its methodological contribution (introspective redundancy identification + masked preference optimization) is a general learning signal beyond outcome-based RLVR, timely given current reasoning-model scaling. Paper 1 is valuable for agent reliability auditing and benchmarking, but is narrower (deep-research agents) and more evaluation/audit-focused than a general capability-improving training method.
Paper 2 addresses a critical bottleneck in state-of-the-art Large Reasoning Models by reducing redundant Chain-of-Thought tokens by 56% without sacrificing accuracy. Given the massive scale of LLM deployment and current interest in inference-time reasoning, this offers immense computational savings and broad applicability. While Paper 1 presents an elegant and necessary solution for embodied AI memory constraints on edge hardware, Paper 2's focus on general-purpose reasoning models makes it more timely, widely applicable, and highly influential across the broader AI research community and industry.
Paper 2 is more likely to have higher scientific impact due to stronger novelty and breadth: it contributes to the theoretical foundations of causal inference by introducing derivation graphs to characterize do-calculus equivalence classes and providing a bounded (≤4 steps) reasoning procedure, with downstream implications for identification and estimation efficiency. This can influence multiple areas (statistics, epidemiology, economics, ML causality) and has long-term relevance. Paper 1 is timely and practically useful for LLM efficiency, but it appears more incremental within a fast-moving, model-specific optimization landscape and may generalize less broadly.
ThoughtFold addresses a highly critical and timely bottleneck in Large Reasoning Models: inference inefficiency and over-thinking. Reducing token usage by 56% without sacrificing accuracy offers massive real-world cost savings and scalability benefits. While SkillDAG introduces an innovative structural approach for agent skill selection, improving general reasoning efficiency has a broader, more immediate impact across the entire LLM ecosystem.
Paper 1 addresses a highly timely and critical issue in modern AI: the inefficiency and over-thinking of Large Reasoning Models (like DeepSeek-R1) during Chain-of-Thought generation. By reducing token usage by 56% while maintaining accuracy, it offers massive computational savings and immediate real-world applicability across the booming LLM industry. While Paper 2 presents innovative work in BCI and EEG foundation models, Paper 1's impact is substantially broader and more immediate across the broader AI and computer science landscape.
Paper 2 addresses a critical and highly timely issue in state-of-the-art large reasoning models: over-thinking and token inefficiency. By reducing token usage by 56% without sacrificing accuracy, it offers massive practical benefits for deployment cost and latency across a wide range of LLM applications. Paper 1 offers a valuable methodological contribution to automated theorem proving, but its impact is confined to a much narrower domain compared to the broad applicability of the general reasoning efficiency improvements proposed in Paper 2.
Paper 1 addresses a critical challenge in Large Language Models (computational efficiency in reasoning chains), offering broad implications for reducing AI operational costs. Its 56% token reduction without accuracy loss demonstrates massive real-world value. Paper 2 focuses on a narrow niche in video game procedural generation, which has significantly less potential for broad scientific and societal impact.
Paper 1 likely has higher impact: it introduces an agentic framework that materially advances automated formal theorem proving, plus a new challenging benchmark (Lean-IMO-Bench) and demonstrates success on high-profile, time-relevant tasks (Putnam 2025, IMO-style problems) and research-grade formalization of open combinatorial challenges. This combines novelty, strong methodological grounding via compiler-verified proofs, broad cross-field implications (AI, formal methods, mathematics), and clear real-world applications in verification and mathematical discovery. Paper 2 is valuable but more incremental/optimization-focused on efficiency for existing RLVR CoT training.
Paper 2 presents a concrete, novel training framework (ThoughtFold) that targets a timely and widely observed issue in LRMs—over-long/overthinking chain-of-thought—via introspective redundancy detection and a masked preference optimization objective. It reports substantial empirical gains (≈56% token reduction at maintained SOTA accuracy), indicating methodological rigor and immediate practical value for lowering inference cost/latency across many LLM deployments. Paper 1 is a useful architectural perspective for embedded agent systems, but is less empirically grounded and likely to have narrower near-term impact compared to an algorithmic advance applicable broadly across reasoning models.
Paper 2 addresses the fundamental problem of scalable oversight in AI alignment—how weaker models can supervise stronger ones—which is a critical challenge as AI capabilities advance. Its novel framing of 'weak-critic strong oversight' and the OPCD method have broader implications across reasoning, alignment, and AI safety. While Paper 1 (ThoughtFold) makes a solid contribution to reasoning efficiency with impressive token reduction, it addresses a more incremental optimization problem. Paper 2's relevance to alignment and scalable oversight gives it greater breadth of impact and timeliness given current AI safety concerns.
ThoughtFold addresses a highly critical and timely issue in the rapidly growing field of Large Reasoning Models: mitigating over-thinking and redundant explorations in long Chain-of-Thought paths. By achieving a massive 56% reduction in token usage without sacrificing accuracy, it offers a fundamental algorithmic improvement to reasoning efficiency. While ToolGate provides valuable efficiency gains for tool-augmented VLMs, ThoughtFold's approach is likely to have a broader and more transformative impact on general-purpose AI reasoning capabilities.
Paper 2 likely has higher scientific impact due to greater novelty and broader relevance: it introduces a general framework (ThoughtFold) to reduce redundant exploration in chain-of-thought reasoning via introspective, fine-grained preference learning and a masked preference optimization objective. If results generalize, it directly improves efficiency (large token savings) while preserving accuracy—highly timely for LRM deployment costs and latency. Its applications span many reasoning tasks and model families. Paper 1 is valuable but more domain-specific (CS1 C++ grading) with narrower cross-field impact despite solid methodological contributions (rubric conditioning and distribution matching).
Paper 2 addresses a fundamental issue in the training of Large Reasoning Models (reinforcing redundant explorations during RLVR) and proposes a novel preference learning framework to inherently shorten reasoning chains. This algorithmic improvement to model training is likely to have broader, lasting impact on how future reasoning models are developed, whereas Paper 1 offers a practical but more localized inference-time mitigation for quantization artifacts.
ThoughtFold addresses a timely and practical problem (over-thinking in LRMs) with a concrete, well-evaluated solution achieving 56% token reduction while maintaining accuracy. Its immediate applicability to widely-used reasoning models (DeepSeek-R1) and clear methodological contribution (masked preference optimization) give it high near-term impact. Paper 2, while intellectually ambitious in formalizing Machine Theory of Mind, is more theoretical and foundational without empirical validation of a new system, limiting its immediate measurable impact despite its potential long-term significance.
ThoughtFold addresses a critical bottleneck—over-thinking and high token consumption—in Large Reasoning Models (LRMs). By significantly reducing token usage while maintaining accuracy, it offers foundational improvements applicable to any domain utilizing reasoning LLMs. While Paper 1 presents an innovative and valuable medical application, Paper 2's methodological advancement in fundamental AI efficiency grants it a much broader potential impact and higher timeliness across the entire artificial intelligence community.
Paper 1 is more likely to have higher impact: it introduces a concrete, novel training framework (introspective redundancy identification + masked preference optimization) that directly targets a timely, widely observed LRM failure mode (overthinking/inefficient long CoTs) and demonstrates large efficiency gains (~56% token reduction) while preserving SOTA accuracy—high immediate practical value and broad relevance across LLM training and deployment. Paper 2 poses an interesting question but relies on stronger assumptions (causal discovery reliability) and frames results as preliminary/limited-scope, likely reducing near-term adoption and impact.
Paper 1 addresses a critical and highly timely issue in Large Reasoning Models (over-thinking in RLVR-trained CoTs). By reducing token usage by 56% without sacrificing accuracy, it offers massive implications for inference efficiency and computational cost reduction. While Paper 2 provides valuable insights into agent training dynamics, Paper 1's direct solution to a major bottleneck in state-of-the-art reasoning models promises broader and more immediate real-world and scientific impact across the AI community.
Paper 2 likely has higher scientific impact because it targets a broad, pervasive failure mode—multi-constraint instruction following—relevant to many LLM deployments (agents, safety/policy compliance, tool use). Its formulation of the Constraint Adherence Problem and graph-based CRGC with “bridge constraints” is a more generally applicable conceptual framework than efficiency-focused CoT shortening. While Paper 1 is timely and useful for reducing overthinking/token cost, its impact is narrower (reasoning-chain compression under RLVR/CoT settings). Paper 2’s approach can transfer across tasks and governance/safety contexts.