DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning
Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu
Abstract
Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.
AI Impact Assessments
(1 models)Scientific Impact Assessment: DeepTool
1. Core Contribution
DeepTool addresses a genuine gap in tool-integrated reasoning (TIR): the lack of deliberate, System-2-style thinking during multi-turn tool invocation. The framework introduces three interconnected innovations:
The key conceptual insight — that tool invocation should not be treated as an atomic action but as a deliberative cognitive process requiring planning, verification, and self-correction — is well-motivated and practically significant.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The framework addresses a practical bottleneck in deploying LLMs for complex reasoning tasks requiring tool use. The core ideas — interleaved deliberation, adversarial perturbation during synthesis, and process-level rewards for TIR — are broadly applicable beyond mathematics to domains like scientific computing, data analysis, and agentic workflows.
The thinking budget scaling analysis (Figure 4) provides practical guidance for deployment, showing that budget should be calibrated to task difficulty. The finding that moderate thinking budgets can achieve Pareto-dominant operating points (fewer tokens AND higher accuracy than non-thinking baselines) is practically valuable.
However, the impact may be somewhat limited by:
4. Timeliness & Relevance
The paper is highly timely, situated at the intersection of two active research frontiers: (1) scaling test-time compute / System 2 reasoning (post-o1/R1 era), and (2) RL for tool-augmented agents. The specific problem of sparse rewards in multi-turn TIR is widely acknowledged, and the process supervision approach is a natural and needed extension.
The work builds on the GRPO framework from DeepSeek-R1 and extends it meaningfully to the multi-turn tool-use setting. The timing aligns well with the community's shift from outcome-based to process-based RL supervision.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The paper's framing around "System 2 deliberation" is somewhat loose — what's actually happening is extended chain-of-thought within each tool-use turn, which is a useful but more specific contribution than the framing suggests. The connection to dual-process theory is metaphorical rather than substantive.
The writing is generally clear but could benefit from tighter notation and more formal problem statements. The case study appendix effectively demonstrates the practical difference between thinking and non-thinking modes.
Generated May 29, 2026
Comparison History (16)
Paper 1 addresses the critical frontier of scaling reasoning and tool use in LLMs via process-supervised RL. By integrating interleaved deliberation and achieving massive performance gains on rigorous benchmarks like AIME, it pushes the boundaries of agentic AI. While Paper 2 offers highly practical efficiency gains for safety guardrails, Paper 1's methodology fundamentally expands model capabilities and aligns with the highly impactful trend of reasoning-time scaling, suggesting broader implications for advancing general AI capabilities.
DeepTool addresses a more fundamental problem in tool-integrated reasoning—scaling deliberate thinking with process-supervised RL—showing dramatic improvements (e.g., AIME24: 3.2%→40.4%). Its contributions (action-centric process rewards, adversarial perturbation pipeline) are broadly applicable to any LLM reasoning with tools. ZipRL tackles the important but narrower problem of context compression in multi-turn settings. While both are methodologically rigorous and timely, DeepTool's broader impact on reasoning capabilities and its striking performance gains suggest higher scientific impact.
Paper 1 addresses a highly critical and timely bottleneck in modern LLMs: scaling test-time deliberation and tool-use via process-supervised RL. Its massive empirical gains on hard reasoning benchmarks (e.g., AIME) signal immediate and transformative practical applications in AI agent development. While Paper 2 offers a strong fundamental contribution to deep learning theory and generalization, Paper 1's alignment with the current frontier of scaling reasoning capabilities gives it a significantly higher potential for immediate, widespread scientific and industrial impact.
Paper 1 has higher potential scientific impact due to a more novel technical contribution (process-supervised RL for interleaved tool use with action-centric rewards), strong demonstrated performance gains on widely recognized benchmarks, and broad applicability to agentic LLM systems, planning, and robust tool invocation across many domains. Its methodology appears more directly generalizable and likely to influence downstream model training and deployment. Paper 2 is timely and important for peer-review integrity, but its impact is narrower (scholarly workflow governance) and more contingent on policy adoption and specific review settings.
While Paper 1 offers strong empirical improvements in LLM tool use, Paper 2 addresses a fundamental, systemic issue in modern AI: model collapse from synthetic data in multi-agent environments. Its counter-intuitive finding that human curation can backfire in multi-model loops provides critical, timely insights for the long-term viability of AI alignment and foundation model training.
Paper 1 introduces a highly novel paradigm by repurposing reasoning models as natural context compressors, eliminating the need for dedicated compression modules. This conceptual shift offers broad, real-world utility by significantly reducing LLM inference costs and latency for long-context tasks. While Paper 2 presents strong empirical gains in tool-integrated reasoning using process-supervised RL, it represents a more incremental methodological advance following current RL trends. Paper 1's innovative approach addresses a fundamental architectural bottleneck, giving it a higher potential for broad scientific and practical impact.
Paper 1 advances the fundamental capabilities of LLMs in reasoning and tool-use via Process-Supervised RL, achieving massive gains on rigorous benchmarks like AIME24 with a 7B model. Enhancing LLM deliberation with tools is a critical frontier in AI research, promising broader algorithmic impact and generalizability across domains compared to Paper 2's system-level architectural optimizations for multi-agent serving.
Paper 2 presents a paradigm-shifting contribution: the first large-scale autoformalization of graduate-level mathematics, producing a verified library of 45,000+ Lean 4 declarations across 26 textbooks. This has broader and deeper impact across mathematics, formal verification, and AI-for-math. The open-source artifacts (AutoformBot + Atlas) create lasting infrastructure. While Paper 1 makes solid improvements to tool-integrated reasoning for LLMs, it represents an incremental advance in RL training methodology. Paper 2 demonstrates feasibility of a long-standing goal in mathematical formalization, enabling automated verification of research-level mathematics.
DeepTool addresses a fundamental challenge in LLM reasoning with tool use, proposing a novel process-supervised RL framework that achieves dramatic performance improvements (e.g., 3.2%→40.4% on AIME24). Its contributions—interleaved deliberation, action-centric process rewards, and adversarial perturbation training—are broadly applicable across AI/ML. Paper 2 addresses a niche urban planning problem (tourist mobility in Tokyo) with incremental methodological contributions and limited generalizability. DeepTool's impact spans the rapidly growing LLM reasoning community, making it significantly more impactful.
Paper 2 likely has higher scientific impact: it proposes a generally applicable training framework (process-supervised RL for interleaved tool use) with large, quantifiable gains on widely recognized reasoning benchmarks, making it timely and broadly relevant to LLM capability scaling and agentic systems. Its methodological contribution (action-centric process rewards, robust trajectory synthesis) is transferable across tools and domains, with clear real-world applications in reliable tool-using agents. Paper 1 is valuable but more narrowly scoped to web-generation evaluation and may impact a smaller subcommunity.
Paper 1 is likely to have broader and more durable impact: it introduces a diagnostic benchmark targeting the under-studied “harness” layer that governs real agent behavior, enabling reproducible, cross-model/cross-harness analysis with artifacts and traces. This addresses a timely reliability/auditability gap for deployed agents and can influence evaluation standards across academia and industry. Paper 2 is methodologically ambitious and shows large gains, but appears more model/training-specific and risks narrower generalization; its impact depends on adoption and reproducibility of the RL/synthesis pipeline.
DeepTool presents a novel training framework (process-supervised RL for tool-integrated reasoning) with dramatic empirical improvements (e.g., AIME24: 3.2%→40.4%), introducing both a new data synthesis pipeline and a process reward mechanism. This addresses a fundamental limitation in how LLMs learn to use tools through RL, with broad applicability. AsyncTool, while valuable as a benchmark for asynchronous tool calling, is primarily an evaluation contribution with narrower scope. DeepTool's methodological innovations in training paradigms are likely to influence more subsequent research and have greater real-world impact on building capable reasoning agents.
Paper 1 addresses a fundamental challenge in general AI—scaling deliberate reasoning and tool use in LLMs using process-supervised RL. Its massive performance boosts on complex benchmarks like AIME suggest broad, foundational implications across AI research and applications. In contrast, Paper 2, while highly innovative in its hierarchical approach to trajectory generation, is constrained to a specific domain (urban computing and smart cities), significantly limiting its overall breadth of impact compared to the general-purpose AI advancements in Paper 1.
Paper 1 has higher estimated scientific impact due to stronger novelty and methodological contribution: it introduces a concrete, scalable framework (interleaved deliberation trajectories + adversarial robustness) and a specific process-supervised RL method with an action-centric reward, validated by large performance gains on multiple established benchmarks and efficiency analysis. This combination suggests clear, generalizable advances for tool-augmented LLM reasoning with broad applicability across domains. Paper 2 is timely and important for education, but is largely conceptual/architectural and appears less empirically validated, likely yielding narrower near-term scientific influence.
Paper 2 tackles a fundamental challenge in AI—enhancing LLM tool-integrated reasoning and deliberation via process-supervised RL. Its broad applicability to agentic workflows and massive performance gains on standard benchmarks (e.g., AIME24 from 3.2% to 40.4%) indicate a high potential for widespread adoption across the AI community. While Paper 1 offers valuable insights into public policy text analysis and evaluation, Paper 2's methodological advancements in test-time compute and RL align with the most pressing, high-impact frontiers in core AI research.
DeepTool demonstrates dramatically larger performance gains (e.g., AIME24: 3.2%→40.4%) on established benchmarks, introduces a novel framework combining process-supervised RL with tool-integrated reasoning that has broad applicability across many LLM tasks. Its methodological contributions (action-centric process rewards, adversarial perturbation pipeline) are more technically innovative and generalizable. While MIRA addresses an important health equity concern with the novel 'Differential Information Dilution' concept, its scope is narrower (medical Q&A auditing) and its mitigations show modest improvements (~6-8%). DeepTool's advances in reasoning capabilities have wider downstream impact.