DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

#480 of 2821 · Artificial Intelligence
Share
Tournament Score
1484±48
10501800
75%
Win Rate
12
Wins
4
Losses
16
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: DeepTool

1. Core Contribution

DeepTool addresses a genuine gap in tool-integrated reasoning (TIR): the lack of deliberate, System-2-style thinking during multi-turn tool invocation. The framework introduces three interconnected innovations:

  • MOSAIC: A hierarchical synthesis pipeline using a Manager-Actor architecture to generate interleaved thinking trajectories, with stochastic adversarial perturbations (intrinsic code errors, extrinsic environment failures) to build robustness.
  • Process-Supervised RL with Action-Centric Process Rewards: A dense supervision mechanism built on GRPO that evaluates each step's action correctness (via Gestalt Pattern Matching) while permitting diverse reasoning paths, addressing the credit assignment problem in long-horizon TIR.
  • Step-wise decomposition: Converting single trajectories into multiple independent training signals using ground-truth history prefixes, neutralizing error propagation during RL training.
  • The key conceptual insight — that tool invocation should not be treated as an atomic action but as a deliberative cognitive process requiring planning, verification, and self-correction — is well-motivated and practically significant.

    2. Methodological Rigor

    Strengths:

  • The experimental setup is reasonably comprehensive, testing across six math benchmarks (AIME24, AIME25, MATH500, OlympiadBench, AMC23, HMMT25) plus GPQA-Diamond, with two backbone models (Qwen2.5-7B, Qwen3-4B).
  • The avg@8 evaluation protocol reduces variance.
  • Multiple ablations systematically isolate contributions: thinking vs. no-thinking, state preservation vs. discarding, thinking budget scaling, and MOSAIC vs. standard synthesis.
  • The cost-effectiveness analysis (accuracy gain per 1k tokens) is a useful practical metric.
  • Weaknesses:

  • The step-wise decomposition using ground-truth history prefixes (Eq. 5) is a known technique (expert iteration / step-level training), and the paper doesn't fully discuss the distribution mismatch between ground-truth prefixes and model-generated prefixes during inference.
  • The Action-Centric Process Reward uses Gestalt Pattern Matching (string similarity) rather than execution-based correctness verification. This seems fragile — syntactically different but semantically equivalent code would receive lower rewards. The paper doesn't analyze failure modes of this reward signal.
  • The comparison against baselines has notable gaps: ToRL and ReTool numbers appear to come from their respective papers, but evaluation protocols may differ (e.g., pass@k vs. avg@k, sampling temperature). Some baselines lack results on key benchmarks (HMMT25, OlympiadBench, GPQA-D), making direct comparison difficult.
  • The MOSAIC pipeline relies on DeepSeek-V3.2 as the actor/manager, introducing substantial distillation effects. The paper doesn't disentangle the contribution of the stronger teacher model from the framework design.
  • 3. Potential Impact

    The framework addresses a practical bottleneck in deploying LLMs for complex reasoning tasks requiring tool use. The core ideas — interleaved deliberation, adversarial perturbation during synthesis, and process-level rewards for TIR — are broadly applicable beyond mathematics to domains like scientific computing, data analysis, and agentic workflows.

    The thinking budget scaling analysis (Figure 4) provides practical guidance for deployment, showing that budget should be calibrated to task difficulty. The finding that moderate thinking budgets can achieve Pareto-dominant operating points (fewer tokens AND higher accuracy than non-thinking baselines) is practically valuable.

    However, the impact may be somewhat limited by:

  • Restriction to code interpreter as the sole tool; generalization to heterogeneous tool ecosystems (search, APIs, databases) remains undemonstrated.
  • The synthesis pipeline's dependence on a powerful reasoning model (DeepSeek-V3.2) limits accessibility.
  • Results are primarily on mathematical reasoning; generalization to other domains is assumed but not tested.
  • 4. Timeliness & Relevance

    The paper is highly timely, situated at the intersection of two active research frontiers: (1) scaling test-time compute / System 2 reasoning (post-o1/R1 era), and (2) RL for tool-augmented agents. The specific problem of sparse rewards in multi-turn TIR is widely acknowledged, and the process supervision approach is a natural and needed extension.

    The work builds on the GRPO framework from DeepSeek-R1 and extends it meaningfully to the multi-turn tool-use setting. The timing aligns well with the community's shift from outcome-based to process-based RL supervision.

    5. Strengths & Limitations

    Key Strengths:

  • Strong empirical results: 3.2% → 40.4% on AIME24 and 0.0% → 28.6% on HMMT25 are dramatic improvements.
  • The monotonic improvement across the three-stage pipeline (base → SFT → RL) provides clean evidence for each component's value.
  • The adversarial perturbation mechanism is a creative and well-motivated design choice for building robust TIR agents.
  • The state preservation analysis (Figure 3) provides useful architectural guidance.
  • Detailed case studies showing error recovery behavior are illuminating.
  • Notable Limitations:

  • The Gestalt Pattern Matching reward (Eq. 8) is a surface-level metric for code correctness — no discussion of semantic equivalence.
  • Only 8k SFT instances and LIMO dataset for RL — the data efficiency claims are interesting but the generalizability to larger data regimes is unknown.
  • No analysis of failure modes or systematic error categorization.
  • The paper tests only on Qwen-family models; cross-architecture generalization is untested.
  • The comparison to concurrent work like SimpleTIR and AutoTIR is limited.
  • Reproducibility concerns: while hyperparameters are detailed, the MOSAIC pipeline involves complex multi-agent interactions with a proprietary model.
  • 6. Additional Observations

    The paper's framing around "System 2 deliberation" is somewhat loose — what's actually happening is extended chain-of-thought within each tool-use turn, which is a useful but more specific contribution than the framing suggests. The connection to dual-process theory is metaphorical rather than substantive.

    The writing is generally clear but could benefit from tighter notation and more formal problem statements. The case study appendix effectively demonstrates the practical difference between thinking and non-thinking modes.

    Rating:6.5/ 10
    Significance 6.5Rigor 6Novelty 6Clarity 7

    Generated May 29, 2026

    Comparison History (16)

    vs. Robust and Efficient Guardrails with Latent Reasoning
    gemini-3.15/29/2026

    Paper 1 addresses the critical frontier of scaling reasoning and tool use in LLMs via process-supervised RL. By integrating interleaved deliberation and achieving massive performance gains on rigorous benchmarks like AIME, it pushes the boundaries of agentic AI. While Paper 2 offers highly practical efficiency gains for safety guardrails, Paper 1's methodology fundamentally expands model capabilities and aligns with the highly impactful trend of reasoning-time scaling, suggesting broader implications for advancing general AI capabilities.

    vs. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay
    claude-opus-4.65/29/2026

    DeepTool addresses a more fundamental problem in tool-integrated reasoning—scaling deliberate thinking with process-supervised RL—showing dramatic improvements (e.g., AIME24: 3.2%→40.4%). Its contributions (action-centric process rewards, adversarial perturbation pipeline) are broadly applicable to any LLM reasoning with tools. ZipRL tackles the important but narrower problem of context compression in multi-turn settings. While both are methodologically rigorous and timely, DeepTool's broader impact on reasoning capabilities and its striking performance gains suggest higher scientific impact.

    vs. Quantifying and Optimizing Simplicity via Polynomial Representations
    gemini-3.15/29/2026

    Paper 1 addresses a highly critical and timely bottleneck in modern LLMs: scaling test-time deliberation and tool-use via process-supervised RL. Its massive empirical gains on hard reasoning benchmarks (e.g., AIME) signal immediate and transformative practical applications in AI agent development. While Paper 2 offers a strong fundamental contribution to deep learning theory and generalization, Paper 1's alignment with the current frontier of scaling reasoning capabilities gives it a significantly higher potential for immediate, widespread scientific and industrial impact.

    vs. Review Arcade: On the Human Alignment and Gameability of LLM Reviews
    gpt-5.25/29/2026

    Paper 1 has higher potential scientific impact due to a more novel technical contribution (process-supervised RL for interleaved tool use with action-centric rewards), strong demonstrated performance gains on widely recognized benchmarks, and broad applicability to agentic LLM systems, planning, and robust tool invocation across many domains. Its methodology appears more directly generalizable and likely to influence downstream model training and deployment. Paper 2 is timely and important for peer-review integrity, but its impact is narrower (scholarly workflow governance) and more contingent on policy adoption and specific review settings.

    vs. When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop
    gemini-3.15/29/2026

    While Paper 1 offers strong empirical improvements in LLM tool use, Paper 2 addresses a fundamental, systemic issue in modern AI: model collapse from synthetic data in multi-agent environments. Its counter-intuitive finding that human curation can backfire in multi-model loops provides critical, timely insights for the long-term viability of AI alignment and foundation model training.

    vs. Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
    gemini-3.15/29/2026

    Paper 1 introduces a highly novel paradigm by repurposing reasoning models as natural context compressors, eliminating the need for dedicated compression modules. This conceptual shift offers broad, real-world utility by significantly reducing LLM inference costs and latency for long-context tasks. While Paper 2 presents strong empirical gains in tool-integrated reasoning using process-supervised RL, it represents a more incremental methodological advance following current RL trends. Paper 1's innovative approach addresses a fundamental architectural bottleneck, giving it a higher potential for broad scientific and practical impact.

    vs. A Policy-Driven Runtime Layer for Agentic LLM Serving
    gemini-3.15/29/2026

    Paper 1 advances the fundamental capabilities of LLMs in reasoning and tool-use via Process-Supervised RL, achieving massive gains on rigorous benchmarks like AIME24 with a 7B model. Enhancing LLM deliberation with tools is a critical frontier in AI research, promising broader algorithmic impact and generalizability across domains compared to Paper 2's system-level architectural optimizations for multi-agent serving.

    vs. Formalizing Mathematics at Scale
    claude-opus-4.65/29/2026

    Paper 2 presents a paradigm-shifting contribution: the first large-scale autoformalization of graduate-level mathematics, producing a verified library of 45,000+ Lean 4 declarations across 26 textbooks. This has broader and deeper impact across mathematics, formal verification, and AI-for-math. The open-source artifacts (AutoformBot + Atlas) create lasting infrastructure. While Paper 1 makes solid improvements to tool-integrated reasoning for LLMs, it represents an incremental advance in RL training methodology. Paper 2 demonstrates feasibility of a long-standing goal in mathematical formalization, enabling automated verification of research-level mathematics.

    vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation
    claude-opus-4.65/29/2026

    DeepTool addresses a fundamental challenge in LLM reasoning with tool use, proposing a novel process-supervised RL framework that achieves dramatic performance improvements (e.g., 3.2%→40.4% on AIME24). Its contributions—interleaved deliberation, action-centric process rewards, and adversarial perturbation training—are broadly applicable across AI/ML. Paper 2 addresses a niche urban planning problem (tourist mobility in Tokyo) with incremental methodological contributions and limited generalizability. DeepTool's impact spans the rapidly growing LLM reasoning community, making it significantly more impactful.

    vs. Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
    gpt-5.25/29/2026

    Paper 2 likely has higher scientific impact: it proposes a generally applicable training framework (process-supervised RL for interleaved tool use) with large, quantifiable gains on widely recognized reasoning benchmarks, making it timely and broadly relevant to LLM capability scaling and agentic systems. Its methodological contribution (action-centric process rewards, robust trajectory synthesis) is transferable across tools and domains, with clear real-world applications in reliable tool-using agents. Paper 1 is valuable but more narrowly scoped to web-generation evaluation and may impact a smaller subcommunity.

    vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
    gpt-5.25/29/2026

    Paper 1 is likely to have broader and more durable impact: it introduces a diagnostic benchmark targeting the under-studied “harness” layer that governs real agent behavior, enabling reproducible, cross-model/cross-harness analysis with artifacts and traces. This addresses a timely reliability/auditability gap for deployed agents and can influence evaluation standards across academia and industry. Paper 2 is methodologically ambitious and shows large gains, but appears more model/training-specific and risks narrower generalization; its impact depends on adoption and reproducibility of the RL/synthesis pipeline.

    vs. AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios
    claude-opus-4.65/29/2026

    DeepTool presents a novel training framework (process-supervised RL for tool-integrated reasoning) with dramatic empirical improvements (e.g., AIME24: 3.2%→40.4%), introducing both a new data synthesis pipeline and a process reward mechanism. This addresses a fundamental limitation in how LLMs learn to use tools through RL, with broad applicability. AsyncTool, while valuable as a benchmark for asynchronous tool calling, is primarily an evaluation contribution with narrower scope. DeepTool's methodological innovations in training paradigms are likely to influence more subsequent research and have greater real-world impact on building capable reasoning agents.

    vs. From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs
    gemini-3.15/29/2026

    Paper 1 addresses a fundamental challenge in general AI—scaling deliberate reasoning and tool use in LLMs using process-supervised RL. Its massive performance boosts on complex benchmarks like AIME suggest broad, foundational implications across AI research and applications. In contrast, Paper 2, while highly innovative in its hierarchical approach to trajectory generation, is constrained to a specific domain (urban computing and smart cities), significantly limiting its overall breadth of impact compared to the general-purpose AI advancements in Paper 1.

    vs. Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance
    gpt-5.25/29/2026

    Paper 1 has higher estimated scientific impact due to stronger novelty and methodological contribution: it introduces a concrete, scalable framework (interleaved deliberation trajectories + adversarial robustness) and a specific process-supervised RL method with an action-centric reward, validated by large performance gains on multiple established benchmarks and efficiency analysis. This combination suggests clear, generalizable advances for tool-augmented LLM reasoning with broad applicability across domains. Paper 2 is timely and important for education, but is largely conceptual/architectural and appears less empirically validated, likely yielding narrower near-term scientific influence.

    vs. When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis
    gemini-3.15/29/2026

    Paper 2 tackles a fundamental challenge in AI—enhancing LLM tool-integrated reasoning and deliberation via process-supervised RL. Its broad applicability to agentic workflows and massive performance gains on standard benchmarks (e.g., AIME24 from 3.2% to 40.4%) indicate a high potential for widespread adoption across the AI community. While Paper 1 offers valuable insights into public policy text analysis and evaluation, Paper 2's methodological advancements in test-time compute and RL align with the most pressing, high-impact frontiers in core AI research.

    vs. MIRA: A Bilingual Benchmark for Medical Information Response Audit
    claude-opus-4.65/29/2026

    DeepTool demonstrates dramatically larger performance gains (e.g., AIME24: 3.2%→40.4%) on established benchmarks, introduces a novel framework combining process-supervised RL with tool-integrated reasoning that has broad applicability across many LLM tasks. Its methodological contributions (action-centric process rewards, adversarial perturbation pipeline) are more technically innovative and generalizable. While MIRA addresses an important health equity concern with the novel 'Differential Information Dilution' concept, its scope is narrower (medical Q&A auditing) and its mitigations show modest improvements (~6-8%). DeepTool's advances in reasoning capabilities have wider downstream impact.