Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

Yunhui Gan, Tan Pan, Kaiyu Guo, Limei Han, Weimiao Yu, Guangnan Ye, Chen Jiang, Yuan Cheng

#1179 of 2682 · Artificial Intelligence
Share
Tournament Score
1423±41
10501800
53%
Win Rate
10
Wins
9
Losses
19
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes a meaningful gap in medical AI agent systems: the discrepancy between task-level tool selection (choosing the best single tool for a task) and instance-level tool selection (choosing the optimal tool for each individual case). The authors define the "Single-Oracle risk gap" to quantify this difference and propose CSRL (Collaborative Synergy Reinforcement Learning), a GRPO-based framework that learns instance-level tool selection policies. The framework consists of three interconnected components: (1) a Brier reward for probabilistic risk minimization, (2) an override reward for disagreement-aware synergy learning, and (3) entropy-guided sampling to upweight high-disagreement training instances.

The problem formulation is well-motivated. The empirical observation that different diagnostic tools exhibit non-overlapping failure patterns (Figure 1) provides a compelling case for instance-level selection. The gap between the best single tool and an Oracle selector is clearly demonstrated, establishing that there is recoverable performance to be gained.

Methodological Rigor

The theoretical grounding is relatively straightforward but appropriate. Propositions 3.1 and 3.2 establish alignment between the reward components and the optimization objectives—maximizing the Brier reward minimizes prediction risk, and maximizing the override reward increases the policy's improvement over the best single tool. These are clean results, though not technically deep.

The GRPO-based optimization framework is a reasonable architectural choice, leveraging established policy optimization techniques. The entropy-guided sampling strategy is a sensible approach to address the class imbalance problem in tool disagreement cases, though the idea of upweighting difficult/informative examples is well-established in machine learning.

One concern is the evaluation scope. While seven benchmarks across two tasks sounds comprehensive, both tasks are within chest X-ray analysis, which limits claims about generalizability. The tool pools are also relatively small (6 tools for Task 1, 3 for Task 2), and it's unclear how the method would scale to larger, more heterogeneous tool ecosystems.

The ablation study (Table 2) is thorough, systematically varying the policy model size, training data scale, dataset composition, sampling strategy, reward components, and tool budget. This provides good evidence for the contribution of each component, though some ablations show modest differences (e.g., removing the override reward decreases average Acc by only 0.9%).

Potential Impact

Practical relevance: The problem of combining imperfect medical AI tools is genuinely important for clinical deployment. No single diagnostic model is universally reliable, and the paper's approach of learning when to trust which tool on a case-by-case basis addresses a real need. The 7.5% accuracy and 6.8% F1 improvements over the strongest individual tool are clinically meaningful.

Broader applicability: While focused on medical imaging, the framework is conceptually applicable to any domain where multiple imperfect tools with complementary failure patterns exist. The instance-level selection formulation could extend to legal document analysis, financial risk assessment, or multi-source information retrieval.

Limitations on impact: The framework requires access to ground-truth labels for computing tool agreement statistics and training rewards, which may limit applicability in low-resource or novel clinical settings. The method also assumes a fixed, predefined tool pool—dynamic tool discovery or adaptation is not addressed.

Timeliness & Relevance

The paper is well-timed, sitting at the intersection of two active research areas: medical AI agents and reinforcement learning for tool use. The recent surge in MLLM-based medical agents (MedRAX, CheXagent, etc.) and RL-based tool learning (ToolRL, Search-R1) makes this contribution timely. The focus on safety and reliability in medical AI is increasingly important as these systems approach clinical deployment.

Strengths

1. Clear problem formulation: The Single-Oracle risk gap provides a clean theoretical framework for understanding when and why tool synergy matters.

2. Comprehensive evaluation: Comparison against 16+ baselines across four categories (general VLMs, medical MLLMs, closed-source models, specialist tools) plus combination baselines provides strong evidence of effectiveness.

3. Practical insights: The scalability analysis (Figure 3) revealing diminishing returns with more tools offers actionable guidelines for system designers.

4. Strong OOD performance: CSRL's consistent gains on out-of-domain datasets (ChestX-ray14, VinDr-CXR, NIH-Google, RSNA) suggest that learned synergy transfers beyond training distributions.

5. Ablation completeness: The systematic ablation across multiple dimensions strengthens confidence in design choices.

Limitations & Weaknesses

1. Narrow domain evaluation: Despite seven benchmarks, all are chest X-ray datasets. The VQA task provides some diversity, but claims about "medical agents" broadly are not fully substantiated.

2. Static tool pool assumption: The framework doesn't address how to handle new tools being added or existing tools being updated, which is realistic in clinical environments.

3. Computational overhead: Training with 8×H200 GPUs, 16 rollouts per prompt, and up to 6 parallel tool calls per turn represents substantial compute. The inference cost analysis is limited.

4. Modest theoretical depth: The propositions, while correct, are essentially restatements of well-known properties of proper scoring rules and risk decomposition.

5. Limited comparison with ensemble methods: The paper compares against majority voting and logistic regression as combination baselines, but misses comparison with more sophisticated ensemble techniques (stacking, Bayesian model combination, etc.).

6. Reproducibility concerns: While hyperparameters are reported, the reliance on multiple proprietary and specialized tools makes full reproduction challenging.

7. No prospective or human evaluation: All evaluation is retrospective on existing benchmarks, with no assessment of how CSRL would perform in actual clinical workflows.

Overall Assessment

This paper makes a solid contribution to the medical AI agent literature by formalizing instance-level tool synergy and providing a practical RL-based framework to achieve it. The problem is well-motivated, the solution is technically sound if not groundbreaking, and the experimental evidence is comprehensive within its scope. The main limitations are the narrow domain focus and the gap between the paper's broad framing ("medical agents") and its specific evaluation (chest X-ray diagnosis). The work represents meaningful incremental progress rather than a paradigm shift.

Rating:6.5/ 10
Significance 6.5Rigor 7Novelty 6Clarity 7.5

Generated May 27, 2026

Comparison History (19)

vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
gemini-3.15/28/2026

Paper 1 provides fundamental mechanistic insights into how LLMs allocate computational depth during complex, multi-turn agentic tasks. Its findings on layer-wise dynamics can broadly impact foundation model architecture design, efficient inference routing, and agent reasoning strategies across multiple domains. While Paper 2 offers a valuable applied framework for medical AI safety and tool use, Paper 1's foundational discoveries offer a significantly wider breadth of impact across the core AI and large language model research communities.

vs. SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment
gemini-3.15/28/2026

Paper 1 addresses a critical safety bottleneck in medical AI—unreliable tool use in clinical settings. By formulating instance-level tool selection to mitigate real-world failure risks, it offers immediate, high-stakes applications in healthcare. While Paper 2 presents a strong technical advancement for LLM agent skill internalization, Paper 1's focus on safety, reliability, and synergistic tool use in a highly sensitive domain gives it broader societal relevance and higher potential real-world impact.

vs. Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG
gemini-3.15/28/2026

Paper 2 addresses a critical and widespread flaw in Retrieval-Augmented Generation (RAG) evaluation—citation laundering. By introducing a novel benchmark (FORCEBENCH) to evaluate evidence-force calibration, its impact spans across all domains utilizing RAG systems. While Paper 1 offers strong methodological innovations for medical AI, Paper 2 provides a foundational evaluation tool for the broader AI community, ensuring higher breadth of impact and timeliness given the ubiquitous deployment of cited RAG applications.

vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information
gpt-5.25/28/2026

Paper 1 targets a broadly observed, safety-critical failure mode in reasoning LMs—detecting missing information yet still answering—and formalizes it (detection-to-abstention gap) with a general control framework (Judge-Then-Solve) applicable beyond medicine. It offers a clear, widely transferable mechanism (explicit answerability commitment + RL shaping) with efficiency benefits and likely relevance across many deployments of reasoning models. Paper 2 is timely and rigorous but more domain- and setting-specific (medical tool ensembles/selection), making its cross-field breadth and generality somewhat narrower.

vs. Behavioural Analysis of Alignment Faking
claude-opus-4.65/28/2026

Paper 1 addresses alignment faking, a critical AI safety concern with broad implications as models become more capable. It provides a systematic decomposition of AF drivers (values, goal guarding, sycophancy), demonstrates AF across a wider range of models including small ones, and offers concrete detection/mitigation directions. This has fundamental implications for AI alignment research. Paper 2, while practically useful for medical AI tool selection, addresses a more incremental optimization problem with narrower scope. The alignment faking work is more timely and impactful given growing concerns about deceptive AI behavior.

vs. Do Clinical Models Change Treatment Decisions?
gemini-3.15/28/2026

Paper 2 introduces a novel benchmark and evaluation paradigm that challenges the current standard of static medical QA, revealing critical flaws in how clinical models handle shifting patient contexts. By exposing a fundamental gap between QA performance and clinical decision-making, it has the potential to broadly steer future research in clinical AI evaluation and model development, giving it a higher potential for widespread scientific impact than the specific algorithmic improvements proposed in Paper 1.

vs. Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
claude-opus-4.65/27/2026

Paper 2 addresses a fundamental and broadly applicable problem—instance-level tool selection under imperfect conditions—with a novel RL framework (GRPO-based) that has theoretical grounding (Single-Oracle risk gap) and practical implications for safety-critical medical AI. Its contributions (instance-level selection formulation, disagreement-aware synergy learning, entropy-guided sampling) are generalizable beyond medicine. Paper 1, while practically valuable for ERP benchmarking, is more narrowly focused on a specific benchmark generation methodology. Paper 2's methodological innovations and validation across seven benchmarks suggest broader scientific influence.

vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
gpt-5.25/27/2026

Paper 2 has higher potential impact due to a clearer path to real-world deployment in high-stakes medicine, addressing a practical and under-studied failure mode (imperfect tools) with an instance-level selection formulation and concrete learning framework. It proposes methodological innovations (risk-aware + disagreement/synergy learning with entropy-guided sampling) and reports consistent gains across multiple medical benchmarks, suggesting rigor and near-term relevance. Paper 1 is valuable as a diagnostic benchmark for LLM skill formation, but it is primarily evaluative and its impact depends on subsequent methods adopting and improving on the benchmark.

vs. Can LLMs Introspect? A Reality Check
gemini-3.15/27/2026

Paper 1 challenges fundamental assumptions about LLM metacognition, offering a critical re-evaluation that impacts AI safety, alignment, and cognitive science. By highlighting flaws in current evaluation paradigms, it has the potential to redirect future theoretical and empirical research across the broader AI community. While Paper 2 offers significant practical advancements in medical AI, Paper 1 addresses a more foundational scientific question with broader interdisciplinary implications.

vs. Whose Alignment? Comparing LLM Process Alignment Across Diverse Organizational Decision Contexts
claude-opus-4.65/27/2026

Paper 2 addresses a broadly applicable problem (tool failure in medical AI agents) with a novel RL-based framework that demonstrates consistent improvements across seven benchmarks. Its practical relevance to clinical safety, methodological contribution (GRPO-based instance-level tool selection with disagreement-aware learning), and breadth of experimental validation give it higher potential impact. Paper 1 offers valuable conceptual insights on pluralistic alignment with process-level measurement, but its findings are more domain-specific (legal/credit decisions) and primarily diagnostic rather than providing a scalable solution, limiting its broader impact.

vs. From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch
gemini-3.15/27/2026

Paper 2 addresses a highly critical and timely global challenge: the escalating energy and water demands of data centers. By introducing a differentiable optimization layer to integrate virtual water impacts into power system dispatch, it provides a novel, interdisciplinary methodological advancement. Its potential to directly mitigate the environmental footprint of AI and cloud computing gives it exceptional real-world applicability and systemic impact across the sustainability, energy, and computing fields.

vs. Back to Parsimonious Latents: Learning Task-Centric World Models from Visual Foundations
gemini-3.15/27/2026

Paper 1 addresses a critical and highly sensitive problem: the safety and reliability of medical AI agents when utilizing external tools. Its focus on mitigating instance-level tool failures has immediate, high-stakes real-world implications for clinical settings. While Paper 2 offers strong theoretical advancements in world models for robotics, Paper 1's intersection of AI safety, reinforcement learning, and healthcare provides a more urgent and broadly impactful contribution to the rapidly deploying field of medical AI.

vs. Benchmarking the Limits of In-Context Reinforcement Learning for Ad-Hoc Teamwork
gemini-3.15/27/2026

Paper 2 addresses a critical safety and reliability issue in medical AI agents, offering a novel RL-based framework with strong real-world applicability. While Paper 1 provides a valuable benchmark for multi-agent RL, Paper 2's focus on clinical safety, mitigating tool failures, and immediate relevance to the rapidly growing field of LLM agents gives it broader and more profound potential scientific and societal impact.

vs. Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents
gemini-3.15/27/2026

Paper 2 introduces a highly novel, cross-domain actuarial framework for autonomous agent safety, offering a fresh perspective on AI alignment through risk pricing and reserve capital. This approach addresses a critical bottleneck in deploying autonomous agents with real-world side effects. In contrast, while Paper 1 is methodologically sound and highly relevant to medical AI, its impact is more narrowly focused on a specific domain, making Paper 2's broader, foundational contribution likely to have a wider scientific and practical impact.

vs. Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
gpt-5.25/27/2026

Paper 2 has higher potential impact due to a clearer, generalizable methodological contribution (instance-level tool selection under imperfect tools with RL and disagreement/entropy mechanisms) that applies broadly to medical agentic systems and tool-augmented LLMs beyond a single deployment. It addresses a timely, widely recognized failure mode (tool unreliability) with evaluation across multiple benchmarks, suggesting stronger rigor and reproducibility. Paper 1 is valuable and practical, but is more architecture/integration-centric and partially tied to one real-world system, with verification guarantees limited to structured requirements and embedding-based checks for semantics.

vs. 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
gemini-3.15/27/2026

Paper 1 addresses a highly timely and critical issue in the rapidly growing field of medical AI agents: tool reliability and safety. Its approach to mitigating instance-level tool failures has direct, high-stakes real-world applications in healthcare. In contrast, Paper 2 focuses on theoretical complexity and implementation details of a specific logic programming fragment (ASP(Q)), which, while rigorous, has a much narrower breadth of impact and caters to a more niche academic audience.

vs. DemoEvolve: Overcoming Sparse Feedback in Agentic Harness Evolution with Demonstrations
gemini-3.15/27/2026

Paper 1 tackles a critical safety issue in medical AI—handling tool failures—which has profound real-world implications for patient safety and clinical adoption. Its proposed instance-level RL framework addresses a realistic and urgent gap in medical agent reliability. In contrast, while Paper 2 presents an innovative approach to agent harness evolution, its current validation on game environments suggests a narrower immediate real-world impact compared to the life-critical and broad applications of Paper 1.

vs. From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation
gpt-5.25/27/2026

Paper 2 has higher likely impact due to broader applicability (general medical agent tool-use reliability) and stronger methodological contribution (instance-level tool selection framed via a defined risk gap, RL optimization, and disagreement/entropy-driven sampling) validated across multiple tasks and seven benchmarks. Its focus on safety-critical, real-world tool failures is timely and relevant for clinical deployment, and the ideas generalize to other agentic settings with imperfect tools. Paper 1 is valuable but more domain- and dataset-specific (French marine environmental law) with narrower cross-field reach.

vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
gpt-5.25/27/2026

Paper 2 is likely higher impact due to stronger novelty and timeliness: it targets diffusion LLM safety monitoring, an underexplored and rapidly emerging model class, and leverages a distinctive diffusion-specific signal (trajectory “hesitation”) unavailable to AR-LLMs. The proposed dynamic routing monitor is broadly applicable to safety deployment with clear real-world benefits (efficient always-on moderation) and general relevance across many D-LLMs and datasets. Paper 1 addresses an important medical agent issue, but its impact may be narrower (medical tool orchestration) and more incremental relative to existing RL-based tool-selection work.