Active Evidence-Seeking and Diagnostic Reasoning in Large Language Models for Clinical Decision Support

Chen Zhan, Xihe Qiu, Xiaoyu Tan, Xibing Zhuang, Gengchen Ma, Yue Zhang, Shuo Li, Peifeng Liu

May 21, 2026

arXiv:2605.22047v1 PDF

cs.AI(primary)

#808of 2292·Artificial Intelligence

#808 of 2292 · Artificial Intelligence

Tournament Score

1443±44

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor5.5

Novelty5.5

Clarity6.5

Tournament Score

1443±44

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models perform well on static medical examinations, yet clinical diagnosis often requires iterative evidence gathering under uncertainty. Building on prior interactive evaluation efforts, we introduce an OSCE-inspired standardized patient simulator and a controlled, reproducible benchmark for active diagnostic inquiry. Across 468 cases and 15 models in our protocol, we observe that multi-turn evidence seeking reduces diagnostic accuracy by 12.75% and lowers supporting-evidence quality by 24.36% relative to full-context evaluation; error analyses associate these drops with premature diagnostic closure and inefficient questioning. Together, these results suggest that static full-context benchmarks may overestimate performance in interactive evidence-seeking settings, motivating complementary interactive assessment for safer clinical decision support.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Active Evidence-Seeking and Diagnostic Reasoning in LLMs for Clinical Decision Support

1. Core Contribution

This paper introduces ROUNDS-Bench, an OSCE-inspired interactive evaluation framework that benchmarks LLMs on their ability to actively seek clinical evidence through multi-turn dialogue rather than passively diagnose from complete case information. The central finding is a consistent "capability gap": across 468 cases and 15 models, transitioning from full-context diagnosis (Task 1) to active evidence-seeking (Task 2) reduces diagnostic accuracy by 12.75% and evidence quality by 24.36% on average. The paper also introduces the concept of "hallucinated reasoning"—correct diagnoses produced without adequate evidentiary grounding—and proposes metrics (StrictEQ, FSA) to detect this failure mode.

The problem addressed is genuinely important: current medical LLM benchmarks overwhelmingly evaluate models under unrealistic full-information conditions, potentially inflating perceived readiness for clinical deployment. The paper makes a convincing case that interactive evidence acquisition is a qualitatively different cognitive task from static pattern matching.

2. Methodological Rigor

Strengths in design:

The dual-task within-subject design (same cases, same models, two information conditions) is well-constructed for isolating the effect of information availability versus reasoning.

The Information Gating Mechanism—requiring models to explicitly request specific modules/tests to unlock information—is a clean operationalization of progressive disclosure.

Stratification across 6 clinical systems (78 cases each) and 4 data sources mitigates domain and source bias.

The 15-model evaluation spans proprietary, open-weight, reasoning-enhanced, and distilled architectures across multiple parameter scales, enabling scaling law analysis.

Methodological concerns:

LLM-as-Judge reliance: Both the SP simulator (Qwen2.5-32B-Instruct) and the evaluator (DeepSeek-v3) are LLM-based. The authors acknowledge this but provide no inter-rater reliability with human clinicians. Given that DeepSeek-v3 is also an evaluated model, the potential for systematic bias (even with blinding) is non-trivial.

Coarse scoring granularity: The 0/1/2 scales for accuracy and evidence quality are quite coarse. Many nuanced diagnostic situations may be poorly captured by three levels.

SP simulator fidelity: The "verbatim retrieval" approach means the simulator returns exact text from structured records rather than naturalistic patient responses. This makes the interaction closer to EHR querying than patient interviewing, limiting ecological validity for the OSCE analogy.

Single-action-per-turn constraint and 10-turn limit: These are pragmatic but arbitrary. Real clinical encounters don't have such rigid constraints. The 10-turn cap especially may disadvantage models that take a more thorough, systematic approach.

No statistical testing: The paper reports percentage drops but no confidence intervals, significance tests, or variance estimates across cases, making it difficult to assess whether differences are robust.

Dataset construction uses LLMs (DeepSeek-v3) for structuring, introducing potential systematic biases in how information is parsed and segmented.

3. Potential Impact

The paper addresses a real and growing concern in medical AI evaluation. As LLMs are increasingly proposed for clinical decision support, the demonstration that static benchmark performance significantly overestimates interactive capability is valuable for:

Regulatory and safety frameworks: The evidence quality gap and "hallucinated reasoning" concept are directly relevant to discussions about deploying LLMs in clinical settings.

Model development: The finding that distilled reasoning models (DeepSeek-R1-Distill variants) perform particularly poorly in active settings, despite competitive static performance, suggests that current distillation approaches fail to transfer planning capabilities.

Benchmark design: ROUNDS-Bench could become a complementary evaluation tool alongside MedQA, though it would need human validation to gain broad adoption.

Training methodology: The results motivate RL-based training that rewards evidence-seeking efficiency and penalizes unsupported diagnoses.

However, impact may be limited by the fact that several prior works (MediQ, AMIE, MedQA-CS, MedAgentBench) have already established interactive evaluation as important. The paper positions itself as offering better controllability and reproducibility than these predecessors, but the comparison is largely qualitative rather than demonstrated empirically.

4. Timeliness & Relevance

The paper is highly timely. The rush to deploy LLMs in healthcare—combined with impressive but potentially misleading static benchmark scores—creates an urgent need for more realistic evaluations. The gap between static and interactive performance is a finding the community needs, even if the specific magnitude depends on benchmark design choices. The scaling law analysis and reasoning architecture comparisons are also timely given the rapid proliferation of model families.

5. Strengths & Limitations

Key strengths:

Clear, well-motivated research question with practical clinical relevance

Comprehensive model coverage (15 models, 4 scales, multiple architectures)

The FSA metric jointly capturing diagnostic accuracy and evidence grounding is a useful contribution

Code availability enhances reproducibility

Domain-specific analysis (respiratory vs. neurological) provides actionable insights

Notable limitations:

No human clinician validation of any component (SP simulator, evaluator, or benchmark cases)

The OSCE analogy is somewhat oversold—the interaction is more like structured database querying than patient communication

Absence of formal statistical analysis weakens quantitative claims

The "12.75% average drop" conflates very different model capabilities and architectures

Limited discussion of how prompt engineering or few-shot examples might close the gap

The paper is long and repetitive, with substantial redundancy between results, discussion, and figure captions

Summary

This is a solid benchmarking contribution that quantifies an important and timely phenomenon—the gap between static and interactive diagnostic performance in LLMs. The framework design is sensible though not without significant methodological caveats, particularly the absence of human validation and statistical rigor. The findings are largely expected but usefully quantified, and the evidence quality analysis adds meaningful safety-relevant insights. The work is incremental relative to prior interactive evaluation efforts but offers improved structure and reproducibility.

Rating:6/ 10

Significance 6.5Rigor 5.5Novelty 5.5Clarity 6.5

Generated May 22, 2026

Comparison History (20)

vs. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

gemini-3.15/22/2026

Paper 1 addresses a critical, high-stakes domain (healthcare) by revealing significant flaws in how medical LLMs are currently evaluated. Demonstrating that static benchmarks overestimate interactive diagnostic performance has profound implications for patient safety and AI deployment in clinical settings. While Paper 2 provides a valuable benchmark for software agents, Paper 1's focus on clinical reasoning under uncertainty directly impacts the safe integration of AI in life-critical applications, giving it a broader and more urgent scientific and societal impact.

vs. What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental conceptual gap in LLM research by providing a taxonomy and shared vocabulary for AI sycophancy—a cross-cutting concern affecting alignment, safety, and governance. Its breadth of impact spans multiple fields (AI safety, policy, HCI), and its contributions (taxonomy from 70 papers, survey of 106 experts) offer a foundational framework that will shape future research directions. Paper 1 makes a valuable but narrower contribution to clinical LLM evaluation. While important, its findings (that interactive settings reduce LLM accuracy) are somewhat expected, whereas Paper 2's conceptual clarification has broader, more lasting influence on the field.

vs. Cross-domain benchmarks reveal when coordinated AI agents improve scientific inference from partial evidence

gemini-3.15/22/2026

Paper 1 addresses a critical safety gap in a high-stakes domain (healthcare) by demonstrating that static benchmarks significantly overestimate LLM performance in realistic, interactive clinical settings. This finding has immediate, profound implications for the deployment of medical AI. While Paper 2 offers a broad, multi-disciplinary methodological benchmark, Paper 1's direct relevance to patient safety and its timely critique of current AI evaluation paradigms give it higher potential for immediate and critical real-world impact.

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

gpt-5.25/22/2026

Paper 2 has higher likely impact because it introduces a controlled, reproducible interactive benchmark (OSCE-style simulator) directly addressing a timely safety-critical gap: LLM performance under active evidence seeking for clinical decision support. It evaluates many models across hundreds of cases with quantitative findings and error analysis, making it immediately useful to the medical AI, evaluation, and safety communities and likely to shape subsequent benchmarking and deployment guidance. Paper 1 is conceptually novel for agent system architecture, but its impact is more speculative and depends on broader adoption and stronger empirical validation.

vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

gpt-5.25/22/2026

Paper 2 introduces a principled new metric (Synergistic Faithfulness) grounded in Shapley interaction to isolate true cross-modal contributions, addressing a clear failure mode (evaluation collapse) in VLM explainability. It offers strong methodological rigor (theory-backed metric, high surrogate correlation, large speedup, multi-model/multi-dataset evaluation) and broad relevance across multimodal ML, XAI, safety, and auditing. Paper 1 provides an important benchmark and negative result for clinical LLM interaction, but its impact is narrower (clinical decision support) and more incremental relative to existing interactive evaluation efforts.

vs. MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

gpt-5.25/22/2026

Paper 2 is more novel and broadly impactful: it advances autonomous agents from prompt/memory tweaking to verified source-level self-rewriting with a concrete, deployable pipeline (evidence batching, deterministic stages, replay-based verification, safe promotion/rollback). This could generalize across many agent systems and application domains, with clear real-world operational benefits. Paper 1 offers timely, rigorous evaluation for clinical LLMs and important safety insights, but it is primarily a benchmarking/diagnostic analysis contribution with narrower domain impact compared to a general mechanism for self-improving production agents.

vs. Parametric Modular Answer Set Programs Made Declarative

gpt-5.25/22/2026

Paper 2 has higher likely impact: it introduces a timely, clinically relevant interactive benchmark (OSCE-inspired patient simulator) addressing a known evaluation gap for LLMs (active evidence seeking vs static full-context). The results are broadly applicable across AI safety, NLP evaluation, and medical decision support, and can directly influence model development, deployment practices, and regulatory/clinical assessment. Paper 1 is novel and rigorous within ASP theory, but its applications and cross-field reach are narrower, making its near-term real-world and broad scientific impact likely lower.

vs. Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

gpt-5.25/22/2026

Paper 2 has higher potential impact due to its timeliness and direct real-world application to clinical decision support, where safety and evaluation methodology are urgent. Introducing an OSCE-inspired simulator and reproducible interactive benchmark across many models/cases can reshape how medical LLMs are assessed, revealing systematic shortcomings (premature closure, inefficient questioning) that static benchmarks miss. This evaluation framework is likely to influence both clinical AI deployment and broader interactive LLM benchmarking. Paper 1 is innovative and useful for long-horizon agents, but its impact is more engineering-focused and less immediately tied to high-stakes domains.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

gpt-5.25/22/2026

Paper 2 has higher impact potential due to stronger novelty and broader applicability: a large, guaranteed-solvable, multi-turn, long-horizon benchmark spanning 5 realistic drug-design task types, 102 targets, and wide chemical space, with a public leaderboard enabling community adoption and sustained progress. Its applications (autonomous computational drug design) are high-value and cross-cut chemistry, biology, and AI/tool-use agent research. Paper 1 is timely and rigorous for clinical LLM evaluation, but is narrower in domain scope and primarily diagnostic/benchmarking rather than directly enabling discovery pipelines.

vs. Scaling Observation-aware Planning in Uncertain Domains

claude-opus-4.65/22/2026

Paper 1 addresses a highly timely and broadly impactful problem—evaluating LLMs for clinical decision support in realistic interactive settings. The finding that multi-turn evidence seeking significantly degrades LLM diagnostic performance challenges prevailing assumptions from static benchmarks, with direct implications for patient safety and AI deployment in healthcare. The benchmark and methodology are likely to influence a large research community working on LLMs in medicine. Paper 2, while technically impressive with major computational improvements, addresses a more specialized problem (POMDP sensor selection) with a narrower audience and less immediate societal impact.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

claude-opus-4.65/22/2026

Paper 2 addresses a critical gap in evaluating LLMs for clinical decision support, revealing that static benchmarks significantly overestimate real-world diagnostic performance. This finding has broad implications for AI safety in healthcare, regulatory evaluation, and clinical deployment—areas with enormous societal impact. The 12.75% accuracy drop in interactive settings is a striking result that could reshape how medical AI is benchmarked. Paper 1, while technically solid, addresses incremental optimization of KV cache compression, a narrower systems-level concern with less cross-disciplinary reach and fewer direct real-world safety implications.

vs. Meta-Soft: Leveraging Composable Meta-Tokens for Context-Preserving KV Cache Compression

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to its direct relevance to high-stakes real-world clinical decision support and its timely contribution: a controlled, reproducible OSCE-inspired simulator plus a multi-model benchmark revealing a systematic gap between static and interactive evaluation. This can influence evaluation standards, safety practices, and regulatory expectations across medical AI and LLM alignment. Paper 1 is innovative and useful for long-context efficiency, but its impact is more specialized to systems optimization within LLM inference, with narrower cross-domain consequences than a clinically grounded benchmarking framework.

vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

gemini-3.15/22/2026

Paper 1 addresses a critical safety gap in a high-stakes domain (healthcare) by revealing how static LLM benchmarks overestimate clinical diagnostic abilities in realistic, multi-turn scenarios. Its focus on dynamic reasoning and medical AI safety gives it broader, more urgent real-world implications and higher potential for citations in both the machine learning and medical informatics communities compared to the pedagogical focus of Paper 2.

vs. Enhancing Visual Token Representations for Video Large Language Models via Training-Free Spatial-Temporal Pooling and Gridding

gemini-3.15/22/2026

Paper 1 addresses a critical safety and performance gap in high-stakes clinical AI, demonstrating that static benchmarks overestimate LLM diagnostic abilities. Its focus on interactive evaluation and healthcare applications provides broader societal and interdisciplinary impact compared to Paper 2's domain-specific, technical optimization for video processing.

vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments

claude-opus-4.65/22/2026

Paper 1 addresses a highly timely and broadly impactful problem—evaluating LLMs for clinical decision support in realistic interactive settings. The finding that multi-turn evidence seeking significantly degrades LLM diagnostic accuracy challenges prevailing assumptions from static benchmarks and has immediate implications for patient safety, AI regulation, and future LLM development. Its large-scale benchmark (468 cases, 15 models) provides strong methodological rigor. Paper 2, while technically sound, addresses a niche topic (compositional semantics for assurance arguments using Subjective Logic) with a narrower audience in safety engineering. Paper 1's broader relevance across AI, medicine, and evaluation methodology gives it higher impact potential.

vs. A Causal Argumentation Method for Explainability of Machine Learning Models

gemini-3.15/22/2026

Paper 1 addresses a critical and highly timely issue: the evaluation of LLMs in real-world, interactive clinical settings. By demonstrating that current static benchmarks significantly overestimate models' diagnostic capabilities, it directly impacts the safe deployment of AI in healthcare and introduces a much-needed methodological shift in evaluation. Paper 2 presents an interesting theoretical advancement in XAI, but Paper 1 has broader immediate relevance, higher potential real-world impact, and addresses a more urgent gap in current AI research.

vs. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact: it introduces an interactive, OSCE-inspired benchmark highlighting a critical mismatch between static LLM medical evals and real diagnostic workflows, directly affecting patient safety and clinical deployment. Its findings (performance drops, error modes like premature closure) are actionable and timely for LLM evaluation, alignment, and healthcare regulation, with broad implications for interactive AI systems beyond medicine. Paper 2 is a solid, practical benchmark for multi-page parsing with clear applications, but its scope is narrower and less societally high-stakes than clinical decision support.

vs. Towards Resilient and Autonomous Networks: A BlueSky Vision on AI-Native 6G

gemini-3.15/22/2026

Paper 2 presents concrete empirical findings and a novel reproducible benchmark in a highly critical field (medical AI). By demonstrating significant performance drops in interactive settings compared to static benchmarks, it provides actionable insights that directly impact patient safety and LLM evaluation methods. In contrast, Paper 1 is a visionary roadmap without empirical validation. Paper 2's methodological rigor, immediate real-world applicability in healthcare, and timely identification of current LLM limitations give it a much higher potential for measurable scientific impact.

vs. Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

gemini-3.15/22/2026

Paper 1 presents a foundational algorithmic contribution to LLM post-training, simplifying search-augmented reasoning without relying on external supervision. Its method of self-distillation and self-evolution addresses a core bottleneck in developing reasoning agents, offering broad applicability across all AI domains. While Paper 2 provides critical insights for medical AI safety, Paper 1's general-purpose training paradigm is likely to drive wider methodological shifts and achieve higher cross-disciplinary impact in the rapidly evolving landscape of self-improving AI.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gpt-5.25/22/2026

Paper 2 has higher impact potential due to a more broadly applicable, novel finding (inverse scaling where stronger LLMs make worse forecasts under superlinear growth/tail-risk), spanning multiple high-stakes domains (finance, epidemiology, macroeconomics) and supported by simulated and real-world replications plus mechanistic analysis (upper-tail decomposition, within-family scale vs post-training). It also identifies a widely relevant evaluation flaw (threshold metrics missing tail costs) with clear methodological recommendations. Paper 1 is timely and important for clinical CDS evaluation, but its scope is narrower and primarily benchmark-focused.