ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models
Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan
Abstract
Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.
AI Impact Assessments
(1 models)Scientific Impact Assessment: ClinicalMC
1. Core Contribution
ClinicalMC addresses a genuine gap in clinical NLP benchmarking: the absence of multi-course evaluation scenarios that reflect how patient conditions evolve over multiple treatment rounds during hospitalization. While existing benchmarks (ClinicalLab, AI Hospital, MedJourney) evaluate single-course decision-making—essentially one round of diagnosis and treatment—ClinicalMC models the iterative admission-to-discharge process across multiple clinical courses (averaging 5.11 courses in English, 3.42 in Chinese). The benchmark spans four clinical stages: triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. It includes 1,275 Chinese and 5,804 English samples across 16 and 24 departments, respectively.
The paper also introduces SimHospital, a multi-agent evaluation framework with patient, examiner, and doctor agents, enabling both single-turn static and multi-turn dynamic evaluation settings.
2. Methodological Rigor
Data Construction: The pipeline is reasonably thorough. Chinese data derives from MedEureka EHRs with two-stage anonymization and filtering (6,947→3,317 records). English data is sourced from PMC-Patients case reports (167,034→6,748), filtered via GPT-4o for completeness and multi-course structure. A perplexity analysis comparing GPT-4o-converted EHRs against real MIMIC-III/MIMIC-IV discharge summaries shows reasonable alignment (6.064 vs. 3.144 for MIMIC-IV).
Quality Control: A three-round annotation workflow (LLM segmentation → medical student verification → dual clinician review) with batch-based iterative validation is employed. The reported 93.3% quality pass rate and Cohen's kappa of 0.85 are respectable but not extraordinary. However, significant reliance on GPT-4o for converting PMC-Patients case reports into full EHRs with multiple courses raises concerns about synthetic data fidelity—despite perplexity validation, the clinical realism of synthetically generated progress notes remains questionable.
Evaluation Framework: Using GPT-4o-mini for patient and examiner agents introduces a dependency that could influence results, though the ablation study (Appendix A.7) replacing with Qwen3 and DeepSeek-V3.2 shows relative stability. The evaluation metrics are appropriately diverse—Accuracy for triage, Recall for examinations, F1 for diagnoses, IoU for treatment plans, and LLM-scored assessments for diagnostic basis. However, using an LLM to evaluate diagnostic reasoning (PB_Score, FB_Score, DD_Score) is inherently circular when evaluating LLM capabilities.
Experimental Design: The inclusion of Asclepius-Llama2 (trained on PMC-Patients) as a data leakage control is a thoughtful design choice, and its poor performance (12.58-12.61% avg) suggests the benchmark genuinely tests reasoning beyond memorization. Human performance baselines (85-87.5%) provide useful reference points, though based on only 100 samples from a single medical student.
3. Potential Impact
The benchmark fills a meaningful niche. Multi-course clinical reasoning is indeed closer to real inpatient care than single-pass diagnosis. The dramatic performance gap between LLMs (~43-50% best average) and human performance (~85-87%) provides clear evidence that current models struggle with longitudinal clinical reasoning. Key findings—that performance degrades in multi-course versus single-course settings, and that error cascading occurs in dynamic settings—have practical implications for clinical AI deployment.
However, the impact is somewhat bounded by: (a) the synthetic nature of much of the data, particularly the English dataset; (b) restriction to text-only modalities; and (c) the narrow applicability to Chinese and English clinical contexts. The benchmark would be more impactful if it incorporated real multi-course EHRs rather than LLM-generated approximations.
4. Timeliness & Relevance
This work is timely given the rapid deployment of LLMs in healthcare settings and the recognized need for more realistic evaluation beyond exam-style Q&A. The shift toward evaluating temporal reasoning and treatment adjustment over multiple encounters addresses a genuine bottleneck. The bilingual (Chinese-English) nature also adds practical value given the global interest in medical LLMs.
5. Strengths & Limitations
Strengths:
Limitations:
Additional Observations:
The error analysis revealing that ~30% of errors are "Redundant Diagnostic and Treatment Plans" and ~20% involve "Failure to Detect Subtle but Critical Changes" provides valuable direction for future model improvement. The finding that assessment performance actually improves with more courses (while examination and treatment degrade) is an interesting insight about how accumulated context helps condition evaluation but hinders specific action planning.
Generated Jun 3, 2026
Comparison History (21)
Paper 2 addresses a fundamental and widespread challenge in AI: LLM self-correction. By introducing a structured method for error localization that improves reasoning across domains, it has broad foundational impact. Paper 1, while highly valuable, offers a domain-specific benchmark for clinical decision-making, giving it a narrower scope of impact compared to the core algorithmic advancements in Paper 2.
Paper 2 introduces a novel benchmark for multi-course clinical decision-making, addressing a critical gap in current LLM evaluation. Benchmark papers in the LLM domain typically yield broader scientific impact and higher citation rates than specialized algorithmic improvements. Its bilingual dataset, dynamic multi-agent evaluation framework, and relevance to healthcare deployment give it extensive interdisciplinary appeal compared to Paper 1's domain-specific focus on autonomous driving trajectory prediction.
ClinicalMC addresses a more impactful gap—benchmarking LLMs for multi-course clinical decision-making across the full patient journey. Healthcare is a high-stakes domain with broad societal impact. The benchmark's bilingual design (Chinese/English), multi-agent evaluation framework, and comprehensive coverage from admission to discharge offer lasting utility for the research community. While Paper 1 presents solid methodological contributions (CPO with synthetic contrastive traces for multi-table QA), its scope is narrower and more incremental. Paper 2's benchmark nature means it will likely be widely adopted and cited across medical AI research.
Paper 1 introduces a novel, technically substantive residual-centric framework for high-fidelity learned scientific data compression, with clear methodological contributions (deterministic residual coding pipeline; neural-guided Lorenzo residual with deterministic decoding) and strong, quantified gains across multiple major scientific datasets at stringent error targets. Its impact spans HPC, numerical simulation workflows, storage/I/O, and learned compression, with immediate deployability. Paper 2 is timely and useful as a benchmark, but benchmarks often face faster obsolescence, higher sensitivity to curation/bias, and narrower methodological novelty; its real-world clinical impact is indirect without demonstrated deployment or safety validation.
Paper 2 likely has higher scientific impact: it proposes a broadly applicable methodological advance for improving LLM reasoning via consistency-regulated, step-level reward decomposition, addressing key RL/self-alignment bottlenecks (noise, sparse supervision, collapse). If rigorous and reproducible (code provided), it can influence multiple subfields (RLHF/RLAIF, reasoning, training dynamics) and downstream domains. Paper 1 is valuable and timely for healthcare evaluation, but as a benchmark/framework it is more domain-specific and its impact depends on adoption and dataset quality; it advances assessment rather than core learning methods.
Paper 2 introduces a concrete, highly relevant benchmark and evaluation framework for LLMs in healthcare, addressing a critical bottleneck in multi-course clinical decision-making. Its empirical approach, large-scale dataset, and direct real-world applicability in medical AI provide broader utility, higher citation potential, and more immediate practical impact compared to the abstract, philosophical arguments presented in Paper 1.
ClinicalMC addresses a significant gap in clinical AI evaluation by introducing a benchmark for multi-course clinical decision-making, which better reflects real-world healthcare complexity. Its broader impact across healthcare AI, systematic evaluation framework with multi-agent assessment, and bilingual dataset make it more impactful. Paper 1, while practically useful for cost reduction, addresses a narrower engineering optimization problem. Healthcare benchmarks tend to have wider adoption and influence across both AI and medical communities, and the temporal/longitudinal aspect of clinical reasoning is a timely and important research direction.
Paper 1 has higher impact potential due to a more novel, generalizable contribution: it identifies and explains a failure mode (critique-induced confusion), proposes a predictive condition for when debate helps vs hurts, and validates it via large factorial experiments plus external generalization across 19 prior studies and multiple domains. This yields actionable design principles for multi-agent systems broadly (beyond data cleaning). Paper 2 is timely and application-relevant, but primarily contributes a benchmark/evaluation framework in one domain; its broader methodological novelty and cross-field generalization appear more limited.
Paper 2 is likely to have higher scientific impact due to strong timeliness and broad real-world relevance: it introduces a substantial bilingual benchmark for longitudinal (multi-course) clinical decision-making, a key gap in current LLM evaluation. Benchmarks often become community standards, enabling reproducibility and rapid cross-model progress across NLP, healthcare AI, and agent evaluation. Paper 1 is methodologically rigorous and novel for scalable constrained MARL with guarantees, but its immediate impact may be narrower to MARL specialists and specific coordination/constraint structures, whereas ClinicalMC can influence many downstream studies and deployments.
Paper 1 addresses a critical, highly timely issue in foundational AI research: the failure modes of test-time compute in Large Reasoning Models. Its insights into 'harmful overthinking' and logical drift have broad implications across all LLM applications, potentially altering how inference scaling is designed. While Paper 2 presents a valuable domain-specific medical benchmark, Paper 1's fundamental methodological contributions to AI reasoning and alignment offer broader, more paradigm-shifting scientific impact across the entire field of artificial intelligence.
Paper 1 addresses a high-stakes, critical domain (healthcare) by introducing a benchmark for multi-course clinical decision-making. Tackling the evolution of patient conditions over time represents a significant leap in medical AI evaluation. In contrast, Paper 2 focuses on a narrower educational application (automated grading for CS1), which, while methodologically sound, has a more limited scope and lower potential for broad real-world impact compared to advancing AI in clinical settings.
ANDES addresses a fundamental challenge in automated AI alignment—autonomous data curation for post-training—by introducing a novel framework that reimagines data generation as a reusable agent skill. Its self-evolving routing mechanism and closed-loop interface represent genuine methodological innovation with broad applicability across AI development. It achieves SOTA on PostTrainBench and demonstrates cross-task generalization. ClinicalMC, while valuable, is primarily a benchmark contribution for a specific domain (clinical decision-making) with more incremental novelty. ANDES has wider impact potential across the rapidly growing AI agents and alignment research fields.
Paper 2 pioneers a novel integration of LLMs with physics-based simulations for materials synthesis, addressing a critical bottleneck in discovering and manufacturing novel inorganic materials. While Paper 1 provides a valuable benchmark for clinical LLMs, Paper 2's approach represents a transformative paradigm shift in AI for science, offering broad methodological implications for bridging data-driven generative models with rigorous physical constraints.
Paper 2 is more novel and broadly applicable: it proposes a general, single-pass method to distill reusable “reasoning primitives” from agent traces, yielding a compact pseudo-tool library that improves performance and reduces inference cost across diverse reasoning/planning tasks. This has wide impact potential for agentic LLM design, efficiency, and automation beyond any single domain. Paper 1 is timely and valuable for healthcare evaluation, but as a benchmark its impact is narrower, more dataset-dependent, and primarily methodological/assessment-focused rather than introducing a generalizable new learning/control mechanism.
Paper 1 (AGENTCL) addresses a fundamental challenge in language agent research—rigorous evaluation of continual learning—with broader applicability across AI. It introduces a novel evaluation framework with controlled task streams, transfer metrics, and a diagnostic probing method (MemProbe), contributing methodological innovations applicable to the growing field of language agents. Paper 2 (ClinicalMC) is a valuable domain-specific benchmark for clinical decision-making but is narrower in scope. AGENTCL's contributions to understanding memory, plasticity, and knowledge reuse in agents have wider cross-field impact and address a timely, foundational problem.
Paper 2 presents a fundamental methodological advancement in agentic reinforcement learning (EAPO), addressing a critical bottleneck (tool abuse/efficiency) applicable to LLM agents across all domains. While Paper 1 introduces a valuable medical benchmark, Paper 2's algorithmic improvements to reasoning efficiency and accuracy over state-of-the-art baselines like GRPO offer broader, immediate impact across the entire AI ecosystem.
LEAP demonstrates a major breakthrough in formal theorem proving, solving all 12 Putnam 2025 problems and significantly outperforming specialized IMO systems on a new benchmark. It introduces a novel agentic framework bridging informal reasoning with formal verification, with demonstrated utility on open research problems. Its impact spans AI, mathematics, and formal verification. ClinicalMC, while valuable, is primarily a benchmark contribution for clinical LLM evaluation—important but more incremental, addressing a narrower evaluation gap without introducing fundamentally new methods.
Paper 2 introduces a novel and generalizable framework (PDA with GAM) for aggregating weak supervision signals to improve strong LLMs, addressing the fundamental challenge of scarce high-quality training data. Its methodological innovation—geometric alignment merging for LoRA adapters—is broadly applicable across domains and model scales. Paper 1 contributes a useful benchmark for clinical LLM evaluation, but benchmarks typically have narrower impact than new training methodologies. Paper 2's approach has broader applicability, stronger methodological novelty, and greater potential to influence future LLM training paradigms.
Paper 2 addresses a critical gap in healthcare AI by introducing a large-scale, bilingual benchmark for multi-course clinical decision-making. Its focus on dynamic, evolving patient scenarios offers significant real-world applications and has a broader, more immediate impact on improving medical LLMs and patient care compared to the specialized social simulation framework in Paper 1.
Paper 1 likely has higher scientific impact due to a concrete, novel benchmark addressing an important evaluation gap (multi-course, longitudinal clinical decision-making) with substantial bilingual data and an explicit multi-agent evaluation framework. It is directly applicable to healthcare model validation and deployment, supporting reproducibility and downstream method development. Paper 2 is timely and potentially influential conceptually, but as a visionary survey/framework without new experimental results, its impact is more speculative and harder to operationalize. Overall, Paper 1 offers clearer methodological rigor and immediate real-world utility.