ClinicalMC: A Benchmark for Multi-Course Clinical Decision-Making with Large Language Models

Ruihui Hou, Siyi Zhu, Ziyue Huai, Guangya Yu, Yongqi Fan, Chunming Wang, Tong Ruan

#1888 of 3404 · Artificial Intelligence
Share
Tournament Score
1390±44
10501800
43%
Win Rate
9
Wins
12
Losses
21
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs) have been widely adopted in healthcare, yet they still encounter significant challenges in complex clinical decision-making scenarios. Existing benchmarks primarily assess LLM performance in single-course settings and lack systematic evaluation in multi-course scenarios, where a patient's condition evolves over time. To address this gap, we propose ClinicalMC, a benchmark for multi-course clinical decision-making. It includes 1,275 Chinese and 5,804 English samples across four stages from admission to discharge. These stages cover triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. In ClinicalMC, patients in the English dataset undergo an average of 5.11 clinical courses, whereas those in the Chinese dataset undergo 3.42. To assess LLM performance, we construct a multi-agent evaluation framework that includes patient, examiner, and doctor agents. Based on the benchmark and framework, we design two experimental settings -- a single-turn static setting and a multi-turn dynamic setting -- and assess three categories of LLMs: 1) closed-source LLMs like GPT5-mini; 2) open-source LLMs like DeepSeek-V3.2; and 3) medical LLMs like HuatuoGPT-o1. Through extensive evaluation, we aim to better understand LLM performance in the medical domain and support its effective deployment in healthcare.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ClinicalMC

1. Core Contribution

ClinicalMC addresses a genuine gap in clinical NLP benchmarking: the absence of multi-course evaluation scenarios that reflect how patient conditions evolve over multiple treatment rounds during hospitalization. While existing benchmarks (ClinicalLab, AI Hospital, MedJourney) evaluate single-course decision-making—essentially one round of diagnosis and treatment—ClinicalMC models the iterative admission-to-discharge process across multiple clinical courses (averaging 5.11 courses in English, 3.42 in Chinese). The benchmark spans four clinical stages: triage, first-course examination/diagnosis/treatment, subsequent multi-course examination/assessment/treatment, and final diagnosis. It includes 1,275 Chinese and 5,804 English samples across 16 and 24 departments, respectively.

The paper also introduces SimHospital, a multi-agent evaluation framework with patient, examiner, and doctor agents, enabling both single-turn static and multi-turn dynamic evaluation settings.

2. Methodological Rigor

Data Construction: The pipeline is reasonably thorough. Chinese data derives from MedEureka EHRs with two-stage anonymization and filtering (6,947→3,317 records). English data is sourced from PMC-Patients case reports (167,034→6,748), filtered via GPT-4o for completeness and multi-course structure. A perplexity analysis comparing GPT-4o-converted EHRs against real MIMIC-III/MIMIC-IV discharge summaries shows reasonable alignment (6.064 vs. 3.144 for MIMIC-IV).

Quality Control: A three-round annotation workflow (LLM segmentation → medical student verification → dual clinician review) with batch-based iterative validation is employed. The reported 93.3% quality pass rate and Cohen's kappa of 0.85 are respectable but not extraordinary. However, significant reliance on GPT-4o for converting PMC-Patients case reports into full EHRs with multiple courses raises concerns about synthetic data fidelity—despite perplexity validation, the clinical realism of synthetically generated progress notes remains questionable.

Evaluation Framework: Using GPT-4o-mini for patient and examiner agents introduces a dependency that could influence results, though the ablation study (Appendix A.7) replacing with Qwen3 and DeepSeek-V3.2 shows relative stability. The evaluation metrics are appropriately diverse—Accuracy for triage, Recall for examinations, F1 for diagnoses, IoU for treatment plans, and LLM-scored assessments for diagnostic basis. However, using an LLM to evaluate diagnostic reasoning (PB_Score, FB_Score, DD_Score) is inherently circular when evaluating LLM capabilities.

Experimental Design: The inclusion of Asclepius-Llama2 (trained on PMC-Patients) as a data leakage control is a thoughtful design choice, and its poor performance (12.58-12.61% avg) suggests the benchmark genuinely tests reasoning beyond memorization. Human performance baselines (85-87.5%) provide useful reference points, though based on only 100 samples from a single medical student.

3. Potential Impact

The benchmark fills a meaningful niche. Multi-course clinical reasoning is indeed closer to real inpatient care than single-pass diagnosis. The dramatic performance gap between LLMs (~43-50% best average) and human performance (~85-87%) provides clear evidence that current models struggle with longitudinal clinical reasoning. Key findings—that performance degrades in multi-course versus single-course settings, and that error cascading occurs in dynamic settings—have practical implications for clinical AI deployment.

However, the impact is somewhat bounded by: (a) the synthetic nature of much of the data, particularly the English dataset; (b) restriction to text-only modalities; and (c) the narrow applicability to Chinese and English clinical contexts. The benchmark would be more impactful if it incorporated real multi-course EHRs rather than LLM-generated approximations.

4. Timeliness & Relevance

This work is timely given the rapid deployment of LLMs in healthcare settings and the recognized need for more realistic evaluation beyond exam-style Q&A. The shift toward evaluating temporal reasoning and treatment adjustment over multiple encounters addresses a genuine bottleneck. The bilingual (Chinese-English) nature also adds practical value given the global interest in medical LLMs.

5. Strengths & Limitations

Strengths:

  • Novel and clinically meaningful problem formulation (multi-course decision-making)
  • Comprehensive task coverage from admission to discharge with well-defined subtasks
  • Bilingual dataset with reasonable scale
  • Thorough evaluation across 20+ models spanning medical, open-source, and closed-source categories
  • Error taxonomy (RDTP, FDSC, ICD, IRC) provides actionable insights
  • Multi-turn dynamic evaluation captures realistic error propagation
  • Limitations:

  • Heavy reliance on GPT-4o for data synthesis, especially for English data—the "benchmark" partly evaluates models against GPT-4o-generated ground truth
  • Human evaluation baseline is limited (100 samples, single medical student)
  • LLM-as-judge evaluation for subjective metrics creates circularity concerns
  • Department distribution imbalance is acknowledged but not mitigated
  • The paper mentions "GPT5-mini" in the abstract/tables, which appears to be a very recent model—unclear if this refers to an actual released model or is mislabeled
  • No analysis of how specific clinical specialties or disease complexity affects multi-course performance
  • The synonym list construction for entity matching relies on LLMs, adding another layer of potential systematic bias
  • Text-only modality is a significant limitation for realistic clinical decision-making
  • Additional Observations:

    The error analysis revealing that ~30% of errors are "Redundant Diagnostic and Treatment Plans" and ~20% involve "Failure to Detect Subtle but Critical Changes" provides valuable direction for future model improvement. The finding that assessment performance actually improves with more courses (while examination and treatment degrade) is an interesting insight about how accumulated context helps condition evaluation but hinders specific action planning.

    Rating:5.8/ 10
    Significance 6Rigor 5.5Novelty 6Clarity 6.5

    Generated Jun 3, 2026

    Comparison History (21)

    vs. Structure Enables Effective Self-Localization of Errors in LLMs
    gemini-3.16/6/2026

    Paper 2 addresses a fundamental and widespread challenge in AI: LLM self-correction. By introducing a structured method for error localization that improves reasoning across domains, it has broad foundational impact. Paper 1, while highly valuable, offers a domain-specific benchmark for clinical decision-making, giving it a narrower scope of impact compared to the core algorithmic advancements in Paper 2.

    vs. MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction
    gemini-3.16/5/2026

    Paper 2 introduces a novel benchmark for multi-course clinical decision-making, addressing a critical gap in current LLM evaluation. Benchmark papers in the LLM domain typically yield broader scientific impact and higher citation rates than specialized algorithmic improvements. Its bilingual dataset, dynamic multi-agent evaluation framework, and relevance to healthcare deployment give it extensive interdisciplinary appeal compared to Paper 1's domain-specific focus on autonomous driving trajectory prediction.

    vs. Synthetic Contrastive Reasoning for Multi-Table Q&A
    claude-opus-4.66/5/2026

    ClinicalMC addresses a more impactful gap—benchmarking LLMs for multi-course clinical decision-making across the full patient journey. Healthcare is a high-stakes domain with broad societal impact. The benchmark's bilingual design (Chinese/English), multi-agent evaluation framework, and comprehensive coverage from admission to discharge offer lasting utility for the research community. While Paper 1 presents solid methodological contributions (CPO with synthetic contrastive traces for multi-table QA), its scope is narrower and more incremental. Paper 2's benchmark nature means it will likely be widely adopted and cited across medical AI research.

    vs. Residual Modeling for High-Fidelity Learned Compression of Scientific Data
    gpt-5.26/5/2026

    Paper 1 introduces a novel, technically substantive residual-centric framework for high-fidelity learned scientific data compression, with clear methodological contributions (deterministic residual coding pipeline; neural-guided Lorenzo residual with deterministic decoding) and strong, quantified gains across multiple major scientific datasets at stringent error targets. Its impact spans HPC, numerical simulation workflows, storage/I/O, and learned compression, with immediate deployability. Paper 2 is timely and useful as a benchmark, but benchmarks often face faster obsolescence, higher sensitivity to curation/bias, and narrower methodological novelty; its real-world clinical impact is indirect without demonstrated deployment or safety validation.

    vs. LC-ERD: Mining Latent Logic for Self-Evolving Reasoning via Consistency-Regulated Reward Decomposition
    gpt-5.26/5/2026

    Paper 2 likely has higher scientific impact: it proposes a broadly applicable methodological advance for improving LLM reasoning via consistency-regulated, step-level reward decomposition, addressing key RL/self-alignment bottlenecks (noise, sparse supervision, collapse). If rigorous and reproducible (code provided), it can influence multiple subfields (RLHF/RLAIF, reasoning, training dynamics) and downstream domains. Paper 1 is valuable and timely for healthcare evaluation, but as a benchmark/framework it is more domain-specific and its impact depends on adoption and dataset quality; it advances assessment rather than core learning methods.

    vs. Unplugging a Seemingly Sentient Machine Is the Rational Choice -- A Metaphysical Perspective
    gemini-3.16/5/2026

    Paper 2 introduces a concrete, highly relevant benchmark and evaluation framework for LLMs in healthcare, addressing a critical bottleneck in multi-course clinical decision-making. Its empirical approach, large-scale dataset, and direct real-world applicability in medical AI provide broader utility, higher citation potential, and more immediate practical impact compared to the abstract, philosophical arguments presented in Paper 1.

    vs. Cross-Lingual Token Arbitrage: Optimizing Code Agent Context Windows via Local LLM Preprocessing
    claude-opus-4.66/3/2026

    ClinicalMC addresses a significant gap in clinical AI evaluation by introducing a benchmark for multi-course clinical decision-making, which better reflects real-world healthcare complexity. Its broader impact across healthcare AI, systematic evaluation framework with multi-agent assessment, and bilingual dataset make it more impactful. Paper 1, while practically useful for cost reduction, addresses a narrower engineering optimization problem. Healthcare benchmarks tend to have wider adoption and influence across both AI and medical communities, and the temporal/longitudinal aspect of clinical reasoning is a timely and important research direction.

    vs. When Helping Hurts and How to Fix It: Multi-Agent Debate for Data Cleaning
    gpt-5.26/3/2026

    Paper 1 has higher impact potential due to a more novel, generalizable contribution: it identifies and explains a failure mode (critique-induced confusion), proposes a predictive condition for when debate helps vs hurts, and validates it via large factorial experiments plus external generalization across 19 prior studies and multiple domains. This yields actionable design principles for multi-agent systems broadly (beyond data cleaning). Paper 2 is timely and application-relevant, but primarily contributes a benchmark/evaluation framework in one domain; its broader methodological novelty and cross-field generalization appear more limited.

    vs. Coordination Graphs for Constrained Multi-Agent Reinforcement Learning
    gpt-5.26/3/2026

    Paper 2 is likely to have higher scientific impact due to strong timeliness and broad real-world relevance: it introduces a substantial bilingual benchmark for longitudinal (multi-course) clinical decision-making, a key gap in current LLM evaluation. Benchmarks often become community standards, enabling reproducibility and rapid cross-model progress across NLP, healthcare AI, and agent evaluation. Paper 1 is methodologically rigorous and novel for scalable constrained MARL with guarantees, but its immediate impact may be narrower to MARL specialists and specific coordination/constraint structures, whereas ClinicalMC can influence many downstream studies and deployments.

    vs. Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models
    gemini-3.16/3/2026

    Paper 1 addresses a critical, highly timely issue in foundational AI research: the failure modes of test-time compute in Large Reasoning Models. Its insights into 'harmful overthinking' and logical drift have broad implications across all LLM applications, potentially altering how inference scaling is designed. While Paper 2 presents a valuable domain-specific medical benchmark, Paper 1's fundamental methodological contributions to AI reasoning and alignment offer broader, more paradigm-shifting scientific impact across the entire field of artificial intelligence.

    vs. Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria
    gemini-3.16/3/2026

    Paper 1 addresses a high-stakes, critical domain (healthcare) by introducing a benchmark for multi-course clinical decision-making. Tackling the evolution of patient conditions over time represents a significant leap in medical AI evaluation. In contrast, Paper 2 focuses on a narrower educational application (automated grading for CS1), which, while methodologically sound, has a more limited scope and lower potential for broad real-world impact compared to advancing AI in clinical settings.

    vs. ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment
    claude-opus-4.66/3/2026

    ANDES addresses a fundamental challenge in automated AI alignment—autonomous data curation for post-training—by introducing a novel framework that reimagines data generation as a reusable agent skill. Its self-evolving routing mechanism and closed-loop interface represent genuine methodological innovation with broad applicability across AI development. It achieves SOTA on PostTrainBench and demonstrates cross-task generalization. ClinicalMC, while valuable, is primarily a benchmark contribution for a specific domain (clinical decision-making) with more incremental novelty. ANDES has wider impact potential across the rapidly growing AI agents and alignment research fields.

    vs. Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials
    gemini-3.16/3/2026

    Paper 2 pioneers a novel integration of LLMs with physics-based simulations for materials synthesis, addressing a critical bottleneck in discovering and manufacturing novel inorganic materials. While Paper 1 provides a valuable benchmark for clinical LLMs, Paper 2's approach represents a transformative paradigm shift in AI for science, offering broad methodological implications for bridging data-driven generative models with rigorous physical constraints.

    vs. Inducing Reasoning Primitives from Agent Traces
    gpt-5.26/3/2026

    Paper 2 is more novel and broadly applicable: it proposes a general, single-pass method to distill reusable “reasoning primitives” from agent traces, yielding a compact pseudo-tool library that improves performance and reduces inference cost across diverse reasoning/planning tasks. This has wide impact potential for agentic LLM design, efficiency, and automation beyond any single domain. Paper 1 is timely and valuable for healthcare evaluation, but as a benchmark its impact is narrower, more dataset-dependent, and primarily methodological/assessment-focused rather than introducing a generalizable new learning/control mechanism.

    vs. AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
    claude-opus-4.66/3/2026

    Paper 1 (AGENTCL) addresses a fundamental challenge in language agent research—rigorous evaluation of continual learning—with broader applicability across AI. It introduces a novel evaluation framework with controlled task streams, transfer metrics, and a diagnostic probing method (MemProbe), contributing methodological innovations applicable to the growing field of language agents. Paper 2 (ClinicalMC) is a valuable domain-specific benchmark for clinical decision-making but is narrower in scope. AGENTCL's contributions to understanding memory, plasticity, and knowledge reuse in agents have wider cross-field impact and address a timely, foundational problem.

    vs. Learning When Not to Act: Mitigating Tool Abuse in Agentic Reinforcement Learning
    gemini-3.16/3/2026

    Paper 2 presents a fundamental methodological advancement in agentic reinforcement learning (EAPO), addressing a critical bottleneck (tool abuse/efficiency) applicable to LLM agents across all domains. While Paper 1 introduces a valuable medical benchmark, Paper 2's algorithmic improvements to reasoning efficiency and accuracy over state-of-the-art baselines like GRPO offer broader, immediate impact across the entire AI ecosystem.

    vs. LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
    claude-opus-4.66/3/2026

    LEAP demonstrates a major breakthrough in formal theorem proving, solving all 12 Putnam 2025 problems and significantly outperforming specialized IMO systems on a new benchmark. It introduces a novel agentic framework bridging informal reasoning with formal verification, with demonstrated utility on open research problems. Its impact spans AI, mathematics, and formal verification. ClinicalMC, while valuable, is primarily a benchmark contribution for clinical LLM evaluation—important but more incremental, addressing a narrower evaluation gap without introducing fundamentally new methods.

    vs. From "Weak" Signals to Strong Models: Preference Delta Aggregation with LoRA Merging
    claude-opus-4.66/3/2026

    Paper 2 introduces a novel and generalizable framework (PDA with GAM) for aggregating weak supervision signals to improve strong LLMs, addressing the fundamental challenge of scarce high-quality training data. Its methodological innovation—geometric alignment merging for LoRA adapters—is broadly applicable across domains and model scales. Paper 1 contributes a useful benchmark for clinical LLM evaluation, but benchmarks typically have narrower impact than new training methodologies. Paper 2's approach has broader applicability, stronger methodological novelty, and greater potential to influence future LLM training paradigms.

    vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation
    gemini-3.16/3/2026

    Paper 2 addresses a critical gap in healthcare AI by introducing a large-scale, bilingual benchmark for multi-course clinical decision-making. Its focus on dynamic, evolving patient scenarios offers significant real-world applications and has a broader, more immediate impact on improving medical LLMs and patient care compared to the specialized social simulation framework in Paper 1.

    vs. Model-Native Computing Architecture: Envisioning Future System Architecture Through the Lens of Computer Architecture
    gpt-5.26/3/2026

    Paper 1 likely has higher scientific impact due to a concrete, novel benchmark addressing an important evaluation gap (multi-course, longitudinal clinical decision-making) with substantial bilingual data and an explicit multi-agent evaluation framework. It is directly applicable to healthcare model validation and deployment, supporting reproducibility and downstream method development. Paper 2 is timely and potentially influential conceptually, but as a visionary survey/framework without new experimental results, its impact is more speculative and harder to operationalize. Overall, Paper 1 offers clearer methodological rigor and immediate real-world utility.