Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Xavier Theimer-Lienhard, Mushtaha El-Amin, Fay Elhassan, Sahaj Vaidya, Victor Cartier-Negadi, David Sasu, Lars Klein, Mary-Anne Hartley

May 15, 2026

arXiv:2605.16215v1 PDF

cs.AI(primary)cs.CL

#151of 2292·Artificial Intelligence

#151 of 2292 · Artificial Intelligence

Tournament Score

1529±46

10501800

81%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty6.5

Clarity8

Tournament Score

1529±46

10501800

81%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Clinical decision support systems (CDSS) require scrutable, auditable pipelines that enable rigorous, reproducible validation. Yet current LLM-based CDSS remain largely opaque. Most "open" models are open-weight only, releasing parameters while withholding the data provenance, curation procedures, and generation pipelines that determine model behavior. Fully Open (FO) models, which expose the complete training stack end-to-end, do not currently exist in medicine. We introduce Fully Open Meditron, the first fully open pipeline for building LLM-CDSS, comprising a clinician-audited training corpus, a reproducible data construction and training framework, and a use-aligned evaluation protocol. The corpus unifies eight public medical QA datasets into a normalized conversational format and expands coverage with three clinician-vetted synthetic extensions: exam-style QA, guideline-grounded QA derived from 46,469 clinical practice guidelines, and clinical vignettes. The pipeline enforces system-wide decontamination, gold-label resampling of teacher generations, and end-to-end validation by a four-physician panel. We evaluate using an LLM-as-a-judge protocol over expert-written clinical vignettes, calibrated against 204 human raters. We apply the recipe to five FO base models (Apertus-70B/8B-Instruct, OLMo-2-32B-SFT, EuroLLM-22B/9B-Instruct). All MeditronFO variants are preferred over their bases. Apertus-70B-MeditronFO improves +6.6 points over its base (47.2% to 53.8%) on aggregate medical benchmarks, establishing a new FO SoTA. Gemma-3-27B-MeditronFO is preferred over MedGemma in 58.6% of LLM-as-a-judge comparisons and outperforms it on HealthBench (58% vs 55.9%). These results show that fully open pipelines can achieve state-of-the-art domain-specific performance without sacrificing auditability or reproducibility.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Fully Open Meditron

1. Core Contribution

The paper introduces "Fully Open Meditron" (MeditronFO), positioned as the first end-to-end fully open pipeline for adapting LLMs into clinical decision support systems. The core novelty lies not in a single technical breakthrough but in the principled assembly of an auditable, reproducible pipeline spanning corpus construction, synthetic data generation, decontamination, training, and evaluation — all under fully open constraints where data provenance, code, and weights are publicly released.

The key contributions are: (a) a clinician-audited training corpus unifying 8 public medical QA datasets with three synthetic extensions (exam-style QA, guideline-grounded QA from 46,469 clinical practice guidelines, and clinical vignettes); (b) a systematic decontamination pipeline; (c) Auto-MOOVE, an LLM-as-a-judge evaluation protocol validated against 204 human raters; and (d) a family of fine-tuned models across five fully open bases. The paper explicitly defines a taxonomy of openness (Table 1) and demonstrates that no prior medical LLM satisfies all dimensions simultaneously.

2. Methodological Rigor

Strengths in methodology:

The decontamination pipeline (two-stage n-gram + token alignment from Apertus) is systematically applied against all evaluation benchmarks, addressing a well-documented concern in medical LLM evaluation.

Gold-label rejection sampling (up to 8× at T=0.7) for synthetic data provides a principled approach to mitigating hallucinations in teacher-generated content.

The clinician panel (4 physicians) audited generation prompts, though the coverage is limited (3 sampled QA pairs per template).

Auto-MOOVE validation against 204 human raters with κ analysis situating the judge within the human distribution is methodologically sound.

Corpus component ablations (Table 3) are informative and reveal non-trivial tradeoffs between MCQA accuracy and open-ended evaluation.

Methodological concerns:

The clinician audit is relatively thin: 4 physicians reviewing 3 samples per prompt template cannot systematically catch item-level errors across ~386K synthetic samples (64% of the corpus). The authors acknowledge this limitation.

The decontamination is purely syntactic (n-gram based), which may miss semantic contamination through teacher paraphrasing — a real risk given that GPT-OSS-120B generates from seeds that overlap with evaluation distributions.

Auto-MOOVE's judge κ (0.232 with ties, 0.487 without) falls below the human median, and the judge is systematically less discriminating on safety-critical dimensions (harmlessness: +0.03 judge vs. +0.76 human Likert delta). This is a significant weakness for a clinical evaluation tool.

Single teacher (GPT-OSS-120B) and single primary judge (Qwen3-235B) introduce correlated biases. The judge ablation (Table 12) partially addresses this but shows EuroLLM-22B results flip with GPT-OSS-120B as judge, suggesting non-trivial judge sensitivity.

3. Potential Impact

High-impact dimensions:

Regulatory and auditability: As clinical AI faces increasing regulatory scrutiny (EU AI Act, FDA guidance), fully open pipelines enabling third-party auditing are practically essential. This work provides a template that regulators, hospitals, and researchers can inspect end-to-end.

Reproducibility infrastructure: The release of corpus, code, prompts, and training configurations creates a reusable foundation. Other groups can build on, modify, or critique specific pipeline components rather than starting from scratch.

Benchmarking openness: The openness taxonomy (Table 1, Appendix L) provides a clear framework for holding future medical LLM releases accountable.

Coverage gap analysis: The systematic metadata extraction revealing underrepresentation of emergency care (15%), life-threatening cases (8.6%), and low-resource settings in public QA datasets is itself a useful contribution.

Limitations on impact:

The performance gap between fully open models and frontier systems remains substantial. Apertus-70B-MeditronFO (53.8% average) trails MedGemma-27B (60.7%) on aggregate MCQA, and both trail proprietary systems significantly.

The Gemma-3-27B-MeditronFO vs. MedGemma comparison is the most compelling result (+2.1 on HealthBench, 58.6% preference on Auto-MOOVE), but Gemma-3-27B is itself only open-weight, not fully open — making this a "partially open" rather than fully open achievement.

No actual clinical deployment or prospective validation is attempted.

4. Timeliness & Relevance

This work addresses a genuine and growing need. The tension between LLM capability and auditability in healthcare is well-recognized, and regulatory frameworks are actively demanding transparency. The paper arrives as medical LLMs are being deployed in patient-facing settings (symptom checkers, clinical note generation) with minimal scrutiny of training data provenance. The inclusion of AfriMed-QA and attention to low-resource settings reflects growing awareness of health equity concerns in AI.

The timing relative to MedGemma's release (which withholds training data) makes this a timely counterpoint advocating for a different development paradigm.

5. Strengths & Limitations

Key strengths:

First systematic definition and implementation of "fully open" for medical LLMs

Comprehensive pipeline with all artifacts released

Consistent improvements across 5 diverse base models

Thoughtful evaluation combining MCQA, HealthBench, and Auto-MOOVE

Honest reporting of limitations (IFEval degradation, judge weaknesses)

Notable weaknesses:

Absolute performance still substantially below frontier models

Thin clinician oversight relative to corpus scale

Catastrophic forgetting on instruction-following for some bases (Apertus-70B: IFEval 64.7→41.0)

Auto-MOOVE underperforms humans on safety-critical criteria, undermining its utility for the dimensions that matter most clinically

Single-epoch SFT only; no preference optimization or continued pretraining explored

Overall Assessment

This is a well-executed systems paper that makes a principled case for fully open medical AI development. Its primary contribution is the pipeline and framework rather than raw performance gains. The work is most impactful as infrastructure and as an advocacy piece for auditability standards, though the performance gap with closed systems and the thin clinical validation limit its immediate practical significance.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 6.5Clarity 8

Generated May 18, 2026

Comparison History (16)

vs. PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

gpt-5.25/19/2026

Paper 2 has higher likely scientific impact due to its strong real-world applicability (clinical CDSS), broad community value (fully open, auditable end-to-end pipeline), and timeliness amid demands for transparency, reproducibility, and safety in medical AI. Methodologically, it emphasizes decontamination, clinician auditing, and human-calibrated evaluation, enabling adoption and extension across institutions and domains. Paper 1 is novel and potentially important for RL/agent training, but its impact is more specialized and depends on wider validation across tasks/models and downstream performance gains beyond reward-model AUROC.

vs. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches

claude-opus-4.65/19/2026

Fully Open Meditron addresses a critical gap in clinical AI—full auditability and reproducibility of LLM-based clinical decision support. Its impact spans healthcare AI, regulatory compliance, and open science. The rigorous pipeline with clinician oversight, decontamination, and calibrated evaluation sets a new standard for medical LLMs. It has broader societal implications (patient safety, trust in AI) and affects a larger research community. While Paper 2 makes a solid contribution to OR democratization, its scope is narrower, and LLM-agent frameworks for optimization are becoming increasingly common.

vs. Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability

claude-opus-4.65/18/2026

Fully Open Meditron addresses a critical gap in clinical AI—the lack of truly open, auditable LLM pipelines for medicine. Its contributions span data curation, training methodology, evaluation protocols, and reproducibility, with direct clinical applications. The release of the first fully open medical LLM pipeline sets a new standard for transparency in healthcare AI. While Paper 2 makes valuable methodological contributions to AI agent reliability measurement, Paper 1's broader real-world impact in healthcare, combined with its novelty as the first fully open clinical LLM pipeline and extensive validation, gives it higher potential impact.

vs. SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

gpt-5.25/18/2026

Paper 1 likely has higher scientific impact: it introduces the first fully open, end-to-end auditable pipeline for clinical LLM-CDSS, addressing a major barrier (reproducibility, provenance, decontamination, clinician validation) in a high-stakes domain with immediate real-world and regulatory relevance. It contributes reusable assets (audited corpus, training/eval protocol) and demonstrates competitive/SoTA performance across multiple base models, broadening adoption. Paper 2 is timely and useful as a benchmark, but its impact may be narrower (evaluation-only) and more incremental relative to the crowded agent-benchmark space.

vs. TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

claude-opus-4.65/18/2026

Fully Open Meditron addresses a critical gap in clinical AI—the lack of truly open, auditable LLM pipelines for medical decision support. Its contributions span data curation, training methodology, and evaluation protocols, with broad implications for reproducibility, regulatory compliance, and trust in healthcare AI. The work establishes new state-of-the-art results on fully open models and has potential to reshape how clinical LLMs are developed and validated. Paper 2, while technically sophisticated, addresses a narrower domain (microservice RCA) with more incremental advances in multi-agent LLM reasoning.

vs. Imperfect World Models are Exploitable

gemini-3.15/18/2026

Paper 1 provides a fully open and auditable pipeline for clinical LLMs, directly addressing the critical need for transparency and reproducibility in medical AI. Its practical utility, open-source release, and application to a high-stakes domain like healthcare suggest it will have widespread adoption and substantial real-world impact. While Paper 2 offers valuable theoretical insights into RL safety, Paper 1's immediate relevance to clinical decision support systems gives it broader cross-disciplinary potential.

vs. Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

gpt-5.25/18/2026

Paper 1 likely has higher scientific impact due to its strong novelty and field-level infrastructure contribution: a fully open, auditable end-to-end pipeline for clinical LLMs (data provenance, curation, decontamination, training, and evaluation), addressing a major barrier to medical AI deployment and reproducibility. Its applications (clinical decision support, regulatory/scientific auditing) are immediate and broadly relevant across medicine, NLP, and AI governance. Paper 2 is methodologically interesting and includes a real-user study, but is evaluated mainly in Overcooked-AI, making near-term real-world impact and cross-domain breadth less certain.

vs. SkillEvolver: Skill Learning as a Meta-Skill

gemini-3.15/18/2026

Paper 2 addresses a critical bottleneck in a high-stakes domain (healthcare) by providing the first fully open, end-to-end auditable pipeline for clinical LLMs. Its emphasis on transparency, rigorous clinician validation, and reproducibility sets a new standard for medical AI. While Paper 1 introduces an innovative approach to agent skill learning, Paper 2's direct real-world application, broad impact on medical AI research, and contribution to open science give it a higher potential for significant scientific and societal impact.

vs. BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents

gpt-5.25/18/2026

Paper 2 likely has higher scientific impact because it delivers a broadly reusable, standardized infrastructure for building and fairly evaluating biomedical research agents, addressing a pervasive reproducibility and comparability bottleneck (“per-paper engineering tax”). Its modular decoupling of evaluation layers, large coverage (147 benchmarks, 75 tools), and low-friction extensibility can catalyze many downstream studies across models, tasks, and tool-using agent paradigms. While Paper 1 is novel and important for auditability in clinical LLMs, its impact is more domain-narrow and tied to specific training pipelines/datasets than a field-wide evaluation substrate.

vs. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

gemini-3.15/18/2026

Paper 1 introduces a foundational algorithmic innovation (GCPO) that addresses a critical bottleneck (exploration collapse) in reinforcement learning for LLMs. By shifting from competitive to cooperative policy optimization, it provides a generalizable solution to improve reasoning diversity and accuracy across all domains. While Paper 2 offers significant value for medical AI auditability, Paper 1's methodological breakthrough in RL optimization has a broader potential impact across the entire machine learning and AI community.

vs. AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

claude-opus-4.65/18/2026

The AI Co-Mathematician represents a paradigm shift in AI-assisted mathematical research, demonstrating practical utility in solving open problems and achieving SOTA on FrontierMath Tier 4 (48%). Its potential to accelerate mathematical discovery across all fields gives it extraordinary breadth of impact. While Fully Open Meditron makes important contributions to transparency and reproducibility in clinical LLMs—addressing a critical need—its improvements are incremental (e.g., +6.6 points) and focused on a narrower domain. The co-mathematician's novel agentic workflow paradigm and demonstrated ability to solve open problems suggest transformative potential.

vs. Back to the Beginning of Heuristic Design: Bridging Code and Knowledge with LLMs

claude-opus-4.65/18/2026

Fully Open Meditron addresses a critical gap in clinical AI—auditability and reproducibility of medical LLMs—with direct real-world healthcare applications. It introduces the first fully open pipeline for clinical LLM development, establishes new benchmarks, and has broad implications for regulatory compliance, trust, and safety in medical AI. While Paper 1 makes a solid conceptual contribution to automatic heuristic design with a novel top-down paradigm, Paper 2's impact spans multiple fields (medicine, AI safety, policy, reproducible science) and addresses urgent societal needs around transparent clinical AI systems.

vs. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

gemini-3.15/18/2026

Paper 2 addresses a critical bottleneck in clinical AI: opacity and lack of reproducibility. By providing the first fully open, end-to-end auditable pipeline for clinical LLMs, it sets a new standard for transparency, safety, and open science in medical AI. While Paper 1 offers a highly useful automation tool for neuroimaging workflows, Paper 2's foundational approach to clinical decision support systems has broader applicability and transformative potential across the entire healthcare AI community.

vs. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

gemini-3.15/18/2026

Paper 1 tackles a critical bottleneck in healthcare AI: the lack of auditable, fully open pipelines for Clinical Decision Support Systems. By providing a reproducible, clinician-vetted framework, it addresses crucial safety, transparency, and regulatory concerns in medicine. While Paper 2 offers a valuable benchmarking tool for web agents in e-commerce, Paper 1 has a significantly higher potential for profound real-world societal impact and addresses a more pressing scientific need for methodological rigor and trust in safety-critical domains.

vs. Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning

claude-opus-4.65/18/2026

Fully Open Meditron addresses a critical gap in clinical AI—auditability and reproducibility of LLM-based clinical decision support systems. Its impact spans healthcare AI policy, regulatory compliance, and clinical practice, with broad real-world applications. The fully open pipeline paradigm, clinician-audited corpus, and demonstrated state-of-the-art performance across multiple base models establish a replicable framework for the field. Paper 2 presents interesting theoretical work on deterministic event-graph world models, but its scope is narrower (symbolic reasoning benchmarks) with less immediate real-world applicability and a smaller potential user community.

vs. LatentRouter: Can We Choose the Right Multimodal Model Before Seeing Its Answer?

claude-opus-4.65/18/2026

Fully Open Meditron addresses a critical gap in clinical AI—auditability and reproducibility of LLM-based clinical decision support systems. Its contribution of the first fully open pipeline for medical LLMs has broader impact across healthcare, AI safety, and regulatory compliance. The work establishes new standards for transparency in medical AI, which is increasingly demanded by regulators and clinicians. While LatentRouter presents a technically interesting routing mechanism for multimodal models, its impact is narrower, focused on model selection optimization. Meditron's real-world clinical applications, methodological rigor with physician validation, and timeliness regarding AI transparency give it higher potential impact.