MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang, Juexiao Zhou, Hao Peng, Yue Guo

May 26, 2026

arXiv:2605.26567v1 PDF

cs.AI(primary)

#1245of 2682·Artificial Intelligence

#1245 of 2682 · Artificial Intelligence

Tournament Score

1418±42

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor6.5

Novelty7.5

Clarity7.5

Tournament Score

1418±42

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MedGuideX

1. Core Contribution

MedGuideX introduces a pipeline that converts clinical practice guidelines (CPGs) into executable Python functions representing decision trees, then uses these functions to generate both factual and counterfactual QA training data. The key insight is that CPGs contain procedural decision logic (condition-action rules) that can be formalized as executable functions, enabling: (a) deterministic label generation through function execution, (b) counterfactual data synthesis by intervening on variables, and (c) verifiable reward signals for reinforcement learning. The model is post-trained via SFT on mixed factual/counterfactual data followed by GRPO on factual QA, achieving a 10.28% relative improvement over the Qwen3.5-9B base model across four benchmarks.

2. Methodological Rigor

The pipeline is well-formalized. The decision-tree representation (Equation 1-2) provides clear semantics, and the counterfactual generation framework (abduction-intervention-prediction, Equation 3) is principled. The executable functions serve dual purposes: data generation and reward computation during RL, which is an elegant design.

Strengths in evaluation design:

Four diverse benchmarks spanning exam-style QA (MedQA), long-form case reasoning (MedCaseReasoning), EHR-based decision making (MIMIC-CDM-FI), and emergency diagnosis (ER-Reason)

No benchmark training data is used during MedGuideX training, ensuring fair evaluation

Paired answer transition analysis (Table 2) provides granular insight into where improvements originate

Blinded physician evaluation with inter-annotator overlap on 10 shared cases

Thorough ablation study examining data composition and training phase combinations

Methodological concerns:

The LLM-dependent pipeline (extraction, validation, compilation) introduces potential error propagation that is not quantified. How many of the 2,793 final functions are semantically faithful to the original guidelines?

The physician evaluation involves only 2 physicians and 30 cases—a relatively small sample for drawing strong conclusions about clinical reasoning quality

The counterfactual reward (Equation 7) requires exact match on hidden variables, which seems overly strict and may explain why counterfactual RL underperformed factual RL in ablations

The paper does not report confidence intervals or statistical significance tests on benchmark results

3. Potential Impact

Direct applications: The framework provides a scalable method for converting authoritative medical knowledge into structured training supervision. This could be applied beyond the ~841 guidelines used here to larger guideline repositories, potentially covering more medical specialties.

Broader implications:

The executable-function-as-verifier paradigm could generalize to other domains with procedural decision logic (legal reasoning, financial compliance, engineering standards)

The counterfactual training approach for teaching models sensitivity to condition changes has applications in any domain requiring robust conditional reasoning

The physician evaluation showing 76.86% preference over GPT-5.0 on rationale quality (despite lower accuracy) suggests the approach improves interpretability—critical for clinical adoption

Limitations on real-world deployment: The authors appropriately note this is a research system. CPGs cover only a fraction of clinical scenarios, and many real-world cases require reasoning beyond what any single guideline encodes. The gap between guideline-based logic and actual clinical reasoning (which involves uncertainty, incomplete information, and competing guidelines) remains significant.

4. Timeliness & Relevance

This work addresses a current bottleneck: medical LLMs trained on unstructured clinical text acquire broad knowledge but lack stable, verifiable decision procedures. The approach arrives at an important moment when: (a) RL-based reasoning training is rapidly advancing (DeepSeek-R1, etc.), (b) there is growing demand for trustworthy medical AI, and (c) regulatory bodies increasingly require explainability in clinical decision support. The executable verification approach aligns well with emerging needs for auditable AI in healthcare.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated framing: treating CPGs as executable programs rather than text is conceptually clean and practically useful

The factual/counterfactual data generation is a genuine contribution—counterfactual training for clinical reasoning is underexplored

Strong experimental design with multiple baselines including several CPG-utilization methods (RAG, CPGPrompt, fine-tuning on raw CPGs, RL with process rewards)

The comparison against CPG baselines (Table 1, bottom section) effectively demonstrates that executable supervision outperforms alternative CPG utilization strategies

Comprehensive ablation study that justifies design choices

Notable Weaknesses:

Scale of supervision is modest: 2,793 executable functions yielding ~10K QA instances. It's unclear how performance scales with more guidelines

The 841 curated guidelines are restricted to US-based sources (CDC, PubMed), limiting generalizability to other healthcare contexts

Gains on MIMIC-CDM-FI are small (4.41%), and ER-Reason accuracy remains low across all models (~28-34%), suggesting the approach has limited effectiveness for certain clinical reasoning tasks

The physician evaluation compares against GPT-5.0 but not against other medical LLMs at similar scale, which would be more informative

No analysis of failure modes or systematic errors introduced by guideline-derived training

The paper acknowledges but does not address potential conflicts between different guidelines or guideline currency issues

Reproducibility: The pipeline depends heavily on LLM calls for extraction, validation, and data generation, making exact reproduction challenging. However, the overall framework is clearly described.

Overall Assessment

MedGuideX presents a creative and well-executed approach to an important problem. The core idea—transforming procedural medical knowledge into executable training supervision—is sound and has clear extensibility. The experimental evidence is convincing, though the scale of evaluation (particularly the human study) could be larger. The work makes a meaningful contribution to the intersection of knowledge-grounded training and medical AI, with the executable verification paradigm being its most lasting contribution.

Rating:7/ 10

Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

Generated May 27, 2026

Comparison History (26)

vs. Deconstructing Spatial Complexity: Hierarchical Decomposition for LLM Spatial Reasoning

claude-opus-4.65/28/2026

MedGuideX addresses a critical real-world problem in clinical AI with a novel pipeline that transforms clinical practice guidelines into executable decision logic for training medical LLMs. It demonstrates strong empirical results (10.28% improvement) validated by physician evaluation across multiple dimensions. The approach of using factual and counterfactual QA from structured guidelines is innovative and highly scalable. While Paper 1 presents solid technical contributions to spatial reasoning with MCTS-guided optimization, Paper 2 has broader immediate impact potential in healthcare AI, addresses a more pressing societal need, and offers a more generalizable methodology for incorporating structured expert knowledge into LLMs.

vs. A Policy-Driven Runtime Layer for Agentic LLM Serving

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact due to stronger novelty (a new serving-stack architecture with explicit primitives bridging agent frameworks and engines), broad applicability across many cross-cutting policies (caching, batching, fairness, safety, memoization), and timeliness as multi-agent LLM serving becomes a dominant workload. It also demonstrates concrete, systems-level gains on real workloads. Paper 2 is impactful in a high-stakes domain, but its approach is more incremental (structured guideline-to-data supervision) and narrower in scope, with moderate benchmark gains and domain-specific deployment constraints.

vs. Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement

gemini-3.15/28/2026

Paper 1 introduces a fundamental, domain-agnostic methodological breakthrough in prompt optimization. By enabling instance-level, compositional prompt generation through discrete codebooks, it solves brittleness in existing APO methods and significantly reduces prompt length. Its broad applicability across LLM workflows gives it a wider potential impact across multiple fields compared to Paper 2, which presents a valuable but domain-specific data generation pipeline for medical LLMs.

vs. CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

claude-opus-4.65/28/2026

CORE introduces a broadly applicable, novel learning paradigm (contrastive reflection for non-parametric self-improvement) that generalizes across reasoning tasks with strong efficiency gains over established methods like GRPO. Its contributions to sample efficiency, interpretability, and the general framework of LLM self-improvement have broader impact across multiple fields. MedGuideX, while valuable for clinical AI, addresses a narrower domain-specific problem (medical guideline internalization) with a more incremental contribution—transforming structured guidelines into training data. CORE's methodological novelty and cross-domain applicability give it higher potential impact.

vs. Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

gpt-5.25/28/2026

Paper 2 has higher potential impact due to a clearer, high-stakes real-world application (clinical decision support) and a novel, broadly reusable pipeline that converts clinical practice guidelines into executable logic to generate factual/counterfactual supervision. This directly targets reliability and faithfulness—key barriers to deployment—supported by benchmark gains and physician evaluation. Paper 1 is methodologically interesting and broadly applicable, but leverages model confidence (often poorly calibrated) as a training signal and appears more incremental relative to existing uncertainty-weighting and replay ideas. MedGuideX is also timely given regulatory and safety pressure in medical AI.

vs. PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

claude-opus-4.65/28/2026

MedGuideX introduces a novel methodology for internalizing executable clinical decision logic into LLMs, demonstrating significant improvements (10.28% relative accuracy gain) with physician-validated results. It addresses a fundamental challenge in medical AI—faithful clinical reasoning—with a generalizable pipeline applicable beyond medicine. Paper 1, while useful, is primarily a benchmark contribution for a narrow domain (petroleum engineering) without methodological innovation. Paper 2 has broader impact potential across healthcare AI, stronger novelty in its counterfactual training approach, and more rigorous validation including physician evaluation.

vs. Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and broadly applicable problem—bridging the gap between idealized training and noisy real-world deployment for LLM agents. Its framework (NoisyAgent) is domain-agnostic and applicable across diverse agent tasks, giving it broader impact potential. The finding that noise-augmented training also improves performance on clean benchmarks suggests a generalizable principle. Paper 2, while valuable for clinical AI, targets a narrower domain (medical guideline reasoning) with a more incremental contribution (structured data augmentation from CPGs). Paper 1's timeliness is also higher given the rapid proliferation of LLM agents.

vs. MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

gpt-5.25/27/2026

Paper 2 has higher likely impact due to direct, high-stakes real-world applicability in clinical decision support, a clear and timely problem (reliable medical reasoning), and a novel supervision source: converting guidelines into executable logic plus factual/counterfactual QA for training. It reports gains across multiple benchmarks and includes physician evaluation, strengthening rigor and practical relevance. Paper 1 is broadly relevant to agent design and skill reuse, but the contribution is more incremental within an active area and its validated impact is narrower and more dependent on benchmark/ecosystem adoption.

vs. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning

gpt-5.25/27/2026

Paper 2 has higher likely impact due to broader applicability and timeliness: scaling-out collective reasoning for long-horizon agent tasks generalizes across domains (software engineering, robotics, scientific discovery, operations), not just medicine. Its framework-level contribution (shared reasoning hub + SFT/RL training) can influence multi-agent architectures and evaluation paradigms widely. Paper 1 is innovative and high-value for clinical AI, but its impact is narrower to guideline-rich healthcare settings and depends on guideline availability/maintainability and clinical validation pathways. Overall, AgentFugue’s cross-field breadth and relevance to current agentic research give it higher estimated impact.

vs. Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

gemini-3.15/27/2026

Paper 1 addresses a critical gap in medical AI by translating clinical guidelines into executable decision logic, significantly enhancing the reliability of medical LLMs. This has profound real-world implications for healthcare and patient outcomes. In contrast, Paper 2 presents a highly specialized, incremental quantization technique tailored to a specific video generation challenge. Paper 1 offers greater novelty, broader cross-disciplinary applicability, and stronger potential for significant societal and scientific impact.

vs. VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it introduces a novel, generalizable pipeline that converts clinical guidelines into executable logic to generate factual/counterfactual supervision, directly improving model reliability in a high-stakes domain. The approach has clear real-world applicability (clinical decision support), strong methodological signals (multi-benchmark gains plus physician evaluation), and timely relevance given deployment pressures for medical LLMs. Paper 1 is valuable as benchmarking infrastructure for personalization/proactiveness, but its impact is more indirect (evaluation-focused) and may be narrower unless it becomes a widely adopted standard.

vs. LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

gemini-3.15/27/2026

Paper 2 tackles a high-stakes application (clinical reasoning) by introducing a novel approach that translates clinical guidelines into executable decision logic for LLMs. It demonstrates rigorous evaluation, including a significant 10% performance gain on benchmarks and validation by domain experts (physicians). In contrast, Paper 1 presents a practical Python library and demo for entity linking, which, while useful for engineering pipelines, offers less fundamental scientific innovation and narrower broader impact compared to advancing reliable medical AI.

vs. Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

gpt-5.25/27/2026

Paper 2 has higher potential impact: it introduces a broadly applicable conceptual and formal framework (GEM) for long-term agent memory, framing it as a new data-management workload with state-trajectory correctness, new operators, and correctness conditions, plus negative results about record-level systems. This targets a timely, cross-cutting bottleneck for AI agents and databases, likely influencing both systems and ML communities and spawning follow-on work (engines, benchmarks, theory). Paper 1 is strong and practical but more domain-specific (clinical guidelines) and incremental within LLM adaptation.

vs. Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

claude-opus-4.65/27/2026

MedGuideX addresses a critical gap in medical AI by transforming clinical practice guidelines into executable decision logic for training LLMs, achieving significant improvements (10.28%) on clinical reasoning benchmarks with physician validation. Its approach of generating factual and counterfactual QA data from structured guidelines is novel and methodologically rigorous. The direct clinical applications, scalability of the training pipeline, and potential to improve healthcare decision-making give it broader real-world impact. POLAR is innovative for embodied agent personalization but addresses a narrower, less immediately impactful application domain.

vs. Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

gpt-5.25/27/2026

Paper 2 is more novel and broadly impactful: it operationalizes clinical practice guidelines into executable decision logic, generates factual/counterfactual supervision, and shows sizable benchmark gains plus physician-judged improvements in rationale quality. It targets a high-stakes, real-world domain (clinical reasoning) with clear application pathways and timeliness given rapid medical LLM deployment. Paper 1 is a careful robustness comparison but is narrower (single model/dataset, non-significant results) and primarily provides incremental diagnostic insight rather than a new capability or scalable method.

vs. A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

claude-opus-4.65/27/2026

MedGuideX presents a novel methodology for internalizing executable clinical decision logic from guidelines into LLMs, demonstrating significant improvements (10.28% relative accuracy gain) across four benchmarks with physician validation. It addresses a fundamental challenge in medical AI—faithful clinical reasoning—with a scalable, generalizable pipeline. Paper 2 contributes a useful speech dataset but is more incremental, covering only four conditions with a relatively straightforward benchmark. Paper 1's methodological innovation, stronger empirical results, and broader applicability to reliable medical AI give it substantially higher impact potential.

vs. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

gpt-5.25/27/2026

Paper 1 has higher likely scientific impact due to a more novel and broadly generalizable approach: converting clinical practice guidelines into executable decision logic to generate factual/counterfactual supervision for LLM post-training. This directly targets a central, high-stakes problem (faithful clinical reasoning) with clear methodological contribution transferable to many guideline-driven domains. It is timely for trustworthy medical AI and shows measurable benchmark gains plus physician evaluation. Paper 2 is rigorous and practical for a specialized domain, but its impact is narrower and more system-integration oriented than a broadly reusable learning paradigm.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

gpt-5.25/27/2026

Paper 2 introduces a broadly applicable conceptual and methodological contribution: it identifies “composition collapse,” shows aggregate multi-hop scores can be misleading, and proposes a double-gate evaluation protocol that decomposes gains into atomic stability, residual composition, and critical depth. This reframes how post-training and reasoning improvements should be measured across many LLM domains, with immediate relevance to current evaluation practices. Paper 1 is impactful for clinical NLP and guideline-based supervision, but its scope is narrower and more application-specific, whereas Paper 2’s insights and metrics can influence evaluation and training claims across fields.

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact due to broader applicability: it targets a fundamental limitation in multi-turn dialogue RL (compounding distribution shift), offers a unifying framework (Calibrated Interactive RL) plus theoretical analysis, and addresses both policy- and simulator-induced shift—relevant across many interactive LLM applications. Its methodological rigor is strengthened by explicit theory-to-experiment linkage and claims of state-of-the-art performance on multiple tasks. Paper 1 is impactful in medicine, but its domain specificity narrows breadth despite strong real-world relevance.

vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact because it identifies a general, structural vulnerability in the dominant alignment paradigm (RLHF) with broad implications for safety, deployment, and future training methods across nearly all LLM applications. Its concept (alignment tampering) is novel and timely, provides empirical demonstrations across multiple bias/goal-seeking settings, and highlights limitations of existing mitigations—likely to motivate follow-up work in alignment, evaluation, dataset construction, and governance. Paper 1 is impactful for clinical NLP, but its scope is more domain-specific and incremental relative to broader LLM alignment research.