Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang
Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.
The paper addresses what the authors term the "Pulmonary Knowledge-to-Diagnosis Gap" — the disconnect between LLMs performing well on medical knowledge QA tasks versus actually reasoning over electronic medical records (EMRs) to produce accurate pulmonary diagnoses. Two primary contributions emerge:
LungKG: A structured pulmonary knowledge graph with 59,038 nodes and 164,308 edges spanning 15 entity types and 112 relation types. It organizes disease-symptom, pathogen-infection, examination-diagnosis, drug-contraindication, and treatment-condition relations into a directed graph.
Lung-R1: A two-stage training framework that first uses KG-constrained chain-of-thought (CoT) supervision for SFT, then applies KG-guided reinforcement learning with a composite reward function incorporating diagnosis correctness, graph faithfulness, and relation/path consistency.
The core idea — using a domain-specific knowledge graph not at inference time (as in RAG) but as a training-time structural constraint for both SFT data generation and RL reward shaping — is a meaningful architectural distinction from prior work.
Strengths in design: The two-stage pipeline is well-motivated. KG-constrained CoT construction (Equations 1-4) provides a principled mechanism for generating graph-grounded reasoning chains. The RL reward decomposition (Equations 7-10) into diagnosis correctness, graph faithfulness, and relation/path consistency is sensible and interpretable. The inverse-degree sampling strategy for addressing long-tail knowledge coverage is a thoughtful design choice.
Evaluation concerns: The evaluation methodology has several notable weaknesses:
Ablation studies are informative but limited. The source ablation (KGQA vs. EMR) demonstrates complementarity, but the RL ablation shows only marginal improvement on EMR Diagnosis (4.3575 → 4.3583), which is essentially negligible.
Clinical AI: The approach of encoding domain knowledge graphs into training signals rather than relying on inference-time retrieval is a promising paradigm for specialty medical AI. If validated more rigorously, this could influence how domain-specific medical LLMs are built for cardiology, oncology, or other specialties.
Knowledge graph construction: LungKG itself, with its annotation protocol (88.2% entity F-agreement, 83.8% relation F-agreement), could serve as a reusable resource, though the paper does not clearly commit to public release.
Methodological template: The KG-constrained CoT + KG-guided RL pipeline is generalizable beyond pulmonology. The reward decomposition into outcome-level and process-level components could inform medical RL more broadly.
Practical limitations: The system is text-only, excluding imaging, which is fundamental to pulmonary diagnosis. The training corpus is relatively small (15,147 SFT samples, 3,569 EMR cases). The authors appropriately note this is not a clinical decision-support tool.
The work addresses a genuine bottleneck: current medical LLMs excel at knowledge recall but struggle with patient-specific diagnostic reasoning. The timing is relevant given the rapid deployment of LLMs in clinical settings and growing recognition that benchmark performance on MedQA-style tests does not translate to clinical utility. The integration of knowledge graphs with RL-based alignment is timely given the post-DeepSeek-R1 wave of reasoning-focused LLM development.
The paper's framing occasionally overstates novelty. Medical KG-guided LLM training has precedents, and the claim of "first structured pulmonary knowledge graph" requires careful qualification. The competitive baselines include models orders of magnitude larger, which speaks to the approach's efficiency, but the absolute performance differences remain small. The clinical validation, while present, is insufficient for strong claims about diagnostic reliability.
Generated Jun 11, 2026
Paper 2 has higher estimated impact due to broader, more general applicability: a stateful, evidence-calibrated framework for open-ended scientific discovery can transfer across many domains (biology, materials, social science, ML-assisted discovery). It targets a timely, central limitation of autonomous research agents—overclaiming vs evidence—and proposes an explicit mechanism (externalized state) that could influence agent design broadly. Paper 1 is novel and potentially high-impact clinically, but it is domain-specific (pulmonology) and its gains appear incremental; impact may hinge on deployment, regulation, and generalization beyond the constructed KG/benchmarks.
Paper 2 addresses a critical healthcare challenge with direct implications for patient outcomes. The introduction of LungKG provides a valuable, reusable resource for the medical AI community, significantly increasing its citation potential and breadth of impact. Additionally, applying KG-guided reinforcement learning to LLMs for diagnostic reasoning represents a highly timely and impactful methodological advancement compared to the AEC-focused compliance checking in Paper 1.
Paper 1 introduces a highly generalizable infrastructure that enables AI agents to operate complex scientific simulations across 14 Earth science domains. Its breadth of impact, bridging AI and climate/resource modeling, offers far wider real-world applicability and systemic innovation than Paper 2, which, despite its methodological rigor, is confined to a specific medical subfield (pulmonary diagnosis).
Paper 2 likely has higher impact due to broader methodological novelty and cross-domain applicability: it identifies a fundamental failure mode of entropy-based credit assignment in multimodal RL and proposes a principled, general token-selection fix (vision-anchored coupling) that can transfer across many vision-language reasoning tasks and model families. Its contribution is timely for RLVR in multimodal models and can influence training recipes widely. Paper 1 is strong and application-relevant, but is more domain-specific (pulmonology/EMR) and its gains, while meaningful, may have narrower adoption outside clinical NLP.
Paper 2 introduces a fundamentally novel paradigm for AI interpretability—bypassing explanation to directly forecast model behavior—which has broad applicability across all LRM applications. Its conceptual innovation (behavior forecasting as a learnable task) opens a new research direction in AI safety/trustworthiness with cross-domain impact. Paper 1, while technically solid, addresses a narrower clinical domain (pulmonary diagnosis) with an incremental combination of knowledge graphs and reinforcement learning. Paper 2's framework is more generalizable, timely given rapid LRM deployment, and likely to inspire significant follow-up work across multiple fields.
Paper 1 introduces a novel, reusable resource (LungKG) and achieves state-of-the-art results in a crucial clinical task (EMR-based diagnostic reasoning). In contrast, Paper 2 is a small exploratory study with statistically insignificant results and low inter-rater reliability, serving primarily as a pilot. Paper 1's concrete methodological advancements and clear empirical success give it a significantly higher potential for real-world application and broad scientific impact.
Paper 1 has higher likely scientific impact due to a concrete, technically novel ML contribution (domain knowledge graph + KG-constrained reasoning chains + KG-guided RL) with direct clinical decision-support potential and reusable resources (LungKG). It reports empirical SOTA gains across multiple benchmarks, indicating methodological rigor and immediate relevance to healthcare AI. Paper 2 is timely and important for AI governance, but its main output is a conceptual/regulatory framework with narrower scientific/technical novelty and less clearly generalizable empirical validation, yielding more limited cross-field methodological impact.
Paper 1 addresses a critical, high-stakes domain (medical diagnosis) by introducing a novel, large-scale pulmonary knowledge graph (LungKG) and a KG-guided LLM framework. Its focus on grounding diagnostic reasoning in EMR data directly tackles major limitations of LLMs in healthcare. While Paper 2 presents a strong open-source multimodal framework, Paper 1's creation of a foundational medical resource and its highly translational clinical applications give it a higher potential for significant scientific and societal impact.
Lung-R1 presents a novel knowledge graph (LungKG) with 59K nodes and 164K edges for pulmonary diagnosis, combined with KG-guided reinforcement learning—a concrete, reusable resource with clear clinical applications. It demonstrates state-of-the-art results on multiple benchmarks with rigorous evaluation across 20 systems. The direct medical application (pulmonary diagnosis from EMRs) has significant real-world impact potential. Paper 1, while interesting as an HCI study, has a smaller sample size (74 participants), narrower scope, and more exploratory findings about human-AI creative interaction without comparable methodological depth or breadth of impact.
Paper 1 offers a foundational contribution to LLM training and evaluation, addressing a critical bottleneck in the field (scalable, complex evaluation). Its methodology for using expert rubrics in RLVR yields significant, transferable improvements across general capabilities. While Paper 2 presents a valuable domain-specific application (pulmonary medicine) with direct clinical potential, Paper 1's broad applicability to foundational model development ensures a much wider and more pervasive scientific impact across the entire AI landscape.