Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

Haoyang Zeng, Yuanxi Fu, Rongzhen Li, Yuming Yang, Xiao Sun, Jingwang Huang, Gujie Shao, Guohui Xiang

Jun 10, 2026arXiv:2606.11675v1

cs.AI

#1839of 3489·Artificial Intelligence

#1839 of 3489 · Artificial Intelligence

Tournament Score

1393±49

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty5.5

Clarity6.5

Abstract

Diagnosing pulmonary diseases requires integrating heterogeneous evidence amid phenotypic variability and cross-disease overlap. Although large language models (LLMs) have shown progress on pulmonary knowledge question answering (QA) and information-processing tasks, reliable pulmonary diagnosis requires patient-specific, relation-aware reasoning over electronic medical record (EMR) evidence rather than isolated knowledge recall. We define this gap between pulmonary knowledge and case-level diagnostic reasoning as the Pulmonary Knowledge-to-Diagnosis Gap. To address it, we introduce LungKG, the first structured pulmonary knowledge graph for diagnostic knowledge organization and record-grounded reasoning. LungKG contains 59,038 nodes and 164,308 edges across 15 entity types and 112 relation types, serving as both a reusable pulmonary knowledge resource and the foundation for LungKG-guided model adaptation. Built on LungKG, we propose Lung-R1, a LungKG-guided pulmonary LLM trained through KG-constrained reasoning-chain construction and KG-guided reinforcement learning. In a 20-system evaluation, Lung-R1-14B achieves state-of-the-art performance across Choice, Pulmonary-QA, and EMR Diagnosis, reaching an EMR Diagnosis score of 4.3583 and surpassing the strongest non-Lung-R1 baseline by 0.1476 points. These results demonstrate the value of LungKG-guided training for EMR-based pulmonary diagnosis.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Lung-R1

1. Core Contribution

The paper addresses what the authors term the "Pulmonary Knowledge-to-Diagnosis Gap" — the disconnect between LLMs performing well on medical knowledge QA tasks versus actually reasoning over electronic medical records (EMRs) to produce accurate pulmonary diagnoses. Two primary contributions emerge:

LungKG: A structured pulmonary knowledge graph with 59,038 nodes and 164,308 edges spanning 15 entity types and 112 relation types. It organizes disease-symptom, pathogen-infection, examination-diagnosis, drug-contraindication, and treatment-condition relations into a directed graph.

Lung-R1: A two-stage training framework that first uses KG-constrained chain-of-thought (CoT) supervision for SFT, then applies KG-guided reinforcement learning with a composite reward function incorporating diagnosis correctness, graph faithfulness, and relation/path consistency.

The core idea — using a domain-specific knowledge graph not at inference time (as in RAG) but as a training-time structural constraint for both SFT data generation and RL reward shaping — is a meaningful architectural distinction from prior work.

2. Methodological Rigor

Strengths in design: The two-stage pipeline is well-motivated. KG-constrained CoT construction (Equations 1-4) provides a principled mechanism for generating graph-grounded reasoning chains. The RL reward decomposition (Equations 7-10) into diagnosis correctness, graph faithfulness, and relation/path consistency is sensible and interpretable. The inverse-degree sampling strategy for addressing long-tail knowledge coverage is a thoughtful design choice.

Evaluation concerns: The evaluation methodology has several notable weaknesses:

The primary evaluation uses LLM-as-Judge scoring with five frozen judge models, which introduces systematic biases. While QWK with physician ratings is reported, the physician validation process is described only at a high level.

The evaluation suite is relatively small: 250 Choice questions, 250 Pulmonary-QA items, and 300 EMR cases. For a 20-system comparison, this limits statistical power.

The EMR Diagnosis improvement of 0.1476 points over the strongest baseline (Claude-Sonnet-4.5) on a 0-5 scale is modest. No confidence intervals or significance tests are reported, making it difficult to assess whether this difference is statistically meaningful.

The ICD-10 distribution of EMR cases is heavily skewed: 64.7% are influenza and pneumonia (J09-J18), limiting generalizability claims.

DeepSeek-R1 is used for both KG construction and training data generation, creating potential circularity concerns.

Ablation studies are informative but limited. The source ablation (KGQA vs. EMR) demonstrates complementarity, but the RL ablation shows only marginal improvement on EMR Diagnosis (4.3575 → 4.3583), which is essentially negligible.

3. Potential Impact

Clinical AI: The approach of encoding domain knowledge graphs into training signals rather than relying on inference-time retrieval is a promising paradigm for specialty medical AI. If validated more rigorously, this could influence how domain-specific medical LLMs are built for cardiology, oncology, or other specialties.

Knowledge graph construction: LungKG itself, with its annotation protocol (88.2% entity F-agreement, 83.8% relation F-agreement), could serve as a reusable resource, though the paper does not clearly commit to public release.

Methodological template: The KG-constrained CoT + KG-guided RL pipeline is generalizable beyond pulmonology. The reward decomposition into outcome-level and process-level components could inform medical RL more broadly.

Practical limitations: The system is text-only, excluding imaging, which is fundamental to pulmonary diagnosis. The training corpus is relatively small (15,147 SFT samples, 3,569 EMR cases). The authors appropriately note this is not a clinical decision-support tool.

4. Timeliness & Relevance

The work addresses a genuine bottleneck: current medical LLMs excel at knowledge recall but struggle with patient-specific diagnostic reasoning. The timing is relevant given the rapid deployment of LLMs in clinical settings and growing recognition that benchmark performance on MedQA-style tests does not translate to clinical utility. The integration of knowledge graphs with RL-based alignment is timely given the post-DeepSeek-R1 wave of reasoning-focused LLM development.

5. Strengths & Limitations

Key Strengths:

Well-articulated problem framing (Knowledge-to-Diagnosis Gap)

Comprehensive 20-system benchmark comparison including frontier models (GPT-5, Claude-Sonnet-4.5)

Training-time KG integration is more principled than inference-time RAG for encoding structural knowledge

Thorough appendix with prompt templates, annotation protocols, and error analysis

The error taxonomy (Table 11) revealing that 69.83% of discrepancies are semantic equivalence or overprediction is clinically insightful

Notable Weaknesses:

The margin of improvement is thin and not statistically validated

Heavy reliance on LLM-as-Judge evaluation without robust inter-rater reliability analysis

The KG construction uses DeepSeek-R1, the same model family used for training data generation — potential data leakage or circularity is not discussed

No comparison with RAG-based baselines using LungKG at inference time, which would be the most natural ablation

Disease distribution is narrow (pneumonia-dominated), limiting claims about general pulmonary diagnosis

Reproducibility is uncertain — the EMR data cannot be shared, and the KG release status is unclear

The RL improvement is negligible on the primary EMR Diagnosis metric, undermining claims about KG-guided RL's value

Additional Observations

The paper's framing occasionally overstates novelty. Medical KG-guided LLM training has precedents, and the claim of "first structured pulmonary knowledge graph" requires careful qualification. The competitive baselines include models orders of magnitude larger, which speaks to the approach's efficiency, but the absolute performance differences remain small. The clinical validation, while present, is insufficient for strong claims about diagnostic reliability.

Rating:5.5/ 10

Significance 5.5Rigor 5Novelty 5.5Clarity 6.5

Generated Jun 11, 2026

Comparison History (29)

Lostvs. StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

Paper 2 has higher estimated impact due to broader, more general applicability: a stateful, evidence-calibrated framework for open-ended scientific discovery can transfer across many domains (biology, materials, social science, ML-assisted discovery). It targets a timely, central limitation of autonomous research agents—overclaiming vs evidence—and proposes an explicit mechanism (externalized state) that could influence agent design broadly. Paper 1 is novel and potentially high-impact clinically, but it is domain-specific (pulmonology) and its gains appear incremental; impact may hinge on deployment, regulation, and generalization beyond the constructed KG/benchmarks.

gpt-5.2·Jun 11, 2026

Wonvs. Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

Paper 2 addresses a critical healthcare challenge with direct implications for patient outcomes. The introduction of LungKG provides a valuable, reusable resource for the medical AI community, significantly increasing its citation potential and breadth of impact. Additionally, applying KG-guided reinforcement learning to LLMs for diagnostic reasoning represents a highly timely and impactful methodological advancement compared to the AEC-focused compliance checking in Paper 1.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

Paper 1 introduces a highly generalizable infrastructure that enables AI agents to operate complex scientific simulations across 14 Earth science domains. Its breadth of impact, bridging AI and climate/resource modeling, offers far wider real-world applicability and systemic innovation than Paper 2, which, despite its methodological rigor, is confined to a specific medical subfield (pulmonary diagnosis).

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

Paper 2 likely has higher impact due to broader methodological novelty and cross-domain applicability: it identifies a fundamental failure mode of entropy-based credit assignment in multimodal RL and proposes a principled, general token-selection fix (vision-anchored coupling) that can transfer across many vision-language reasoning tasks and model families. Its contribution is timely for RLVR in multimodal models and can influence training recipes widely. Paper 1 is strong and application-relevant, but is more domain-specific (pulmonology/EMR) and its gains, while meaningful, may have narrower adoption outside clinical NLP.

gpt-5.2·Jun 11, 2026

Lostvs. Forecasting Future Behavior as a Learning Task

Paper 2 introduces a fundamentally novel paradigm for AI interpretability—bypassing explanation to directly forecast model behavior—which has broad applicability across all LRM applications. Its conceptual innovation (behavior forecasting as a learnable task) opens a new research direction in AI safety/trustworthiness with cross-domain impact. Paper 1, while technically solid, addresses a narrower clinical domain (pulmonary diagnosis) with an incremental combination of knowledge graphs and reinforcement learning. Paper 2's framework is more generalizable, timely given rapid LRM deployment, and likely to inspire significant follow-up work across multiple fields.

claude-opus-4-6·Jun 11, 2026

Wonvs. Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Paper 1 introduces a novel, reusable resource (LungKG) and achieves state-of-the-art results in a crucial clinical task (EMR-based diagnostic reasoning). In contrast, Paper 2 is a small exploratory study with statistically insignificant results and low inter-rater reliability, serving primarily as a pilot. Paper 1's concrete methodological advancements and clear empirical success give it a significantly higher potential for real-world application and broad scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. When Do Data-Driven Systems Exhibit the Capability to Infer?

Paper 1 has higher likely scientific impact due to a concrete, technically novel ML contribution (domain knowledge graph + KG-constrained reasoning chains + KG-guided RL) with direct clinical decision-support potential and reusable resources (LungKG). It reports empirical SOTA gains across multiple benchmarks, indicating methodological rigor and immediate relevance to healthcare AI. Paper 2 is timely and important for AI governance, but its main output is a conceptual/regulatory framework with narrower scientific/technical novelty and less clearly generalizable empirical validation, yielding more limited cross-field methodological impact.

gpt-5.2·Jun 11, 2026

Wonvs. MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Paper 1 addresses a critical, high-stakes domain (medical diagnosis) by introducing a novel, large-scale pulmonary knowledge graph (LungKG) and a KG-guided LLM framework. Its focus on grounding diagnostic reasoning in EMR data directly tackles major limitations of LLMs in healthcare. While Paper 2 presents a strong open-source multimodal framework, Paper 1's creation of a foundational medical resource and its highly translational clinical applications give it a higher potential for significant scientific and societal impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Lung-R1 presents a novel knowledge graph (LungKG) with 59K nodes and 164K edges for pulmonary diagnosis, combined with KG-guided reinforcement learning—a concrete, reusable resource with clear clinical applications. It demonstrates state-of-the-art results on multiple benchmarks with rigorous evaluation across 20 systems. The direct medical application (pulmonary diagnosis from EMRs) has significant real-world impact potential. Paper 1, while interesting as an HCI study, has a smaller sample size (74 participants), narrower scope, and more exploratory findings about human-AI creative interaction without comparable methodological depth or breadth of impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. ComplexConstraints and Beyond: Expert Rubrics for RLVR

Paper 1 offers a foundational contribution to LLM training and evaluation, addressing a critical bottleneck in the field (scalable, complex evaluation). Its methodology for using expert rubrics in RLVR yields significant, transferable improvements across general capabilities. While Paper 2 presents a valuable domain-specific application (pulmonary medicine) with direct clinical potential, Paper 1's broad applicability to foundational model development ensures a much wider and more pervasive scientific impact across the entire AI landscape.

gemini-3.1-pro-preview·Jun 11, 2026

#1839of 3489·Artificial Intelligence

#1839 of 3489 · Artificial Intelligence

Tournament Score

1393±49

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty5.5

Clarity6.5