Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

Yunhan Wang, Yuda Wang, Zhiying Tu, Mingqiang Song, Li Song, Kun Li, Dianhui Chu, Bolin Zhang

Jun 5, 2026arXiv:2606.06869v1

cs.AI

#3290of 3489·Artificial Intelligence

#3290 of 3489 · Artificial Intelligence

Tournament Score

1227±43

10501800

23%

Win Rate

Wins

Losses

Matches

Rating

4.2/ 10

Significance4

Rigor3.5

Novelty4.5

Clarity6.5

Abstract

Aim: Existing AI-assisted traditional Chinese medicine diagnostic tools suffer from opaque reasoning processes, passive interaction, and limited treatment plan presentation. This study proposes a knowledge-enhanced visual diagnostic system to improve the transparency and interpretability of syndrome differentiation and treatment. Methods: The system is built upon a Neo4j knowledge graph comprising 241 syndromes, 1,263 symptoms, and 2,485 relations. It incorporates a four-stage symptom matching pipeline (exact, semantic, fuzzy, and large language model verification), an information gain-driven proactive questioning strategy optimized with genetic algorithms, and a multimodal treatment presentation integrating artificial intelligence-generated illustrations, three-dimensional meridian-acupoint models, and evidence-based literature. Results: Knowledge graph constraints reduced non-standard outputs by 32%. Case studies validated the effectiveness of the interactive workflow across patient self-assessment, clinician-assisted diagnosis, and traditional Chinese medicine education. Automated paired-comparison evaluation across 30 cases further demonstrated significant improvements in diagnostic trust (Cohen's d = 1.82, p < 0.001), reduced cognitive load (improvements in four of five dimensions), and higher credibility of evidence-based references (4.21 vs. 2.95). Conclusions: The proposed system enhances the transparency of traditional Chinese medicine diagnostic reasoning and the interpretability of treatment plans through knowledge graph-driven visualization and multimodal interaction, offering a practical solution for trustworthy artificial intelligence-assisted traditional Chinese medicine applications.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper presents an integrated system for AI-assisted Traditional Chinese Medicine (TCM) diagnosis that addresses three identified gaps: opaque reasoning, passive interaction, and text-only treatment presentation. The system combines a Neo4j knowledge graph (241 syndromes, 1,263 symptoms, 2,485 relations) with a four-stage symptom matching pipeline (exact → semantic → fuzzy → LLM verification), an information-gain-driven proactive questioning strategy optimized via genetic algorithms, and multimodal treatment output including AI-generated illustrations, 3D meridian-acupoint models, and evidence-based literature references.

The novelty lies not in any single component but in the system-level integration: using a knowledge graph simultaneously as a "soft scaffold" for guiding multi-turn interaction and a "hard boundary" for constraining LLM outputs. The diagnosis-treatment dual-chain visualization architecture—where progressive KG subgraphs evolve across interaction rounds—is a thoughtful design contribution that makes the diagnostic reasoning process itself the object of explanation rather than just the final output.

2. Methodological Rigor

The evaluation methodology is the paper's most significant weakness. Several concerns:

Syndrome differentiation accuracy (RQ1): The reported Top-1 accuracy of 47.11% is modest, and the comparison baselines are limited—only DeepSeek R1 (a general-purpose model) and HuatuoGPT-Vision (a multimodal medical model) on a 147-label classification task. No TCM-specific baselines or established syndrome differentiation methods are compared. The authors acknowledge this is not meant to claim "comprehensive superiority," but the evaluation is too thin to convincingly demonstrate the system's diagnostic capability.

LLM-as-a-Judge framework (RQ2, RQ3): The reliance on Claude Sonnet 4.5 as an automated judge for trust, cognitive load, and evidence credibility is problematic. While the authors take steps to mitigate same-source bias (using a different vendor than the system model), fundamental concerns remain: (a) LLMs evaluating LLM-generated outputs may share systematic biases about what constitutes "good" reasoning; (b) having an LLM simulate a "rural resident with middle school education" for cognitive load assessment is a poor proxy for actual user testing; (c) the sample size of N=25-30 cases is small despite the statistical sophistication applied.

Statistical analysis: The statistical methods themselves are appropriate—Wilcoxon signed-rank tests, Benjamini-Hochberg correction, Cohen's d and rank-biserial r. The reported effect sizes are very large (Cohen's d = 1.82 overall for trust), which may partly reflect the artificial nature of the evaluation setup rather than genuine clinical impact. When an LLM judge compares "with KG visualization" vs. "without KG visualization," large effects are almost tautological—of course providing more information scores higher on transparency metrics.

The 32% reduction in non-standard outputs is mentioned repeatedly but never formally defined or rigorously measured. What constitutes a "non-standard output"? Against what baseline? This claim lacks the methodological detail to be convincing.

3. Potential Impact

The system targets a genuine need in TCM practice—making syndrome differentiation reasoning transparent and treatment plans accessible. The three target user groups (patients, clinicians, students) are well-defined, and the multimodal approach to treatment plan presentation (especially the 3D acupoint model) addresses a real comprehension barrier.

However, the practical impact is constrained by several factors:

The knowledge graph covers only 241 of 1000+ documented syndrome patterns

No real clinical validation has been conducted

The system is Chinese-medicine-specific, limiting broader applicability

The reliance on Gemini 3.1 Pro creates vendor dependency and cost considerations

The broader methodological pattern—using KGs to constrain LLM outputs in medical settings—has transferable value to other clinical decision support systems. The four-layer matching pipeline and information-gain-driven questioning strategy are potentially reusable components.

4. Timeliness & Relevance

The paper addresses a timely intersection of LLMs in healthcare, knowledge graph grounding, and medical AI transparency—all active areas. The concern about LLM hallucination in medical contexts is well-documented, and the KG-as-constraint approach is a pragmatic response. The use of multimodal presentation (AI-generated images, 3D models) reflects current capabilities. However, the specific domain of TCM AI assistance is relatively niche, and the lack of standardized benchmarks in this space makes it difficult to assess progress.

5. Strengths & Limitations

Strengths:

Comprehensive system design addressing multiple real pain points simultaneously

Well-articulated design challenges (DC1-DC3) mapping to concrete solutions

The progressive KG visualization concept is pedagogically sound

Thorough supplementary materials with full evaluation prompts enabling reproducibility of the evaluation pipeline

Honest acknowledgment of limitations

Limitations:

No human user study—the entire evaluation relies on automated LLM judgment, which is insufficient for claims about trust, cognitive load, and comprehension

The diagnostic accuracy is low (47%) even on Top-1, raising questions about clinical utility

The CEMRs dataset cannot be publicly released, limiting reproducibility

The comparison with baselines is unfair (general-purpose models vs. a KG-constrained system on a domain-specific task)

The system description reads more as an engineering report than a research contribution with generalizable insights

Key design parameters (thresholds δ, τ, and weights α, β, γ) are not reported or analyzed for sensitivity

Overall Assessment: This is a competent systems paper that integrates multiple components into a coherent TCM diagnostic tool. The engineering is thorough and the design rationale is well-articulated. However, the absence of real user evaluation fundamentally undermines the paper's core claims about trust, cognitive load, and clinical utility. The scientific contribution is primarily integrative rather than advancing any individual technique, and the evaluation does not yet meet the standard needed for confident claims about practical impact.

Rating:4.2/ 10

Significance 4Rigor 3.5Novelty 4.5Clarity 6.5

Generated Jun 8, 2026

Comparison History (22)

Lostvs. Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Paper 2 addresses a broadly relevant and timely question about human-AI interaction in creative writing, with a novel gamified methodology that inverts typical AI-assistant paradigms. Its findings about when humans accept vs. resist AI suggestions have broad implications across HCI, cognitive science, and AI design. Paper 1, while technically sound, addresses a narrower domain (Traditional Chinese Medicine diagnostics) with incremental integration of existing techniques (knowledge graphs, LLMs), limiting its cross-disciplinary impact and audience reach.

claude-opus-4-6·Jun 11, 2026

Wonvs. Frequency-based Constrained Sampling for Interval Patterns

Paper 1 combines highly timely technologies (LLMs, knowledge graphs, multimodal AI) with a practical healthcare application (Traditional Chinese Medicine). Its focus on explainability, diagnostic trust, and cognitive load addresses critical barriers in AI medical adoption. While Paper 2 offers solid algorithmic advancements in pattern mining with mathematical proofs, Paper 1 has a higher potential for broad multidisciplinary impact, immediate real-world utility, and wider visibility in the current AI-driven research landscape.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Paper 2 presents a fully implemented system with empirical results, including statistically significant improvements (Cohen's d = 1.82, p < 0.001) across 30 cases, a concrete knowledge graph with specific data, and a validated multi-stage pipeline. It addresses a tangible clinical need in TCM diagnostics with measurable outcomes. Paper 1, while addressing an interesting theoretical problem (AI hivemind prevention), describes anticipated/expected results from a framework yet to be fully validated, making it more speculative. Paper 2's methodological rigor, completed evaluation, and direct clinical applicability give it higher near-term scientific impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. DyCon: Dynamic Reasoning Control via Evolving Difficulty Modeling

Paper 1 addresses a critical and highly timely challenge in foundational Large Language Models (inference-time reasoning efficiency and 'overthinking') with a training-free method. Its impact spans multiple broad domains like math, coding, and general QA. In contrast, Paper 2, while methodologically rigorous, focuses on a niche application (Traditional Chinese Medicine), significantly limiting its broader scientific and interdisciplinary impact compared to core LLM advancements.

gemini-3.1-pro-preview·Jun 8, 2026

Lostvs. Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

Paper 2 has higher likely scientific impact due to broader applicability and timeliness: open-vocabulary audio-visual event localization is a core multimodal ML problem relevant to video understanding, retrieval, surveillance, robotics, and human-computer interaction. Its methodological contributions (heterogeneous hierarchical graph, semantic constraints across temporal scales, gated fusion, and hyperbolic embedding with entailment regularization) are more generalizable beyond a single domain. Paper 1 is practically useful but is domain-specific (TCM), with impact constrained by clinical validation requirements and narrower transferability.

gpt-5.2·Jun 8, 2026

Wonvs. Bidirectional Search for Longest Paths: Case for Front-to-Front Heuristics

Paper 2 addresses highly timely and impactful challenges in AI healthcare, specifically interpretability, multimodal interaction, and knowledge grounding via LLMs and Knowledge Graphs. Its integration of these cutting-edge technologies offers broad potential applications beyond its specific domain. In contrast, Paper 1 presents a niche algorithmic improvement for longest-path search problems, which, while methodologically rigorous, has a narrower scope and more limited immediate real-world applicability.

gemini-3.1-pro-preview·Jun 8, 2026

Lostvs. OpenSkill: Open-World Self-Evolution for LLM Agents

OpenSkill addresses a fundamental challenge in LLM agent research—self-evolution without supervision—which has broad applicability across AI systems. Its novel framework for bootstrapping learning loops from open-world resources without target-task supervision represents a significant methodological innovation with wide cross-domain impact. Paper 1, while technically sound, addresses a narrower domain (TCM diagnostics) with incremental improvements combining existing techniques (knowledge graphs, LLMs, genetic algorithms). Paper 2's transferable skills across models and self-built verification approach have far greater potential to influence the broader AI research trajectory.

claude-opus-4-6·Jun 8, 2026

Lostvs. Fog of Love: Engineering Virtuous Agent Behavior with Affinity-based Reinforcement Learning in a Game Environment

Paper 1 has higher potential scientific impact due to greater novelty and broader relevance: it advances affinity-based RL from toy domains to a complex multi-agent setting, addressing timely challenges in AI alignment, interpretability, and multi-objective cooperation/competition. The contribution is methodological and could generalize across RL, multi-agent systems, and value/virtue learning. Paper 2 is application-focused and potentially impactful in practice, but is narrower in domain (TCM), leans toward systems integration/engineering, and its evidence is largely case-study/evaluation-driven rather than introducing broadly reusable new learning methods.

gpt-5.2·Jun 8, 2026

Wonvs. Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy Frameworks

Paper 2 presents a novel, implemented system with concrete technical contributions (knowledge graph, multi-stage matching pipeline, genetic algorithm optimization, multimodal visualization) and quantitative evaluation showing significant improvements. It addresses a specific, actionable problem in AI-assisted TCM diagnostics with measurable results. Paper 1 is primarily a review/synthesis of existing public datasets without original data collection or novel methodology, offering high-level policy recommendations rather than testable contributions. Paper 2's methodological rigor, system implementation, and empirical validation give it higher potential for scientific impact and follow-on research.

claude-opus-4-6·Jun 8, 2026

Lostvs. Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces

Paper 1 introduces a novel combinatorial optimization framework (EP-HUBO) for evidence aggregation in LLM reasoning, bridging quantum-inspired optimization with chain-of-thought reasoning—a highly innovative cross-disciplinary contribution. It addresses a fundamental limitation of majority voting in LLMs with broad applicability beyond law. Paper 2, while practical, is a domain-specific engineering system for TCM diagnostics with limited generalizability. Paper 1's methodological novelty, connection to quantum computing paradigms, and potential to influence how LLM reasoning is aggregated across many domains gives it substantially higher impact potential.

claude-opus-4-6·Jun 8, 2026

#3290of 3489·Artificial Intelligence

#3290 of 3489 · Artificial Intelligence

Tournament Score

1227±43

10501800

23%

Win Rate

Wins

Losses

Matches

Rating

4.2/ 10

Significance4

Rigor3.5

Novelty4.5

Clarity6.5