A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization
Tianyu Liu, Wangjie Zheng, Rui Yang, Benny Kai Guo Loo, Hui Zhang, Jeffries Lauran, Jianlei Gu, Botao Yu
Abstract
Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization"
1. Core Contribution
Hygieia is a multi-modal AI agent system that addresses two clinically important tasks: rare disease diagnosis and risk gene prioritization. Its key architectural innovations include: (1) a router-based system that distinguishes common from rare diseases and applies different diagnostic pipelines accordingly, (2) a verifier-corrector self-reflection mechanism for iterative output validation, (3) confidence estimation via majority voting, and (4) unification of disease diagnosis and gene prioritization within a single framework. The system integrates phenotypic features, genetic profiles, and clinical records, leveraging external knowledge retrieval (web search, PubMed, patient databases) alongside LLM reasoning.
The paper addresses a genuine clinical bottleneck—the "diagnostic odyssey" of rare diseases averaging 4-5 years—and the inability of existing AI systems to distinguish common from rare diseases, maintain output consistency, or provide interpretable reasoning chains.
2. Methodological Rigor
Strengths in evaluation breadth: The paper evaluates across seven datasets (MyGene2, four RareBench splits, RareArena, and in-house YSM/YNHH data), providing reasonable coverage of different data distributions. The inclusion of ablation studies examining base model choice, context information, and verifier contributions adds analytical depth.
Concerns:
3. Potential Impact
The rare disease space genuinely needs better diagnostic tools, and the integration of diagnosis with gene prioritization is clinically meaningful—physicians need both to order appropriate genetic testing and reach diagnoses. The router concept (separating common from rare disease pipelines) is practically sensible and addresses a real failure mode of existing systems.
The human-AI collaboration framework, while preliminary, points toward a valuable deployment paradigm where Hygieia serves as a verification tool for physician diagnoses, not just a standalone diagnostic engine. The case study showing correction of a misdiagnosis (Figure 6c) illustrates a compelling use case.
However, the translational path has obstacles: reliance on commercial APIs raises cost and privacy concerns for clinical deployment; the system's performance is tightly coupled to specific model versions (GPT-5, Claude Sonnet 4.5); and the lack of prospective clinical validation limits immediate clinical applicability.
4. Timeliness & Relevance
The paper is highly timely, arriving as LLM-based medical agents are proliferating but few have been rigorously applied to rare diseases with gene prioritization. The concurrent publication of DeepRare in Nature (reference [5]) demonstrates active competition in this space. Hygieia differentiates itself through its router mechanism, confidence estimation, and dual-task capability, though the competitive landscape is rapidly evolving.
The paper also arrives at a moment when the AI agent paradigm is maturing beyond simple prompting, making the multi-agent architecture with tool use and self-verification architecturally relevant.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The architectural design is sensible but incremental—routing, retrieval-augmented generation, and self-verification are established techniques being combined in a domain-specific manner. The novelty lies more in the application context and integration than in individual components. The paper would benefit from a more rigorous statistical framework for the human comparison and a clearer characterization of failure modes and limitations of the confidence estimation approach.
Generated May 8, 2026
Comparison History (20)
Paper 1 demonstrates higher scientific impact by introducing a novel paradigm of autonomous AI-driven scientific modeling. By using LLM-guided tree search to generate executable forecasting code, it matches or outperforms gold-standard CDC ensembles in rigorous, prospective real-time trials. While Paper 2 presents an excellent clinical diagnostic tool, Paper 1's framework of autonomously translating complex theory into transparent, executable software solves a critical human labor bottleneck. This methodological breakthrough has transformational potential for automated hypothesis generation and modeling across multiple data-scarce scientific domains beyond public health.
While Paper 1 addresses crucial AI safety issues, Paper 2 demonstrates higher potential for profound real-world impact by tackling rare disease diagnosis. The integration of multi-modal clinical and genetic data, combined with rigorous real-world validation by top medical institutions, sets it apart. Achieving up to a 60% diagnostic improvement over physicians highlights a transformative application of AI in precision medicine. Paper 2's interdisciplinary breadth across AI, genomics, and clinical practice, along with its direct potential to save lives and reduce diagnostic odysseys, gives it superior estimated scientific and societal impact.
Paper 1 addresses a fundamental theoretical gap in causal fairness for generative AI—a rapidly expanding and consequential area. It provides a novel formal framework unifying causal fairness across standard ML and generative AI settings, with new decomposition results and identification conditions. This foundational contribution has broad applicability across all generative AI applications and high-stakes domains, likely influencing future fairness research methodology. Paper 2, while clinically valuable with strong practical results, is more application-specific (rare disease diagnosis) and represents an engineering integration of existing approaches rather than a new theoretical paradigm.
Paper 2 demonstrates profound real-world clinical utility by significantly improving rare disease diagnosis and risk gene prioritization over human physicians. Its multi-modal integration of genomics and clinical records offers broader, immediate impact across medicine and bioinformatics compared to Paper 1's more niche, though highly innovative, application in intracranial brain-computer interfaces.
Paper 2 likely has higher scientific impact due to its direct, high-stakes real-world application in rare disease diagnosis, multi-modal integration (phenotype/genomics/records), and demonstrated clinical validation with expert collaborators, including comparisons against physicians and real-world case utility—supporting translational adoption. Its breadth spans AI, genomics, clinical informatics, and healthcare delivery, and it is highly timely. Paper 1 is methodologically rigorous and novel for MoE theory/metrics, but its impact is more specialized to ML training diagnostics and may translate more slowly outside ML systems research.
Paper 1 likely has higher scientific impact due to its direct clinical relevance and potential to change real-world rare disease diagnosis workflows. It integrates multimodal clinical/genetic data, addresses hallucination, provides confidence estimates, and reports validation with clinicians plus large performance gains vs physicians—suggesting tangible translational impact and broad implications for precision medicine and genomics. Paper 2 is novel and rigorous for diffusion-model alignment, but its impact is more specialized to generative modeling; applications are strong yet generally less societally critical than improved rare disease diagnosis.
Paper 1 demonstrates higher potential impact due to its direct, validated application in clinical settings. While Paper 2 offers an innovative approach for materials discovery, Paper 1 addresses the critical real-world problem of rare disease diagnosis. By demonstrating a 12-60% diagnostic improvement over clinical experts from top medical institutions and effectively integrating multi-modal genomic and phenotypic data, Paper 1 promises immediate, life-saving applications in precision medicine. Its rigorous real-world validation gives it a profound edge in both immediate societal relevance and cross-disciplinary clinical impact.
Paper 1 addresses a critical real-world problem (rare disease diagnosis) and demonstrates profound clinical improvements (12-60%) over human physicians. Its practical validation with top medical institutions gives it exceptional translational scientific impact, whereas Paper 2, while methodologically rigorous, is primarily focused on LLM engineering and benchmarking.
Paper 1 introduces a mathematically grounded, universal, random-access vector quantization scheme tailored to KV-cache geometry, with provable gains and strong empirical results on LLMs. Its impact is broad and timely: reducing inference memory traffic directly affects deployment cost/latency across many models and systems, and the method is likely reusable beyond KV caches. Paper 2 has high application value but resembles an integration-heavy clinical agent; impact may be constrained by dataset/generalization, regulatory/clinical adoption hurdles, and reproducibility/validation requirements.
Paper 1 presents a highly impactful, clinically validated AI agent for rare disease diagnosis. Its multi-modal approach, direct collaboration with top medical schools, and demonstrated superior performance over physicians give it profound real-world applicability and societal value. While Paper 2 offers strong methodological improvements for LLM reasoning, Paper 1 addresses a critical bottleneck in healthcare with tangible, immediate benefits for patient outcomes.
Paper 2 (Hygieia) addresses a critical clinical need—rare disease diagnosis—with a concrete AI system demonstrating state-of-the-art performance and validated clinical utility with 12-60% improvement over physicians. It has immediate real-world applications in healthcare, rigorous multi-benchmark evaluation, and clinical validation with experts from major institutions. Paper 1 contributes valuable HCI/design insights about human-AI collaboration (intentmaking/sensemaking), but is based on a small formative study (n=11) with primarily conceptual/qualitative contributions, limiting its immediate measurable impact compared to Paper 2's direct clinical applications.
Paper 1 addresses a critical, life-saving application in rare disease diagnosis with a novel multi-modal AI agent. Its demonstrated real-world clinical validation and significant quantitative improvements over human physicians (12-60%) offer profound and immediate societal impact. While Paper 2 presents important theoretical work on AI security, Paper 1's direct clinical utility and methodological rigor give it a higher overall scientific and practical impact.
Paper 2 has higher potential impact due to direct, high-stakes real-world applicability in rare disease diagnosis, multimodal integration (phenotypes, genetics, records), and clinician-in-the-loop validation suggesting near-term translational relevance. If robustly evaluated, gains over physicians and deployment as decision support could influence clinical practice, biomedical AI, and genomics broadly. Paper 1 is novel and methodologically interesting for long-horizon time-series generation, but its impact is likely narrower (generative modeling/time-series) and more incremental relative to rapidly evolving diffusion/flow methods, with less immediate societal/clinical leverage.
Paper 1 likely has higher scientific impact due to stronger novelty and broader real-world applicability: it is multimodal (phenotype, genetics, clinical records), targets rare-disease diagnosis plus risk gene prioritization (a major unmet need), and explicitly addresses hallucination with a router/knowledge-enhanced design and calibrated confidence. The clinical validations across institutions and claims of substantial physician-comparison gains suggest meaningful translational potential. Paper 2’s iterative generate-judge-refine approach is useful but conceptually closer to existing self-refinement/critic LLM paradigms and is narrower (treatment-plan text quality) with higher safety/regulatory hurdles.
Hygieia addresses a critical clinical need—rare disease diagnosis—with a validated multi-modal AI system demonstrating 12-60% improvement over physicians and real-world clinical validation at top medical institutions. Its direct clinical applicability, state-of-the-art benchmarks, and practical impact on clinician workload give it broad, immediate real-world significance. Paper 2, while methodologically rigorous and offering important mechanistic insights about LLM failure modes, addresses a narrower interpretability question with primarily negative results (steering doesn't work), limiting its immediate practical impact despite its theoretical contributions.
Paper 2 likely has higher impact due to strong real-world applicability (rare disease diagnosis), multimodal integration, and reported gains over physician baselines with clinical collaborations—positioning it for translational adoption and broad relevance across medicine, genomics, and clinical NLP. Paper 1 is methodologically rigorous and novel in clarifying limits of linear steering despite decodability, with useful implications for LLM interpretability and reliability estimation, but its immediate practical impact is narrower and primarily within LLM mechanistic interpretability rather than direct clinical deployment.
Paper 2 presents a multi-modal AI system for rare disease diagnosis with direct, life-saving clinical applications. It demonstrates exceptional methodological rigor through clinical validation with top medical schools, showing a quantifiable improvement over physician accuracy (12%-60%). The integration of AI, genomics, and clinical data offers broader and more critical scientific impact compared to the urban planning data model presented in Paper 1.
Paper 1 presents a highly innovative, multi-modal AI system addressing a critical bottleneck in medicine: rare disease diagnosis. Its rigorous validation with top medical institutions, demonstrating a 12-60% performance improvement over human physicians, highlights immediate and transformative real-world clinical impact. In contrast, Paper 2 offers a valuable but less groundbreaking data standardization model for urban planning. Paper 1's integration of genomics, clinical AI, and precision medicine gives it a significantly higher ceiling for scientific and societal impact.
Hygieia demonstrates higher potential scientific impact through its direct clinical validation with medical experts from prestigious institutions, showing 12-60% improvement over physicians in rare disease diagnosis. It addresses a critical unmet need (rare disease diagnosis delays), integrates multiple data modalities, and shows immediate real-world clinical utility. While SpatialEpiBench provides a valuable benchmarking contribution for epidemic forecasting, its primary finding is negative (most methods underperform a simple baseline), which, while important, offers less transformative potential. Hygieia's combination of methodological innovation, clinical validation, and practical applicability gives it broader and more immediate impact.
Paper 1 likely has higher scientific impact due to a direct, high-stakes real-world application (rare disease diagnosis) with clear clinical utility, multimodal integration, and validation against physicians plus real clinical records—supporting translational adoption. Its potential to improve diagnostic accuracy, reduce time-to-diagnosis, and prioritize risk genes can affect healthcare outcomes broadly. Paper 2 is novel and timely for AI safety, but as a benchmark study its immediate practical deployment and cross-domain downstream impact may be narrower, and measured effects are relatively small (5.1% IC rate) though important.