A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization

Tianyu Liu, Wangjie Zheng, Rui Yang, Benny Kai Guo Loo, Hui Zhang, Jeffries Lauran, Jianlei Gu, Botao Yu

#81 of 2292 · Artificial Intelligence
Share
Tournament Score
1549±47
10501800
85%
Win Rate
17
Wins
3
Losses
20
Matches
Rating
6.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Accurate and timely diagnosis is essential for effective treatment, particularly in the context of rare diseases. However, current diagnostic workflows often lead to prolonged assessment times and low accuracy. To address these limitations, we introduce Hygieia, a multi-modal AI agent system designed to support precision disease diagnosis by integrating diverse data sources, including phenotypic features, genetic profiles, and clinical records. Hygieia features a router-based and knowledge-enhanced framework that mitigates hallucination and tailors diagnostic strategies to different disease categories. Notably, it prioritizes risk-related genomic factors for rare diseases and provides confidence scores to assist clinical decision-making. We conducted a comprehensive evaluation demonstrating that Hygieia achieves state-of-the-art performance across multiple diagnostic benchmarks. In collaboration with clinical experts from Yale School of Medicine and Duke-NUS Medical School, we further validated its practical utility by showing (1) Hygieia's superior diagnostic performance compared to physicians with an improvement from 12%-60% and (2) its effectiveness in assisting clinicians with medical records for handling real-world cases. Our findings indicate that Hygieia not only enhances diagnostic accuracy and interpretability but also significantly reduces clinician workload, highlighting its potential as a valuable tool in clinical decision support systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization"

1. Core Contribution

Hygieia is a multi-modal AI agent system that addresses two clinically important tasks: rare disease diagnosis and risk gene prioritization. Its key architectural innovations include: (1) a router-based system that distinguishes common from rare diseases and applies different diagnostic pipelines accordingly, (2) a verifier-corrector self-reflection mechanism for iterative output validation, (3) confidence estimation via majority voting, and (4) unification of disease diagnosis and gene prioritization within a single framework. The system integrates phenotypic features, genetic profiles, and clinical records, leveraging external knowledge retrieval (web search, PubMed, patient databases) alongside LLM reasoning.

The paper addresses a genuine clinical bottleneck—the "diagnostic odyssey" of rare diseases averaging 4-5 years—and the inability of existing AI systems to distinguish common from rare diseases, maintain output consistency, or provide interpretable reasoning chains.

2. Methodological Rigor

Strengths in evaluation breadth: The paper evaluates across seven datasets (MyGene2, four RareBench splits, RareArena, and in-house YSM/YNHH data), providing reasonable coverage of different data distributions. The inclusion of ablation studies examining base model choice, context information, and verifier contributions adds analytical depth.

Concerns:

  • The human evaluation involves only three physicians evaluating 23 questions (13 for diagnosis, 10 for gene prioritization). This is an extremely small sample size for drawing conclusions about 12-60% improvement over physicians. The statistical power is insufficient to make robust claims about human-AI performance gaps, and the claimed improvements should be interpreted cautiously.
  • The constraint that "human experts are not allowed to use LLMs as assistants" creates an artificial comparison setting. In practice, physicians already use various computational tools; the comparison would be more meaningful if it reflected realistic clinical workflows.
  • The confidence estimation validation (Supplementary Figure 4) relies on a Mann-Whitney U test showing significance (p=9.02E-06), but the practical calibration quality is not thoroughly characterized (e.g., no calibration curves or expected calibration error metrics).
  • The system heavily depends on GPT-5-chat and Claude-Sonnet-4.5 as backbones—closed-source models whose behavior can change over time, creating reproducibility concerns the authors themselves acknowledge.
  • The Recall@K metric, while standard, doesn't fully capture clinical utility. A system might rank the correct disease second but provide a closely related diagnosis that would still guide appropriate clinical action.
  • 3. Potential Impact

    The rare disease space genuinely needs better diagnostic tools, and the integration of diagnosis with gene prioritization is clinically meaningful—physicians need both to order appropriate genetic testing and reach diagnoses. The router concept (separating common from rare disease pipelines) is practically sensible and addresses a real failure mode of existing systems.

    The human-AI collaboration framework, while preliminary, points toward a valuable deployment paradigm where Hygieia serves as a verification tool for physician diagnoses, not just a standalone diagnostic engine. The case study showing correction of a misdiagnosis (Figure 6c) illustrates a compelling use case.

    However, the translational path has obstacles: reliance on commercial APIs raises cost and privacy concerns for clinical deployment; the system's performance is tightly coupled to specific model versions (GPT-5, Claude Sonnet 4.5); and the lack of prospective clinical validation limits immediate clinical applicability.

    4. Timeliness & Relevance

    The paper is highly timely, arriving as LLM-based medical agents are proliferating but few have been rigorously applied to rare diseases with gene prioritization. The concurrent publication of DeepRare in Nature (reference [5]) demonstrates active competition in this space. Hygieia differentiates itself through its router mechanism, confidence estimation, and dual-task capability, though the competitive landscape is rapidly evolving.

    The paper also arrives at a moment when the AI agent paradigm is maturing beyond simple prompting, making the multi-agent architecture with tool use and self-verification architecturally relevant.

    5. Strengths & Limitations

    Key Strengths:

  • Clinically motivated architecture that mirrors actual diagnostic workflows (triage → specialized assessment → verification)
  • Unified framework for diagnosis and gene prioritization—a meaningful advance over single-task systems
  • Comprehensive baseline comparisons including both open/closed-source LLMs, fine-tuned models, and existing agent systems
  • Open-source code with IRB approval and documented prompts, supporting reproducibility
  • The case studies (Figures 3, 5, 6) provide qualitative evidence of superior reasoning depth
  • The Table 1 comparison clearly positions Hygieia's unique features against competitors
  • Notable Limitations:

  • The human evaluation is underpowered (3 experts, 23 questions). The "12-60% improvement" headline claim rests on this thin evidence base.
  • Heavy dependence on proprietary APIs (GPT-5, Claude) undermines long-term reproducibility and clinical deployment feasibility
  • No prospective clinical validation or integration into actual clinical workflows
  • The gene prioritization evaluation acknowledges that "as we increase the pool of candidates, there is a diminishing advantage of Hygieia," suggesting the advantage may be narrow
  • Cost analysis (Supplementary Figure 8) shows Hygieia is cheaper than GPT-5-search but doesn't provide absolute cost figures per case, important for clinical deployment
  • The paper's writing occasionally veers into promotional language rather than measured scientific claims (e.g., "next generation of AI-assisted medical decision systems"), which somewhat undermines credibility
  • No error analysis examining systematic failure modes—understanding where Hygieia fails would be as informative as where it succeeds
  • Additional Observations

    The architectural design is sensible but incremental—routing, retrieval-augmented generation, and self-verification are established techniques being combined in a domain-specific manner. The novelty lies more in the application context and integration than in individual components. The paper would benefit from a more rigorous statistical framework for the human comparison and a clearer characterization of failure modes and limitations of the confidence estimation approach.

    Rating:6.2/ 10
    Significance 6.5Rigor 5.5Novelty 5.5Clarity 6.5

    Generated May 8, 2026

    Comparison History (20)

    vs. Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
    gemini-3.15/18/2026

    Paper 1 demonstrates higher scientific impact by introducing a novel paradigm of autonomous AI-driven scientific modeling. By using LLM-guided tree search to generate executable forecasting code, it matches or outperforms gold-standard CDC ensembles in rigorous, prospective real-time trials. While Paper 2 presents an excellent clinical diagnostic tool, Paper 1's framework of autonomously translating complex theory into transparent, executable software solves a critical human labor bottleneck. This methodological breakthrough has transformational potential for automated hypothesis generation and modeling across multiple data-scarce scientific domains beyond public health.

    vs. Revealing Interpretable Failure Modes of VLMs
    gemini-3.15/16/2026

    While Paper 1 addresses crucial AI safety issues, Paper 2 demonstrates higher potential for profound real-world impact by tackling rare disease diagnosis. The integration of multi-modal clinical and genetic data, combined with rigorous real-world validation by top medical institutions, sets it apart. Achieving up to a 60% diagnostic improvement over physicians highlights a transformative application of AI in precision medicine. Paper 2's interdisciplinary breadth across AI, genomics, and clinical practice, along with its direct potential to save lives and reduce diagnostic odysseys, gives it superior estimated scientific and societal impact.

    vs. Causal Bias Detection in Generative Artifical Intelligence
    claude-opus-4.65/16/2026

    Paper 1 addresses a fundamental theoretical gap in causal fairness for generative AI—a rapidly expanding and consequential area. It provides a novel formal framework unifying causal fairness across standard ML and generative AI settings, with new decomposition results and identification conditions. This foundational contribution has broad applicability across all generative AI applications and high-stakes domains, likely influencing future fairness research methodology. Paper 2, while clinically valuable with strong practical results, is more application-specific (rare disease diagnosis) and represents an engineering integration of existing approaches rather than a new theoretical paradigm.

    vs. CORTEG: Foundation Models Enable Cross-Modality Representation Transfer from Scalp to Intracranial Brain Recordings
    gemini-3.15/16/2026

    Paper 2 demonstrates profound real-world clinical utility by significantly improving rare disease diagnosis and risk gene prioritization over human physicians. Its multi-modal integration of genomics and clinical records offers broader, immediate impact across medicine and bioinformatics compared to Paper 1's more niche, though highly innovative, application in intracranial brain-computer interfaces.

    vs. Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection
    gpt-5.25/16/2026

    Paper 2 likely has higher scientific impact due to its direct, high-stakes real-world application in rare disease diagnosis, multi-modal integration (phenotype/genomics/records), and demonstrated clinical validation with expert collaborators, including comparisons against physicians and real-world case utility—supporting translational adoption. Its breadth spans AI, genomics, clinical informatics, and healthcare delivery, and it is highly timely. Paper 1 is methodologically rigorous and novel for MoE theory/metrics, but its impact is more specialized to ML training diagnostics and may translate more slowly outside ML systems research.

    vs. FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling
    gpt-5.25/16/2026

    Paper 1 likely has higher scientific impact due to its direct clinical relevance and potential to change real-world rare disease diagnosis workflows. It integrates multimodal clinical/genetic data, addresses hallucination, provides confidence estimates, and reports validation with clinicians plus large performance gains vs physicians—suggesting tangible translational impact and broad implications for precision medicine and genomics. Paper 2 is novel and rigorous for diffusion-model alignment, but its impact is more specialized to generative modeling; applications are strong yet generally less societally critical than improved rare disease diagnosis.

    vs. CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation
    gemini-3.15/16/2026

    Paper 1 demonstrates higher potential impact due to its direct, validated application in clinical settings. While Paper 2 offers an innovative approach for materials discovery, Paper 1 addresses the critical real-world problem of rare disease diagnosis. By demonstrating a 12-60% diagnostic improvement over clinical experts from top medical institutions and effectively integrating multi-modal genomic and phenotypic data, Paper 1 promises immediate, life-saving applications in precision medicine. Its rigorous real-world validation gives it a profound edge in both immediate societal relevance and cross-disciplinary clinical impact.

    vs. UniToolCall: Unifying Tool-Use Representation, Data, and Evaluation for LLM Agents
    gemini-3.15/16/2026

    Paper 1 addresses a critical real-world problem (rare disease diagnosis) and demonstrates profound clinical improvements (12-60%) over human physicians. Its practical validation with top medical institutions gives it exceptional translational scientific impact, whereas Paper 2, while methodologically rigorous, is primarily focused on LLM engineering and benchmarking.

    vs. FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
    gpt-5.25/16/2026

    Paper 1 introduces a mathematically grounded, universal, random-access vector quantization scheme tailored to KV-cache geometry, with provable gains and strong empirical results on LLMs. Its impact is broad and timely: reducing inference memory traffic directly affects deployment cost/latency across many models and systems, and the method is likely reusable beyond KV caches. Paper 2 has high application value but resembles an integration-heavy clinical agent; impact may be constrained by dataset/generalization, regulatory/clinical adoption hurdles, and reproducibility/validation requirements.

    vs. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
    gemini-3.15/16/2026

    Paper 1 presents a highly impactful, clinically validated AI agent for rare disease diagnosis. Its multi-modal approach, direct collaboration with top medical schools, and demonstrated superior performance over physicians give it profound real-world applicability and societal value. While Paper 2 offers strong methodological improvements for LLM reasoning, Paper 1 addresses a critical bottleneck in healthcare with tangible, immediate benefits for patient outcomes.

    vs. Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
    claude-opus-4.65/8/2026

    Paper 2 (Hygieia) addresses a critical clinical need—rare disease diagnosis—with a concrete AI system demonstrating state-of-the-art performance and validated clinical utility with 12-60% improvement over physicians. It has immediate real-world applications in healthcare, rigorous multi-benchmark evaluation, and clinical validation with experts from major institutions. Paper 1 contributes valuable HCI/design insights about human-AI collaboration (intentmaking/sensemaking), but is based on a small formative study (n=11) with primarily conceptual/qualitative contributions, limiting its immediate measurable impact compared to Paper 2's direct clinical applications.

    vs. Authorization Propagation in Multi-Agent AI Systems: Identity Governance as Infrastructure
    gemini-3.15/8/2026

    Paper 1 addresses a critical, life-saving application in rare disease diagnosis with a novel multi-modal AI agent. Its demonstrated real-world clinical validation and significant quantitative improvements over human physicians (12-60%) offer profound and immediate societal impact. While Paper 2 presents important theoretical work on AI security, Paper 1's direct clinical utility and methodological rigor give it a higher overall scientific and practical impact.

    vs. SDFlow: Similarity-Driven Flow Matching for Time Series Generation
    gpt-5.25/8/2026

    Paper 2 has higher potential impact due to direct, high-stakes real-world applicability in rare disease diagnosis, multimodal integration (phenotypes, genetics, records), and clinician-in-the-loop validation suggesting near-term translational relevance. If robustly evaluated, gains over physicians and deployment as decision support could influence clinical practice, biomedical AI, and genomics broadly. Paper 1 is novel and methodologically interesting for long-horizon time-series generation, but its impact is likely narrower (generative modeling/time-series) and more incremental relative to rapidly evolving diffusion/flow methods, with less immediate societal/clinical leverage.

    vs. TheraAgent: Self-Improving Therapeutic Agent for Precise and Comprehensive Treatment Planning
    gpt-5.25/8/2026

    Paper 1 likely has higher scientific impact due to stronger novelty and broader real-world applicability: it is multimodal (phenotype, genetics, clinical records), targets rare-disease diagnosis plus risk gene prioritization (a major unmet need), and explicitly addresses hallucination with a router/knowledge-enhanced design and calibrated confidence. The clinical validations across institutions and claims of substantial physician-comparison gains suggest meaningful translational potential. Paper 2’s iterative generate-judge-refine approach is useful but conceptually closer to existing self-refinement/critic LLM paradigms and is narrower (treatment-plan text quality) with higher safety/regulatory hurdles.

    vs. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
    claude-opus-4.65/8/2026

    Hygieia addresses a critical clinical need—rare disease diagnosis—with a validated multi-modal AI system demonstrating 12-60% improvement over physicians and real-world clinical validation at top medical institutions. Its direct clinical applicability, state-of-the-art benchmarks, and practical impact on clinician workload give it broad, immediate real-world significance. Paper 2, while methodologically rigorous and offering important mechanistic insights about LLM failure modes, addresses a narrower interpretability question with primarily negative results (steering doesn't work), limiting its immediate practical impact despite its theoretical contributions.

    vs. Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
    gpt-5.25/8/2026

    Paper 2 likely has higher impact due to strong real-world applicability (rare disease diagnosis), multimodal integration, and reported gains over physician baselines with clinical collaborations—positioning it for translational adoption and broad relevance across medicine, genomics, and clinical NLP. Paper 1 is methodologically rigorous and novel in clarifying limits of linear steering despite decodability, with useful implications for LLM interpretability and reliability estimation, but its immediate practical impact is narrower and primarily within LLM mechanistic interpretability rather than direct clinical deployment.

    vs. Housing Potential Common Data Model and City Digital Twin
    gemini-3.15/8/2026

    Paper 2 presents a multi-modal AI system for rare disease diagnosis with direct, life-saving clinical applications. It demonstrates exceptional methodological rigor through clinical validation with top medical schools, showing a quantifiable improvement over physician accuracy (12%-60%). The integration of AI, genomics, and clinical data offers broader and more critical scientific impact compared to the urban planning data model presented in Paper 1.

    vs. Housing Potential Common Data Model and City Digital Twin
    gemini-3.15/8/2026

    Paper 1 presents a highly innovative, multi-modal AI system addressing a critical bottleneck in medicine: rare disease diagnosis. Its rigorous validation with top medical institutions, demonstrating a 12-60% performance improvement over human physicians, highlights immediate and transformative real-world clinical impact. In contrast, Paper 2 offers a valuable but less groundbreaking data standardization model for urban planning. Paper 1's integration of genomics, clinical AI, and precision medicine gives it a significantly higher ceiling for scientific and societal impact.

    vs. SpatialEpiBench: Benchmarking Spatial Information and Epidemic Priors in Forecasting
    claude-opus-4.65/8/2026

    Hygieia demonstrates higher potential scientific impact through its direct clinical validation with medical experts from prestigious institutions, showing 12-60% improvement over physicians in rare disease diagnosis. It addresses a critical unmet need (rare disease diagnosis delays), integrates multiple data modalities, and shows immediate real-world clinical utility. While SpatialEpiBench provides a valuable benchmarking contribution for epidemic forecasting, its primary finding is negative (most methods underperform a simple baseline), which, while important, offers less transformative potential. Hygieia's combination of methodological innovation, clinical validation, and practical applicability gives it broader and more immediate impact.

    vs. Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors
    gpt-5.25/8/2026

    Paper 1 likely has higher scientific impact due to a direct, high-stakes real-world application (rare disease diagnosis) with clear clinical utility, multimodal integration, and validation against physicians plus real clinical records—supporting translational adoption. Its potential to improve diagnostic accuracy, reduce time-to-diagnosis, and prioritize risk genes can affect healthcare outcomes broadly. Paper 2 is novel and timely for AI safety, but as a benchmark study its immediate practical deployment and cross-domain downstream impact may be narrower, and measured effects are relatively small (5.1% IC rate) though important.