SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann

May 5, 2026

arXiv:2605.04012v1 PDF

cs.AI(primary)

#15of 2292·Artificial Intelligence

Silver · Week 19, 2026 Share

Tournament Score

1597±33

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance8

Rigor6.5

Novelty7.5

Clarity7

Tournament Score

1597±33

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SymptomAI

1. Core Contribution

SymptomAI represents the first large-scale, real-world evaluation of conversational AI for symptom assessment involving actual patients with genuine health concerns, rather than clinical vignettes or simulated scenarios. The paper makes three distinct contributions: (1) demonstrating that agentic interview strategies that actively elicit symptoms outperform passive, user-guided conversations (the current default of consumer LLMs); (2) showing that AI-generated differential diagnoses (DDx) are more accurate than those from board-certified clinicians given the same conversation transcripts; and (3) leveraging AI-generated diagnoses as phenotype labels for a phenome-wide association study (PheWAS) linking wearable biosignals to nearly 400 conditions across 500,000+ person-days of data.

The central insight—that structured symptom elicitation before diagnosis significantly improves accuracy—is both practically important and directly actionable. Given that ~20% of health-related LLM conversations involve symptom assessment, and most consumer LLMs default to user-guided interaction, this finding has immediate design implications for deployed systems.

2. Methodological Rigor

Strengths in design: The randomized five-arm study design comparing different prompting strategies (base, fixed canonical, flexible canonical, dynamic live, dynamic final) is well-structured and allows meaningful comparisons. The blinded clinical evaluation—where a third clinician ranks DDx lists without knowing their source—is a rigorous approach that minimizes bias.

Clinical evaluation: The involvement of three board-certified Family Medicine physicians with 35+ years combined post-residency experience, conducting 250+ hours of annotation on 517 cases, provides substantial grounding. The McNemar's test showing OR=2.47 (p<0.001) for SymptomAI over clinicians is statistically robust, and the analysis across conversation quality strata strengthens this finding.

Notable limitations in rigor:

The ground truth is self-reported diagnoses from healthcare providers, relayed second-hand by participants. This introduces substantial noise, which the authors acknowledge but cannot fully address.

The clinician baseline is disadvantaged: clinicians reviewed AI-conducted conversations rather than conducting their own interviews. This is an important confound—clinicians may elicit different information through different questioning strategies, and the information architecture of the conversation was optimized for the AI's reasoning, not for human clinical judgment.

Only 1,228 of 13,917 participants (8.8%) reported a diagnosis, and only 517 were clinically evaluated. While representativeness tests suggest minimal selection bias, the low reporting rate raises questions.

The auto-rater (LLM verifier) used to scale evaluation beyond the 517 clinically-reviewed cases, while validated (AUC=0.84, F1=0.92), introduces circular reasoning when Gemini-family models evaluate Gemini-generated outputs.

3. Potential Impact

Direct clinical applications: The finding that structured symptom interviews significantly improve diagnostic accuracy could reshape how consumer-facing health AI products are designed. The 27.34% average accuracy improvement from eliciting additional information is substantial and practically meaningful.

Wearable integration: The PheWAS analysis linking AI-generated diagnoses with wearable biosignals is innovative. The strong associations for acute respiratory infections (OR>7 for influenza) suggest a pathway toward proactive health monitoring where physiological changes detected by wearables could trigger symptom assessment conversations before users self-initiate.

Public health implications: The scale (13,917 conversations) and real-world deployment context make this directly relevant to the emerging reality of millions using LLMs for health queries. The demonstration that current user-guided approaches are suboptimal provides evidence for regulatory and design guidance.

Adjacent fields: The methodology of using AI-generated labels for large-scale epidemiological analysis (PheWAS) could be broadly applicable beyond this specific use case, enabling population-level studies that would be cost-prohibitive with clinician-generated labels.

4. Timeliness & Relevance

This paper addresses a critical gap at the intersection of two major trends: the explosion of consumer LLM usage for health queries and the growing ecosystem of wearable health devices. With recent studies showing dramatic accuracy drops when laypeople interact with diagnostic AI compared to direct vignette processing (94.5% → 34.5%, Bean et al. 2025), demonstrating viable strategies for maintaining accuracy in real-world layperson interactions is highly timely.

5. Strengths & Limitations

Key strengths:

Unprecedented scale for real-world conversational AI symptom assessment evaluation

Rigorous blinded clinical comparison methodology

Randomized experimental design across multiple agent strategies

Novel integration of conversational AI with wearable biosignal analysis

Auxiliary general population validation (N=1,509) addressing selection bias concerns

Data and code availability commitments

Notable weaknesses:

Self-reported diagnoses as ground truth introduce systematic noise

The comparison with clinicians is asymmetric—clinicians review AI-directed conversations, not their own interviews

The study population skews toward Fitbit users (predominantly female, 68.3%), though auxiliary data partially mitigates this

Top-5 accuracy as the primary metric is generous; top-1 accuracy (reported in supplementary materials at ~40%) tells a more sobering story

The PheWAS analysis uses AI-generated labels as ground truth, creating a potentially circular validation when biosignal correlations are interpreted as supporting diagnostic accuracy

Limited analysis of safety outcomes (harm assessment appears only briefly in supplementary materials)

The study focuses predominantly on common, self-limiting conditions; performance on serious or rare conditions remains unclear

Missing analyses: No stratification by condition severity or urgency is prominently featured, which is critical for understanding safety implications. The paper also lacks discussion of cases where AI diagnosis was confidently wrong, which is arguably more important than aggregate accuracy metrics for a safety-critical application.

Overall Assessment

This is an important contribution that advances the field from curated benchmarks to real-world evaluation at meaningful scale. The core finding about structured symptom elicitation is robust and actionable. However, the claims of superiority over clinicians require careful contextualization given the asymmetric evaluation design. The paper is most impactful as a demonstration of feasibility and a design guide for conversational health AI, rather than as definitive evidence of clinical-grade performance.

Rating:7.5/ 10

Significance 8Rigor 6.5Novelty 7.5Clarity 7

Generated May 6, 2026

Comparison History (42)

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

claude-opus-4.65/16/2026

The AI-assisted peer review paper addresses a fundamental bottleneck in scientific publishing affecting all fields, deployed at massive scale (22,977 papers at AAAI-26). Its impact is broader because it could transform how science itself is evaluated across all disciplines. While SymptomAI is impressive in medical AI with strong methodology and real-world deployment, the peer review paper's potential to reshape scientific infrastructure gives it wider cross-disciplinary impact. Both are rigorous large-scale deployments, but reforming peer review has cascading effects on all of science.

vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors

gemini-3.15/15/2026

Paper 2 demonstrates extraordinary scale and real-world application, deploying a medical AI to nearly 14,000 users via Fitbit. Its methodological rigor—combining blinded clinical evaluations, large-scale randomized participant interactions, and integration with 500,000 days of wearable metrics—provides robust evidence for LLM efficacy in everyday healthcare. While Paper 1 offers a valuable novel approach to AI safety and evaluation, Paper 2's massive empirical scale, multi-modal health insights, and direct impact on public health technology give it a substantially higher potential scientific and societal impact.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

gemini-3.15/15/2026

Paper 1 presents a foundational advance by training a model on an unprecedented scale of 200 million patients' claims data. Its impact is extremely broad, significantly improving clinical disease prediction, healthcare expenditure forecasting, and epidemiological causal inference. While Paper 2 offers valuable insights into real-world consumer AI triage, Paper 1 provides a highly scalable, population-level infrastructure with immense methodological rigor that avoids the limitations of self-reported data, making it likely to serve as a cornerstone resource across multiple biomedical and economic fields.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

claude-opus-4.65/15/2026

IatroBench introduces a novel, pre-registered benchmark exposing a fundamental and previously unquantified problem: AI safety measures causing iatrogenic harm through identity-contingent withholding. This challenges core assumptions in AI safety alignment and has immediate policy implications for how frontier models are deployed in healthcare. The finding that safety filters systematically harm the most vulnerable users (those who have exhausted referrals) is paradigm-shifting. Paper 2, while impressive in scale and practical value for symptom assessment, represents incremental progress in a well-studied area (AI-assisted diagnosis). IatroBench's cross-cutting implications for AI safety, medical ethics, and evaluation methodology give it broader and deeper impact.

vs. Emotion Concepts and their Function in a Large Language Model

gpt-5.25/15/2026

Paper 2 has higher likely scientific impact due to its large-scale real-world deployment (N=13,917), randomized comparison across multiple agents, and clinician-blinded evaluation, yielding strong evidence and immediate clinical/consumer-health applications. It also creates a valuable dataset linking symptom dialogues to wearable metrics across hundreds of conditions, enabling broader downstream research. Paper 1 is novel and timely for LLM interpretability/alignment, but its impact depends on generalizability beyond a single model and is less directly actionable; methodological and external-validity signals are weaker than Paper 2’s field study.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

gpt-5.25/15/2026

Paper 1 is more novel and broadly impactful: it proposes a general paradigm (multi-agent symbolic + metaheuristic “collective intelligence”) for discovering governing equations, a core scientific capability with applications across physics, biology, chemistry, and engineering. If validated, its claims of orders-of-magnitude extrapolation gains and strong interpretability address a major limitation of black-box AI, making it highly timely for scientific discovery. Paper 2 is methodologically strong and highly applicable clinically, but is more incremental (LLM-based symptom interviewing) and constrained by self-reported ground truth and domain specificity, limiting cross-field impact.

vs. Interestingness as an Inductive Heuristic for Future Compression Progress

claude-opus-4.65/15/2026

Paper 1 demonstrates higher scientific impact due to its large-scale real-world deployment (N=13,917), direct clinical relevance, and actionable findings showing AI agents outperform clinicians in differential diagnosis from symptom interviews. It addresses a massive practical need in healthcare accessibility, combines conversational AI with wearable data across 400+ conditions, and provides rigorous methodology with clinician-blinded evaluation. Paper 2 offers interesting theoretical contributions on interestingness as a heuristic for self-improving systems, but its impact is more niche and speculative, lacking immediate practical applications compared to Paper 1's transformative healthcare implications.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gpt-5.25/7/2026

Paper 2 has higher likely scientific impact due to its large-scale real-world deployment (N=13,917), randomized comparison across multiple agents, clinician-blinded benchmarking, and creation of a unique corpus linking symptom dialogues to longitudinal wearable signals across ~400 conditions. This yields immediate translational applications in digital health, clinical triage, and population monitoring, with broad relevance to ML, HCI, epidemiology, and medicine. Paper 1 is conceptually novel and cross-disciplinary, but its impact hinges on acceptance and reproducibility of abstract theoretical claims and validations, which typically diffuse more slowly.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gemini-35/6/2026

Paper 2 demonstrates exceptional methodological rigor and real-world scale through a randomized trial with nearly 14,000 participants, addressing a critical gap in medical AI by moving beyond curated vignettes to everyday symptom assessment. Its integration of LLM diagnostics with extensive wearable sensor data offers profound, immediate implications for digital health, clinical practice, and epidemiology. While Paper 1 presents a valuable methodological advance in materials discovery, Paper 2's unprecedented scale, rigorous clinician-backed validation, and direct societal applications give it a broader and more transformative potential scientific impact.

vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures

gemini-35/6/2026

Paper 1 represents a massive leap in medical AI by evaluating LLMs in real-world, uncurated settings with a massive randomized cohort (N=13,917). Its integration of conversational AI with continuous wearable data to map physiological shifts to specific conditions is highly innovative and has immense real-world implications for digital health. While Paper 2 offers significant methodological advancements in materials discovery, the unprecedented scale, rigorous clinical validation, and direct societal applicability of Paper 1 give it a higher potential for broad scientific and clinical impact.

vs. MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

gemini-35/6/2026

Paper 1 demonstrates massive real-world impact by deploying a conversational medical AI to nearly 14,000 users, rigorously validating it against clinician performance, and integrating it with physiological wearable data. This interdisciplinary breakthrough bridges AI, public health, and human-computer interaction. While Paper 2 presents a significant methodological advancement in MARL, Paper 1's unprecedented scale, rigorous clinical benchmarking, and immediate societal applicability give it a higher broad scientific impact.

vs. EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

gpt-5.25/6/2026

Paper 2 likely has higher scientific impact due to strong methodological rigor and real-world relevance: a large randomized deployment (N=13,917) inside a consumer product, blinded clinician comparisons, and extensive clinician annotation, yielding clinically meaningful findings about interview strategy and diagnostic accuracy. It also creates a valuable real-world dialogue dataset and links symptom reports to >500,000 wearable-days across ~400 conditions, enabling downstream digital health research. Paper 1 is novel and broadly applicable, but relies on benchmark improvements in a fast-moving agent-framework space with less demonstrated real-world validation.

vs. Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

gpt-5.25/6/2026

Paper 2 likely has higher impact due to its scale (13,917 real-world users), rigorous evaluation (randomization, clinician panel annotation, blinded comparison), and broad applicability across digital health, conversational AI, and wearable physiology. It is timely given rapid deployment of LLMs in consumer health and offers both methodological insights (agentic interviewing improves DDx) and a large corpus enabling downstream research (symptom-dialogue + wearable associations across ~400 conditions). Paper 1 is novel and valuable for AV compliance, but is narrower in domain and appears less broadly generalizable beyond traffic-law settings.

vs. Towards Lawful Autonomous Driving: Deriving Scenario-Aware Driving Requirements from Traffic Laws and Regulations

gpt-5.25/6/2026

Paper 1 has higher impact potential due to its large-scale real-world deployment (N=13,917) and rigorous clinician-blinded comparative evaluation showing improved diagnostic performance over clinicians on the same dialogues. It contributes a valuable naturalistic symptom-dialogue corpus, demonstrates agentic interviewing benefits, and links conversational outcomes to longitudinal wearable physiology across hundreds of conditions—broadly relevant to digital health, clinical NLP, and wearable computing. Paper 2 is timely and useful for AV compliance, but its scope is narrower (jurisdiction-specific laws, scenario dataset size) and the scientific contribution is more incremental pipeline engineering than a large empirical demonstration with broad downstream biomedical implications.

vs. LLM Reasoning Is Latent, Not the Chain of Thought

gpt-5.25/6/2026

Paper 2 has higher likely impact due to a large randomized real-world deployment (N=13,917) with clinician-blinded comparisons and substantial annotation, yielding actionable evidence about conversational diagnostic agents and interview strategies. It also creates a valuable corpus and links symptom dialogues to wearable physiology across hundreds of conditions, enabling downstream clinical and digital-health research. Paper 1 is timely and conceptually important for LLM interpretability/reasoning research, but as a position/framework paper its immediate real-world application and empirical leverage are narrower than Paper 2’s масштаб and translational relevance.

vs. LLM Reasoning Is Latent, Not the Chain of Thought

gpt-5.25/6/2026

Paper 1 combines high novelty with strong real-world impact: a large-scale randomized deployment (N=13,917) in a consumer health app, clinician-blinded comparative evaluation, and a uniquely valuable dataset linking conversational symptom reports to wearable physiology across ~400 conditions. The methodological rigor and immediate clinical/consumer-health applications make it likely to influence both medical AI practice and future research. Paper 2 offers an important conceptual reframing of LLM reasoning, but as a position paper its impact depends on downstream adoption and lacks comparable empirical leverage or direct application.

vs. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

gemini-35/6/2026

Paper 1 conducts a massive, real-world deployment of an AI diagnostic agent with nearly 14,000 participants, demonstrating superior diagnostic accuracy compared to human clinicians and successfully linking outcomes to wearable health data. Its direct applications in telemedicine, public health, and consumer digital health provide a significantly broader and more immediate societal impact than the niche, albeit innovative, AI tooling advancements presented in Paper 2.

vs. AI scientists produce results without reasoning scientifically

claude-opus-4.65/6/2026

Paper 1 addresses a fundamental question about the epistemological validity of AI-driven scientific research—a rapidly growing practice with far-reaching consequences. Its findings (evidence ignored in 68% of traces, scaffold contributes only 1.5% of variance) challenge core assumptions about autonomous AI science and have broad implications across all scientific fields using LLM agents. The rigorous methodology (25,000+ runs, 8 domains, dual evaluation lenses) and the actionable conclusion that reasoning must become a training target will likely influence AI development, scientific policy, and epistemology. Paper 2, while practically valuable, addresses a narrower clinical application domain.

vs. Mechanical Conscience: A Mathematical Framework for Dependability of Machine Intelligenc

gemini-35/6/2026

Paper 2 presents a large-scale, real-world deployment of conversational AI in healthcare with nearly 14,000 users. Its empirical validation against clinical professionals and integration with wearable data demonstrate significant immediate real-world applicability, methodological rigor, and broad impact in medical AI. While Paper 1 introduces a novel theoretical framework for AI safety, Paper 2 provides concrete, evidence-based advancements in a high-stakes applied domain, likely leading to broader and more immediate scientific impact.

vs. End-to-end autonomous scientific discovery on a real optical platform

gpt-5.25/6/2026

Paper 1 is more novel and field-shaping: it demonstrates end-to-end autonomous discovery in a real physical system and experimentally validates a previously unreported mechanism, which is a rare, high-impact milestone. Its implications span AI agents, experimental physics, and potentially new optical computing hardware, giving broad cross-field impact and timeliness amid rapid interest in autonomous science and post-Moore hardware. Paper 2 is rigorous and highly applicable with a large real-world deployment, but it is more incremental (improving conversational symptom assessment) and constrained by self-reported ground truth.