SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment
Joseph Breda, Fadi Yousif, Beszel Hawkins, Marinela Cotoi, Miao Liu, Ray Luo, Po-Hsuan Cameron Chen, Mike Schaekermann
Abstract
Language models excel at diagnostic assessments on currated medical case-studies and vignettes, performing on par with, or better than, clinical professionals. However, existing studies focus on complex scenarios with rich context making it difficult to draw conclusions about how these systems perform for patients reporting symptoms in everyday life. We deployed SymptomAI, a set of conversational AI agents for end-to-end patient interviewing and differential diagnosis (DDx), via the Fitbit app in a study that randomized participants (N=13,917) to interact with five AI agents. This corpus captures diverse communication and a realistic distribution of illnesses from a real world population. A subset of 1,228 participants reported a clinician-provided diagnosis, and 517 of these were further evaluated by a panel of clinicians during over 250 hours of annotation. SymptomAI DDx were significantly more accurate (OR = 2.47, p < 0.001) than those from independent clinicians given the same dialogue in a blinded randomized comparison. Moreover, agentic strategies which conduct a dedicated symptom interview that elicit additional symptom information before providing a diagnosis, perform substantially better than baseline, user-guided conversations (p < 0.001). An auxiliary analysis on 1,509 conversations from a general US population panel validated that these results generalize beyond wearable device users. We used SymptomAI diagnoses as labels for all 13,917 participants to analyze over 500,000 days of wearable metrics across nearly 400 unique conditions. We identified strong associations between acute infections and physiological shifts (e.g., OR > 7 for influenza). While limited by self-reported ground truth, these results demonstrate the benefits of a dedicated and complete symptom interview compared to a user-guided symptom discussion, which is the default of most consumer LLMs.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SymptomAI
1. Core Contribution
SymptomAI represents the first large-scale, real-world evaluation of conversational AI for symptom assessment involving actual patients with genuine health concerns, rather than clinical vignettes or simulated scenarios. The paper makes three distinct contributions: (1) demonstrating that agentic interview strategies that actively elicit symptoms outperform passive, user-guided conversations (the current default of consumer LLMs); (2) showing that AI-generated differential diagnoses (DDx) are more accurate than those from board-certified clinicians given the same conversation transcripts; and (3) leveraging AI-generated diagnoses as phenotype labels for a phenome-wide association study (PheWAS) linking wearable biosignals to nearly 400 conditions across 500,000+ person-days of data.
The central insight—that structured symptom elicitation before diagnosis significantly improves accuracy—is both practically important and directly actionable. Given that ~20% of health-related LLM conversations involve symptom assessment, and most consumer LLMs default to user-guided interaction, this finding has immediate design implications for deployed systems.
2. Methodological Rigor
Strengths in design: The randomized five-arm study design comparing different prompting strategies (base, fixed canonical, flexible canonical, dynamic live, dynamic final) is well-structured and allows meaningful comparisons. The blinded clinical evaluation—where a third clinician ranks DDx lists without knowing their source—is a rigorous approach that minimizes bias.
Clinical evaluation: The involvement of three board-certified Family Medicine physicians with 35+ years combined post-residency experience, conducting 250+ hours of annotation on 517 cases, provides substantial grounding. The McNemar's test showing OR=2.47 (p<0.001) for SymptomAI over clinicians is statistically robust, and the analysis across conversation quality strata strengthens this finding.
Notable limitations in rigor:
3. Potential Impact
Direct clinical applications: The finding that structured symptom interviews significantly improve diagnostic accuracy could reshape how consumer-facing health AI products are designed. The 27.34% average accuracy improvement from eliciting additional information is substantial and practically meaningful.
Wearable integration: The PheWAS analysis linking AI-generated diagnoses with wearable biosignals is innovative. The strong associations for acute respiratory infections (OR>7 for influenza) suggest a pathway toward proactive health monitoring where physiological changes detected by wearables could trigger symptom assessment conversations before users self-initiate.
Public health implications: The scale (13,917 conversations) and real-world deployment context make this directly relevant to the emerging reality of millions using LLMs for health queries. The demonstration that current user-guided approaches are suboptimal provides evidence for regulatory and design guidance.
Adjacent fields: The methodology of using AI-generated labels for large-scale epidemiological analysis (PheWAS) could be broadly applicable beyond this specific use case, enabling population-level studies that would be cost-prohibitive with clinician-generated labels.
4. Timeliness & Relevance
This paper addresses a critical gap at the intersection of two major trends: the explosion of consumer LLM usage for health queries and the growing ecosystem of wearable health devices. With recent studies showing dramatic accuracy drops when laypeople interact with diagnostic AI compared to direct vignette processing (94.5% → 34.5%, Bean et al. 2025), demonstrating viable strategies for maintaining accuracy in real-world layperson interactions is highly timely.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Missing analyses: No stratification by condition severity or urgency is prominently featured, which is critical for understanding safety implications. The paper also lacks discussion of cases where AI diagnosis was confidently wrong, which is arguably more important than aggregate accuracy metrics for a safety-critical application.
Overall Assessment
This is an important contribution that advances the field from curated benchmarks to real-world evaluation at meaningful scale. The core finding about structured symptom elicitation is robust and actionable. However, the claims of superiority over clinicians require careful contextualization given the asymmetric evaluation design. The paper is most impactful as a demonstration of feasibility and a design guide for conversational health AI, rather than as definitive evidence of clinical-grade performance.
Generated May 6, 2026
Comparison History (42)
The AI-assisted peer review paper addresses a fundamental bottleneck in scientific publishing affecting all fields, deployed at massive scale (22,977 papers at AAAI-26). Its impact is broader because it could transform how science itself is evaluated across all disciplines. While SymptomAI is impressive in medical AI with strong methodology and real-world deployment, the peer review paper's potential to reshape scientific infrastructure gives it wider cross-disciplinary impact. Both are rigorous large-scale deployments, but reforming peer review has cascading effects on all of science.
Paper 2 demonstrates extraordinary scale and real-world application, deploying a medical AI to nearly 14,000 users via Fitbit. Its methodological rigor—combining blinded clinical evaluations, large-scale randomized participant interactions, and integration with 500,000 days of wearable metrics—provides robust evidence for LLM efficacy in everyday healthcare. While Paper 1 offers a valuable novel approach to AI safety and evaluation, Paper 2's massive empirical scale, multi-modal health insights, and direct impact on public health technology give it a substantially higher potential scientific and societal impact.
Paper 1 presents a foundational advance by training a model on an unprecedented scale of 200 million patients' claims data. Its impact is extremely broad, significantly improving clinical disease prediction, healthcare expenditure forecasting, and epidemiological causal inference. While Paper 2 offers valuable insights into real-world consumer AI triage, Paper 1 provides a highly scalable, population-level infrastructure with immense methodological rigor that avoids the limitations of self-reported data, making it likely to serve as a cornerstone resource across multiple biomedical and economic fields.
IatroBench introduces a novel, pre-registered benchmark exposing a fundamental and previously unquantified problem: AI safety measures causing iatrogenic harm through identity-contingent withholding. This challenges core assumptions in AI safety alignment and has immediate policy implications for how frontier models are deployed in healthcare. The finding that safety filters systematically harm the most vulnerable users (those who have exhausted referrals) is paradigm-shifting. Paper 2, while impressive in scale and practical value for symptom assessment, represents incremental progress in a well-studied area (AI-assisted diagnosis). IatroBench's cross-cutting implications for AI safety, medical ethics, and evaluation methodology give it broader and deeper impact.
Paper 2 has higher likely scientific impact due to its large-scale real-world deployment (N=13,917), randomized comparison across multiple agents, and clinician-blinded evaluation, yielding strong evidence and immediate clinical/consumer-health applications. It also creates a valuable dataset linking symptom dialogues to wearable metrics across hundreds of conditions, enabling broader downstream research. Paper 1 is novel and timely for LLM interpretability/alignment, but its impact depends on generalizability beyond a single model and is less directly actionable; methodological and external-validity signals are weaker than Paper 2’s field study.
Paper 1 is more novel and broadly impactful: it proposes a general paradigm (multi-agent symbolic + metaheuristic “collective intelligence”) for discovering governing equations, a core scientific capability with applications across physics, biology, chemistry, and engineering. If validated, its claims of orders-of-magnitude extrapolation gains and strong interpretability address a major limitation of black-box AI, making it highly timely for scientific discovery. Paper 2 is methodologically strong and highly applicable clinically, but is more incremental (LLM-based symptom interviewing) and constrained by self-reported ground truth and domain specificity, limiting cross-field impact.
Paper 1 demonstrates higher scientific impact due to its large-scale real-world deployment (N=13,917), direct clinical relevance, and actionable findings showing AI agents outperform clinicians in differential diagnosis from symptom interviews. It addresses a massive practical need in healthcare accessibility, combines conversational AI with wearable data across 400+ conditions, and provides rigorous methodology with clinician-blinded evaluation. Paper 2 offers interesting theoretical contributions on interestingness as a heuristic for self-improving systems, but its impact is more niche and speculative, lacking immediate practical applications compared to Paper 1's transformative healthcare implications.
Paper 2 has higher likely scientific impact due to its large-scale real-world deployment (N=13,917), randomized comparison across multiple agents, clinician-blinded benchmarking, and creation of a unique corpus linking symptom dialogues to longitudinal wearable signals across ~400 conditions. This yields immediate translational applications in digital health, clinical triage, and population monitoring, with broad relevance to ML, HCI, epidemiology, and medicine. Paper 1 is conceptually novel and cross-disciplinary, but its impact hinges on acceptance and reproducibility of abstract theoretical claims and validations, which typically diffuse more slowly.
Paper 2 demonstrates exceptional methodological rigor and real-world scale through a randomized trial with nearly 14,000 participants, addressing a critical gap in medical AI by moving beyond curated vignettes to everyday symptom assessment. Its integration of LLM diagnostics with extensive wearable sensor data offers profound, immediate implications for digital health, clinical practice, and epidemiology. While Paper 1 presents a valuable methodological advance in materials discovery, Paper 2's unprecedented scale, rigorous clinician-backed validation, and direct societal applications give it a broader and more transformative potential scientific impact.
Paper 1 represents a massive leap in medical AI by evaluating LLMs in real-world, uncurated settings with a massive randomized cohort (N=13,917). Its integration of conversational AI with continuous wearable data to map physiological shifts to specific conditions is highly innovative and has immense real-world implications for digital health. While Paper 2 offers significant methodological advancements in materials discovery, the unprecedented scale, rigorous clinical validation, and direct societal applicability of Paper 1 give it a higher potential for broad scientific and clinical impact.
Paper 1 demonstrates massive real-world impact by deploying a conversational medical AI to nearly 14,000 users, rigorously validating it against clinician performance, and integrating it with physiological wearable data. This interdisciplinary breakthrough bridges AI, public health, and human-computer interaction. While Paper 2 presents a significant methodological advancement in MARL, Paper 1's unprecedented scale, rigorous clinical benchmarking, and immediate societal applicability give it a higher broad scientific impact.
Paper 2 likely has higher scientific impact due to strong methodological rigor and real-world relevance: a large randomized deployment (N=13,917) inside a consumer product, blinded clinician comparisons, and extensive clinician annotation, yielding clinically meaningful findings about interview strategy and diagnostic accuracy. It also creates a valuable real-world dialogue dataset and links symptom reports to >500,000 wearable-days across ~400 conditions, enabling downstream digital health research. Paper 1 is novel and broadly applicable, but relies on benchmark improvements in a fast-moving agent-framework space with less demonstrated real-world validation.
Paper 2 likely has higher impact due to its scale (13,917 real-world users), rigorous evaluation (randomization, clinician panel annotation, blinded comparison), and broad applicability across digital health, conversational AI, and wearable physiology. It is timely given rapid deployment of LLMs in consumer health and offers both methodological insights (agentic interviewing improves DDx) and a large corpus enabling downstream research (symptom-dialogue + wearable associations across ~400 conditions). Paper 1 is novel and valuable for AV compliance, but is narrower in domain and appears less broadly generalizable beyond traffic-law settings.
Paper 1 has higher impact potential due to its large-scale real-world deployment (N=13,917) and rigorous clinician-blinded comparative evaluation showing improved diagnostic performance over clinicians on the same dialogues. It contributes a valuable naturalistic symptom-dialogue corpus, demonstrates agentic interviewing benefits, and links conversational outcomes to longitudinal wearable physiology across hundreds of conditions—broadly relevant to digital health, clinical NLP, and wearable computing. Paper 2 is timely and useful for AV compliance, but its scope is narrower (jurisdiction-specific laws, scenario dataset size) and the scientific contribution is more incremental pipeline engineering than a large empirical demonstration with broad downstream biomedical implications.
Paper 2 has higher likely impact due to a large randomized real-world deployment (N=13,917) with clinician-blinded comparisons and substantial annotation, yielding actionable evidence about conversational diagnostic agents and interview strategies. It also creates a valuable corpus and links symptom dialogues to wearable physiology across hundreds of conditions, enabling downstream clinical and digital-health research. Paper 1 is timely and conceptually important for LLM interpretability/reasoning research, but as a position/framework paper its immediate real-world application and empirical leverage are narrower than Paper 2’s масштаб and translational relevance.
Paper 1 combines high novelty with strong real-world impact: a large-scale randomized deployment (N=13,917) in a consumer health app, clinician-blinded comparative evaluation, and a uniquely valuable dataset linking conversational symptom reports to wearable physiology across ~400 conditions. The methodological rigor and immediate clinical/consumer-health applications make it likely to influence both medical AI practice and future research. Paper 2 offers an important conceptual reframing of LLM reasoning, but as a position paper its impact depends on downstream adoption and lacks comparable empirical leverage or direct application.
Paper 1 conducts a massive, real-world deployment of an AI diagnostic agent with nearly 14,000 participants, demonstrating superior diagnostic accuracy compared to human clinicians and successfully linking outcomes to wearable health data. Its direct applications in telemedicine, public health, and consumer digital health provide a significantly broader and more immediate societal impact than the niche, albeit innovative, AI tooling advancements presented in Paper 2.
Paper 1 addresses a fundamental question about the epistemological validity of AI-driven scientific research—a rapidly growing practice with far-reaching consequences. Its findings (evidence ignored in 68% of traces, scaffold contributes only 1.5% of variance) challenge core assumptions about autonomous AI science and have broad implications across all scientific fields using LLM agents. The rigorous methodology (25,000+ runs, 8 domains, dual evaluation lenses) and the actionable conclusion that reasoning must become a training target will likely influence AI development, scientific policy, and epistemology. Paper 2, while practically valuable, addresses a narrower clinical application domain.
Paper 2 presents a large-scale, real-world deployment of conversational AI in healthcare with nearly 14,000 users. Its empirical validation against clinical professionals and integration with wearable data demonstrate significant immediate real-world applicability, methodological rigor, and broad impact in medical AI. While Paper 1 introduces a novel theoretical framework for AI safety, Paper 2 provides concrete, evidence-based advancements in a high-stakes applied domain, likely leading to broader and more immediate scientific impact.
Paper 1 is more novel and field-shaping: it demonstrates end-to-end autonomous discovery in a real physical system and experimentally validates a previously unreported mechanism, which is a rare, high-impact milestone. Its implications span AI agents, experimental physics, and potentially new optical computing hardware, giving broad cross-field impact and timeliness amid rapid interest in autonomous science and post-Moore hardware. Paper 2 is rigorous and highly applicable with a large real-world deployment, but it is more incremental (improving conversational symptom assessment) and constrained by self-reported ground truth.