Towards a General Intelligence and Interface for Wearable Health Data
Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari, Samy Abdel-Ghaffar, Marius Guerard, Kara Vaillancourt, Zhihan Zhang, Jake Garrison
Abstract
While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Towards a General Intelligence and Interface for Wearable Health Data"
1. Core Contribution
This paper introduces SensorFM, a foundation model for wearable health data pretrained on over one trillion minutes of multimodal sensor data from 5 million participants. The core contributions are threefold: (1) establishing scaling laws for wearable sensor foundation models showing that joint scaling of model capacity and data volume yields predictable downstream improvements; (2) demonstrating that the learned representations generalize across 35 diverse health prediction tasks spanning cardiovascular, metabolic, mental health, sleep, lifestyle, and demographic domains; and (3) showing that integrating SensorFM predictions into a Personal Health Agent yields clinician-rated improvements in response quality comparable to using ground-truth clinical measurements.
The paper also introduces an LLM-driven "classroom" framework for automated downstream head design, conducting over 30,000 experiments to adapt embeddings to specific tasks, and demonstrates generative capabilities for missing data imputation that improve daily health metric estimation.
2. Methodological Rigor
Strengths in rigor:
Methodological concerns:
3. Potential Impact
Direct impact on digital health: This work could fundamentally shift wearable health analytics from bespoke, task-specific pipelines to a general-purpose embedding interface. The demonstration that simple linear probes on frozen embeddings outperform engineered feature baselines on 34/35 tasks is practically significant, dramatically lowering the barrier to developing new health applications.
Clinical screening and risk stratification: The ability to predict cardiovascular risk scores, metabolic markers, and mental health screening scores from passively collected sensor data could enable population-level screening. The generative infilling capability (99.7% step count accuracy with 60 minutes of missing data) addresses a real-world limitation of intermittent sensor wear.
AI agent integration: The demonstration that SensorFM predictions are statistically non-inferior to ground-truth clinical labels when used by a health agent is particularly impactful for consumer health applications. This validates a pathway from raw sensor data to personalized health coaching.
Broader ML impact: The agentic classroom framework for automated model adaptation, while not the primary contribution, introduces a practical methodology for efficiently adapting foundation model embeddings across many downstream tasks simultaneously.
4. Timeliness & Relevance
This paper arrives at a critical intersection of three trends: the maturation of foundation models, the explosive growth of consumer wearables (~500M+ devices globally), and increasing use of LLMs for health queries. The scarcity of labeled health data paired with wearable signals has been a persistent bottleneck; self-supervised pretraining at this scale directly addresses this limitation. The integration with LLM-based health agents anticipates the near-term convergence of sensing and conversational AI in consumer health products.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment: This is a landmark paper in wearable health AI that establishes scaling laws, demonstrates broad generalization, and validates end-to-end clinical utility. While the proprietary data and single-ecosystem evaluation limit reproducibility and generalizability claims, the scale of evidence and breadth of evaluation set a new standard for the field. The work is likely to accelerate both academic research and commercial development in digital health foundation models.
Generated May 22, 2026
Comparison History (43)
HealthFormer demonstrates higher scientific impact potential through its ability to simulate clinical interventions in silico, effectively creating 'clinical digital twins.' Its validation against 41 published randomized trials with strong agreement is remarkable and has direct translational implications for personalized medicine and drug development. While Paper 1 impressively scales pretraining data (1 trillion minutes, 5M participants) and introduces novel LLM agent integration, Paper 2's multimodal generative approach across 667 measurements with demonstrated intervention simulation capability addresses a more fundamental medical challenge—predicting individual treatment responses—with rigorous external validation across independent cohorts.
Paper 1 marks a historic milestone in AI by demonstrating an agent capable of end-to-end autonomous scientific discovery in the physical world, uncovering a novel physical mechanism. This paradigm shift—moving from AI as a tool to an independent researcher—has profound implications for accelerating discovery across all scientific disciplines. While Paper 2 offers a massive-scale foundation model with high clinical utility, Paper 1's conceptual breakthrough in autonomous empirical science represents a more fundamental leap in how research itself is conducted.
While Paper 2 offers a profound theoretical unification across multiple disciplines, Paper 1 represents an unprecedented empirical breakthrough in digital health. Pretraining on 5 million participants and 1 trillion minutes of data establishes a true foundation model for wearable health. Its direct integration with LLMs, broad applicability across 35 health prediction tasks, and clinical validation demonstrate massive potential for immediate real-world healthcare applications and label-efficient learning.
Paper 2 addresses a fundamental and timely question about whether AI agents can truly reason scientifically, finding critical epistemic failures across 25,000+ runs. This has broader impact because it challenges the rapidly growing field of autonomous AI-driven science, providing evidence that current LLM agents lack self-correcting reasoning. Its findings affect every domain deploying AI scientists and will likely influence AI training paradigms, evaluation standards, and policy. Paper 1, while impressive in scale and clinical utility, is more narrowly focused on wearable health and represents incremental (though significant) progress in foundation models for a specific domain.
Paper 1 has higher likely scientific impact due to its combination of population-scale data (5M participants; >1T minutes), a foundation-model paradigm for wearable health, and broad validation across 35 clinically relevant tasks plus clinician-rated safety/utility via a Personal Health Agent. The real-world application potential (health monitoring, risk prediction, personalized insights) is immediate and large, and the work is timely given the growth of wearables and foundation models. Paper 2 is methodologically novel for controllable/attribute-aware decoding, but its demonstrated impact appears narrower and less directly societally transformative.
Paper 1 introduces a field-defining foundation model trained on an unprecedented scale of wearable data (1 trillion minutes, 5M participants). Its broad applicability across 35 health prediction tasks and integration with LLM agents represent a massive leap in personalized digital health. While Paper 2 provides a crucial and timely critique of AI safety mechanisms, Paper 1's sheer scale, novelty in multimodal physiological representation, and potential to catalyze widespread downstream clinical applications give it a higher overall scientific and real-world impact.
Paper 2 likely has higher scientific impact due to its large-scale foundation model (trillion-minute pretraining, 5M participants), broad validation across 35 diverse tasks, and clear pathway to real-world deployment via a Personal Health Agent with clinician-rated evaluation. Its methodological scope (scaling laws, few-shot/label efficiency, generative metric estimation, automated head search) and applicability across many health domains suggest wide cross-field influence. Paper 1 is novel and timely in highlighting iatrogenic harm from safety behaviors, but its narrower domain, smaller scale, and primarily evaluative nature limit breadth and downstream adoption.
Paper 2 likely has higher impact due to unprecedented scale (5M participants, >1T minutes), strong timeliness in foundation models, and broad real-world applicability across many health domains with 35 tasks plus clinician-validated interface work. Its methodological rigor is supported by large-scale pretraining, systematic scaling results, diverse evaluations, and deployment-oriented validation. Paper 1 is highly novel for symbolic equation discovery and could be transformative for scientific modeling, but its impact may be narrower and more dependent on benchmarking breadth and adoption across disciplines.
While Paper 1 presents a massive, highly translational foundation model for wearable health, Paper 2 represents a paradigm shift in scientific methodology itself. Demonstrating the first end-to-end autonomous AI agent that proposes and experimentally validates a previously unreported physical mechanism on real hardware fundamentally changes how experimental science can be conducted. This breakthrough in AI-driven autonomous research offers profound cross-disciplinary implications and broader scientific impact than a domain-specific, albeit impressive, health model.
Paper 1 presents a deeply novel theoretical unification connecting Bayesian inference, game theory, and thermodynamics through a collective variational principle. This bridges multiple fundamental fields (physics, neuroscience, AI, economics) with formal mathematical proofs and falsifiable predictions validated across domains. While Paper 2 is impressive in scale and practical utility—building a foundation model for wearable health—it represents an incremental (though large-scale) application of existing paradigms (foundation models, self-supervised learning) to a specific domain. Paper 1's theoretical contribution has broader cross-disciplinary impact potential and conceptual novelty that could reshape foundational thinking across multiple sciences.
Paper 1 introduces a foundation model pretrained on an unprecedented scale of wearable data (5 million participants, 1 trillion minutes), demonstrating high impact through immediate real-world applications in predictive healthcare across 35 tasks. While Paper 2 provides a valuable critical evaluation of AI reasoning, Paper 1 represents a massive technological leap with direct, measurable benefits to public health, personalized medicine, and the broader application of AI in clinical settings.
HealthFormer demonstrates higher scientific impact through its ability to simulate clinical interventions in silico, validated against 41 randomized trial comparisons, and its multimodal integration of 667 measurements across seven physiological domains. Its clinical digital twin framework directly addresses drug development and personalized medicine needs. While Paper 2 impressively scales wearable data pretraining (5M participants, 1T+ minutes), its scope is narrower (wearable sensors only), and its downstream applications are more incremental. HealthFormer's intervention simulation capability—recovering individual-level biomarker changes and matching published trial results—represents a more transformative advance for clinical decision-making.
Paper 2 presents a foundation model pretrained on an unprecedented scale (1 trillion minutes, 5 million participants) for wearable health, addressing a critical gap in personalized medicine. Its breadth of impact spans 35 health prediction tasks across multiple domains, with practical clinical validation (1,860 clinician ratings). The combination of foundation model scaling laws for health sensors, few-shot learning capabilities, and integration with LLM agents for a Personal Health Agent represents a paradigm shift in digital health. While Paper 1 is highly innovative in symbolic equation discovery, Paper 2's massive scale, immediate clinical applicability, and broader societal impact give it higher potential impact.
Paper 2 is more novel methodologically: it unifies diffusion generation and random structure search into a single physically grounded sampling framework, addressing a core bottleneck (energy-landscape exploration) with clear, generalizable gains (order-of-magnitude cost reduction, OOD compositions). Its rigor is supported by cross-domain evaluation (molecules and crystals) and explicit coupling to physical forces, which strengthens reliability and adoption. The potential impact spans computational chemistry, materials science, and generative modeling, enabling faster discovery pipelines. Paper 1 is large-scale and highly applicable but is more a scaling/deployment integration of established paradigms.
Paper 2 likely has higher scientific impact due to stronger methodological rigor and direct relevance to regulatory-grade real-world evidence. It trains on a massive, widely used nationwide claims substrate, evaluates on >1,000 tasks with prospective and external validation, and demonstrates impact beyond prediction (expenditure forecasting and reduced bias in target trial emulation). These advances can broadly influence epidemiology, health economics, pharmacoepidemiology, and regulatory science. Paper 1 is highly innovative and large-scale, but relies on wearables with harder-to-validate endpoints and more consumer-facing applications, potentially limiting near-term cross-field adoption.
Paper 2 has higher likely scientific impact due to greater methodological novelty (aligned, partially observed multimodal generative modeling across biomolecular modalities) and broader cross-field reach (genomics, transcriptomics, protein science, structural biology, and molecular design). Its framework enables both prediction and constrained design, with clear translational relevance to therapeutics and variant interpretation, and introduces a reusable curated dataset (LORE). Paper 1 is impactful for digital health, but is more application-domain specific and relies heavily on scale and engineering of wearables/LLM agents rather than a fundamentally new modeling paradigm.
Paper 1 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), demonstrating broad applicability across 35 health tasks with few-shot learning, LLM agent integration, and clinician-validated Personal Health Agent. Its novelty in scale, cross-domain health impact, and practical clinical relevance give it higher potential impact. Paper 2 makes important theoretical contributions to DPO/RLHF understanding, but addresses a narrower technical issue within the LLM alignment community. Paper 1's breadth across healthcare, AI, and wearable computing, combined with its massive scale and real-world deployment potential, suggests greater overall scientific impact.
Paper 2 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), addressing a fundamental challenge in digital health. Its breadth of impact spans 35 health prediction tasks across multiple domains, introduces novel LLM-agent integration for automated model search, and demonstrates clinical validation. The scale of data, generalizability across health domains, and practical applicability to personalized health monitoring give it transformative potential across healthcare, AI, and wearable technology. Paper 1, while theoretically rigorous in identifying DPO failure modes, addresses a narrower problem within the RLHF/alignment community.
Paper 1 presents a foundation model trained on an unprecedented scale of wearable data (5 million participants) with immediate, broad applications across 35 diverse health tasks. Its integration of LLMs for few-shot learning and clinician-validated personalized health agents offers massive, tangible real-world impact in healthcare and personalized medicine, slightly edging out the important but more specialized theoretical contributions of Paper 2 in formal AI safety verification.
Paper 2 presents a foundation model trained on an unprecedented scale (5 million participants) with immediate, high-impact applications across diverse healthcare domains (35 health prediction tasks). Its combination of massive-scale pretraining, few-shot learning, and clinical validation positions it to significantly transform wearable health tech, whereas Paper 1, while algorithmically novel, addresses a narrower scope within reinforcement learning.