Towards a General Intelligence and Interface for Wearable Health Data

Girish Narayanswamy, Maxwell A. Xu, A. Ali Heydari, Samy Abdel-Ghaffar, Marius Guerard, Kara Vaillancourt, Zhihan Zhang, Jake Garrison

#7 of 2292 · Artificial Intelligence
Share
Tournament Score
1643±31
10501800
79%
Win Rate
34
Wins
9
Losses
43
Matches
Rating
8.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

While ubiquitous wearable sensors capture a wealth of behavioral and physiological information, effectively transforming these signals into personalized health insights is challenging. Specifically, converting low-level sensor data into representations capable of characterizing higher-level states is difficult due to high phenotypic diversity and variation in individual baseline health, physiology, and lifestyle factors. Moreover, collecting wearable data paired with health outcome annotations is laborious and expensive, and retrospective annotation remains practically unfeasible, contributing to a scarcity of data with high-quality labels. To overcome these limitations, we propose a foundation model for wearable health that is pretrained on more than one trillion minutes of unlabeled sensor signals drawn from a large cohort of five million participants. We demonstrate that the joint scaling of model capacity and pretraining data volume leads to systematic improvements in performance, as evaluated on a diverse set of 35 health prediction tasks, spanning cardiovascular, metabolic, sleep, and mental health, as well as lifestyle choices and demographic factors. We find that this population scale representation unlocks label-efficient few-shot learning and generative capabilities for robust daily metric estimation. To further leverage this learned representation, we deploy a classroom of LLM agents to autonomously search the space of downstream predictive heads built on the model embeddings, showing broad performance improvements that increase with LLM model capacity. Finally, we show how integrating these downstream predictors into a Personal Health Agent can support model responses that are more relevant, contextually aware, and safe, and we validate this via 1,860 ratings from a cohort of clinicians.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Towards a General Intelligence and Interface for Wearable Health Data"

1. Core Contribution

This paper introduces SensorFM, a foundation model for wearable health data pretrained on over one trillion minutes of multimodal sensor data from 5 million participants. The core contributions are threefold: (1) establishing scaling laws for wearable sensor foundation models showing that joint scaling of model capacity and data volume yields predictable downstream improvements; (2) demonstrating that the learned representations generalize across 35 diverse health prediction tasks spanning cardiovascular, metabolic, mental health, sleep, lifestyle, and demographic domains; and (3) showing that integrating SensorFM predictions into a Personal Health Agent yields clinician-rated improvements in response quality comparable to using ground-truth clinical measurements.

The paper also introduces an LLM-driven "classroom" framework for automated downstream head design, conducting over 30,000 experiments to adapt embeddings to specific tasks, and demonstrates generative capabilities for missing data imputation that improve daily health metric estimation.

2. Methodological Rigor

Strengths in rigor:

  • The pretraining corpus (1 trillion minutes, 5M participants) represents an unprecedented scale for wearable data, with a 50x increase over prior work. The four-orders-of-magnitude sweep across both data volume and model capacity provides convincing evidence for scaling laws.
  • Downstream evaluation uses data from three independent, IRB-approved prospective studies (N=13,985) with clinically meaningful endpoints including lab-verified biomarkers (HbA1c, HOMA-IR, triglycerides), standardized clinical screeners (PHQ-8, GAD-7, PSS), and self-reported diagnoses.
  • Five-fold cross-validation with person-independent splits and appropriate metric aggregation (logit-transformed AUC, z-transformed correlations) demonstrates statistical care.
  • The clinician evaluation (1,860 ratings, 4 board-certified physicians, blinded conditions, Wilcoxon signed-rank tests with Bonferroni correction) follows sound evaluation methodology.
  • Methodological concerns:

  • The frozen encoder with linear probes on PCA-50 reduced embeddings is a deliberately conservative evaluation strategy, but it leaves open whether full fine-tuning would reveal different scaling dynamics.
  • Many downstream labels are self-reported rather than clinically verified, which the authors acknowledge. The reliance on binary thresholds for continuous screening measures introduces noise.
  • The pretraining data is limited to Fitbit/Pixel Watch ecosystems, and the one-minute aggregate resolution discards potentially informative sub-second dynamics. Generalization to other device platforms remains untested.
  • The ICC values for clinician evaluations are moderate to poor on several dimensions (Justifiability ICC = -0.088), suggesting significant inter-rater disagreement that complicates interpretation.
  • 3. Potential Impact

    Direct impact on digital health: This work could fundamentally shift wearable health analytics from bespoke, task-specific pipelines to a general-purpose embedding interface. The demonstration that simple linear probes on frozen embeddings outperform engineered feature baselines on 34/35 tasks is practically significant, dramatically lowering the barrier to developing new health applications.

    Clinical screening and risk stratification: The ability to predict cardiovascular risk scores, metabolic markers, and mental health screening scores from passively collected sensor data could enable population-level screening. The generative infilling capability (99.7% step count accuracy with 60 minutes of missing data) addresses a real-world limitation of intermittent sensor wear.

    AI agent integration: The demonstration that SensorFM predictions are statistically non-inferior to ground-truth clinical labels when used by a health agent is particularly impactful for consumer health applications. This validates a pathway from raw sensor data to personalized health coaching.

    Broader ML impact: The agentic classroom framework for automated model adaptation, while not the primary contribution, introduces a practical methodology for efficiently adapting foundation model embeddings across many downstream tasks simultaneously.

    4. Timeliness & Relevance

    This paper arrives at a critical intersection of three trends: the maturation of foundation models, the explosive growth of consumer wearables (~500M+ devices globally), and increasing use of LLMs for health queries. The scarcity of labeled health data paired with wearable signals has been a persistent bottleneck; self-supervised pretraining at this scale directly addresses this limitation. The integration with LLM-based health agents anticipates the near-term convergence of sensing and conversational AI in consumer health products.

    5. Strengths & Limitations

    Key Strengths:

  • Unprecedented data scale and comprehensive evaluation breadth (35 tasks across 6 health domains)
  • Clean demonstration of scaling laws with near-linear improvement in both pretraining and downstream performance
  • Practical finding that demographic feature dependence decreases with model scale, suggesting implicit learning of physiological traits
  • End-to-end validation from pretraining through clinical evaluation of agent outputs
  • Thorough latent space analysis revealing meaningful structure (UMAP, SHAP, intrinsic dimensionality)
  • Notable Limitations:

  • Single device ecosystem (Fitbit/Pixel Watch) limits generalizability claims
  • One-minute temporal resolution forfeits fine-grained physiological information
  • Population demographics skew toward female, White/Caucasian wearable users
  • Static, single-turn agent evaluation doesn't capture real-world interactive dynamics
  • Absolute performance on some clinically important tasks (e.g., cardiovascular Dx AUC ~0.71, mental health correlations ~0.40-0.46) may be insufficient for standalone clinical utility
  • The paper emerges from Google, with proprietary data that cannot be reproduced externally, limiting independent verification
  • Overall Assessment: This is a landmark paper in wearable health AI that establishes scaling laws, demonstrates broad generalization, and validates end-to-end clinical utility. While the proprietary data and single-ecosystem evaluation limit reproducibility and generalizability claims, the scale of evidence and breadth of evaluation set a new standard for the field. The work is likely to accelerate both academic research and commercial development in digital health foundation models.

    Rating:8.2/ 10
    Significance 8.5Rigor 7.5Novelty 7.5Clarity 8

    Generated May 22, 2026

    Comparison History (43)

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    claude-opus-4.65/22/2026

    HealthFormer demonstrates higher scientific impact potential through its ability to simulate clinical interventions in silico, effectively creating 'clinical digital twins.' Its validation against 41 published randomized trials with strong agreement is remarkable and has direct translational implications for personalized medicine and drug development. While Paper 1 impressively scales pretraining data (1 trillion minutes, 5M participants) and introduces novel LLM agent integration, Paper 2's multimodal generative approach across 667 measurements with demonstrated intervention simulation capability addresses a more fundamental medical challenge—predicting individual treatment responses—with rigorous external validation across independent cohorts.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    gemini-3.15/22/2026

    Paper 1 marks a historic milestone in AI by demonstrating an agent capable of end-to-end autonomous scientific discovery in the physical world, uncovering a novel physical mechanism. This paradigm shift—moving from AI as a tool to an independent researcher—has profound implications for accelerating discovery across all scientific disciplines. While Paper 2 offers a massive-scale foundation model with high clinical utility, Paper 1's conceptual breakthrough in autonomous empirical science represents a more fundamental leap in how research itself is conducted.

    vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics
    gemini-3.15/22/2026

    While Paper 2 offers a profound theoretical unification across multiple disciplines, Paper 1 represents an unprecedented empirical breakthrough in digital health. Pretraining on 5 million participants and 1 trillion minutes of data establishes a true foundation model for wearable health. Its direct integration with LLMs, broad applicability across 35 health prediction tasks, and clinical validation demonstrate massive potential for immediate real-world healthcare applications and label-efficient learning.

    vs. AI scientists produce results without reasoning scientifically
    claude-opus-4.65/22/2026

    Paper 2 addresses a fundamental and timely question about whether AI agents can truly reason scientifically, finding critical epistemic failures across 25,000+ runs. This has broader impact because it challenges the rapidly growing field of autonomous AI-driven science, providing evidence that current LLM agents lack self-correcting reasoning. Its findings affect every domain deploying AI scientists and will likely influence AI training paradigms, evaluation standards, and policy. Paper 1, while impressive in scale and clinical utility, is more narrowly focused on wearable health and represents incremental (though significant) progress in foundation models for a specific domain.

    vs. Conditional Attribute Estimation with Autoregressive Sequence Models
    gpt-5.25/22/2026

    Paper 1 has higher likely scientific impact due to its combination of population-scale data (5M participants; >1T minutes), a foundation-model paradigm for wearable health, and broad validation across 35 clinically relevant tasks plus clinician-rated safety/utility via a Personal Health Agent. The real-world application potential (health monitoring, risk prediction, personalized insights) is immediate and large, and the work is timely given the growth of wearables and foundation models. Paper 2 is methodologically novel for controllable/attribute-aware decoding, but its demonstrated impact appears narrower and less directly societally transformative.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gemini-3.15/22/2026

    Paper 1 introduces a field-defining foundation model trained on an unprecedented scale of wearable data (1 trillion minutes, 5M participants). Its broad applicability across 35 health prediction tasks and integration with LLM agents represent a massive leap in personalized digital health. While Paper 2 provides a crucial and timely critique of AI safety mechanisms, Paper 1's sheer scale, novelty in multimodal physiological representation, and potential to catalyze widespread downstream clinical applications give it a higher overall scientific and real-world impact.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact due to its large-scale foundation model (trillion-minute pretraining, 5M participants), broad validation across 35 diverse tasks, and clear pathway to real-world deployment via a Personal Health Agent with clinician-rated evaluation. Its methodological scope (scaling laws, few-shot/label efficiency, generative metric estimation, automated head search) and applicability across many health domains suggest wide cross-field influence. Paper 1 is novel and timely in highlighting iatrogenic harm from safety behaviors, but its narrower domain, smaller scale, and primarily evaluative nature limit breadth and downstream adoption.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    gpt-5.25/22/2026

    Paper 2 likely has higher impact due to unprecedented scale (5M participants, >1T minutes), strong timeliness in foundation models, and broad real-world applicability across many health domains with 35 tasks plus clinician-validated interface work. Its methodological rigor is supported by large-scale pretraining, systematic scaling results, diverse evaluations, and deployment-oriented validation. Paper 1 is highly novel for symbolic equation discovery and could be transformative for scientific modeling, but its impact may be narrower and more dependent on benchmarking breadth and adoption across disciplines.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    gemini-3.15/22/2026

    While Paper 1 presents a massive, highly translational foundation model for wearable health, Paper 2 represents a paradigm shift in scientific methodology itself. Demonstrating the first end-to-end autonomous AI agent that proposes and experimentally validates a previously unreported physical mechanism on real hardware fundamentally changes how experimental science can be conducted. This breakthrough in AI-driven autonomous research offers profound cross-disciplinary implications and broader scientific impact than a domain-specific, albeit impressive, health model.

    vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics
    claude-opus-4.65/22/2026

    Paper 1 presents a deeply novel theoretical unification connecting Bayesian inference, game theory, and thermodynamics through a collective variational principle. This bridges multiple fundamental fields (physics, neuroscience, AI, economics) with formal mathematical proofs and falsifiable predictions validated across domains. While Paper 2 is impressive in scale and practical utility—building a foundation model for wearable health—it represents an incremental (though large-scale) application of existing paradigms (foundation models, self-supervised learning) to a specific domain. Paper 1's theoretical contribution has broader cross-disciplinary impact potential and conceptual novelty that could reshape foundational thinking across multiple sciences.

    vs. AI scientists produce results without reasoning scientifically
    gemini-3.15/22/2026

    Paper 1 introduces a foundation model pretrained on an unprecedented scale of wearable data (5 million participants, 1 trillion minutes), demonstrating high impact through immediate real-world applications in predictive healthcare across 35 tasks. While Paper 2 provides a valuable critical evaluation of AI reasoning, Paper 1 represents a massive technological leap with direct, measurable benefits to public health, personalized medicine, and the broader application of AI in clinical settings.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    claude-opus-4.65/22/2026

    HealthFormer demonstrates higher scientific impact through its ability to simulate clinical interventions in silico, validated against 41 randomized trial comparisons, and its multimodal integration of 667 measurements across seven physiological domains. Its clinical digital twin framework directly addresses drug development and personalized medicine needs. While Paper 2 impressively scales wearable data pretraining (5M participants, 1T+ minutes), its scope is narrower (wearable sensors only), and its downstream applications are more incremental. HealthFormer's intervention simulation capability—recovering individual-level biomarker changes and matching published trial results—represents a more transformative advance for clinical decision-making.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    claude-opus-4.65/22/2026

    Paper 2 presents a foundation model pretrained on an unprecedented scale (1 trillion minutes, 5 million participants) for wearable health, addressing a critical gap in personalized medicine. Its breadth of impact spans 35 health prediction tasks across multiple domains, with practical clinical validation (1,860 clinician ratings). The combination of foundation model scaling laws for health sensors, few-shot learning capabilities, and integration with LLM agents for a Personal Health Agent represents a paradigm shift in digital health. While Paper 1 is highly innovative in symbolic equation discovery, Paper 2's massive scale, immediate clinical applicability, and broader societal impact give it higher potential impact.

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    gpt-5.25/22/2026

    Paper 2 is more novel methodologically: it unifies diffusion generation and random structure search into a single physically grounded sampling framework, addressing a core bottleneck (energy-landscape exploration) with clear, generalizable gains (order-of-magnitude cost reduction, OOD compositions). Its rigor is supported by cross-domain evaluation (molecules and crystals) and explicit coupling to physical forces, which strengthens reliability and adoption. The potential impact spans computational chemistry, materials science, and generative modeling, enabling faster discovery pipelines. Paper 1 is large-scale and highly applicable but is more a scaling/deployment integration of established paradigms.

    vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact due to stronger methodological rigor and direct relevance to regulatory-grade real-world evidence. It trains on a massive, widely used nationwide claims substrate, evaluates on >1,000 tasks with prospective and external validation, and demonstrates impact beyond prediction (expenditure forecasting and reduced bias in target trial emulation). These advances can broadly influence epidemiology, health economics, pharmacoepidemiology, and regulatory science. Paper 1 is highly innovative and large-scale, but relies on wearables with harder-to-validate endpoints and more consumer-facing applications, potentially limiting near-term cross-field adoption.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    gpt-5.25/22/2026

    Paper 2 has higher likely scientific impact due to greater methodological novelty (aligned, partially observed multimodal generative modeling across biomolecular modalities) and broader cross-field reach (genomics, transcriptomics, protein science, structural biology, and molecular design). Its framework enables both prediction and constrained design, with clear translational relevance to therapeutics and variant interpretation, and introduces a reusable curated dataset (LORE). Paper 1 is impactful for digital health, but is more application-domain specific and relies heavily on scale and engineering of wearables/LLM agents rather than a fundamentally new modeling paradigm.

    vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
    claude-opus-4.65/22/2026

    Paper 1 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), demonstrating broad applicability across 35 health tasks with few-shot learning, LLM agent integration, and clinician-validated Personal Health Agent. Its novelty in scale, cross-domain health impact, and practical clinical relevance give it higher potential impact. Paper 2 makes important theoretical contributions to DPO/RLHF understanding, but addresses a narrower technical issue within the LLM alignment community. Paper 1's breadth across healthcare, AI, and wearable computing, combined with its massive scale and real-world deployment potential, suggests greater overall scientific impact.

    vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
    claude-opus-4.65/22/2026

    Paper 2 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), addressing a fundamental challenge in digital health. Its breadth of impact spans 35 health prediction tasks across multiple domains, introduces novel LLM-agent integration for automated model search, and demonstrates clinical validation. The scale of data, generalizability across health domains, and practical applicability to personalized health monitoring give it transformative potential across healthcare, AI, and wearable technology. Paper 1, while theoretically rigorous in identifying DPO failure modes, addresses a narrower problem within the RLHF/alignment community.

    vs. Containment Verification: AI Safety Guarantees Independent of Alignment
    gemini-3.15/22/2026

    Paper 1 presents a foundation model trained on an unprecedented scale of wearable data (5 million participants) with immediate, broad applications across 35 diverse health tasks. Its integration of LLMs for few-shot learning and clinician-validated personalized health agents offers massive, tangible real-world impact in healthcare and personalized medicine, slightly edging out the important but more specialized theoretical contributions of Paper 2 in formal AI safety verification.

    vs. Implicit Safety Alignment from Crowd Preferences
    gemini-3.15/22/2026

    Paper 2 presents a foundation model trained on an unprecedented scale (5 million participants) with immediate, high-impact applications across diverse healthcare domains (35 health prediction tasks). Its combination of massive-scale pretraining, few-shot learning, and clinical validation positions it to significantly transform wearable health tech, whereas Paper 1, while algorithmically novel, addresses a narrower scope within reinforcement learning.