Jonathan F. Carter, Lionel Tarassenko
Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.
This paper introduces Hypnos, a multi-modal foundation model for physiological signals that adopts next-token prediction (NTP) — the paradigm underlying modern LLMs — as its self-supervised objective. The key insight is that physiological signals, despite being continuous and stochastic, can be effectively compressed via residual vector quantization (RVQ) into discrete token streams, then modeled auto-regressively using an RQ-Transformer. This unifies generative modeling and representation learning into a single framework across eight sensing modalities (EEG, ECG, EOG, EMG, respiratory signals) from over 20,000 overnight polysomnography recordings.
The paper makes a well-motivated argument against the two dominant SSL paradigms for physiological signals: (1) masked reconstruction struggles with stochastic signals where exact waveform reconstruction is ill-defined, and (2) contrastive learning requires positive-pair definitions that embed assumptions about semantic invariances — assumptions that are poorly understood for physiological data. The observation that cross-modal contrastive objectives may discard modality-unique information (citing Daunhawer et al.) is particularly compelling, given that different physiological modalities are recorded precisely because they provide complementary views.
The methodology is thorough and well-executed:
Architecture design: The two-stage approach (tokenization via RVQ, then auto-regressive modeling via RQ-Transformer) is well-justified. The decoupling of temporal and residual-depth modeling in the RQ-Transformer is computationally efficient, scaling attention with T rather than K·T. The causal tokenizer design enabling streaming inference is a practical but important detail.
Modality masking: The Chinese Restaurant Process-based group masking strategy is an elegant solution for handling missing modalities at test time, superior to naive random dropout. The ablation (Table 6/10) validates this design choice convincingly, showing it maintains full-modality performance while significantly improving restricted-modality robustness.
Evaluation: The evaluation is comprehensive and methodologically careful. The authors standardize comparison using linear and MLP probes (rather than allowing fine-tuning or complex downstream heads), which isolates representation quality. Statistical significance testing with Wilcoxon signed-rank tests and FDR correction is appropriate. The inclusion of both in-domain and held-out datasets, with external ECG benchmarks, strengthens the generalization claims. The few-shot analysis across eight cohorts (Tables 11-12) is exceptionally thorough.
Ablations: The paper includes well-designed ablations on residual depth (Kin vs Kout), tokenization length, modality masking strategies, model scaling, and context length. The finding that increasing output residual depth hurts representation quality while increasing input depth helps is a valuable practical insight.
Limitations: The compute usage (~8000 H100 GPU-hours) is reasonable. The scaling analysis beyond Base size is limited to unimodal variants due to compute constraints, which is a minor gap.
Clinical applications: The 100× label efficiency demonstrated in few-shot learning (matching supervised baselines on held-out MrOS with 1% data) has significant practical implications for sleep medicine, where expert annotation is expensive and scarce. The streaming inference capability at 1 Hz opens doors for real-time applications like closed-loop neuromodulation and remote patient monitoring.
Cross-domain transfer: The demonstration that a sleep-trained model surpasses a dedicated ECG foundation model (xECG) on daytime AF detection benchmarks is striking. This suggests the representations capture fundamental physiological dynamics beyond sleep-specific patterns, with potential applications across cardiology, neurology, and broader healthcare.
Methodological influence: By demonstrating that NTP works well for physiological signals, this paper may redirect the field away from contrastive and masked-reconstruction approaches. The framework is modular and could be extended to additional modalities (accelerometry, SpO2, blood pressure) or domains (ICU monitoring, ambulatory EEG).
Generative capabilities: While secondary to representation learning, the generative capability (Figure 7) enables synthetic data generation, signal imputation, and interpretability through generation, which could be valuable for data augmentation and clinical interpretation.
This work is highly timely. Foundation models for physiological signals are an active area with recent contributions (SleepFM in Nature Medicine 2026, OSF at ICML 2026). The paper addresses genuine bottlenecks: the difficulty of defining appropriate augmentations/positive pairs for physiological signals, the need for flexible multi-modal models that handle missing sensors, and the demand for streaming-capable inference. The success of NTP in language and audio makes its application to physiological signals a natural and overdue investigation.
This is a strong, well-executed paper that makes a compelling case for next-token prediction as a unified self-supervised objective for multi-modal physiological signals. The consistent performance improvements, label efficiency, and practical design make it likely to influence both the research community and clinical applications. The work is technically sound, thoroughly evaluated, and addresses a genuine need in the field.
Generated Jun 9, 2026
Paper 2 has higher likely impact due to immediate, broad real-world applicability and timeliness: a scalable foundation-model recipe (next-token prediction) validated on large multi-modal sleep datasets with strong downstream gains and cross-domain generalisation (e.g., atrial fibrillation). This makes it readily adoptable across healthcare ML and sensing. Paper 1 is more novel theoretically and could be influential in causal inference/AI safety, but its impact is narrower and hinges on community uptake of a new formalism and tooling; applications are less direct and likely longer-term.
Paper 2 (Hypnos) has broader scientific impact for several reasons: (1) It introduces a novel finding that next-token prediction—typically associated with language models—is effective for multi-modal physiological signal representation learning, challenging conventional masked-reconstruction and contrastive approaches. (2) It demonstrates practical clinical applications across sleep medicine, cardiology, and neurology with 100x less labeled data. (3) The multi-modal foundation model trained on 20,000+ recordings establishes a scalable paradigm for healthcare AI. (4) Cross-domain generalization (sleep to daytime ECG/atrial fibrillation) suggests broad transferability. Paper 1, while addressing an important LLM safety problem, is more narrowly scoped to AI interpretability/security.
While Paper 1 offers an innovative and highly impactful clinical application, Paper 2 addresses a critical, universal bottleneck in modern artificial intelligence: the evaluation of AI agent reliability and safety. By introducing a comprehensive framework of 12 metrics, Paper 2 provides essential infrastructure for the entire rapidly growing field of agentic AI. Its breadth of impact spans virtually all domains deploying AI, making it exceptionally timely and relevant for the broader scientific community compared to a domain-specific medical foundation model.
Paper 2 has higher potential impact because it proposes a novel, governance-enabling cryptographic verification primitive for frontier AI training, addressing a timely and widely recognized enforcement gap in AI regulation. If feasible, it would have broad cross-field implications (AI systems, cryptography/zkVMs, distributed systems, security, policy) and major real-world application in compliance and international agreements. Paper 1 is methodologically solid and practically useful for biomedical ML, but its innovation (next-token prediction for physiology) is more incremental relative to existing autoregressive foundation-model paradigms and its impact is narrower to health sensing.
Paper 2 presents a multi-modal foundation model for physiological signals with broad, high-stakes healthcare applications. Its demonstration that next-token prediction effectively models stochastic biological data, combined with strong zero-shot or few-shot generalization to tasks like atrial fibrillation detection, offers massive potential impact across medicine and bio-machine learning. While Paper 1 provides a highly practical tool for computational scientists, Paper 2's methodological innovation and potential to transform clinical diagnostics give it a wider and more profound scientific impact.
Hypnos introduces a novel and well-validated foundation model for multi-modal physiological signals using next-token prediction, demonstrating strong results across diverse clinical tasks with 100x less labeled data and cross-domain generalization (sleep to cardiology). It addresses a clear gap in healthcare AI with immediate clinical applications. Paper 2, while useful, is primarily a benchmark contribution that reveals limitations of current continual learning approaches without proposing solutions. Paper 1's methodological innovation, clinical applicability, and demonstration that autoregressive pretraining works for physiological signals represents a more impactful scientific contribution.
Paper 1 reveals a fundamental paradox in current LLM alignment paradigms, challenging core assumptions in AI safety. Its theoretical formalization and empirical validation across major frontier models suggest an urgent need for structural paradigm shifts in AI alignment, offering broader and more critical immediate scientific impact than the applied medical modeling approach in Paper 2.
Paper 1 offers immediate, high-value real-world applications in healthcare by successfully adapting next-token prediction to continuous multi-modal physiological data. This scalable approach achieves strong generalization across diverse medical tasks with significantly less labeled data, promising broad and immediate clinical impact.
Paper 2 introduces Hypnos, a multi-modal sleep foundation model using next-token prediction for physiological signals—a novel application of autoregressive objectives to this domain. It demonstrates strong generalization across modalities and tasks (sleep staging with 100x less labeled data, atrial fibrillation detection), with broad healthcare applications spanning sleep medicine, cardiology, and neurology. The methodological insight that next-token prediction outperforms masked reconstruction and contrastive learning for stochastic physiological signals is broadly impactful. Paper 1 advances LLM agent self-evolution but operates in a more incremental, narrower niche within the already crowded LLM agent space.
Hypnos introduces a novel foundation model paradigm (next-token prediction) for multi-modal physiological signals, demonstrating strong results across sleep medicine, cardiology, and neurology with significantly less labeled data. It offers broad healthcare applications, methodological innovation (RQ-Transformer with residual vector quantization across 8 modalities), and cross-domain generalization. Paper 2, while timely for AI safety, is primarily a measurement/benchmarking study with trend extrapolation rather than a methodological contribution, and its impact is narrower, focused on AI safety monitoring policy rather than advancing scientific methodology.