Next-Token Prediction Learns Generalisable Representations of Sleep Physiology

Jonathan F. Carter, Lionel Tarassenko

Jun 8, 2026arXiv:2606.09605v1

cs.AI

#83of 3489·Artificial Intelligence

#83 of 3489 · Artificial Intelligence

Tournament Score

1553±46

10501800

80%

Win Rate

Wins

Losses

Matches

Rating

8.2/ 10

Significance8.5

Rigor8.5

Novelty7.5

Clarity9

Abstract

Foundation models offer a promising route to compress multi-modal physiological signals into compact representations of human health, with broad applications across sleep medicine, cardiology, neurology and other healthcare domains. Existing models have typically been trained with masked-reconstruction or contrastive objectives. However, masked reconstruction may be poorly suited to the stochastic nature of these signals, while contrastive approaches rely on positive-pair definitions despite the semantic invariances of physiological signals being poorly understood. In this work, we show that next-token prediction is a simple and scalable alternative. We develop Hypnos, a multi-modal sleep foundation model trained using eight different sensing modalities (e.g. EEG, ECG, respiratory signals) drawn from over 20,000 overnight polysomnography recordings. We tokenize each modality into streams of discrete tokens using residual vector quantization, then train a large auto-regressive RQ-Transformer to jointly predict the next token across all modalities in parallel. After training, Hypnos can be applied to continuous streams of sensor data from any subset of supported modalities, generating embeddings for downstream tasks. Across a range of benchmarks, Hypnos significantly outperforms existing foundation models. In sleep stage classification, we match the performance of strong supervised baselines on held-out test sets whilst using $100\times$ less labelled data. Hypnos even generalises to daytime physiology, surpassing a dedicated ECG foundation model at detecting atrial fibrillation. Our results demonstrate that next-token prediction is a strong self-supervised objective for representation learning from multi-modal physiological signals.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Next-Token Prediction Learns Generalisable Representations of Sleep Physiology"

1. Core Contribution

This paper introduces Hypnos, a multi-modal foundation model for physiological signals that adopts next-token prediction (NTP) — the paradigm underlying modern LLMs — as its self-supervised objective. The key insight is that physiological signals, despite being continuous and stochastic, can be effectively compressed via residual vector quantization (RVQ) into discrete token streams, then modeled auto-regressively using an RQ-Transformer. This unifies generative modeling and representation learning into a single framework across eight sensing modalities (EEG, ECG, EOG, EMG, respiratory signals) from over 20,000 overnight polysomnography recordings.

The paper makes a well-motivated argument against the two dominant SSL paradigms for physiological signals: (1) masked reconstruction struggles with stochastic signals where exact waveform reconstruction is ill-defined, and (2) contrastive learning requires positive-pair definitions that embed assumptions about semantic invariances — assumptions that are poorly understood for physiological data. The observation that cross-modal contrastive objectives may discard modality-unique information (citing Daunhawer et al.) is particularly compelling, given that different physiological modalities are recorded precisely because they provide complementary views.

2. Methodological Rigor

The methodology is thorough and well-executed:

Architecture design: The two-stage approach (tokenization via RVQ, then auto-regressive modeling via RQ-Transformer) is well-justified. The decoupling of temporal and residual-depth modeling in the RQ-Transformer is computationally efficient, scaling attention with T rather than K·T. The causal tokenizer design enabling streaming inference is a practical but important detail.

Modality masking: The Chinese Restaurant Process-based group masking strategy is an elegant solution for handling missing modalities at test time, superior to naive random dropout. The ablation (Table 6/10) validates this design choice convincingly, showing it maintains full-modality performance while significantly improving restricted-modality robustness.

Evaluation: The evaluation is comprehensive and methodologically careful. The authors standardize comparison using linear and MLP probes (rather than allowing fine-tuning or complex downstream heads), which isolates representation quality. Statistical significance testing with Wilcoxon signed-rank tests and FDR correction is appropriate. The inclusion of both in-domain and held-out datasets, with external ECG benchmarks, strengthens the generalization claims. The few-shot analysis across eight cohorts (Tables 11-12) is exceptionally thorough.

Ablations: The paper includes well-designed ablations on residual depth (Kin vs Kout), tokenization length, modality masking strategies, model scaling, and context length. The finding that increasing output residual depth hurts representation quality while increasing input depth helps is a valuable practical insight.

Limitations: The compute usage (~8000 H100 GPU-hours) is reasonable. The scaling analysis beyond Base size is limited to unimodal variants due to compute constraints, which is a minor gap.

3. Potential Impact

Clinical applications: The 100× label efficiency demonstrated in few-shot learning (matching supervised baselines on held-out MrOS with 1% data) has significant practical implications for sleep medicine, where expert annotation is expensive and scarce. The streaming inference capability at 1 Hz opens doors for real-time applications like closed-loop neuromodulation and remote patient monitoring.

Cross-domain transfer: The demonstration that a sleep-trained model surpasses a dedicated ECG foundation model (xECG) on daytime AF detection benchmarks is striking. This suggests the representations capture fundamental physiological dynamics beyond sleep-specific patterns, with potential applications across cardiology, neurology, and broader healthcare.

Methodological influence: By demonstrating that NTP works well for physiological signals, this paper may redirect the field away from contrastive and masked-reconstruction approaches. The framework is modular and could be extended to additional modalities (accelerometry, SpO2, blood pressure) or domains (ICU monitoring, ambulatory EEG).

Generative capabilities: While secondary to representation learning, the generative capability (Figure 7) enables synthetic data generation, signal imputation, and interpretability through generation, which could be valuable for data augmentation and clinical interpretation.

4. Timeliness & Relevance

This work is highly timely. Foundation models for physiological signals are an active area with recent contributions (SleepFM in Nature Medicine 2026, OSF at ICML 2026). The paper addresses genuine bottlenecks: the difficulty of defining appropriate augmentations/positive pairs for physiological signals, the need for flexible multi-modal models that handle missing sensors, and the demand for streaming-capable inference. The success of NTP in language and audio makes its application to physiological signals a natural and overdue investigation.

5. Strengths & Limitations

Key strengths:

Strong, well-motivated conceptual argument for NTP over alternatives

Consistent state-of-the-art performance across diverse tasks, modalities, and datasets

Exceptional label efficiency (100× reduction)

Practical design: streaming inference, flexible modality handling, 1 Hz embedding rate

Comprehensive evaluation with appropriate statistical testing

Promising scaling behavior across model size and context length

Notable limitations:

No evaluation on longitudinal clinical outcomes (acknowledged by authors)

Multimodal scaling limited to Base size; larger models only explored in unimodal setting

Simple mean pooling for temporal/modality aggregation — more sophisticated approaches could yield further gains

Limited to eight specific modalities; generalization to novel sensor types untested

No comparison with diffusion-based approaches, which the authors themselves note as a viable alternative

The improvement on desaturation detection is relatively modest compared to other tasks

Pre-training data is limited to NSRR cohorts with known demographic biases

Overall Assessment

This is a strong, well-executed paper that makes a compelling case for next-token prediction as a unified self-supervised objective for multi-modal physiological signals. The consistent performance improvements, label efficiency, and practical design make it likely to influence both the research community and clinical applications. The work is technically sound, thoroughly evaluated, and addresses a genuine need in the field.

Rating:8.2/ 10

Significance 8.5Rigor 8.5Novelty 7.5Clarity 9

Generated Jun 9, 2026

Comparison History (15)

Wonvs. WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

Paper 2 has higher likely impact due to immediate, broad real-world applicability and timeliness: a scalable foundation-model recipe (next-token prediction) validated on large multi-modal sleep datasets with strong downstream gains and cross-domain generalisation (e.g., atrial fibrillation). This makes it readily adoptable across healthcare ML and sensing. Paper 1 is more novel theoretically and could be influential in causal inference/AI safety, but its impact is narrower and hinges on community uptake of a new formalism and tooling; applications are less direct and likely longer-term.

gpt-5.2·Jun 10, 2026

Wonvs. PRISM: Recovering Instruction Sets from Language Model Activations

Paper 2 (Hypnos) has broader scientific impact for several reasons: (1) It introduces a novel finding that next-token prediction—typically associated with language models—is effective for multi-modal physiological signal representation learning, challenging conventional masked-reconstruction and contrastive approaches. (2) It demonstrates practical clinical applications across sleep medicine, cardiology, and neurology with 100x less labeled data. (3) The multi-modal foundation model trained on 20,000+ recordings establishes a scalable paradigm for healthcare AI. (4) Cross-domain generalization (sleep to daytime ECG/atrial fibrillation) suggests broad transferability. Paper 1, while addressing an important LLM safety problem, is more narrowly scoped to AI interpretability/security.

claude-opus-4-6·Jun 9, 2026

Lostvs. Towards a Science of AI Agent Reliability

While Paper 1 offers an innovative and highly impactful clinical application, Paper 2 addresses a critical, universal bottleneck in modern artificial intelligence: the evaluation of AI agent reliability and safety. By introducing a comprehensive framework of 12 metrics, Paper 2 provides essential infrastructure for the entire rapidly growing field of agentic AI. Its breadth of impact spans virtually all domains deploying AI, making it exceptionally timely and relevant for the broader scientific community compared to a domain-specific medical foundation model.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Zero knowledge verification for frontier AI training is possible

Paper 2 has higher potential impact because it proposes a novel, governance-enabling cryptographic verification primitive for frontier AI training, addressing a timely and widely recognized enforcement gap in AI regulation. If feasible, it would have broad cross-field implications (AI systems, cryptography/zkVMs, distributed systems, security, policy) and major real-world application in compliance and international agreements. Paper 1 is methodologically solid and practically useful for biomedical ML, but its innovation (next-token prediction for physiology) is more incremental relative to existing autoregressive foundation-model paradigms and its impact is narrower to health sensing.

gpt-5.2·Jun 9, 2026

Wonvs. SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

Paper 2 presents a multi-modal foundation model for physiological signals with broad, high-stakes healthcare applications. Its demonstration that next-token prediction effectively models stochastic biological data, combined with strong zero-shot or few-shot generalization to tasks like atrial fibrillation detection, offers massive potential impact across medicine and bio-machine learning. While Paper 1 provides a highly practical tool for computational scientists, Paper 2's methodological innovation and potential to transform clinical diagnostics give it a wider and more profound scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

Hypnos introduces a novel and well-validated foundation model for multi-modal physiological signals using next-token prediction, demonstrating strong results across diverse clinical tasks with 100x less labeled data and cross-domain generalization (sleep to cardiology). It addresses a clear gap in healthcare AI with immediate clinical applications. Paper 2, while useful, is primarily a benchmark contribution that reveals limitations of current continual learning approaches without proposing solutions. Paper 1's methodological innovation, clinical applicability, and demonstration that autoregressive pretraining works for physiological signals represents a more impactful scientific contribution.

claude-opus-4-6·Jun 9, 2026

Lostvs. Safety Paradox: How Enhanced Safety Awareness Leaves LLMs Vulnerable to Posterior Attack

Paper 1 reveals a fundamental paradox in current LLM alignment paradigms, challenging core assumptions in AI safety. Its theoretical formalization and empirical validation across major frontier models suggest an urgent need for structural paradigm shifts in AI alignment, offering broader and more critical immediate scientific impact than the applied medical modeling approach in Paper 2.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 1 offers immediate, high-value real-world applications in healthcare by successfully adapting next-token prediction to continuous multi-modal physiological data. This scalable approach achieves strong generalization across diverse medical tasks with significantly less labeled data, promising broad and immediate clinical impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. OpenSkill: Open-World Self-Evolution for LLM Agents

Paper 2 introduces Hypnos, a multi-modal sleep foundation model using next-token prediction for physiological signals—a novel application of autoregressive objectives to this domain. It demonstrates strong generalization across modalities and tasks (sleep staging with 100x less labeled data, atrial fibrillation detection), with broad healthcare applications spanning sleep medicine, cardiology, and neurology. The methodological insight that next-token prediction outperforms masked reconstruction and contrastive learning for stochastic physiological signals is broadly impactful. Paper 1 advances LLM agent self-evolution but operates in a more incremental, narrower niche within the already crowded LLM agent space.

claude-opus-4-6·Jun 9, 2026

Wonvs. Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Hypnos introduces a novel foundation model paradigm (next-token prediction) for multi-modal physiological signals, demonstrating strong results across sleep medicine, cardiology, and neurology with significantly less labeled data. It offers broad healthcare applications, methodological innovation (RQ-Transformer with residual vector quantization across 8 modalities), and cross-domain generalization. Paper 2, while timely for AI safety, is primarily a measurement/benchmarking study with trend extrapolation rather than a methodological contribution, and its impact is narrower, focused on AI safety monitoring policy rather than advancing scientific methodology.

claude-opus-4-6·Jun 9, 2026

#83of 3489·Artificial Intelligence

#83 of 3489 · Artificial Intelligence

Tournament Score

1553±46

10501800

80%

Win Rate

Wins

Losses

Matches

Rating

8.2/ 10

Significance8.5

Rigor8.5

Novelty7.5

Clarity9