A multimodal and temporal foundation model for virtual patient representations at healthcare system scale

Andrew Zhang, Tong Ding, Sophia J. Wagner, Caiwei Tian, Ming Y. Lu, Rowland Pettit, Joshua E. Lewis, Alexandre Misrahi

Apr 20, 2026arXiv:2604.18570v1

cs.LGcs.AIcs.CL

#1of 5717·cs.LG

Gold · Week 17, 2026

Tournament Score

1692±28

10501750

99%

Win Rate

248

Wins

Losses

251

Matches

Rating

8.2/ 10

Significance8.5

Rigor7.5

Novelty8

Clarity8.5

Abstract

Modern medicine generates vast multimodal data across siloed systems, yet no existing model integrates the full breadth and temporal depth of the clinical record into a unified patient representation. We introduce Apollo, a multimodal temporal foundation model trained and evaluated on over three decades of longitudinal hospital records from a major US hospital system, composed of 25 billion records from 7.2 million patients, representing 28 distinct medical modalities and 12 major medical specialties. Apollo learns a unified representation space integrating over 100 thousand unique medical events in our clinical vocabulary as well as images and clinical text. This "atlas of medical concepts" forms a computational substrate for modeling entire patient care journeys comprised of sequences of structured and unstructured events, which are compressed by Apollo into virtual patient representations. To assess the potential of these whole-patient representations, we created 322 prognosis and retrieval tasks from a held-out test set of 1.4 million patients. We demonstrate the generalized clinical forecasting potential of Apollo embeddings, including predicting new disease onset risk up to five years in advance (95 tasks), disease progression (78 tasks), treatment response (59 tasks), risk of treatment-related adverse events (17 tasks), and hospital operations endpoints (12 tasks). Using feature attribution techniques, we show that model predictions align with clinically-interpretable multimodal biomarkers. We evaluate semantic similarity search on 61 retrieval tasks, and moreover demonstrate the potential of Apollo as a multimodal medical search engine using text and image queries. Together, these modeling capabilities establish the foundation for computable medicine, where the full context of patient care becomes accessible to computational reasoning.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: Apollo — A Multimodal Temporal Foundation Model for Virtual Patient Representations

Core Contribution

Apollo represents the most comprehensive attempt to date at building a unified foundation model over the full breadth of the electronic health record. The key novelty lies in the integration of 28 distinct medical modalities — structured data (diagnoses, labs, medications, vitals, flowsheets), clinical text (progress notes, diagnostic reports), and medical images (pathology WSIs, blood smears, EM) — into a single temporally-aware transformer trained on 25 billion records from 7.2 million patients spanning 33 years. Unlike prior EHR foundation models (BEHRT, Med-BERT, MOTOR, Chronoformer) that are limited to structured data or curated ICU cohorts like MIMIC-IV, Apollo operates across the full continuum of inpatient and outpatient care from a real multi-institutional health system.

The architectural approach is pragmatic: modality-specific frozen encoders (GatorTron for text, TITAN/CONCH for pathology, DinoBloom for hematology) produce embeddings that are projected into a shared space, then processed by a temporal transformer with masked token modeling. Patient embeddings are generated by appending a masked diagnosis token at inference. This design isolates the temporal integration module from raw PHI — a meaningful privacy consideration.

Methodological Rigor

Strengths: The evaluation framework is exceptionally thorough — 322 tasks across 5 prognostic categories and retrieval, with 1.4M held-out patients. The Cox proportional hazards framework with IPCW-based metrics is statistically appropriate for right-censored survival data. Case-cohort sampling, proper train/validation/test separation, and re-estimation of baseline hazards on unmodified validation sets demonstrate careful methodological practice. Bootstrap confidence intervals (100 iterations) with formal significance testing add rigor. Calibration analysis across all tasks is commendable.

Limitations: The primary baseline is age-sex Cox regression — an extremely weak comparator. While informative as a statistical reference, the absence of head-to-head comparisons with established clinical risk scores (ASCVD, CHA₂DS₂-VASc, Framingham, HEART) for well-studied endpoints limits clinical interpretability. The authors acknowledge this but argue computational infeasibility at scale. The supervised baseline (same architecture without pretraining) provides stronger evidence of pretraining value, but is only evaluated on 28 cancer progression tasks. The structured-only ablation (+0.025 mean AUROC from multimodal integration) shows a real but modest gain from unstructured modalities on the cancer benchmark, raising questions about cost-benefit tradeoffs.

The linear probing evaluation (frozen embeddings + Cox regression with PCA to 50 dimensions) is standard for foundation model assessment but understates potential performance with fine-tuning. Conversely, it means the reported numbers are conservative estimates.

Potential Impact

The paper establishes several capabilities with broad implications:

1. Clinical forecasting at scale: 261 prognostic tasks with strong performance (e.g., heart failure 0.88 AUROC, schizophrenia 0.92, type 2 diabetes 0.85) suggest genuine clinical utility for risk stratification, though deployment would require prospective validation.

2. Multimodal semantic retrieval: The ability to query a 1.4M patient database using text descriptions or pathology images is novel and practically valuable for cohort discovery and clinical trial matching. The glioblastoma image retrieval example (retrieving IDH-wildtype patients from an external TCGA slide) is particularly compelling.

3. Atlas of medical concepts: The unified embedding space showing clinically coherent clustering across modalities (Figure 2) has value for understanding medical ontology relationships and could serve as a foundation for downstream medical AI systems.

4. Interpretability: Both local (LOTO) and global (Integrated Gradients) attribution methods recover clinically plausible risk factors, though the authors appropriately caution these are associative, not causal.

Timeliness & Relevance

This work addresses a critical bottleneck: the fragmentation of clinical data across modality-specific silos. With healthcare generating massive data volumes but utilizing <3%, the need for unified representations is acute. The "virtual patient" concept parallels the "virtual cell" paradigm in computational biology, positioning this work at an emerging conceptual frontier. The scale of data (7.2M patients, 25B records) represents what is achievable within a single large health system, making this a realistic template for other institutions.

Strengths

Scale and comprehensiveness: No prior work integrates this many modalities at this scale from a real clinical system

Evaluation breadth: 322 tasks spanning disease onset, progression, treatment response, adverse events, and operations is unprecedented

Practical design choices: PHI isolation, frozen encoders for efficiency, and modality-specific masking are well-motivated

Transparency: Extensive supplementary tables with per-task metrics, calibration curves, and Kaplan-Meier plots for every task

Limitations

Single health system: All data from Mass General Brigham limits generalizability claims; the Northeastern US demographic skew is acknowledged

Weak baselines: Age-sex reference is insufficient for clinical benchmarking; no comparison with existing EHR foundation models

Associational, not causal: Treatment response predictions stratify within treated cohorts, not across treatments — a fundamental limitation for clinical decision-making

Frozen encoders: No end-to-end training of modality encoders likely leaves performance on the table

Imaging contribution unclear: With only 1.1M images among 25B records, the imaging modality's contribution is hard to isolate

No prospective validation or deployment study: All evaluations are retrospective

Code/data not yet available: Reproducibility cannot be verified

Overall Assessment

Apollo is a landmark systems-level contribution that demonstrates the feasibility and utility of whole-record multimodal patient modeling at healthcare-system scale. While individual task performances may not dramatically exceed specialized models, the generality and breadth of a single model across 322 tasks is the primary achievement. The work sets a new standard for EHR foundation model evaluation comprehensiveness, though stronger baselines and external validation would substantially strengthen the claims.

Rating:8.2/ 10

Significance 8.5Rigor 7.5Novelty 8Clarity 8.5

Generated Apr 21, 2026

Comparison History (251)

Wonvs. Riemannian Metric Matching for Scalable Geometric Modeling of Distributions

Apollo represents a landmark-scale clinical foundation model integrating 28 modalities across 7.2 million patients over 30+ years, with 322 evaluation tasks spanning prognosis, retrieval, and clinical forecasting. Its breadth of real-world clinical applications—disease prediction, treatment response, adverse events—gives it transformative potential across healthcare. While Paper 1 makes a solid methodological contribution to geometric modeling with impressive speedups, its impact is narrower and more incremental within computational geometry. Paper 2's scale, multimodal integration, and direct clinical relevance position it for broader and deeper scientific impact.

claude-opus-4-6·Jun 15, 2026

Wonvs. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

Apollo represents a landmark contribution in clinical AI by integrating 28 medical modalities across 7.2 million patients over three decades into a unified foundation model. Its scale (25 billion records), breadth (322 evaluation tasks across 12 specialties), and practical clinical applications (disease prediction, treatment response, adverse events) give it enormous real-world impact potential. While Paper 1 provides valuable mechanistic insights into RL post-training for reasoning models, its scope is narrower (single 1.5B model, math reasoning). Paper 2's cross-disciplinary impact spanning AI and medicine, plus its potential to transform clinical practice, gives it substantially higher impact.

claude-opus-4-6·Jun 12, 2026

Wonvs. Pretraining Recurrent Networks without Recurrence

Apollo represents a landmark contribution in clinical AI by integrating 28 medical modalities across 7.2 million patients into unified patient representations, demonstrating broad clinical utility across 322 tasks spanning prognosis, retrieval, and disease forecasting. Its scale, multimodal integration, and direct healthcare applications give it transformative real-world impact potential. Paper 2 presents an elegant methodological advance for RNN training, but its impact is narrower—primarily advancing efficient sequence model training. Apollo's breadth across medical specialties, unprecedented data scale, and immediate clinical relevance position it for higher scientific impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. Attention by Synchronization in Coupled Oscillator Networks

Paper 2 is more novel and broadly impactful: it proposes a fundamentally different, physically realizable attention mechanism grounded in coupled-oscillator dynamics, with theoretical guarantees (unique, globally attractive fixed point) and relevance to neuromorphic/analog/energy-efficient computing across hardware and ML. Its cross-disciplinary bridge (dynamical systems, physics, and transformers) and timeliness for compute/energy constraints suggest wide uptake. Paper 1 is highly applied and potentially transformative in healthcare but is constrained by data access/replicability, system-specific biases, and domain-limited breadth despite strong scale and utility.

gpt-5.2·Jun 11, 2026

Wonvs. Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

Paper 2 likely has higher scientific impact due to its large-scale, longitudinal multimodal dataset (25B records; 7.2M patients), broad clinical task suite (322 tasks), and clear, near-term real-world applications across forecasting, retrieval, and hospital operations. Its breadth spans multiple specialties and modalities, enabling reuse across many healthcare ML problems and potentially influencing clinical decision support and biomedical informatics. Paper 1 is highly novel and timely for AI alignment, but is narrower in application and depends on a specific model-organism setup, limiting immediate cross-field and translational impact compared to a healthcare-system-scale foundation model.

gpt-5.2·Jun 11, 2026

Wonvs. Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

Paper 2 introduces a medical foundation model trained on an unprecedented scale of longitudinal, multimodal healthcare data. Its ability to create virtual patient representations and accurately predict clinical outcomes across over 300 tasks offers transformative real-world applications in medicine. While Paper 1 presents valuable methodological efficiency improvements for neural fields, Paper 2 has a much broader scope, immense scale, and direct societal impact, making its potential scientific and practical impact substantially higher.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Learn from your own latents and not from tokens: A sample-complexity theory

Paper 1 presents a massive-scale, multimodal foundation model for clinical data with immediate and transformative real-world applications in healthcare. Its ability to integrate 30 years of diverse medical data to predict disease onset, progression, and treatment response promises widespread impact across clinical practice and research. While Paper 2 offers valuable theoretical insights into the sample complexity of self-supervised learning, Paper 1's unprecedented scale, comprehensive evaluation across hundreds of clinical tasks, and direct implications for improving human health give it a significantly higher potential for broad scientific and societal impact.

gemini-3.1-pro-preview·May 29, 2026

Wonvs. Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

Paper 2 introduces a massive-scale, multimodal foundation model trained on over 7 million patients, applicable to hundreds of clinical tasks across medical specialties. Its broad generalizability and unprecedented scale represent a paradigm shift in computational medicine, offering significantly wider scientific and clinical impact than Paper 1's disease-specific, albeit highly valuable, early detection model.

gemini-3.1-pro-preview·May 29, 2026

Wonvs. Recursive Flow Matching

Paper 1 likely has higher impact due to its large-scale, multimodal, longitudinal clinical foundation model spanning 7.2M patients and many modalities, enabling broad downstream applications (risk prediction, progression, treatment response, operations, retrieval/search) with direct real-world healthcare relevance. Its breadth across specialties and tasks suggests a general platform effect and strong timeliness given current momentum in clinical foundation models. Paper 2 is methodologically novel and broadly applicable to scientific simulation, but appears narrower in immediate societal deployment and validation scope relative to Paper 1’s system-scale clinical integration and extensive task suite.

gpt-5.2·May 27, 2026

Wonvs. Aligning Few-Step Generative Models by Amortizing Sample-based Variational Inference

Apollo represents a landmark contribution in clinical AI by integrating 28 medical modalities across 7.2 million patients into a unified foundation model, addressing a fundamental challenge in healthcare. Its scale (25 billion records, 322 evaluation tasks, 12 specialties) and demonstrated utility for clinical forecasting, disease prediction, and multimodal medical search establish a new paradigm for 'computable medicine.' While Paper 2 makes a solid methodological contribution to generative model alignment with broad applicability, Paper 1's potential to transform clinical decision-making across all of medicine gives it substantially greater real-world impact and breadth.

claude-opus-4-6·May 27, 2026

#1of 5717·cs.LG

Gold · Week 17, 2026

Tournament Score

1692±28

10501750

99%

Win Rate

248

Wins

Losses

251

Matches

Rating

8.2/ 10

Significance8.5

Rigor7.5

Novelty8

Clarity8.5