Simulating clinical interventions with a generative multimodal model of human physiology
Guy Lutsker, Gal Sapir, Jordi Merino, Smadar Shilo, Anastasia Godneva, Eli Meirom, Shie Mannor, Hagai Rossman
Abstract
Understanding how human health changes over time, and why responses to interventions vary between individuals, remains a central challenge in medicine. Here we present HealthFormer, a decoder-only transformer that models the human physiological trajectory generatively, by training on data from the Human Phenotype Project, a multi-visit cohort of over 15,000 deeply phenotyped individuals. We tokenise each participant's health trajectory across 667 measurements spanning seven domains: blood biomarkers, body composition, sleep physiology, continuous glucose monitoring, gut microbiome, wearable-derived physiology, and behaviour and medication exposure. We train HealthFormer to forecast individual physiological trajectories across these domains, and from this single generative objective a range of clinically relevant tasks can be expressed as queries on the model. We show that, without task-specific training, HealthFormer transfers to four independent cohorts and improves prediction for 27 of 30 incident-disease and mortality endpoints, exceeding established clinical risk scores in every comparison. We further show that the model can simulate interventions in silico: in a held-out personalised-nutrition trial, intervention-conditioned predictions recover individual six-month biomarker changes (e.g., Pearson r = 0.78 for diastolic blood pressure). Across 41 randomised intervention-outcome comparisons drawn from published trials, our results show that the predicted direction of effect agrees in every case, and the predicted mean falls within the reported 95% confidence interval in 30 cases. We position HealthFormer as an initial health world model, from which forecasting, risk stratification, and intervention-conditioned simulation arise as queries, providing a basis for clinical digital twins.
AI Impact Assessments
(3 models)Scientific Impact Assessment: HealthFormer
1. Core Contribution
HealthFormer represents a conceptual shift from modeling electronic health records (discrete clinical events) to modeling continuous physiological trajectories across multiple biological systems. The key novelty is threefold: (a) a unified tokenization scheme that encodes 667 heterogeneous health measurements across seven domains into a single autoregressive sequence; (b) a decoder-only transformer trained with next-token prediction on ~64M tokens from >15,000 deeply phenotyped individuals; and (c) the demonstration that a single generative objective, without task-specific fine-tuning, enables forecasting, risk stratification, and intervention-conditioned simulation as downstream queries.
The most striking claim is that the model can simulate clinical interventions in silico. By appending treatment tokens to a participant's sequence, the model generates predicted biomarker trajectories that agree in direction with 41/41 published RCT comparisons and fall within the reported 95% CI in 30/41 cases. This positions the work at the intersection of digital twins, foundation models, and precision medicine.
2. Methodological Rigor
Strengths: The evaluation strategy is multi-layered and well-structured. The authors evaluate within-visit reconstruction, longitudinal prediction over ~2 years, zero-shot transfer to four independent cohorts (UK Biobank, NHANES, Framingham, PNP3), disease risk prediction across 30 endpoints, individual-level intervention prediction in a held-out nutrition trial, and population-level comparison against 41 RCT endpoints. The progressive context experiment (Fig. 4d) showing monotonic improvement with additional data is convincing. Negative control experiments (Supplementary Fig. S7) demonstrating pharmacological specificity strengthen the intervention claims.
Concerns: Several methodological issues warrant scrutiny. First, the synthetic population approach for RCT validation introduces a significant degree of freedom — synthetic cohorts are generated from published Table 1 statistics, not actual trial participants, making it impossible to assess whether the model captures individual-level heterogeneity versus population-level regression patterns. Second, the PNP3 "held-out" trial shares ascertainment criteria, CGM devices, and diet-logging conventions with HPP, weakening claims of true external validation. Third, the concordance metric (predicted mean within published 95% CI) is generous — many CIs in large trials are wide, and the predicted direction matching 41/41 comparisons could partly reflect learning that medications generally improve the conditions they treat. Fourth, the model architecture is relatively modest (139M parameters, 2 attention heads) which, while practically advantageous, raises questions about whether the scaling analysis truly supports the claim that larger models would improve substantially. Fifth, the intervention encoding is crude — a single categorical token per medication class without dose, and temporal spacing encoding frequency — limiting the clinical fidelity of simulated interventions.
3. Potential Impact
The practical implications are substantial. If the intervention-conditioned prediction framework proves reliable, it could accelerate hypothesis generation for clinical trials, enable personalized treatment selection, and support regulatory science through in silico evidence. The disease risk prediction results (improving on established clinical scores like Framingham CVD and PREVENT-ASCVD across all comparisons) suggest immediate utility for risk stratification in clinical settings.
However, the authors are appropriately cautious: the model learns associations, not causal effects. The gap between "the predicted direction matches RCT results" and "the model can reliably simulate what would happen to a specific patient under treatment X" is enormous. The systematic under-prediction of high-potency interventions (high-dose statins, semaglutide) and over-prediction of off-target effects (SGLT2i on blood pressure) reveals fundamental limitations tied to the observational training distribution.
4. Timeliness & Relevance
This work arrives at a critical inflection point where foundation models are being applied to health data (BEHRT, Foresight, recent Nature papers on generative EHR models) but have been limited to administrative medical records. The availability of deeply phenotyped cohorts like HPP creates a new data substrate. The clinical digital twin concept has attracted enormous interest but lacked concrete computational instantiations. HealthFormer provides perhaps the most complete prototype to date, though the authors wisely frame it as an "initial health world model" rather than a clinical digital twin.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Missing Analyses:
Summary
HealthFormer is an ambitious and well-executed study that advances the field by demonstrating that generative modeling of dense physiological trajectories enables a richer class of clinical queries than EHR-based approaches. The intervention simulation results, while not establishing causality, represent a meaningful proof-of-concept. The work's impact will ultimately depend on prospective validation and whether the gap between observational learning and causal intervention effects can be bridged through the extensions the authors propose.
Generated May 1, 2026
Comparison History (107)
HealthFormer demonstrates higher scientific impact potential through its ability to simulate clinical interventions in silico, effectively creating 'clinical digital twins.' Its validation against 41 published randomized trials with strong agreement is remarkable and has direct translational implications for personalized medicine and drug development. While Paper 1 impressively scales pretraining data (1 trillion minutes, 5M participants) and introduces novel LLM agent integration, Paper 2's multimodal generative approach across 667 measurements with demonstrated intervention simulation capability addresses a more fundamental medical challenge—predicting individual treatment responses—with rigorous external validation across independent cohorts.
HealthFormer demonstrates higher scientific impact through its ability to simulate clinical interventions in silico, validated against 41 randomized trial comparisons, and its multimodal integration of 667 measurements across seven physiological domains. Its clinical digital twin framework directly addresses drug development and personalized medicine needs. While Paper 2 impressively scales wearable data pretraining (5M participants, 1T+ minutes), its scope is narrower (wearable sensors only), and its downstream applications are more incremental. HealthFormer's intervention simulation capability—recovering individual-level biomarker changes and matching published trial results—represents a more transformative advance for clinical decision-making.
Paper 2 presents a foundational multimodal model of human physiology with immediate, transformative applications in personalized medicine, clinical risk prediction, and in silico clinical trials. Its ability to accurately simulate clinical interventions and outperform established risk scores demonstrates profound real-world utility and broad impact across medicine and AI. While Paper 1 introduces a valuable benchmark for AI capabilities in metascience, Paper 2's tangible clinical applications and potential to serve as a 'health world model' give it a higher potential for widespread scientific and societal impact.
Paper 2 likely has higher impact: it proposes a concrete “health world model” trained on large longitudinal multimodal human data, demonstrates strong transfer to multiple cohorts, improves many clinical endpoints over standard risk scores, and shows intervention-conditioned simulation validated against trials—clear translational potential (digital twins, risk stratification, treatment planning). Methodological contribution (generative trajectory tokenization across modalities) is broadly reusable in medicine and ML. Paper 1 is timely and novel as an evaluation benchmark for forecasting science, but is primarily diagnostic/measurement and may have less immediate real-world application breadth than clinical intervention simulation.
HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements and seven physiological domains. Its ability to simulate clinical interventions in silico, validated against published RCTs (41/41 correct direction of effect), and transfer to four independent cohorts for disease/mortality prediction without task-specific training, establishes a new framework for clinical digital twins. This has transformative potential across healthcare. Paper 2, while theoretically rigorous in characterizing DPO/RLHF equivalence conditions, addresses a more incremental technical issue within LLM alignment.
Paper 2 has higher potential impact: it introduces a generative “health world model” trained on large, longitudinal, multimodal human data and demonstrates broad, clinically relevant capabilities (forecasting, risk prediction, cross-cohort transfer, and intervention simulation) with quantitative validation against endpoints and RCTs—supporting real-world applications like digital twins and decision support. Methodological scope and cross-domain biomedical relevance are wide and timely. Paper 1 is valuable for LLM evaluation rigor and benchmarking, but its primary impact is narrower (AI eval) and less directly translational than a validated physiology model.
HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' that integrates seven physiological domains, transfers across independent cohorts, outperforms established clinical risk scores, and simulates clinical interventions with remarkable accuracy. Its breadth of impact spans medicine, clinical trials, digital twins, and personalized healthcare. While Paper 1 makes solid theoretical contributions to preference optimization in LLM alignment, Paper 2 addresses a far more consequential real-world problem with broader interdisciplinary impact and transformative potential for how clinical interventions are designed and personalized.
Paper 2 has higher potential impact due to a more novel and ambitious “health world model” that unifies multimodal longitudinal physiology, forecasting, risk prediction, and intervention-conditioned simulation. Its real-world applications (clinical risk stratification, digital twins, in silico trialing) are substantial and timely. Methodological rigor appears stronger via training on a large cohort, external cohort transfer, and validation against RCT effect directions/CIs. Breadth spans ML, clinical medicine, epidemiology, and digital health. Paper 1 is valuable for evaluation/benchmarking, but its applications and cross-field impact are narrower.
Paper 1 is more likely to have higher scientific impact: it introduces a novel generative “health world model” spanning 667 multimodal longitudinal measurements, demonstrates strong generalization to independent cohorts, and supports actionable clinical tasks (risk prediction and intervention-conditioned simulation) with quantitative validation against RCTs and established risk scores. Its real-world applications (digital twins, forecasting, trial simulation) are broad and timely for precision medicine. Paper 2 is valuable as an evaluation study of agentic auto-research and exposes important failure modes, but its primary impact is methodological/diagnostic within AI research workflows and is less immediately translational.
Paper 1 introduces a foundational model for human health trajectories with profound implications for personalized medicine, clinical trial simulation, and digital twins. Its ability to accurately simulate interventions and predict disease across multiple independent cohorts demonstrates a massive breadth of impact on healthcare and biology, arguably surpassing the narrower, albeit important, AI security focus of Paper 2.
Paper 1 likely has higher scientific impact: it introduces a large-scale generative “health world model” spanning multimodal longitudinal physiology, demonstrates strong cross-cohort transfer, and uniquely attempts intervention-conditioned simulation with agreement to RCT directions and many endpoints. This is highly novel for clinical digital twins, with substantial real-world applications in forecasting, risk stratification, and personalized intervention planning, and broad relevance across medicine, epidemiology, and multimodal ML. Paper 2 is timely and methodologically neat (training-free hallucination reduction), but its impact is narrower to LLM inference behavior and depends on benchmark validity/generalization.
Paper 1 presents a groundbreaking 'health world model' capable of simulating clinical interventions and predicting disease trajectories. Its unprecedented scale, multi-domain physiological data integration, and successful validation against real-world randomized trials represent a massive leap toward clinical digital twins. While Paper 2 offers a rigorous methodological improvement for LLM search algorithms, Paper 1's direct, highly validated applicability to personalized medicine and clinical trial simulation promises a more profound and immediate real-world impact on human health.
Paper 2 likely has higher scientific impact due to its strong real-world applicability (risk prediction and intervention simulation), broad relevance across medicine, epidemiology, and ML, and timely alignment with “digital twin”/foundation-model trends. It reports substantial empirical validation: large multimodal longitudinal cohort, transfer to independent cohorts, improvements across many endpoints, and comparisons to clinical risk scores plus intervention-effect checks against RCT literature. Paper 1 is novel and valuable for automated theorem proving and benchmarking, with potential long-term impact, but its immediate cross-domain societal impact and user base are narrower than clinical modeling.
While Paper 1 presents an innovative approach to ML infrastructure, Paper 2 tackles a fundamental challenge in clinical medicine with profound real-world implications. By successfully modeling human physiological trajectories and simulating clinical interventions, HealthFormer paves the way for personalized medicine and clinical digital twins. Its rigorous validation across independent cohorts, ability to outperform established clinical risk scores, and accurate prediction of trial outcomes demonstrate massive potential impact across AI, healthcare, and biology.
While Paper 1 provides valuable insights into LLM interpretability and the faithfulness of Chain-of-Thought reasoning, Paper 2 introduces a groundbreaking 'health world model' with massive potential for precision medicine. Its ability to simulate individual physiological trajectories and clinical interventions across diverse domains, validated against independent cohorts and published trials, offers far-reaching implications for healthcare, clinical trial design, and personalized medicine, giving it a broader and more transformative potential real-world impact.
Paper 1 introduces a generative foundation model for human physiology with profound implications for personalized medicine, diagnostics, and in-silico clinical trials. Its ability to accurately predict disease endpoints across independent cohorts and simulate intervention outcomes offers a highly transformative and broad real-world impact across healthcare and biology. While Paper 2 provides valuable mechanistic insights for AI safety, Paper 1 represents a more paradigm-shifting advancement in a critical applied domain.
HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements and 7 domains. Its ability to simulate clinical interventions in silico—validated against published RCTs with remarkable concordance—has transformative implications for drug development, clinical trial design, and personalized medicine. The breadth of validation (4 independent cohorts, 30 disease endpoints, 41 intervention comparisons) demonstrates exceptional rigor. Paper 2, while useful, addresses a narrower problem of automating algorithm discovery for scientific image processing, with more incremental impact.
HealthFormer addresses a central challenge in medicine—personalized health forecasting and intervention simulation—with broad clinical applications including risk stratification, digital twins, and in silico trial simulation. It demonstrates strong empirical validation across independent cohorts, 30 disease endpoints, and 41 randomized trial comparisons. Its potential to transform clinical decision-making, drug development, and personalized medicine gives it far greater real-world impact breadth than Paper 1, which, while methodologically interesting, addresses a more niche question about reasoning trace redundancy in language models with primarily theoretical implications.
HealthFormer represents a paradigm-shifting contribution to computational medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements. It demonstrates broad clinical utility—disease prediction surpassing established risk scores, transfer to independent cohorts, and in silico intervention simulation validated against published RCTs. The concept of clinical digital twins has transformative potential for personalized medicine, drug development, and healthcare delivery. Paper 2 addresses a narrower, more incremental problem (LLM conversation grounding verification) with modest accuracy improvements on small benchmarks.
HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements. Its ability to simulate clinical interventions in silico, validated against 41 randomized trial outcomes, and its zero-shot transfer to independent cohorts for disease/mortality prediction, establishes a new framework for clinical digital twins. This has transformative potential across medicine, drug development, and personalized healthcare. Paper 2, while practically useful, offers an incremental efficiency improvement to LLM synthetic data pipelines with narrower impact.