Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

Fan Ma, Yuntian Liu, Xiang Lan, Weipeng Zhou, Jun Ni, Mauro Giuffrè, Lingfei Qian, Xueqing Peng

May 4, 2026

arXiv:2605.02740v1 PDF

cs.AI(primary)cs.CL

#4of 2292·Artificial Intelligence

Gold · Week 19, 2026 Share

Tournament Score

1661±28

10501800

95%

Win Rate

169

Wins

Losses

177

Matches

Rating

7.8/ 10

Significance8

Rigor7.5

Novelty7.5

Clarity8

Tournament Score

1661±28

10501800

95%

Win Rate

169

Wins

Losses

177

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Evidence derived from large-scale real-world data (RWD) is increasingly informing regulatory evaluation and healthcare decision-making. Administrative claims provide population-scale, longitudinal records of healthcare utilization, expenditure, and detailed coding of diagnoses, procedures, and medications, yet their potential as a substrate for healthcare foundation models remains largely unexplored. Here we present ReClaim, a generative transformer trained from scratch on 43.8 billion medical events from more than 200 million enrollees in the MarketScan claims data spanning 2008-2022. ReClaim models longitudinal trajectories across diagnoses, procedures, medications, and expenditure, and was scaled to 140 million, 700 million, and 1.7 billion parameters. Across over 1,000 disease-onset prediction tasks, ReClaim achieved a mean AUC of 75.6%, substantially outperforming disease-specific LightGBM (66.3%) and the transformer-based Delphi model (69.4%), with the largest gains for rare diseases. These advantages held across retrospective and prospective evaluations and in external validation on two independent datasets. Performance improved monotonically with scale, and post-training added 13.8 percentage points over pre-training alone. Beyond disease prediction, ReClaim captured financial outcomes and improved real-world evidence (RWE) analyses: for healthcare expenditure forecasting it increased explained variance from 0.28 to 0.37 relative to LightGBM, and in a target trial emulation it reduced systematic bias by 72% on average relative to Delphi. Together, these results establish administrative claims as a scalable substrate for healthcare foundation models and show that learned representations generalize across time periods and data sources, supporting disease surveillance, expenditure forecasting, and RWE generation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ReClaim — Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

1. Core Contribution

ReClaim is a generative transformer foundation model trained from scratch on 43.8 billion medical events from over 200 million U.S. enrollees in the MarketScan claims database (2008–2022). The central novelty lies in three interconnected claims:

First, it establishes administrative claims as a viable substrate for healthcare foundation models, a domain previously dominated by EHR-based models. Claims data offer population-scale coverage, longitudinal continuity across providers, and defined enrollment windows—advantages the authors leverage effectively.

Second, ReClaim introduces an end-to-end framework that jointly models diagnoses, procedures, medications, and healthcare expenditure, expanding the scope beyond disease prediction to financial forecasting and causal inference support.

Third, the paper demonstrates that learned embeddings from foundation models can reduce confounding bias in target trial emulations—a particularly novel application that bridges representation learning and causal inference.

2. Methodological Rigor

Data scale and processing: The training corpus of 118M enrollees with careful preprocessing (ICD-9 to ICD-10 mapping, NDC-to-RxNorm harmonization, hierarchical vocabulary compression from 180K+ to ~21K tokens) is methodologically sound. The monthly temporal aggregation choice is justified by ablation studies showing it outperforms daily and weekly granularities.

Evaluation breadth: The evaluation across 1,000+ disease endpoints, four benchmarks (retrospective, prospective, two external EHR datasets), and three task categories provides strong evidence of generalizability. The use of Holm-adjusted p-values and bootstrap confidence intervals is appropriate. The paired disease-level comparisons (rather than just aggregate metrics) strengthen the statistical analysis.

Baselines: Comparison against 1,208 disease-specific LightGBM models and Delphi (a recent transformer-based model published in Nature) provides meaningful benchmarks. However, the paper lacks comparison against several other relevant foundation models (e.g., MOTOR, CEHR-GPT, Curiosity at comparable scale). The LightGBM baseline uses bag-of-words features, discarding temporal information—this somewhat inflates the apparent advantage of sequence models.

Potential concerns: The post-training stage uses only 100K sequences, which is remarkably small. While the authors present this as a strength (data efficiency), it raises questions about whether the post-training improvements are fragile or would plateau/improve with more data. The case-control construction for disease prediction, where clinically related conditions are not excluded from controls, is acknowledged but potentially inflates AUC for hierarchically related endpoints. The expenditure prediction evaluation uses Monte Carlo sampling of 20 trajectories, but no sensitivity analysis on the number of samples is provided.

3. Potential Impact

Clinical applications: Disease prediction across 1,000+ conditions with a mean AUC of 75.6% and particular strength for rare diseases addresses a genuine unmet need. The rare disease advantage (mean AUC gain of +16.9pp over LightGBM) is especially noteworthy, as these conditions are precisely where specialized models fail due to data scarcity.

Health economics: The expenditure forecasting capability (R² improvement from 0.28 to 0.37) with identification of high-need, high-cost individuals has direct implications for health system planning, insurance risk adjustment, and resource allocation.

Causal inference: The 72% average reduction in systematic bias in target trial emulation is the most conceptually impactful finding. If validated more broadly, this approach could transform how observational studies handle unmeasured confounding—a fundamental challenge in pharmacoepidemiology and comparative effectiveness research. However, this is demonstrated in only one case study with relatively small sample sizes.

Industry relevance: Claims data are universally available across health systems, making this approach immediately scalable in ways that EHR-based models are not. This could accelerate adoption by payers, regulators (FDA's growing interest in RWE), and pharmaceutical companies.

4. Timeliness & Relevance

The paper arrives at a critical juncture. The FDA's 2024 guidance on using RWD for regulatory decision-making creates institutional demand for better claims-based analytics. Simultaneously, the success of foundation models in NLP and other domains has generated interest in healthcare applications, but most work has focused on EHRs. ReClaim fills a clear gap by demonstrating that claims data—despite lacking clinical granularity—encode sufficient signal for robust longitudinal modeling.

The scaling analysis (140M to 1.7B parameters) and demonstration of monotonic improvement align with the broader scaling laws discourse in AI, though the gains from scaling are modest compared to the dramatic gains from post-training.

5. Strengths & Limitations

Key strengths:

Unprecedented scale: 200M+ enrollees, 43.8B events, multiple model sizes

Comprehensive evaluation: temporal, cross-source, cross-task generalization

Novel RWE application: embedding-augmented propensity score estimation

Thoughtful tokenization design: hierarchical code compression, temporal anchoring, expenditure encoding

Post-training efficiency: massive gains from only 100K sequences

Notable limitations:

No comparison with other recent foundation models beyond Delphi (e.g., Curiosity, MOTOR)

External validation on EHR data shows expected degradation (~3-6pp), and the cross-modality gap is not deeply analyzed

The RWE case study is limited to one drug comparison with small cohort sizes (7,246 individuals)

No interpretability analysis beyond UMAP visualization

MarketScan represents commercially insured U.S. populations—generalizability to uninsured, international, or fundamentally different healthcare systems is unknown

The paper does not release model weights or code, limiting reproducibility

Absolute AUC values, while improved over baselines, remain moderate for clinical deployment in many disease categories

Additional observations: The 13.8pp gain from post-training versus the ~1.3pp gain from scaling (140M→1.7B) suggests that architectural and training strategy innovations may be more impactful than brute-force scaling for this domain. The expenditure tokenization scheme (scientific notation encoding) is an elegant solution to the dynamic range problem.

Rating:7.8/ 10

Significance 8Rigor 7.5Novelty 7.5Clarity 8

Generated May 5, 2026

Comparison History (177)

vs. Towards a General Intelligence and Interface for Wearable Health Data

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to stronger methodological rigor and direct relevance to regulatory-grade real-world evidence. It trains on a massive, widely used nationwide claims substrate, evaluates on >1,000 tasks with prospective and external validation, and demonstrates impact beyond prediction (expenditure forecasting and reduced bias in target trial emulation). These advances can broadly influence epidemiology, health economics, pharmacoepidemiology, and regulatory science. Paper 1 is highly innovative and large-scale, but relies on wearables with harder-to-validate endpoints and more consumer-facing applications, potentially limiting near-term cross-field adoption.

vs. Forecasting Scientific Progress with Artificial Intelligence

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to strong real-world applicability (regulatory RWE, surveillance, forecasting), large-scale novel dataset/modeling (43.8B events, up to 1.7B params), and extensive methodological validation (1,000+ tasks, prospective/retrospective, external datasets, scale/post-training ablations, bias reduction in target trial emulation). Its contributions can influence multiple fields—clinical informatics, epidemiology, health economics, and causal inference—at a timely moment for RWD-driven decision-making. Paper 1 is novel and broadly relevant but mainly diagnostic/benchmarking with less immediate downstream deployment impact.

vs. Forecasting Scientific Progress with Artificial Intelligence

claude-opus-4.65/22/2026

ReClaim demonstrates a large-scale foundation model trained on 43.8 billion medical events from 200M+ patients, showing strong empirical results across 1,000+ prediction tasks with clear scaling laws. It has immediate, concrete applications in disease prediction, expenditure forecasting, and causal inference for real-world evidence—areas of high practical and regulatory importance. Paper 2 introduces an interesting benchmark for AI scientific forecasting but primarily documents limitations of current models without offering solutions, making its impact more diagnostic than transformative. ReClaim's methodological contributions and healthcare applications give it broader and more actionable impact.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

claude-opus-4.65/21/2026

ReClaim represents a major advance in healthcare AI by training a foundation model on 43.8 billion medical events from 200M+ patients, demonstrating strong performance across 1,000+ prediction tasks, expenditure forecasting, and causal inference. Its scale, breadth of validation, and direct applicability to regulatory decision-making and real-world evidence generation give it enormous practical impact. Paper 2 makes a solid theoretical contribution clarifying DPO/RLHF equivalence conditions, but its scope is narrower and incremental relative to the rapidly evolving LLM alignment literature, where methods are frequently superseded.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

claude-opus-4.65/21/2026

ReClaim represents a major advance in healthcare AI by building the first large-scale foundation model on administrative claims data (43.8B events, 200M+ patients). Its demonstrated improvements across 1,000+ prediction tasks, expenditure forecasting, and bias reduction in trial emulations address critical real-world needs in healthcare and regulatory decision-making. The breadth of applications (disease surveillance, cost forecasting, RWE generation) and rigorous external validation suggest transformative potential. Paper 2, while theoretically insightful regarding DPO/RLHF equivalence, addresses a more incremental concern within AI alignment methodology with narrower practical impact.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gemini-3.15/21/2026

Paper 2 presents a population-scale foundation model trained on massive real-world healthcare data (200 million patients). Its ability to significantly improve disease prediction, healthcare expenditure forecasting, and clinical trial emulation demonstrates immense real-world utility and methodological rigor. While Paper 1 introduces a valuable AI benchmark, Paper 2's direct application to critical healthcare outcomes, its massive scale, and its breadth of impact across medicine, epidemiology, and AI give it a substantially higher potential scientific and societal impact.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gpt-5.25/21/2026

Paper 2 likely has higher impact: it introduces a large-scale foundation model trained on unprecedented nationwide claims data and demonstrates strong, broadly useful improvements across >1,000 clinical prediction tasks, external validations, expenditure forecasting, and reduced bias in target trial emulation—directly enabling real-world evidence generation with clear healthcare and regulatory applications. Methodology appears rigorous (scale studies, prospective/retrospective tests, external datasets). Paper 1 is novel and timely as an evaluation benchmark, but its real-world impact is more indirect and primarily within LLM evaluation, whereas Paper 2 can influence clinical research, health economics, epidemiology, and policy.

vs. How Far Are We From True Auto-Research?

gemini-3.15/20/2026

Paper 1 introduces a massive-scale foundation model for healthcare, demonstrating immediate and transformative real-world applications in disease prediction and health economics. Its population-scale validation offers profound, cross-disciplinary impact. Paper 2, while highly relevant to AI methodology, is primarily an evaluative benchmark highlighting current limitations in auto-research, making Paper 1's tangible contributions more impactful.

vs. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

claude-opus-4.65/20/2026

ReClaim represents a major foundation model contribution trained on 43.8 billion medical events from 200M+ patients, demonstrating substantial improvements across 1,000+ prediction tasks, expenditure forecasting, and causal inference. It opens a new substrate (administrative claims) for healthcare AI at population scale with clear regulatory and clinical applications. Paper 2 addresses an important AI safety concern (hallucination-to-action conversion) with a novel architectural solution, but targets a narrower problem domain. ReClaim's breadth of impact across healthcare, policy, and AI methodology, combined with its massive empirical validation, gives it higher potential scientific impact.

vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

gemini-3.15/19/2026

Paper 1 leverages a massive, unprecedented dataset (200 million enrollees) to build a healthcare foundation model with clear, immediate real-world applications in disease prediction, trial emulation, and expenditure forecasting. Its scale, rigorous external validation, and potential to transform population health and healthcare economics give it a broader and more profound societal and scientific impact compared to the algorithmic improvements in LLM hallucination reduction presented in Paper 2.

vs. SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

claude-opus-4.65/18/2026

ReClaim represents a landmark contribution by training the first large-scale foundation model on administrative claims data covering 200M+ patients. Its demonstrated improvements across 1,000+ disease prediction tasks, expenditure forecasting, and bias reduction in trial emulation have immediate, broad real-world healthcare applications. The scale of data, rigorous validation (including external datasets and prospective evaluation), and direct relevance to regulatory decision-making give it enormous potential impact. While SMCEvolve offers elegant theoretical contributions to LLM-driven program search, its impact is more incremental and narrower in scope compared to ReClaim's potential to transform healthcare analytics and evidence generation.

vs. Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

gemini-3.15/16/2026

Paper 1 presents a massive-scale foundation model trained on 200 million patients, demonstrating highly impactful real-world applications in disease prediction, epidemiology, and healthcare expenditure forecasting. Its unprecedented scale and significant performance improvements offer broader societal and interdisciplinary impact compared to Paper 2, which, while valuable for AI development, primarily provides a benchmark for the more niche intersection of automated reasoning and formal mathematics.

vs. VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact: it introduces a large-scale foundation model trained from scratch on unprecedented nationwide claims-scale longitudinal data, demonstrates broad improvements across 1,000+ clinical prediction tasks, external validations, and shows downstream utility for expenditure forecasting and reducing bias in target trial emulation—directly affecting healthcare research, policy, and RWE generation. Its applications span epidemiology, health economics, clinical risk modeling, and regulatory science. Paper 2 is innovative for systems engineering, but its impact is narrower and more incremental relative to existing serving stacks.

vs. SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

gemini-3.15/15/2026

Paper 1 presents a foundational advance by training a model on an unprecedented scale of 200 million patients' claims data. Its impact is extremely broad, significantly improving clinical disease prediction, healthcare expenditure forecasting, and epidemiological causal inference. While Paper 2 offers valuable insights into real-world consumer AI triage, Paper 1 provides a highly scalable, population-level infrastructure with immense methodological rigor that avoids the limitations of self-reported data, making it likely to serve as a cornerstone resource across multiple biomedical and economic fields.

vs. Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

gpt-5.25/15/2026

Paper 1 likely has higher scientific impact due to its large-scale, novel foundation-model training on nationwide claims data (43.8B events, 200M+ enrollees), strong empirical gains across 1,000+ clinical tasks, external validations, and direct relevance to real-world healthcare applications (disease surveillance, expenditure forecasting, and bias reduction in target trial emulation). Its breadth spans clinical prediction, health economics, and regulatory-grade RWE. Paper 2 is methodologically interesting for interpretability of reasoning traces, but its immediate real-world applicability and cross-domain impact are less direct and likely narrower.

vs. Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

claude-opus-4.65/15/2026

ReClaim represents a fundamentally new foundation model for healthcare built on an unprecedented scale of 43.8 billion medical events from 200M+ patients. It addresses a critical gap—applying foundation models to administrative claims data—with broad applications in disease prediction, expenditure forecasting, and causal inference (target trial emulation). The demonstrated improvements across 1,000+ prediction tasks, external validation, and scaling laws provide strong methodological rigor. Its impact spans AI, medicine, health economics, and regulatory science. Paper 1, while practically useful, offers an incremental efficiency improvement to existing synthetic data pipelines with narrower scope.

vs. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

gemini-3.15/15/2026

Paper 2 introduces a population-scale healthcare foundation model with massive real-world implications for disease prediction, expenditure forecasting, and trial emulation. The unprecedented scale of the dataset (200 million enrollees) and substantial performance improvements across multiple critical clinical and operational tasks give it significantly higher potential for broad societal and scientific impact compared to Paper 1's narrower focus on LLM tool-use mechanics.

vs. Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

gemini-3.15/12/2026

Paper 1 develops a massive-scale foundation model trained on data from 200 million patients, demonstrating substantial improvements in disease prediction, expenditure forecasting, and clinical trial emulation. Its direct applications in population health, healthcare economics, and real-world evidence generation provide immense real-world utility and broad interdisciplinary impact. In contrast, Paper 2 offers a valuable but narrower evaluation framework for mathematical reasoning in existing LLMs, making Paper 1's scientific and societal implications significantly higher.

vs. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

claude-opus-4.65/11/2026

ReClaim represents a major foundation model contribution trained on 43.8 billion medical events from 200M+ patients, demonstrating substantial improvements across 1,000+ prediction tasks, expenditure forecasting, and causal inference. Its direct applications to regulatory decision-making, disease surveillance, and healthcare economics give it enormous real-world impact potential. Paper 2, while offering valuable insights into LLM reasoning mechanisms (myopic planning), is more narrowly focused on understanding existing model behavior in a game domain. Paper 1's methodological scale, practical utility, and breadth of healthcare applications position it for higher scientific impact.

vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

gemini-35/7/2026

Paper 2 offers a fundamental theoretical unification of Bayesian inference, game theory, and thermodynamics. While Paper 1 presents an impressive and highly practical engineering achievement in healthcare, Paper 2's potential to create a foundational paradigm shift across multiple scientific disciplines—including physics, biology, economics, and AI—gives it broader scientific impact. Its falsifiable predictions regarding collective intelligence could redefine how we understand complex systems and emergent behavior globally, whereas Paper 1's impact is largely confined to medical informatics.