From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou

May 16, 2026

arXiv:2605.16927v1 PDF

cs.AI(primary)

#254of 2292·Artificial Intelligence

#254 of 2292 · Artificial Intelligence

Tournament Score

1508±46

10501800

78%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6.5

Rigor5.5

Novelty4.5

Clarity6

Tournament Score

1508±46

10501800

78%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper presents a review/framework paper that organizes the field of intervention-aware disease trajectory modeling around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process). The central thesis is that clinical AI must move beyond static risk prediction toward dynamic, causally-grounded trajectory modeling that explicitly accounts for treatment-confounder feedback, irregular observation, and the closed-loop nature of clinical decision-making.

The claimed novelty is the "first unified framework" bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time. In practice, this is primarily a conceptual synthesis rather than a methodological invention—the paper organizes existing work (g-methods, neural ODEs/CDEs, temporal point processes, balanced representation learning, etc.) into a coherent taxonomy and maps each method family to the specific component it addresses.

2. Methodological Rigor

The literature search follows a systematic protocol (PRISMA-style, 1,241 initial records → 119 retained), with dual-reviewer screening and a third adjudicator. This is laudable for a review paper. However, the search terms are notably broad and somewhat biased toward deep learning terminology ("large model," "foundation model," "transformer," "LLM"), which may underrepresent classical biostatistical contributions to dynamic treatment regimes that predate the AI framing.

The conceptual framework is well-structured. The four research questions form a logical dependency chain (target definition → identifiability → estimation → translation). Box 1 on longitudinal causal inference is a genuinely useful pedagogical contribution that clearly explains sequential exchangeability, positivity, g-formula, and MSMs for an AI audience. The method selection matrix (Table 5) is pragmatically valuable.

However, the paper lacks quantitative synthesis—no meta-analytic summaries, no systematic comparison of method performance across standardized benchmarks. The discussion of each method family remains at a descriptive level, and the "unified framework" is primarily conceptual rather than formally specified (no unified mathematical model is actually proposed).

3. Potential Impact

The paper addresses a genuine gap: the disconnect between predictive ML models trained on observational care logs and the causal questions clinicians actually need answered. By explicitly framing trajectory modeling as requiring joint treatment of disease dynamics, treatment assignment, and observation processes, it provides a conceptual checklist that could improve research practice.

The "decision-grade evidence" framing is particularly valuable. The staged evaluation pipeline (factual fidelity → counterfactual identifiability → policy validity → deployment safety) with the sepsis worked example offers actionable guidance. If adopted, this could reduce the common failure mode of papers that achieve high AUROC on logged data while making implicit causal claims.

The impact is likely strongest for: (a) ML researchers entering clinical trajectory modeling who need to understand causal requirements; (b) clinical informaticists seeking to bridge prediction and decision support; (c) reviewers evaluating trajectory modeling papers who need a framework for assessing claim strength.

4. Timeliness & Relevance

The paper is well-timed. The recent Nature publication on generative transformers for disease trajectories (Bica et al., 2025), growing interest in "medical world models," and the proliferation of foundation models for EHR data all create demand for a synthesis that clarifies what these models can and cannot claim. The emphasis on the gap between predictive accuracy and causal validity is increasingly recognized but rarely articulated this systematically.

The "world model" framing in the title is attention-grabbing but somewhat loosely connected to the actual content—the paper is more about longitudinal causal inference than about world models in the reinforcement learning sense. Figure 1 sketches this connection but the paper does not deeply engage with the world model literature from AI/robotics.

5. Strengths & Limitations

Strengths:

Comprehensive taxonomy that bridges traditionally siloed communities (biostatistics, causal inference, deep learning, clinical informatics)

Clear articulation of the claim hierarchy (forecasting → counterfactual → policy) and the conditions each requires

Box 1 is an excellent standalone educational resource

Table 5 (method selection matrix) provides genuinely actionable guidance

Explicit attention to observation process bias—often ignored in trajectory modeling surveys

The discussion of evaluation matching claim strength is a significant conceptual contribution

Limitations:

The "first unified framework" claim is overstated—the framework is primarily organizational/taxonomic rather than formally specified. Prior work by Bica et al. (2021), Hernán & Robins, and others covered much of this ground

No new empirical results, benchmarks, or datasets are contributed

The world model analogy is underdeveloped—the connection between clinical trajectory modeling and the world model paradigm (Ha & Schmidhuber, Dreamer, etc.) is superficial

The paper is very long (~30 pages) with considerable redundancy between sections (the same key points about treatment-confounder feedback and observation bias are repeated multiple times)

Limited engagement with practical deployment challenges: computational costs, real-time inference, EHR integration, regulatory pathways

The search strategy may miss important work in biostatistics and pharmacoepidemiology that doesn't use AI-centric terminology

Some important method families receive superficial treatment (e.g., structural nested models, doubly robust estimators beyond brief mentions)

6. Additional Observations

The paper reads as a position piece wrapped in review paper formatting. Its strongest contribution is the conceptual reframing and the evaluation framework, not the literature synthesis per se. The writing quality is generally good but could be more concise—many paragraphs restate the same core message. The large author list across many institutions suggests a collaborative effort, but the integration sometimes feels uneven.

The paper's impact will ultimately depend on whether the community adopts its evaluation framework. If the staged evidence pipeline becomes a standard checklist for trajectory modeling papers, the impact could be substantial. If it remains one of many survey papers, the impact will be more modest.

Rating:5.5/ 10

Significance 6.5Rigor 5.5Novelty 4.5Clarity 6

Generated May 19, 2026

Comparison History (23)

vs. Generative Recursive Reasoning

claude-opus-4.65/21/2026

Paper 1 introduces GRAM, a novel technical framework that combines recursive reasoning with generative probabilistic modeling, addressing fundamental limitations of deterministic recursive reasoning models. It offers concrete methodological contributions (stochastic latent trajectories, amortized variational inference) with demonstrated improvements on reasoning tasks and inference-time scaling. Paper 2 is a review/synthesis paper that organizes existing work into a unified framework for clinical trajectory modeling. While comprehensive and valuable, review papers typically have less direct scientific impact than papers introducing novel methods. GRAM's contributions to neural reasoning architectures have broad applicability across AI research.

vs. GIM: Evaluating models via tasks that integrate multiple cognitive domains

gemini-3.15/19/2026

Paper 1 addresses a critical and fundamental bottleneck in clinical AI—transitioning from static, confounded risk prediction to causal, intervention-aware dynamic trajectories. Its unified framework bridges deep learning, causal inference, and policy evaluation, offering profound implications for safe, real-world healthcare applications. While Paper 2 provides a valuable and timely LLM benchmark, Paper 1's potential to fundamentally change clinical decision-making and patient outcomes represents a deeper scientific and societal impact.

vs. Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

claude-opus-4.65/19/2026

Paper 1 presents the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation for clinical AI, addressing fundamental challenges in clinical decision-making. Its breadth of impact spans medicine, causal inference, and machine learning, with direct implications for learning health systems and individualized treatment. Paper 2, while insightful about LLM negotiation limitations, addresses a narrower question with less transformative potential. Paper 1's methodological synthesis and its relevance to life-critical healthcare applications give it substantially greater scientific impact.

vs. NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

claude-opus-4.65/19/2026

NeuroMAS introduces a novel conceptual bridge between neural network architectures and multi-agent LLM systems, offering both theoretical foundations and empirical validation. Its framing of multi-agent design as architecture design (with depth, width, connectivity as scalable dimensions) opens a new research direction with broad applicability. The finding about progressive growth enabling scaling is practically important. Paper 1, while a comprehensive and valuable review/framework for clinical AI, is a synthesis of existing methods rather than introducing new methodology, limiting its direct impact despite addressing an important problem.

vs. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

gemini-3.15/19/2026

Paper 2 addresses a critical, systemic bottleneck in clinical AI—transitioning from static prediction to dynamic, causal-aware trajectory modeling. By providing a unified framework for intervention-aware clinical decision-making, it has profound implications for patient outcomes and the safe deployment of AI in healthcare. While Paper 1 offers strong technical advancements in chemical LLMs, Paper 2's synthesis of causal inference and clinical AI is likely to shape broader research paradigms and policies across the high-impact medical domain.

vs. From Prompts to Protocols: An AI Agent for Laboratory Automation

gemini-3.15/19/2026

While Paper 1 offers a rigorous methodological framework for clinical AI, Paper 2 presents a transformative tool for experimental sciences. By enabling natural language control of lab automation, Paper 2 has the potential to dramatically accelerate discovery across diverse fields such as chemistry, biology, and materials science. Its ability to lower the barrier to autonomous experimentation provides a broader and more immediate scientific impact across multiple disciplines.

vs. F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text

gpt-5.25/19/2026

Paper 2 has higher impact potential: it proposes a unified, intervention-aware framework for clinical trajectory prediction that directly addresses core methodological obstacles (time-varying confounding, treatment feedback, informative observation) and links forecasting, counterfactual estimation, and policy evaluation with identifiability and evaluation guidance. Its applications (decision-grade clinical AI, policy stress-testing, safer learning health systems) are broad and timely, spanning clinical informatics, causal inference, and ML. Paper 1 is a narrower, incremental architecture combination on a specific dataset/domain, with more limited cross-field influence.

vs. Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

gemini-3.15/19/2026

Paper 2 offers a comprehensive, unifying framework for a critical, high-stakes challenge: clinical decision-making and causal trajectory modeling in healthcare AI. By bridging forecasting, counterfactual estimation, and policy evaluation while addressing complex confounding biases, it has the potential to fundamentally shift how clinical AI models are developed and evaluated. In contrast, Paper 1 presents an incremental algorithmic tweak (shared backbone PPO) applied to a specific, narrower domain (multi-UAV coverage). Paper 2's broad applicability across medicine, causal inference, and machine learning ensures a much wider and more profound scientific impact.

vs. Dynamics of collective creativity in AI art competitions

gpt-5.25/19/2026

Paper 1 has higher potential impact due to its methodological and translational reach: it proposes a unified, intervention-aware framework linking forecasting, counterfactual trajectory estimation, and policy evaluation while explicitly handling time-varying confounding and informative observation—core barriers to clinically actionable AI. Its applications (treatment-sensitive predictions, policy stress-testing, safer closed-loop learning health systems) are high-stakes and broadly relevant across biostatistics, causal inference, ML, and healthcare delivery. Paper 2 is novel and well-powered empirically, but its impact is more domain-specific (computational social science/creativity) and less likely to reshape high-consequence decision pipelines.

vs. Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

claude-opus-4.65/19/2026

Paper 2 presents the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation in clinical AI, addressing fundamental challenges (treatment confounding, observation bias) that affect real-world healthcare. As a comprehensive review/framework paper, it has broader interdisciplinary impact spanning clinical medicine, causal inference, and machine learning. Paper 1 offers a novel technical contribution (entropy-gradient inversion) for LRM optimization, but is more narrowly focused on reasoning model training. Paper 2's potential to reshape clinical AI deployment and decision-making gives it wider and more lasting impact.

vs. Learning Quantifiable Visual Explanations Without Ground-Truth

gpt-5.25/19/2026

Paper 2 has higher likely impact due to its broad, timely synthesis of dynamic, intervention-aware clinical prediction—directly targeting major real-world deployment barriers (treatment confounding feedback, informative/irregular observation, identifiability). As a unifying framework/review bridging forecasting, counterfactual trajectories, and policy evaluation with concrete evaluation/validation guidance, it can influence multiple subfields (clinical ML, causal inference, time-series modeling, health policy) and shape standards for “decision-grade” evidence. Paper 1 is novel and useful for XAI evaluation, but is narrower in application scope and ecosystem-level influence.

vs. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

gpt-5.25/19/2026

Paper 2 has higher potential impact: it provides a unifying, intervention-aware framework for clinical prediction that explicitly tackles treatment feedback, time-varying confounding, and informative observation—central barriers to deploying reliable clinical AI. Its applications (treatment-sensitive forecasting, counterfactual trajectories, policy evaluation, safer learning health systems) are high-stakes and broadly relevant across medicine, causal inference, and ML. As a Review, it can shape research agendas and evaluation standards across multiple subfields. Paper 1 is novel and rigorous within LLM efficiency, but its impact is narrower and more incremental relative to the broader, decision-grade clinical framing in Paper 2.

vs. Voices in the Loop: Mapping Participatory AI

claude-opus-4.65/19/2026

Paper 2 presents the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation in clinical AI, addressing fundamental methodological challenges (treatment confounding, observation bias, time-varying confounders). It has broader impact across clinical AI, causal inference, and healthcare decision-making, with direct applications to individualized treatment and learning health systems. Paper 1, while valuable for mapping participatory AI, is primarily a repository/atlas contribution with more limited methodological novelty and narrower downstream research utility.

vs. Allegory of the Cave: Measurement-Grounded Vision-Language Learning

gemini-3.15/19/2026

Paper 2 presents a paradigm-shifting framework for clinical AI, addressing critical flaws in static prediction by integrating causal inference and dynamic trajectory modeling. Its potential to reshape clinical decision-making offers broader societal applications, life-saving potential, and higher interdisciplinary impact compared to Paper 1, which focuses on a narrower, albeit technically novel, improvement in vision-language model processing.

vs. Stateful Reasoning via Insight Replay

gpt-5.25/19/2026

Paper 1 likely has higher impact: it proposes a unified, intervention-aware framework linking forecasting, counterfactual trajectory estimation, and policy evaluation while explicitly modeling treatment assignment and observation processes—key blockers to clinical deployment. Its real-world applicability to decision-grade medicine, methodological emphasis on identifiability, bias, and validation, and breadth spanning causal inference, time-series, and health systems suggest durable cross-field influence. Paper 2 is timely and useful for LLM test-time reasoning, but appears more incremental (a prompting/decoding strategy) with narrower domain impact and less foundational methodological shift.

vs. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

claude-opus-4.65/19/2026

Paper 2 presents a novel, concrete method (LGBO) with theoretical guarantees and empirical validation including wet-lab experiments, addressing a practical bottleneck in scientific discovery. It introduces a new mechanism (region-lifted preferences) integrating LLMs into Bayesian optimization with broad applicability across physics, chemistry, biology, and materials science. While Paper 1 provides a valuable unified review framework for clinical trajectory modeling, it is primarily a synthesis/review rather than introducing a new method. Paper 2's combination of theoretical novelty, cross-domain empirical results, and timeliness (LLM integration) gives it higher near-term impact potential.

vs. Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

gpt-5.25/19/2026

Paper 2 has higher potential scientific impact due to its broad, timely scope and cross-field relevance: it unifies trajectory forecasting, counterfactual estimation, and policy evaluation while explicitly addressing treatment-feedback, time-varying confounding, and informative observation—core obstacles to clinically valid AI. The proposed framework can guide methodology, evaluation standards, and deployment practices across many clinical domains, influencing both research and healthcare policy. Paper 1 is methodologically solid and high-impact industrially, but is more domain-specific (music search) and largely an adaptation/engineering advance rather than a field-shaping synthesis.

vs. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

gpt-5.25/19/2026

Paper 2 likely has higher impact due to a novel, mechanistic framing ("Safety Geometry Collapse") with clear, testable metrics and a demonstrated causal intervention, plus an immediately deployable, training-free inference method (ReGap) validated on multiple benchmarks. Its applications are timely and broad—improving safety of widely used multimodal LLMs across domains—and it offers a general representation-level perspective that can influence both safety research and model design. Paper 1 is a valuable, rigorous synthesis for clinical AI, but as a review/framework it is less directly transformative than a new method with strong empirical validation.

vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

gpt-5.25/19/2026

Paper 1 has higher estimated scientific impact due to broader cross-field relevance and real-world applicability: it targets clinical decision-making, where intervention-aware trajectory modeling can directly affect patient outcomes and health-system policy. Its unified framework integrating forecasting, counterfactual estimation, and policy evaluation while explicitly addressing treatment/confounding/observation bias is timely and methodologically consequential, potentially shaping evaluation standards and deployment practices. Paper 2 is technically strong with clear novelty and strong benchmarks in planning, but its impact is more field-specific (AI planning) and less immediately societally transformative.

vs. Learning Lifted Action Models from Traces with Minimal Information About Actions and States

gpt-5.25/19/2026

Paper 1 likely has higher impact: it offers a unified, decision-centric framework connecting forecasting, counterfactual trajectory estimation, and policy evaluation while explicitly modeling treatment assignment and observation processes—key limitations in current clinical ML. Its real-world applicability is strong (treatment-sensitive predictions, policy stress-testing, safer deployment in learning health systems) and highly timely given rapid clinical AI adoption and concerns about bias/causal validity. Paper 2 is methodologically rigorous and novel in planning/AI, but its applications are narrower and more specialized, likely yielding a smaller cross-field footprint than a broadly relevant clinical AI synthesis.