Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Meng Chu, Xuan Billy Zhang, Kevin Qinghong Lin, Lingdong Kong, Jize Zhang, Teng Tu, Weijian Ma, Ziqi Huang
Abstract
As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.
AI Impact Assessments
(3 models)Scientific Impact Assessment: "Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond"
1. Core Contribution
This paper proposes a "levels × laws" taxonomy for world modeling in agentic AI, organized along two axes: three capability levels (L1 Predictor → L2 Simulator → L3 Evolver) and four governing-law regimes (physical, digital, social, scientific). The central thesis is that the fragmented use of "world model" across communities—reinforcement learning, video generation, web agents, scientific discovery—can be unified through a capability-based lens that abstracts away modality-specific concerns.
The L1/L2/L3 hierarchy offers testable boundary conditions: L1 covers one-step local prediction; L2 extends to multi-step rollouts satisfying coherence, intervention sensitivity, and constraint consistency; L3 adds evidence-driven model revision. The paper synthesizes over 400 works, summarizes 100+ representative systems, and maps them onto this two-dimensional coordinate system.
The most distinctive conceptual contribution is the formalization of L3 as a separate capability level, arguing that evidence-driven model revision (design–execute–observe–reflect loops) is qualitatively different from L2 simulation. This draws explicitly from philosophy of science (Lakatos's hard core vs. protective belt, Duhem-Quine holism) to motivate when parameter updates suffice versus when structural revision is needed.
2. Methodological Rigor
As a position-driven survey rather than an empirical paper, rigor must be evaluated differently. The paper's formal apparatus is coherent: the POMDP-grounded notation system is clean, the L1/L2/L3 definitions have explicit mathematical formulations (Equations 1-4, the L2 trajectory query with compatibility term ϕ_c(τ), the L3 revision operator), and the boundary conditions between levels are stated as testable criteria.
However, several concerns emerge. First, the boundary conditions, while conceptually crisp, are not operationally validated—no experiment demonstrates that the proposed tests actually discriminate levels in practice. The paper acknowledges this ("this paper does not introduce a new benchmark") but this limits the taxonomy's immediate empirical utility. Second, the L3 category is arguably the weakest: the examples cited (CAMEO, A-Lab, FunSearch) are heterogeneous systems that the authors themselves acknowledge only partially satisfy L3 criteria (Table 8 shows many systems missing the "Reflect" step). The conceptual boundary between sophisticated L2 online adaptation and genuine L3 revision remains somewhat fuzzy despite the formal definitions. Third, the philosophical motivations, while intellectually stimulating, are acknowledged as "heuristic rather than historical or one-to-one," and some mappings (e.g., Plato's Cave to epistemic drift) feel more illustrative than substantive.
3. Potential Impact
The paper's greatest potential impact lies in its community-bridging function. By providing a shared vocabulary across RL, computer vision, NLP, robotics, and AI-for-science, it could reduce the conceptual fragmentation that currently impedes cross-pollination. The governing-law regime axis is particularly useful: it explains *why* techniques that work in one domain fail in another (e.g., physical constraint verification is analytically tractable while social constraints are reflexive and normative).
Specific high-impact elements include:
The paper could influence adjacent fields through the L3 framing for scientific discovery, which connects autonomous experimentation to world-modeling theory in a way that neither community has articulated clearly.
4. Timeliness & Relevance
The paper is exceptionally timely. The explosion of LLM-based agents, video world models (Sora, Genie), and autonomous scientific laboratories creates an urgent need for conceptual unification. The debate over whether generative models are "genuine world simulators" versus "plausible generators" is active and unresolved; this paper's capability-based framing offers a more precise way to state the question. The practical question of when to trust a world model for planning—and when model revision is needed—is increasingly critical as deployment scales.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's "position-driven survey" framing is both a strength and a limitation. It avoids the blandness of pure surveys while maintaining comprehensiveness, but the normative claims (e.g., "the future of agentic AI lies not in larger predictors") are stated more strongly than the evidence warrants. The GitHub repository and project page suggest community-building ambitions that could amplify impact if maintained.
The paper would benefit from a compact "diagnostic decision tree" that practitioners could apply to classify their own systems, making the taxonomy immediately actionable rather than retrospectively descriptive.
Generated Apr 27, 2026
Comparison History (38)
Paper 1 provides a comprehensive taxonomy and roadmap unifying a highly active and fragmented field (AI agents and world models). Foundational survey papers of this scale in rapidly growing areas typically achieve immense citation impact and shape future research directions across multiple disciplines. While Paper 2 offers a rigorous and valuable methodological contribution to causal inference, Paper 1 has significantly broader appeal, timeliness, and potential to impact a wider array of scientific and engineering communities.
Paper 2 likely has higher scientific impact: it proposes a unifying “levels × laws” taxonomy that bridges multiple major communities (model-based RL, generative video, web/GUI agents, multi-agent simulation, and AI-for-science), synthesizes 400+ works, and offers evaluation principles plus a reproducible package—positioning it to shape shared terminology, benchmarks, and research agendas broadly and timely. Paper 1 is more applied and benchmark-driven with strong engineering value, but its impact may be narrower (agent framework ecosystem) and more sensitive to rapid tooling churn.
Paper 2 is likely to have higher scientific impact because it provides a unifying taxonomy (“levels × laws”), a broad synthesis of 400+ works across multiple subfields, and concrete evaluation principles/roadmap that can standardize terminology and benchmarking across communities. Its breadth (physical/digital/social/scientific regimes) and timeliness for agent/world-model research suggest cross-field influence. Paper 1 is a strong engineering contribution with promising results, but its impact may be narrower and more contingent on adoption and reproducibility of a specific framework and benchmark gains.
Paper 1 likely has higher impact: it offers a unifying “levels × laws” taxonomy spanning physical/digital/social/scientific regimes, synthesizes 400+ works, and proposes evaluation principles and a reproducible package—assets that can standardize a fast-moving field and influence many subcommunities (model-based RL, generative video, GUI/web agents, multi-agent sims, AI4science). Its applications are broad and timely given agentic AI trends. Paper 2 is more conceptually novel and rigorous in a narrower niche (collective agency via causal games/abstraction), but its immediate cross-field uptake and tooling appear more limited.
Paper 2 presents a comprehensive taxonomy and roadmap for agentic world modeling, synthesizing over 400 works across multiple AI subfields (MBRL, video generation, web agents, social simulation, scientific discovery). Its breadth of impact is substantially larger, connecting previously isolated communities and addressing a central bottleneck in AI agent development. While Paper 1 offers a novel theoretical contribution using evolutionary game theory to explain shortcut learning—which is rigorous and interesting—its scope is narrower. Paper 2's timeliness, given the rapid growth of AI agents, and its potential to shape research directions across multiple fields give it higher estimated impact.
Paper 1 likely has higher impact: it introduces a concrete, novel algorithm (JACTUS) that jointly optimizes compression and adaptation with clear methodological contributions (covariance estimation, union-of-subspaces projection, global rank allocation) and strong empirical results across vision and language, enabling immediate real-world deployment benefits (smaller, tunable models without full frozen weights). Paper 2 is timely and broad, but is primarily a taxonomy/survey/roadmap; its impact depends on downstream adoption of proposed evaluation principles rather than demonstrating a new, validated method.
Paper 2 provides a foundational taxonomy and roadmap for a rapidly growing and highly interdisciplinary field (Agentic World Modeling). By synthesizing over 400 works and proposing a new evaluation framework across physical, digital, social, and scientific domains, it has the potential to unify disparate research communities and guide future agendas. While Paper 1 offers a strong, rigorous methodological improvement for model compression and tuning, Paper 2's immense breadth of impact and timely conceptualization give it a higher potential for widespread scientific influence and citations.
Paper 1 introduces a comprehensive taxonomy for agentic world models that synthesizes 400+ works across multiple major AI subfields (MBRL, video generation, web agents, social simulation, scientific discovery). Its breadth of impact spans robotics, software agents, multi-agent systems, and scientific AI. The structured 'levels x laws' framework provides conceptual infrastructure likely to be widely adopted. Paper 2 addresses an important but narrower policy/governance question about AI-authored publications. While timely, its impact is largely confined to academic publishing practices rather than driving new scientific or technical advances.
Paper 2 likely has higher scientific impact due to broader cross-field reach and timeliness: it proposes a unifying taxonomy (“levels × laws”) for agentic world models spanning physical, digital, social, and scientific regimes, synthesizes 400+ works, and offers evaluation principles and a reproducible package—potentially shaping research agendas across AI, robotics, HCI, and AI-for-science. Paper 1 is more methodologically concrete with strong domain applications, but its impact is narrower (biomedical text/table/figure mining) and current performance (F1=0.32) suggests incremental near-term adoption.
Paper 1 likely has higher impact: it proposes a broad, integrative taxonomy (“levels × laws”), synthesizes a very large body of work, and offers evaluation principles and a reproducible package—positioning it to shape research agendas across model-based RL, generative modeling, agentic software, multi-agent simulation, and AI for science. Its breadth and timeliness (agentic systems) increase cross-field influence. Paper 2 is novel and useful for reproducibility, but is narrower in scope, with more limited methodological depth and likely impact confined to inference/reproducibility practices rather than multiple subfields.
Paper 1 presents a comprehensive taxonomy and roadmap unifying the rapidly growing field of world models and agentic AI. By synthesizing over 400 works and establishing a common framework across diverse domains, it is likely to shape future research directions, terminology, and evaluation standards across multiple communities. This foundational scope typically yields broader, longer-lasting scientific impact and higher citation counts than the specific, albeit highly effective, algorithmic improvements presented in Paper 2.
Paper 1 provides a comprehensive taxonomic framework synthesizing 400+ works across multiple AI subfields, connecting isolated research communities and charting a roadmap for agentic world modeling. Its breadth of impact spans model-based RL, video generation, web agents, social simulation, and scientific discovery. While Paper 2 presents a solid engineering contribution with strong benchmark results for agent memory, it addresses a narrower problem. Paper 1's unifying framework, governance considerations, and evaluation principles have broader potential to shape research directions across multiple fields.
Paper 2 has higher potential scientific impact due to its massive breadth and unifying framework. While Paper 1 offers a valuable technical methodology for mechanistic interpretability, Paper 2 synthesizes over 400 works across rapidly growing fields (RL, web agents, scientific discovery) to establish a foundational taxonomy for Agentic World Modeling. By defining terminology, categorizing capabilities, and outlining a roadmap, Paper 2 is positioned to become a heavily cited, field-defining survey that shapes the trajectory of multiple AI sub-communities moving toward agentic systems.
Paper 2 provides a foundational taxonomy and comprehensive synthesis of over 400 works across multiple rapidly growing domains (RL, web agents, scientific discovery). As a unifying framework for world models, it has significantly broader applicability, higher potential to shape future research agendas, and will likely garner more citations across diverse communities compared to Paper 1's specific methodological contribution to mechanistic interpretability.
Paper 2 provides a foundational taxonomy, synthesis of over 400 works, and a roadmap for 'world models,' a central and highly active concept in AI. Its broad scope across physical, digital, social, and scientific domains gives it immense potential to unify communities and shape future research, leading to broader impact. Paper 1 introduces a valuable but narrower benchmark for agent retrieval, which has a more limited scope compared to the foundational framework proposed in Paper 2.
Paper 1 presents novel empirical findings about internal emotion representations in LLMs that causally influence behavior, including alignment-critical behaviors like reward hacking and sycophancy. This is a groundbreaking mechanistic interpretability result with direct implications for AI safety—a high-priority area. While Paper 2 is a comprehensive survey/taxonomy of agentic world modeling synthesizing 400+ works, surveys generally have less transformative impact than novel empirical discoveries. Paper 1's findings open a new research direction connecting emotion concepts to misalignment, with immediate practical implications for making AI systems safer.
Paper 1 addresses a critical bottleneck in autonomous agent deployment—judgment under uncertainty—by introducing a novel, game-proof benchmark (HiL-Bench) and metric (Ask-F1). It also demonstrates that this judgment is trainable via RL. While Paper 2 provides a valuable comprehensive taxonomy and survey of world models, Paper 1 offers a direct, actionable empirical contribution that introduces a new paradigm for evaluating and training human-in-the-loop agentic behavior, likely driving immediate advancements in agent reliability and safety.
Paper 1 likely has higher scientific impact: it proposes a unifying “levels × laws” taxonomy for agentic world models, synthesizes >400 works across multiple AI subfields, and offers evaluation principles and reproducible tooling—positioning it as a broadly reusable framework with cross-domain relevance (robotics, web agents, multi-agent systems, scientific discovery). Its breadth and timeliness with rapidly expanding agent research increase downstream citations and influence. Paper 2 is methodologically solid and clinically relevant, but is narrower in scope (AD prognosis) and more incremental (Transformer fusion + Deep Markov dynamics) within an established digital-twin literature.
Paper 2 provides a comprehensive taxonomic framework ('levels x laws') for agentic world modeling, synthesizing 400+ works across multiple major AI subfields (MBRL, video generation, web agents, social simulation, scientific discovery). Its breadth of impact is substantially larger, connecting isolated research communities and providing architectural guidance and evaluation principles. While Paper 1 introduces a useful and practical protocol (epistemic blinding) addressing an important LLM auditing problem, it targets a narrower methodological concern. Paper 2's survey-and-framework nature positions it as a foundational reference that will likely accumulate more citations across diverse AI research areas.
Paper 1 exposes a critical, measurable flaw in current AI safety paradigms where alignment filters inadvertently cause harm by withholding life-saving information. Its rigorous, pre-registered empirical methodology and urgent real-world life-safety implications offer profound impact, likely forcing an immediate paradigm shift in AI alignment and healthcare AI. While Paper 2 provides a comprehensive and valuable taxonomy for world models, Paper 1 presents highly novel empirical findings that directly challenge and reshape existing AI safety assumptions.