Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

SeungWon Seo, DongHeun Han, SeongRae Noh, HyeongYeop Kang

May 16, 2026

arXiv:2605.16725v1 PDF

cs.AI(primary)

#652of 2292·Artificial Intelligence

#652 of 2292 · Artificial Intelligence

Tournament Score

1456±44

10501800

64%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor5.5

Novelty7.5

Clarity7.5

Tournament Score

1456±44

10501800

64%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models"

1. Core Contribution

The paper addresses a specific but important gap at the intersection of executable world models and online learning without semantic priors. The main contributions are threefold:

Baba in Wonderland benchmark: A controlled variant of Baba Is You where rule-property labels (STOP, WIN, PUSH, etc.) are remapped to semantically unrelated words (EAT, SHRINK, GROW, etc.), preserving dynamics while removing lexical shortcuts that LLMs exploit.

Alice system: A closed-loop architecture where failed program updates are treated as structural signal. When a candidate code revision explains a new transition but breaks previously explained ones, this "preservation conflict" reveals that the current program conflated distinct dynamics. These conflicts induce hypothesis classes that serve dual purposes: (1) compact, class-stratified counterexamples for LLM-based program updates, and (2) guidance for exploration via embedding novelty and class-rarity scoring.

The core insight — that rejected updates contain information about latent dynamics partitions — is elegant and well-motivated. This transforms a typically discarded failure signal into a reusable organizational structure.

2. Methodological Rigor

Strengths: The formalization is clean. The problem is cast as online executable hypothesis refinement in a deterministic MDP, with well-defined acceptance criteria (new transition explained + all previously explained transitions preserved). The hypothesis-class refinement is monotonic and principled — classes only split, never merge, providing a consistent refinement trajectory.

The information-theoretic justification of the frontier score (Appendix A.3) connecting embedding novelty to entropy and class coverage to mutual information is a nice theoretical grounding, though it remains an approximation.

Concerns: The experimental evaluation, while showing strong results, has notable limitations:

Most results appear to be single runs ("each reported number reflects one run with a fixed configuration"), which is problematic given the stochastic nature of LLM-based code generation. No confidence intervals or variance estimates are provided.

The comparison baseline set is limited. Only WorldCoder is compared in the online setting; GIF-MCTS and CWM are offline-only comparisons.

The environment, while complex, is still deterministic, discrete, and symbolic. The authors acknowledge this limitation honestly.

The reliance on GPT-5.4 (a frontier model) makes reproducibility uncertain and cost-prohibitive for many researchers.

The ablation study is well-designed, isolating hypothesis-class refinement (Single-Class vs. Root-Class vs. Alice) and exploration components (BFS, w/o r_h, w/o r_C). Both ablations clearly demonstrate the value of each component.

3. Potential Impact

Direct impact: The work establishes a concrete methodology for building executable world models when pretrained semantic priors are unreliable. This is relevant for novel domains, opaque simulators, and environments with unfamiliar terminology — scenarios where executable models are most valuable.

Broader implications:

The "failed updates as structural signal" principle could generalize beyond world models to any iterative code synthesis task where preservation constraints matter (e.g., automated program repair, incremental API synthesis).

The benchmark design philosophy — preserving dynamics while scrambling semantics — is a clean experimental methodology that could be applied to other domains to test whether systems truly learn dynamics vs. exploit surface semantics.

The hypothesis-class refinement could inform active learning and curriculum design in other program synthesis contexts.

Limitations on impact: The approach is tightly coupled to deterministic, discrete environments where exact execution checking is feasible. Extension to stochastic, continuous, or partially observable domains (as the authors note) would require fundamentally different notions of "explanation" and "preservation."

4. Timeliness & Relevance

The paper is highly timely. As LLMs are increasingly used as world models and code generators, understanding when they succeed due to genuine dynamics understanding vs. semantic shortcuts is critical. The "semantic inertia" problem (cited as [35]) is gaining recognition, and this work provides both a diagnostic tool (Baba in Wonderland) and a solution approach (Alice).

The executable world model paradigm is growing (WorldCoder, GIF-MCTS, CWM, PoE-World), and this paper addresses a genuine gap — none of the prior methods handle prior misalignment well, as demonstrated empirically.

5. Strengths & Limitations

Key strengths:

The central insight (preservation conflicts as structural signal) is novel, well-articulated, and practically useful

Clean experimental design that isolates the semantic-shortcut problem

The dual use of hypothesis classes for both update evidence selection and exploration guidance is an efficient architectural choice

Thorough ablations and additional experiments (backbone variation, hyperparameter sensitivity, qualitative failure analysis)

Honest discussion of limitations and failure cases

Notable weaknesses:

Single-run evaluation undermines statistical confidence in the reported numbers

The environment scope is narrow (one game family, deterministic, discrete, symbolic)

Heavy dependence on frontier LLM capabilities (GPT-5.4) — the approach's effectiveness with smaller models drops substantially (Table 9)

The 100 LLM call budget and level count (32+8) are relatively small scale

The heuristic dynamics discovery used for Balanced Acc. evaluation introduces evaluation-side assumptions that could be scrutinized

Scalability concerns: The full preservation check (re-running the program on all previously explained transitions) grows linearly with dataset size. While the paper manages this in a small-scale setting, scaling to larger environments may require approximate preservation checking.

Summary

This paper presents a creative and well-motivated approach to a real problem in executable world-model learning. The insight about leveraging failed updates is genuinely novel and the benchmark design is clean. However, the limited experimental scale, single-run evaluation, and restriction to deterministic symbolic environments temper the strength of the empirical claims. The work opens an interesting research direction but would benefit from broader environmental validation and stronger statistical evidence.

Rating:6.5/ 10

Significance 6.5Rigor 5.5Novelty 7.5Clarity 7.5

Generated May 19, 2026

Comparison History (22)

vs. Latent Action Reparameterization for Efficient Agent Inference

gemini-3.15/20/2026

Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM agents: inference efficiency and long action horizons. By proposing a learned latent action space, it offers a highly scalable and broadly applicable solution to reduce compute costs while maintaining performance. Paper 2 presents an interesting approach to executable world models and prior misalignment, but its focus is more niche and theoretical. Given the widespread deployment of LLM agents and the pressing need for efficiency improvements, Paper 1 has higher potential for immediate real-world application and broader impact across the AI community.

vs. Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

gemini-3.15/20/2026

Paper 2 addresses a critical and highly relevant bottleneck in the rapidly growing field of LLM agents: the degradation of self-evolving skill libraries over time. By formalizing 'library drift' and providing concrete, generalizable diagnostics and verified fixes, it offers immediate, broad impact for agentic AI systems. While Paper 1 presents an innovative approach to world model discovery, its evaluation is currently limited to a specific game environment, whereas Paper 2 tackles a ubiquitous systemic failure in modern autonomous agents with clear empirical improvements.

vs. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

gemini-3.15/20/2026

Paper 1 addresses a critical bottleneck in LLM-based multi-agent systems: handling conflicting information and preventing error propagation. Its signed graph modeling approach is highly innovative and has broad, immediate real-world applications across numerous domains utilizing LLMs. While Paper 2 presents a novel method for executable world models, its evaluation in a specific puzzle-game environment suggests a narrower immediate impact compared to the highly relevant and widely applicable multi-agent framework proposed in Paper 1.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gpt-5.25/19/2026

Paper 2 is more novel and broadly impactful: it tackles online self-supervised discovery of executable dynamics under severe prior misalignment, introducing a principled conflict-driven learning mechanism that can generalize to world-model learning, program induction, and agent exploration beyond the specific benchmark. Its methodological contribution (using preservation conflicts to refine hypothesis classes and guide exploration) is research-facing and likely to influence multiple fields (RL, foundation models, formal methods). Paper 1 is highly applicable and timely for enterprise automation, but is more systems-integration/domain-specific and thus likely narrower in cross-field scientific impact.

vs. Dynamics of collective creativity in AI art competitions

claude-opus-4.65/19/2026

Paper 2 has higher estimated scientific impact due to its broader interdisciplinary relevance spanning cultural evolution, creativity research, human-AI interaction, and computational social science. It analyzes a large empirical dataset (130K+ images) revealing fundamental dynamics of collective creativity in human-AI systems—a timely topic with growing real-world significance. Its findings about cultural attractors, the paradox of novelty preference vs. remixing behavior, and group-size effects have implications across multiple fields. Paper 1, while technically innovative, addresses a narrower problem in executable world models within a specific game environment, limiting its breadth of impact.

vs. Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

gpt-5.25/19/2026

Paper 2 is more novel and broadly impactful: it tackles online self-supervised discovery of executable world models under severe prior misalignment, a central obstacle for robust autonomous agents beyond language priors. The proposed closed-loop mechanism (using preservation conflicts to refine hypothesis classes and drive exploration) is conceptually innovative and potentially applicable across robotics, planning, program induction, and scientific discovery. While Paper 1 is timely and practical for LLM agents, its contributions are more engineering/system-level and narrower in scope. Paper 2’s ideas could influence multiple fields and future agent foundations.

vs. Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

claude-opus-4.65/19/2026

Paper 2 introduces a more fundamental and broadly applicable contribution—online self-supervised discovery of executable world models from interaction alone, addressing the deep problem of prior misalignment in program synthesis for planning. This has broad implications across AI, robotics, and reinforcement learning. Paper 1, while solid and useful for nanomedicine literature mining, is more domain-specific and incremental (applying existing techniques like embeddings, graph analysis, and LLM workflows to a specific literature corpus). Paper 2's methodological novelty (treating failed updates as structural signal, preservation conflicts) represents a more transferable conceptual advance.

vs. A Global-Local Graph Attention Network for Traffic Forecasting

gemini-3.15/19/2026

Paper 2 addresses fundamental AI challenges in world modeling, symbolic reasoning, and self-supervised learning, offering a novel approach to discovering environment dynamics without relying on semantic shortcuts. This has broad implications for reinforcement learning and general AI. In contrast, Paper 1 presents a relatively incremental architectural improvement (global-local attention) tailored to a specific application (traffic forecasting), resulting in narrower potential scientific impact.

vs. PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

gemini-3.15/19/2026

Paper 2 addresses a critical bottleneck in LLM multi-turn agent optimization by improving reward assignment for GRPO, a highly relevant and timely topic in modern AI. Its approach operates without external model calls or ground-truth dependencies, offering immediate practical utility for scaling LLM reasoning. In contrast, Paper 1 focuses on executable world models in a specific puzzle game environment, which, while theoretically novel, has a narrower scope and less immediate real-world applicability compared to LLM reinforcement learning.

vs. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

claude-opus-4.65/19/2026

Paper 1 introduces a novel framework (Alice) for online self-supervised discovery of executable world models under prior misalignment—a fundamental challenge in AI. Its approach of treating failed updates as structural signal and using preservation conflicts for hypothesis refinement is methodologically innovative and broadly applicable to program synthesis, model-based RL, and world modeling. Paper 2 contributes a domain-specific benchmark (gaming short-video frame search) that, while useful for evaluating multimodal agents, is narrower in scope, more incremental in nature, and primarily serves as an evaluation resource rather than advancing new methodology with broad theoretical implications.

vs. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

claude-opus-4.65/19/2026

ChemVA addresses a practical, high-demand problem—enabling LLMs to understand chemical reaction diagrams—with clear benchmarks (92% accuracy, ~20pp gains across 9 LLMs) and broad applicability in chemistry, drug discovery, and scientific literature mining. The introduction of OCRD-Bench provides a lasting community resource. Paper 2, while intellectually interesting in studying executable world models under prior misalignment, targets a narrower AI/RL niche (a variant of Baba Is You) with less immediate real-world applicability and a smaller potential user community.

vs. Mitigating Cognitive Bias in RLHF by Altering Rationality

gpt-5.25/19/2026

Paper 2 is more novel and broadly impactful: it tackles online discovery of executable world dynamics under prior misalignment (no rewards, no rule text, unreliable lexical priors), a harder and less-studied setting than adjusting RLHF rationality. Its closed-loop use of preservation conflicts to refine hypothesis classes and drive exploration is an innovative methodological contribution with potential applications in robotics, autonomous agents, program induction, and planning. The Baba-in-Wonderland benchmark and ablations suggest solid rigor and timely relevance to world-model-based agents. Paper 1 is useful but more incremental within established RLHF pipelines.

vs. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a novel online self-supervised paradigm for learning executable world models under explicit prior/lexical misalignment, with a closed-loop mechanism (preservation conflicts) that couples model revision and exploration. This advances core problems in model-based RL, program induction, and robust representation learning, with broad applications in planning, interpretability, and agents in changing domains. Paper 1 is timely and important for LLM evaluation practice, but is more diagnostic/measurement-focused and narrower to LLM-as-judge safety pipelines. Both seem rigorous, but Paper 2’s approach is more broadly generalizable.

vs. Primal-Dual Guided Decoding for Constrained Discrete Diffusion

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a broadly applicable, inference-time constrained decoding method for discrete diffusion with theoretical guarantees, no retraining, and no extra model calls—properties that ease adoption. Its applicability spans multiple high-value domains (text, molecules, music) and aligns with timely demand for controllable generative models, suggesting wide cross-field uptake. Paper 1 is novel and rigorous within executable world-model learning under prior misalignment, but its impact may be narrower and more benchmark-specific, with less immediate transfer to mainstream generative/modeling workflows.

vs. HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

gpt-5.25/19/2026

Paper 1 is more novel and broadly impactful: it tackles online self-supervised discovery of executable world models under prior/lexical misalignment, introducing a principled mechanism (preservation conflicts) for structural hypothesis refinement and exploration. This advances core problems in model-based RL, program induction, and robust representation learning, with applications in planning, robotics, and interpretable/controllable agents. The misalignment setting is timely for robustness and generalization. Paper 2 is a solid applied NLP contribution, but personality prediction is narrower, more domain-specific, and often limited by dataset/label validity and ethical constraints, reducing likely long-term scientific impact.

vs. TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

claude-opus-4.65/19/2026

TTE-Flash addresses a widely relevant problem—reducing inference cost of reasoning-enhanced multimodal embeddings—with broad applicability across retrieval, classification, and video understanding tasks. Its latent think tokens replacing explicit CoT is a practical innovation with immediate utility for scaling multimodal systems. Paper 2, while intellectually interesting in its niche of executable world models and the Baba Is You domain, targets a narrower community and a more specialized problem (prior misalignment in program synthesis for game environments), limiting its breadth of impact.

vs. Revealing Interpretable Failure Modes of VLMs

gpt-5.25/19/2026

Paper 2 has higher potential impact due to stronger novelty and broader relevance: it tackles online self-supervised discovery of executable world models under explicit prior/lexical misalignment, a hard setting central to robust agent learning and program induction. The closed-loop mechanism that uses preservation conflicts to refine hypothesis classes and drive exploration is a distinctive methodological contribution with likely applicability beyond the specific benchmark (e.g., sim-to-real, language-misaligned environments, model-based RL). Paper 1 is timely and useful for VLM safety, but is more domain-/model-specific and primarily diagnostic rather than introducing a generally reusable learning paradigm.

vs. A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

claude-opus-4.65/19/2026

Paper 1 introduces a fundamentally novel paradigm for executable world model learning—using preservation conflicts as structural signal for online dynamics discovery without reward or lexical priors. This addresses a deeper, more general problem in AI (world modeling, program synthesis, exploration) with broader implications for planning and reasoning systems. Paper 2, while methodologically sound, addresses a more incremental improvement in sleep stage classification through conflict-aware multi-modal fusion, which is a narrower application domain with less transformative potential across the field.

vs. Property-Guided LLM Program Synthesis for Planning

gpt-5.25/19/2026

Paper 2 likely has higher impact: it tackles a harder and more broadly relevant problem (online world-model/dynamics discovery under prior misalignment) with applications to robotics, games, and agents learning in novel environments. The approach (conflict-driven refinement + class-aware exploration) addresses key failure modes in executable model learning and is timely for foundation-model agents. Paper 1 is methodologically strong and practical for planning/program synthesis, but its applicability depends on having verifiable formal properties, making its impact narrower despite clear efficiency gains.

vs. Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

claude-opus-4.65/19/2026

Paper 1 introduces a novel algorithmic contribution (Alice) addressing the fundamental problem of learning executable world models under prior misalignment, with a principled mechanism for using preservation conflicts as structural signal. This tackles a core challenge in AI—learning dynamics from interaction without semantic priors—with broader implications for planning, model-based RL, and program synthesis. Paper 2 is primarily a systems-level analysis of integrating existing LLM agent paradigms in a specific framework, offering incremental engineering insights but limited methodological novelty. Its contributions are more descriptive and application-specific, with narrower scientific impact.