Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang, Jiayi Zhou, Jiaming Ji, Juntao Dai, Jiawei Chen
Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.
MiraBench introduces a hierarchical evaluation framework that shifts the assessment of robotic world models from visual fidelity to action-conditioned reliability — whether predictions are physically plausible, faithful to commanded actions, and correctly preserve failure outcomes. The benchmark decomposes reliability into three nested levels: Physics Adherence (reference-free physical consistency), Action-Following Fidelity (whether outputs respect task-relevant inputs), and Optimism Bias Detection (whether models hallucinate success under failure-inducing actions).
The most novel conceptual contribution is the formalization of Optimism Bias — a systematic tendency of world models to predict successful outcomes when conditioned on actions that should produce failure. This is grounded in the observation that robot learning datasets are overwhelmingly composed of successful demonstrations, inducing a strong success prior that overrides contradictory action signals. The paper formalizes this as a measurable quantity (Eq. 3) and constructs a perturbation taxonomy of six physically interpretable failure modes to probe it.
Direct impact on robotics: The benchmark addresses a genuine and underappreciated problem. If world models are to serve as scalable simulators for robot learning, their fidelity to action conditioning — especially under failure — is paramount. The finding that visual quality is a poor proxy for action fidelity (Finding 1) and that optimism bias is pervasive (Finding 3) should influence how the community evaluates and trains world models.
Implications for training paradigms: The result that success-only post-training improves action-following but degrades failure preservation (Finding 2) has direct implications for training data curation and objective design. This motivates contrastive action-outcome learning and failure-aware curricula.
Broader evaluation methodology: The hierarchical diagnostic structure — where lower-level failures preclude meaningful higher-level assessment — is a principled design that could influence benchmark construction in adjacent domains (autonomous driving world models, embodied navigation).
Limitations to impact: The benchmark is currently restricted to tabletop manipulation with short-horizon contact dynamics. The perturbation taxonomy, while extensible, covers a relatively narrow slice of possible failures. The reliance on VLM-based evaluation introduces its own reliability ceiling, and the 12-model evaluation, while broad, is dominated by variants of a few architectures (Cosmos/DreamDojo family + Wan family).
This work is highly timely. World models for robotics (DreamDojo, Cosmos, UniSim, IRASim) are rapidly proliferating, and there is genuine community need for evaluation that goes beyond FVD/SSIM. The paper correctly identifies that existing benchmarks (WorldSimBench, WorldModelBench, WorldScore, WorldArena) do not systematically probe failure-regime fidelity. The optimism bias concept fills a clear conceptual gap: the observation that success-dominated training creates systematically biased simulators is important and likely to influence future work on data collection, training objectives, and evaluation standards.
MiraBench makes a meaningful conceptual and empirical contribution by identifying and formalizing optimism bias in robotic world models and providing a structured diagnostic benchmark. The work is timely, methodologically detailed, and produces actionable findings. Its primary limitations are in scope (tabletop manipulation only), evaluator reliability for certain model types, and the absence of downstream policy validation.
Generated May 29, 2026
Paper 2 has higher potential impact due to a more foundational contribution: it proposes a theorem-level explanation (attention bottleneck) for limits of extended reasoning, introduces new metrics, and quantifies a “deterministic horizon” with broad relevance to LLM/agent design. Its implications generalize across many models and domains and directly motivate hybrid tool-augmented architectures, a timely direction. Paper 1 is valuable and rigorous as a robotics benchmark with a large annotation effort, but its impact is more domain-specific (robotic world models) and primarily evaluative rather than yielding general theoretical limits.
Paper 1 introduces a critical benchmark for robotic world models, addressing a major blind spot in current evaluations (visual fidelity vs. physical/action adherence). Benchmarks in rapidly expanding fields like this often have broad, long-lasting impact by steering future research. Paper 2 presents a clever and useful methodological improvement for reward modeling in formal mathematics, but its scope and potential audience are more specialized compared to the foundational evaluation shift proposed in Paper 1.
MiraBench addresses a critical gap in robotic world model evaluation by introducing a systematic benchmark with hierarchical evaluation criteria, a large human-annotated corpus (16K+ judgments), and evaluation of 12 model configurations. Its findings—that visual fidelity poorly proxies action fidelity, scaling doesn't help, and optimism bias is pervasive—have broad implications for robotics, simulation, and embodied AI. While TriLens offers a useful hallucination detection method, it is more incremental (extending logit-lens analysis) and narrower in scope. MiraBench's benchmark nature means it can shape future research directions across the growing robotic world models field.
Paper 2 introduces a foundational benchmark for an emerging and critical field (robotic world models), exposing systemic flaws like optimism bias and the disconnect between visual and action fidelity. Benchmarks that redefine evaluation metrics typically dictate future research directions and gather high citations, offering broader field-level impact than the specific algorithmic improvement presented in Paper 1.
MiraBench addresses a critical gap in evaluating robotic world models by shifting focus from visual fidelity to action-conditioned reliability—a fundamental requirement for deploying world models in robotics. Its rigorous benchmark with 16,000+ human annotations, evaluation of 12 model configurations, and novel findings (e.g., optimism bias, scale not improving action fidelity) provide actionable insights for the rapidly growing field of world models for robotics. Paper 2, while ambitious in automating the scientific lifecycle, presents a system-level contribution with less rigorous evaluation methodology and enters a crowded space of AI-for-science agents with unclear empirical validation of its claims.
Paper 2 likely has higher impact: it introduces a new benchmark and dataset (16k human judgments) targeting an under-evaluated but crucial property—action-conditioned reliability of robotic world models—yielding broadly applicable diagnostics and findings across 12 diverse systems (open/closed, text/vector, scales). Benchmarks often become community standards, influencing many downstream works in robotics, world modeling, and evaluation. Paper 1 is novel in optimizer-centric robustness for LLM safety, but its scope is narrower (alignment robustness) and may see more incremental adoption compared to a widely usable evaluation suite.
Paper 1 identifies a novel, previously undocumented failure mode ('unfaithful capitulation') in reasoning LLMs where chain-of-thought remains correct but the final answer flips under adversarial pressure. This has broad implications for AI safety, deployment reliability, and alignment research—fields of intense current interest. The mechanistic dissociation between trace and answer is a conceptually rich finding with immediate practical relevance for all deployed reasoning models. Paper 2, while rigorous and valuable for robotics, addresses a more domain-specific evaluation gap. Paper 1's timeliness, novelty, and breadth of impact across the rapidly growing reasoning-model ecosystem give it higher potential impact.
Paper 2 likely has higher impact: it introduces a concrete, scalable benchmark with a clear evaluation target (action-conditioned reliability) plus a sizable human-annotated dataset (16k judgments) and broad coverage (12 model configurations, open/closed, text/vector). Benchmarks often become community standards, driving measurable progress across robotics, simulation, and generative modeling, making applications and cross-field uptake strong and timely. Paper 1 is conceptually interesting for multi-agent scientific workflows, but impact hinges on adoption of a specific framework and case studies appear narrower and harder to standardize.
MiraBench addresses a fundamental gap in evaluating robotic world models by shifting focus from visual fidelity to action-conditioned reliability. It introduces a systematic benchmark with 16,000+ human annotations, evaluates 12 model configurations, and reveals important findings (e.g., visual fidelity ≠ action fidelity, optimism bias is pervasive). This has broad impact across robotics, simulation, and model evaluation. Paper 2 addresses an important but narrower problem—over-search in agentic LLM systems—with an incremental RL-based solution. While practical, it targets efficiency optimization rather than exposing fundamental evaluation shortcomings in a rapidly growing field.
Paper 2 likely has higher impact: it targets a timely bottleneck for deploying generative world models in robotics—action-conditioned reliability—directly tied to safety, sim-to-real transfer, and autonomous decision-making. Its evaluation axes (physics adherence, action-following, optimism bias) map to real-world failure modes and can influence both benchmarking and model design across robot learning and generative modeling. The large human-annotated corpus and broad model coverage support methodological rigor and adoption. Paper 1 is novel and rigorous, but is primarily diagnostic for LLM text-only spatial reasoning with narrower immediate application.