MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang, Jiayi Zhou, Jiaming Ji, Juntao Dai, Jiawei Chen

May 28, 2026arXiv:2605.29360v1

cs.AI

#267of 3539·Artificial Intelligence

#267 of 3539 · Artificial Intelligence

Tournament Score

1514±44

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7.5

Clarity7

Abstract

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MiraBench

1. Core Contribution

MiraBench introduces a hierarchical evaluation framework that shifts the assessment of robotic world models from visual fidelity to action-conditioned reliability — whether predictions are physically plausible, faithful to commanded actions, and correctly preserve failure outcomes. The benchmark decomposes reliability into three nested levels: Physics Adherence (reference-free physical consistency), Action-Following Fidelity (whether outputs respect task-relevant inputs), and Optimism Bias Detection (whether models hallucinate success under failure-inducing actions).

The most novel conceptual contribution is the formalization of Optimism Bias — a systematic tendency of world models to predict successful outcomes when conditioned on actions that should produce failure. This is grounded in the observation that robot learning datasets are overwhelmingly composed of successful demonstrations, inducing a strong success prior that overrides contradictory action signals. The paper formalizes this as a measurable quantity (Eq. 3) and constructs a perturbation taxonomy of six physically interpretable failure modes to probe it.

2. Methodological Rigor

Strengths in methodology:

The perturbation taxonomy (Appendix B) is carefully designed with physically interpretable failure modes (grip force insufficiency, premature release, carry slip, contact oscillation, wrist tilt, approach overshoot), each targeting specific joint groups with clear expected physical consequences.

The human annotation corpus (906 videos, 16,704 judgments) is substantial and well-structured across four modules with detailed rubrics (Appendix G). The 16-indicator physical consistency protocol and 18-question optimism bias diagnosis provide unusually fine-grained supervision.

VLM evaluators are validated against human annotations, achieving >85% agreement, with transparent scoring pipelines (e.g., the physics-law evaluator uses explicit kinematic computation rather than LLM judgment for 90% of its score).

Methodological concerns:

The Physics-Law Compliance module currently covers only two motion regimes (free-fall and horizontal push), which is narrow relative to the diversity of manipulation physics.

The human annotation corpus, while large in total judgments, covers only 4 models for the full cross-model analysis, with several evaluation levels targeting only DreamDojo-14B. This limits the generalizability of human-grounded validation.

The VLM evaluator for optimism bias achieves only 75% accuracy on Happy Horse (Table 10), with Y Recall of just 33.3%, suggesting the automated pipeline may systematically undercount bias for certain model types.

Inter-annotator agreement is described only as spot-checking on 10% samples rather than formal IAA metrics (Cohen's κ or similar), which weakens claims about annotation reliability.

3. Potential Impact

Direct impact on robotics: The benchmark addresses a genuine and underappreciated problem. If world models are to serve as scalable simulators for robot learning, their fidelity to action conditioning — especially under failure — is paramount. The finding that visual quality is a poor proxy for action fidelity (Finding 1) and that optimism bias is pervasive (Finding 3) should influence how the community evaluates and trains world models.

Implications for training paradigms: The result that success-only post-training improves action-following but degrades failure preservation (Finding 2) has direct implications for training data curation and objective design. This motivates contrastive action-outcome learning and failure-aware curricula.

Broader evaluation methodology: The hierarchical diagnostic structure — where lower-level failures preclude meaningful higher-level assessment — is a principled design that could influence benchmark construction in adjacent domains (autonomous driving world models, embodied navigation).

Limitations to impact: The benchmark is currently restricted to tabletop manipulation with short-horizon contact dynamics. The perturbation taxonomy, while extensible, covers a relatively narrow slice of possible failures. The reliance on VLM-based evaluation introduces its own reliability ceiling, and the 12-model evaluation, while broad, is dominated by variants of a few architectures (Cosmos/DreamDojo family + Wan family).

4. Timeliness & Relevance

This work is highly timely. World models for robotics (DreamDojo, Cosmos, UniSim, IRASim) are rapidly proliferating, and there is genuine community need for evaluation that goes beyond FVD/SSIM. The paper correctly identifies that existing benchmarks (WorldSimBench, WorldModelBench, WorldScore, WorldArena) do not systematically probe failure-regime fidelity. The optimism bias concept fills a clear conceptual gap: the observation that success-dominated training creates systematically biased simulators is important and likely to influence future work on data collection, training objectives, and evaluation standards.

5. Key Strengths

Conceptual clarity: The formalization of optimism bias and the hierarchical decomposition of reliability provide clean abstractions that the community can build upon.

Comprehensive annotation corpus: 16,704 structured judgments across fine-grained indicators represent a significant annotation effort with reuse potential.

Actionable findings: The three central findings (visual≠action fidelity, scale≠reliability, pervasive optimism bias) are concrete and falsifiable, with clear implications for model development.

Dual modality evaluation: Testing both vector-conditioned and text-conditioned models within the same framework reveals conditioning-type-specific failure modes.

6. Notable Weaknesses

Limited task diversity: Tabletop manipulation only; claims about "robotic world models" are overreaching relative to the evaluation scope.

Evaluator ceiling: The automated evaluators, while validated, have substantial error rates on certain model types (e.g., 33% Y recall on Happy Horse), potentially biasing cross-model comparisons.

No downstream validation: The paper does not demonstrate that MiraBench scores predict downstream policy performance, which would be the ultimate validation of the benchmark's utility.

Reproducibility of closed-source models: Several evaluated models (Happy Horse, WanX, Kling) are API-based, limiting full reproducibility.

Summary

MiraBench makes a meaningful conceptual and empirical contribution by identifying and formalizing optimism bias in robotic world models and providing a structured diagnostic benchmark. The work is timely, methodologically detailed, and produces actionable findings. Its primary limitations are in scope (tabletop manipulation only), evaluator reliability for certain model types, and the absence of downstream policy validation.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 7.5Clarity 7

Generated May 29, 2026

Comparison History (24)

Lostvs. The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

Paper 2 has higher potential impact due to a more foundational contribution: it proposes a theorem-level explanation (attention bottleneck) for limits of extended reasoning, introduces new metrics, and quantifies a “deterministic horizon” with broad relevance to LLM/agent design. Its implications generalize across many models and domains and directly motivate hybrid tool-augmented architectures, a timely direction. Paper 1 is valuable and rigorous as a robotics benchmark with a large annotation effort, but its impact is more domain-specific (robotic world models) and primarily evaluative rather than yielding general theoretical limits.

gpt-5.2·Jun 2, 2026

Wonvs. Expected Value Alignment for Generative Reward Modeling in Formal Mathematics Verification

Paper 1 introduces a critical benchmark for robotic world models, addressing a major blind spot in current evaluations (visual fidelity vs. physical/action adherence). Benchmarks in rapidly expanding fields like this often have broad, long-lasting impact by steering future research. Paper 2 presents a clever and useful methodological improvement for reward modeling in formal mathematics, but its scope and potential audience are more specialized compared to the foundational evaluation shift proposed in Paper 1.

gemini-3.1-pro-preview·Jun 2, 2026

Wonvs. TriLens: Per-Layer Logit-Lens Entropy for White-Box Hallucination Detection

MiraBench addresses a critical gap in robotic world model evaluation by introducing a systematic benchmark with hierarchical evaluation criteria, a large human-annotated corpus (16K+ judgments), and evaluation of 12 model configurations. Its findings—that visual fidelity poorly proxies action fidelity, scaling doesn't help, and optimism bias is pervasive—have broad implications for robotics, simulation, and embodied AI. While TriLens offers a useful hallucination detection method, it is more incremental (extending logit-lens analysis) and narrower in scope. MiraBench's benchmark nature means it can shape future research directions across the growing robotic world models field.

claude-opus-4-6·Jun 2, 2026

Wonvs. Learning Agent-Compatible Context Management for Long-Horizon Tasks

Paper 2 introduces a foundational benchmark for an emerging and critical field (robotic world models), exposing systemic flaws like optimism bias and the disconnect between visual and action fidelity. Benchmarks that redefine evaluation metrics typically dictate future research directions and gather high citations, offering broader field-level impact than the specific algorithmic improvement presented in Paper 1.

gemini-3.1-pro-preview·Jun 1, 2026

Wonvs. AutoSci: A Memory-Centric Agentic System for the Full Scientific Research Lifecycle

MiraBench addresses a critical gap in evaluating robotic world models by shifting focus from visual fidelity to action-conditioned reliability—a fundamental requirement for deploying world models in robotics. Its rigorous benchmark with 16,000+ human annotations, evaluation of 12 model configurations, and novel findings (e.g., optimism bias, scale not improving action fidelity) provide actionable insights for the rapidly growing field of world models for robotics. Paper 2, while ambitious in automating the scientific lifecycle, presents a system-level contribution with less rigorous evaluation methodology and enters a crowded space of AI-for-science agents with unclear empirical validation of its claims.

claude-opus-4-6·Jun 1, 2026

Wonvs. Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Paper 2 likely has higher impact: it introduces a new benchmark and dataset (16k human judgments) targeting an under-evaluated but crucial property—action-conditioned reliability of robotic world models—yielding broadly applicable diagnostics and findings across 12 diverse systems (open/closed, text/vector, scales). Benchmarks often become community standards, influencing many downstream works in robotics, world modeling, and evaluation. Paper 1 is novel in optimizer-centric robustness for LLM safety, but its scope is narrower (alignment robustness) and may see more incremental adoption compared to a widely usable evaluation suite.

gpt-5.2·May 29, 2026

Lostvs. The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

Paper 1 identifies a novel, previously undocumented failure mode ('unfaithful capitulation') in reasoning LLMs where chain-of-thought remains correct but the final answer flips under adversarial pressure. This has broad implications for AI safety, deployment reliability, and alignment research—fields of intense current interest. The mechanistic dissociation between trace and answer is a conceptually rich finding with immediate practical relevance for all deployed reasoning models. Paper 2, while rigorous and valuable for robotics, addresses a more domain-specific evaluation gap. Paper 1's timeliness, novelty, and breadth of impact across the rapidly growing reasoning-model ecosystem give it higher potential impact.

claude-opus-4-6·May 29, 2026

Wonvs. Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Paper 2 likely has higher impact: it introduces a concrete, scalable benchmark with a clear evaluation target (action-conditioned reliability) plus a sizable human-annotated dataset (16k judgments) and broad coverage (12 model configurations, open/closed, text/vector). Benchmarks often become community standards, driving measurable progress across robotics, simulation, and generative modeling, making applications and cross-field uptake strong and timely. Paper 1 is conceptually interesting for multi-agent scientific workflows, but impact hinges on adoption of a specific framework and case studies appear narrower and harder to standardize.

gpt-5.2·May 29, 2026

Wonvs. SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

MiraBench addresses a fundamental gap in evaluating robotic world models by shifting focus from visual fidelity to action-conditioned reliability. It introduces a systematic benchmark with 16,000+ human annotations, evaluates 12 model configurations, and reveals important findings (e.g., visual fidelity ≠ action fidelity, optimism bias is pervasive). This has broad impact across robotics, simulation, and model evaluation. Paper 2 addresses an important but narrower problem—over-search in agentic LLM systems—with an incremental RL-based solution. While practical, it targets efficiency optimization rather than exposing fundamental evaluation shortcomings in a rapidly growing field.

claude-opus-4-6·May 29, 2026

Wonvs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Paper 2 likely has higher impact: it targets a timely bottleneck for deploying generative world models in robotics—action-conditioned reliability—directly tied to safety, sim-to-real transfer, and autonomous decision-making. Its evaluation axes (physics adherence, action-following, optimism bias) map to real-world failure modes and can influence both benchmarking and model design across robot learning and generative modeling. The large human-annotated corpus and broad model coverage support methodological rigor and adoption. Paper 1 is novel and rigorous, but is primarily diagnostic for LLM text-only spatial reasoning with narrower immediate application.

gpt-5.2·May 29, 2026

#267of 3539·Artificial Intelligence

#267 of 3539 · Artificial Intelligence

Tournament Score

1514±44

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7.5

Clarity7