Yuhang Huang, Xuan Lv, Junyan Xu, Zhiyuan Yu, Jiazhao Zhang, Ruizhen Hu, Wancheng Feng, Shilong Zou
World foundation models (WFMs) are powerful simulators, yet they predominantly operate in a single-view setting and lack the multi-view 3D consistency required for robotic manipulation. While robotic systems rely on multiple cameras (egocentric, eye-to-hand, and wrist-mounted) for policy learning, current multi-view world models simply concatenate view tokens without explicit geometric reasoning. This causes cross-view object drift, depth inconsistency, and texture misalignment. We trace these failures to two deficiencies: the absence of an explicit inter-view communication mechanism and the lack of a 3D geometric prior. We argue that resolving both simultaneously is necessary and sufficient. To address this, we present PAIWorld, a framework that augments diffusion-transformer world models via three core components: (1) Geometry-Aware Cross-View Attention blocks that establish an explicit pathway across views, (2) Geometric Rotary Position Embedding that encodes camera ray directions and extrinsic poses into the attention mechanism, and (3) Latent 3D-REPA, which distills 3D-aware features from frozen 3D foundation models to ensure 3D consistency. Built upon a DiT-based world foundation model, PAIWorld achieves state-of-the-art multi-view 3D consistency on robotic manipulation benchmarks, ranking 1st on the WorldArena leaderboard and 2nd on the AgiBot-Challenge2026 leaderboard, while enabling downstream applications such as model-based planning, world action models, and multi-view policy post-training.
PAIWorld addresses a genuine and important gap in world foundation models (WFMs): the lack of multi-view 3D consistency for robotic manipulation. The paper identifies two root causes—absence of inter-view communication and absence of 3D geometric priors—and argues that both must be addressed simultaneously. The framework introduces three modular components atop a DiT-based backbone: (1) Geometry-Aware Cross-View Attention for explicit inter-view feature exchange, (2) Geometric Rotary Position Embedding (Geo-RoPE) encoding camera ray directions and extrinsic poses into attention, and (3) Latent 3D-REPA, a token-relation distillation loss aligning intermediate DiT features with frozen 3D foundation model representations (Depth Anything 3).
The problem formulation is well-motivated: robotic systems inherently use multiple cameras (wrist, egocentric, eye-to-hand), and naive token concatenation without geometric reasoning leads to cross-view drift, depth inconsistency, and texture misalignment. The "two-pillar" framing—architectural pathway plus geometric objective—is conceptually clean and well-argued.
Strengths in design: The split-RoPE mechanism (ray subspace for pixel-level geometric correspondence, pose subspace for view-level identity) is a thoughtful design choice that prevents interference between spatially-varying and spatially-uniform signals. The anchor-sampling strategy for Latent 3D-REPA reduces computational complexity from quadratic to linear while preserving gradient signal quality. The AdaLN-Zero gating initialization preserves pretrained weights at step zero, enabling stable fine-tuning.
Ablation study: The ablation in Table 4 is the paper's most convincing analytical element. It demonstrates the super-additive effect: individual MEt3R improvements of 0.93 (pathway only) and 0.72 (objective only) combine to 2.64 when used together, exceeding the sum of 1.65. This directly supports the central thesis that both pillars are necessary.
Concerns: The paper lacks several important experimental elements. There is no ablation separating the contributions of Cross-View Attention from Geo-RoPE—they are always bundled together as "CVA." The ablation is only conducted on one benchmark (AgiBot-World), leaving open whether the super-additivity holds across settings. There are no computational cost analyses—the paper does not report inference time overhead, memory costs, or throughput comparisons. For a 14B parameter model trained on 200 H200 GPUs for 7 days, reproducibility is a significant concern. The training dataset composition (2.5M clips from five sources with specific mixing ratios) is described but not justified experimentally.
Immediate applications: The paper demonstrates competitive results on two active leaderboards (WorldArena rank 1, AgiBot-Challenge2026 rank 2), validating practical utility. The downstream applications mentioned—model-based planning, world action models, and policy post-training—are highly relevant but only briefly discussed without thorough experimental validation.
Broader influence: The three proposed components are described as "plug-and-play" for any DiT-based world model, which could accelerate adoption. The Geo-RoPE mechanism, in particular, provides a clean interface for injecting camera geometry into transformer attention and could find use beyond world models—in multi-view video understanding, multi-camera surveillance, or autonomous driving systems.
Limitations on impact: The paper's impact is somewhat constrained by its narrow evaluation domain (robotic manipulation) and the lack of demonstration on truly diverse embodiments or environments. The downstream task evaluations (planning, policy learning) are mentioned in the abstract and conclusion but not substantively demonstrated with quantitative results, which significantly weakens the practical impact claims.
The paper is highly timely. World foundation models are a rapidly growing area (Cosmos, Sora, Wan), and the robotics community is actively seeking ways to leverage video generation for policy learning. The multi-view consistency problem is a genuine bottleneck—real robot setups use 2-4 cameras, and single-view world models are insufficient. The emergence of dedicated benchmarks (WorldArena, AgiBot-Challenge2026) in 2025-2026 signals community demand for exactly this capability. The use of Depth Anything 3 as a frozen 3D prior is a smart choice that leverages the most recent geometric foundation models.
Missing comparisons: The paper does not compare against multi-view generation methods like SyncDreamer or MVDiffusion adapted to the video setting, even qualitatively. While the authors argue these methods target different regimes (object-centric, static), a more direct comparison would strengthen the claims.
PAIWorld makes a solid engineering contribution to an important and timely problem. The two-pillar framework is well-motivated and the ablation evidence for super-additivity is compelling. However, the paper's impact is limited by shallow downstream evaluation, lack of computational analysis, and moderate individual component novelty. The strong leaderboard rankings provide validation but the margins are not always decisive.
Generated Jun 18, 2026
Paper 1 tackles Embodied AI's greatest bottleneck—data scarcity—by unlocking internet-scale, unlabeled egocentric video for robotic policy learning. By innovatively extracting geometric trajectories from monocular video to train flow-matching VLAs, it enables scalable training without expensive physical data collection. Coupled with a massive new 250k-scene benchmark and impressive real-world improvements (150% increased success), VEGA's paradigm-shifting approach to data scaling offers broader transformative scientific impact than Paper 2's architectural refinements, providing a direct pathway to generalist robotic navigation.
While Paper 1 offers practical efficiency improvements for VLA models through pruning, Paper 2 introduces foundational architectural innovations to solve a critical bottleneck in world models: multi-view 3D consistency. By embedding explicit geometric reasoning and 3D priors into diffusion transformers, PAIWorld fundamentally advances the capabilities of world foundation models for simulation, planning, and policy learning. This conceptual leap in building reliable, 3D-aware robotic simulators promises a broader, more transformative scientific impact on the trajectory of autonomous robotics and spatial AI.
Paper 1 likely has higher scientific impact due to greater novelty and broader implications: it tackles a core limitation of world foundation models—multi-view 3D consistency—via explicit geometric cross-view communication, pose-aware embeddings, and distillation from 3D foundation models. This can influence simulation, planning, model-based RL, and multi-camera robotic learning broadly, making it timely and cross-disciplinary (vision, graphics, robotics, generative modeling). Paper 2 is methodologically clean and practically useful, but is a more incremental improvement (frequency-domain/regularization) with narrower conceptual reach.
PAIWorld addresses a fundamental limitation of world foundation models—multi-view 3D consistency—which is critical for robotic manipulation. It introduces novel architectural components (Geometry-Aware Cross-View Attention, Geometric RoPE, Latent 3D-REPA) with broad applicability across robotics, generative modeling, and embodied AI. Its state-of-the-art results on competitive leaderboards and enabling of downstream applications like model-based planning suggest wide adoption potential. Paper 1, while practical, presents an incremental geometric correction method for a narrower teleoperation problem using well-established mathematical principles.
Paper 2 demonstrates higher potential scientific impact by developing a foundational world model that addresses a critical bottleneck: multi-view 3D consistency. Its architectural innovations serve as a general-purpose simulator applicable across diverse robotic platforms. While Paper 1 presents an impressive humanoid system using predictive touch, Paper 2's foundational nature, benchmark leadership, and ability to enhance broad downstream applications promise wider adoption and broader methodological influence across the entire embodied AI community.
Paper 2 likely has higher impact due to broader scope and foundational contributions: it advances multi-view, 3D-consistent world foundation models with explicit geometric priors and inter-view communication, enabling many downstream uses (planning, action modeling, policy training) across robotic platforms and camera setups. Its components (geometry-aware attention, pose/ray embeddings, 3D feature distillation) are broadly reusable beyond a specific VLA. Paper 1 is highly compelling and strong experimentally, but is more specialized (dynamic-object latency mitigation via token prediction around a frozen VLA), limiting cross-domain breadth.
PAIWorld addresses a fundamental challenge in world foundation models—multi-view 3D consistency for robotic manipulation—with a principled geometric framework (cross-view attention, geometric positional embeddings, 3D representation alignment). It achieves state-of-the-art results on competitive leaderboards and enables multiple downstream applications. The work has broader impact potential as world models are a rapidly growing area central to robotics, video generation, and embodied AI. Paper 2 presents a useful but more incremental contribution—converting single-arm demos to dual-arm execution via LLM coordination—with narrower scope and applicability.
Paper 2 has higher impact potential due to a more broadly applicable methodological advance: improving multi-view 3D-consistent world foundation models directly benefits many manipulation pipelines (planning, policy learning, simulation, action models) across robotics and generative modeling. Its explicit geometric mechanisms and distillation from 3D foundation models suggest stronger novelty and transferability, and the reported top leaderboard performance indicates rigorous, timely benchmarking. Paper 1 is valuable for bipedal locomotion design and toe actuation validation, but its scope and cross-field reach are narrower and results are primarily simulation-based.
Paper 2 likely has higher impact due to broader novelty and scope: it proposes architectural components (geometry-aware cross-view attention, geometric RoPE, and 3D feature distillation) to address a fundamental limitation of multi-view world models—3D consistency—central to robotics and simulation. This enables multiple downstream uses (planning, WAMs, policy training) and is timely given rapid growth of WFMs. Paper 1 is elegant and practical (training-free inference-time consistency) but is more incremental and primarily boosts existing diffusion policies rather than redefining multi-view world modeling.
Paper 2 addresses a critical bottleneck in general embodied AI: multi-view 3D consistency in world foundation models. Its rigorous methodology solves fundamental geometric and spatial alignment issues, achieving SOTA on major benchmarks like WorldArena. This enables broad downstream applications across robotic manipulation, planning, and policy training. In contrast, while Paper 1 presents an innovative approach to real-time human-robot musical co-creation, its scope is heavily domain-specific. Paper 2's foundational nature and wide applicability across general robotics give it significantly higher potential for widespread scientific and real-world impact.