Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang
We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.
UMA introduces 3D object motion trajectories as a unified intermediate representation that bridges visuomotor control and dynamics modeling within a single model. The key insight is that object motion and robot actions are co-evolving variables that can be jointly modeled under a masked generative objective, where different mask patterns determine different inference modes (action generation, dynamics prediction, task adaptation) from the same pretrained weights.
The paper solves a concrete problem: existing motion-based methods are siloed—motion-conditioned policies require paired motion-action labels and cannot use action-free video, while motion-prediction models need separate controllers and cannot incorporate action labels. UMA unifies both directions through masked autoencoding over motion and action token sequences, enabling heterogeneous data sources (human videos, real robot data, simulated data) to contribute supervision proportional to what labels they carry.
Three specific innovations stand out: (1) extending hindsight experience replay from single-state goals to motion contexts, eliminating the need for manual task annotations during pretraining; (2) a contrastive objective that disentangles task intent from scene geometry in the learned task latent; and (3) Masked DiT blocks with dual adaptive layer normalization that separately modulate target (denoised) and given (conditioning) tokens.
The experimental design is reasonably thorough. The paper evaluates on three real-world tasks (insertion, sweeping, folding) spanning rigid objects, tool use, and deformable materials—covering meaningfully different manipulation challenges. Each task uses 20 randomized trials. Comparisons are made against appropriate baselines: COIL for motion-conditioned control, UVA for joint video-action modeling, PointWorld for dynamics prediction, and π0.5 for few-shot adaptation.
The ablation study is well-structured, addressing data source contributions (removing simulation vs. human videos), the contrastive objective, and architectural choices (dense attention, separate heads, action-only). The finding that simulation primarily contributes motion-action coupling while human videos contribute task diversity is an informative decomposition.
However, there are methodological concerns. The evaluation scale is relatively small—20 trials per task for 3 tasks is modest for drawing strong conclusions. The simulated ablations use 100 episodes but only 3 tasks. The paper lacks statistical significance testing or confidence intervals. The claim of "consistently outperforms state-of-the-art baselines specialized for each inference mode" is strong given the limited evaluation scope. The scalability analysis (Appendix E) shows promising trends but with only 6 configurations.
The data pipeline relies on a cascade of off-the-shelf models (MegaSaM, UniDepth, SAM3, TAPIP3D, GPT-5-nano), making the approach dependent on the quality of these components. Error propagation through this pipeline is not analyzed.
Direct impact on robot learning: The unified treatment of motion and action could shift how the field thinks about data utilization. The ability to incorporate action-free human videos alongside robot demonstrations addresses a real data bottleneck. The soft prompt tuning mechanism for task adaptation (optimizing only the task latent) is practically appealing for deployment scenarios.
Representation design: The argument for 3D object motion as an embodiment-agnostic intermediate representation is compelling and could influence future architecture designs. Unlike pixel-space world models, motion trajectories share the coordinate frame with robot actions and are more compact.
Cross-domain transfer: The heterogeneous training paradigm—where different data sources contribute different loss terms through masking—is a clean formulation that could generalize beyond this specific architecture.
Limitations on broader impact: The approach still requires calibrated camera-to-base extrinsics for deployment, limiting plug-and-play usage. The single-observation conditioning (no temporal history) is acknowledged as a limitation. New embodiments still require embodiment-specific action heads with compatible robot data, which partially undermines the cross-embodiment promise.
This work arrives at a moment when the field is actively debating the right representation for robot foundation models—between end-to-end VLA approaches, video world models, and structured intermediate representations. UMA provides evidence that structured 3D motion representations can outperform both pixel-based joint models (UVA) and language-conditioned VLAs (π0.5) in manipulation settings, contributing a data point to this ongoing debate.
The problem of leveraging heterogeneous data sources (human videos, cross-embodiment data, simulation) is among the most pressing in robot learning, and the mask-based formulation provides an elegant solution to the missing-modality problem.
UMA presents a conceptually clean framework that addresses a genuine gap in robot learning—unifying dynamics modeling and visuomotor control through 3D object motion. The masking-based approach to heterogeneous data is elegant and the experimental evidence, while limited in scale, is consistent across settings. The main contribution is architectural and representational rather than achieving dramatic performance numbers, but it opens a promising research direction. The paper would be strengthened by larger-scale evaluation and more rigorous statistical analysis.
Generated Jun 16, 2026
Qwen-RobotManip demonstrates higher potential impact due to its massive scale (38,100-hour pretraining corpus across 15 platforms), comprehensive alignment framework spanning representation, motion, and behavioral dimensions, and strong empirical results substantially outperforming state-of-the-art including π0.5 across multiple benchmarks and real-robot platforms. While UMA presents an elegant unified motion-action interface with novel masked generative training, Qwen-RobotManip's scale, breadth of validation, emergent generalization capabilities, and practical demonstration across diverse real robots position it for broader impact on the robotics foundation model field.
EgoInfinity addresses a fundamental bottleneck in robot learning—converting internet-scale human videos into actionable robot training data—with a comprehensive, modular engine covering perception through real-robot execution. Its web-scale approach, cross-embodiment generalization, and real-robot validation across diverse tasks (grasping, cutting, wiping, pouring) suggest broader practical impact. While UMA offers an elegant unified framework for heterogeneous robot learning via masked generative objectives, EgoInfinity's data engine paradigm has greater potential to unlock open-world robot learning at scale, addressing the critical data scarcity problem with demonstrated real-world results.
Paper 1 introduces a highly novel, unified framework using 3D object motion trajectories as a universal interface for heterogeneous robot learning. This approach elegantly bridges the gap between diverse data sources (human videos, simulation, different robot morphologies) without requiring manual annotations. By providing a scalable way to leverage diverse, uncurated data for both control and dynamics modeling, it addresses a fundamental bottleneck in embodied AI, likely leading to broader adoption and greater long-term impact across the field than Paper 2's specific regularization method.
Paper 2 introduces a foundational approach for generalized robot learning across heterogeneous data sources (human videos, simulation, robots). This tackles a core challenge in scaling AI for robotics, offering broad scientific impact and applicability across multiple domains. Paper 1, while demonstrating a highly practical real-world application in apparel automation, is primarily a systems engineering and deployment case study, limiting its theoretical novelty and cross-domain impact compared to the algorithmic advancements of Paper 2.
Paper 1 addresses a foundational bottleneck in robotics: leveraging diverse, heterogeneous data (human, simulation, varied robots) without manual annotations. By using 3D motion as a universal interface, it paves the way for scalable, general-purpose robot foundation models. While Paper 2 presents an excellent architectural improvement and benchmark for short-term memory, Paper 1's unified pretraining paradigm has a broader scope and aligns more closely with the highly impactful trend of scaling generalist models in robotics.
Paper 2 introduces a paradigm-shifting concept—ex novo hardware generation in robots via fluidics and photopolymerization. While Paper 1 offers highly timely and robust advancements in robot learning algorithms, Paper 2 pioneers a fundamentally new interdisciplinary capability that bridges materials science, chemistry, and robotics. This biologically-inspired approach to dynamic material restructuring allows robots to physically grow new sensors during operation, offering unprecedented novelty and long-term disruptive scientific impact across multiple fields compared to the software-based algorithmic improvements in Paper 1.
SOLE-R1 addresses a fundamental bottleneck in robot RL—the reward specification problem—by enabling zero-shot online RL without ground-truth rewards, demonstrations, or task-specific tuning across 24 unseen tasks. Its ability to outperform GPT-5 and Gemini-3-Pro as reward models while resisting reward hacking represents a significant breakthrough. The approach could broadly impact how robots learn new tasks autonomously. While UMA's unified motion-action framework is innovative and useful for heterogeneous robot learning, SOLE-R1's contribution is more transformative in potentially eliminating manual reward engineering, a long-standing challenge in robotics RL.
Paper 2 has higher potential impact because it challenges a fundamental assumption in robot learning: that real-world data is required for sim-to-real transfer. By achieving robust zero-shot physical manipulation solely through large-scale synthetic data, it provides a highly scalable paradigm for embodied AI. Furthermore, releasing an open-source procedural generation engine and a 1.8-million trajectory dataset will democratize robotics research and catalyze immense follow-up work. While Paper 1 offers an elegant unified algorithmic approach, Paper 2's empirical breakthrough, superior real-world performance over strong baselines, and massive open-source contributions will likely drive a larger field-wide paradigm shift.
UMA presents a more fundamentally novel contribution by unifying visuomotor control and dynamics modeling through 3D motion trajectories as a shared interface, with a single masked generative framework supporting multiple inference modes. Its ability to bridge heterogeneous data sources (robot demos, human videos, simulated data) without manual task annotations represents a deeper architectural innovation. While Robometer makes a valuable contribution in reward modeling with its dual-objective approach and large-scale dataset (RBM-1M), UMA's unified framework has broader potential to reshape how robot learning systems are designed, pretrained, and deployed across embodiments and tasks.
LDA-1B demonstrates higher potential impact through its massive scale (1B parameters, 30k hours of data in EI-30k dataset), concrete performance gains over strong baselines like π₀.₅, and practical contributions including a standardized large-scale dataset. While UMA introduces elegant unified motion-action modeling, LDA-1B addresses the critical scaling challenge more directly, shows larger empirical improvements on diverse task categories, and introduces the valuable insight of leveraging low-quality data. The dataset contribution (EI-30k) alone could catalyze significant follow-up research in robot foundation models.