Unified Motion-Action Modeling for Heterogeneous Robot Learning

Yunhao Cao, Shitong Liu, Chao Feng, Meryl Zhang, Xuanchen Lu, Andrew Owens, Kuan Fang

Jun 15, 2026arXiv:2606.16917v1

cs.RO

#57of 3949·Robotics

#57 of 3949 · Robotics

Tournament Score

1574±38

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.5

Novelty7.8

Clarity8

Abstract

We present Unified Motion-Action (UMA) Model, an approach that uses 3D object motion trajectories as a shared interface to bridge visuomotor control and dynamics modeling. UMA treats object motion and robot actions as co-evolving variables under a masked generative objective, in which the mask pattern determines both the supervision regime during pretraining and the inference mode at deployment. Using hindsight-relabeled motion contexts and a contrastive objective that disentangles task intent from scene geometry, UMA enables multi-task pretraining across heterogeneous data sources without requiring manually annotated task instructions. At deployment, the same pretrained parameters support motion-conditioned visuomotor control, motion-based dynamics modeling, and task adaptation from few-shot demonstrations. Pretrained on a mixture of robot demonstrations, human videos, and simulated data, UMA consistently outperforms state-of-the-art baselines specialized for each inference mode.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Unified Motion-Action Modeling for Heterogeneous Robot Learning

1. Core Contribution

UMA introduces 3D object motion trajectories as a unified intermediate representation that bridges visuomotor control and dynamics modeling within a single model. The key insight is that object motion and robot actions are co-evolving variables that can be jointly modeled under a masked generative objective, where different mask patterns determine different inference modes (action generation, dynamics prediction, task adaptation) from the same pretrained weights.

The paper solves a concrete problem: existing motion-based methods are siloed—motion-conditioned policies require paired motion-action labels and cannot use action-free video, while motion-prediction models need separate controllers and cannot incorporate action labels. UMA unifies both directions through masked autoencoding over motion and action token sequences, enabling heterogeneous data sources (human videos, real robot data, simulated data) to contribute supervision proportional to what labels they carry.

Three specific innovations stand out: (1) extending hindsight experience replay from single-state goals to motion contexts, eliminating the need for manual task annotations during pretraining; (2) a contrastive objective that disentangles task intent from scene geometry in the learned task latent; and (3) Masked DiT blocks with dual adaptive layer normalization that separately modulate target (denoised) and given (conditioning) tokens.

2. Methodological Rigor

The experimental design is reasonably thorough. The paper evaluates on three real-world tasks (insertion, sweeping, folding) spanning rigid objects, tool use, and deformable materials—covering meaningfully different manipulation challenges. Each task uses 20 randomized trials. Comparisons are made against appropriate baselines: COIL for motion-conditioned control, UVA for joint video-action modeling, PointWorld for dynamics prediction, and π0.5 for few-shot adaptation.

The ablation study is well-structured, addressing data source contributions (removing simulation vs. human videos), the contrastive objective, and architectural choices (dense attention, separate heads, action-only). The finding that simulation primarily contributes motion-action coupling while human videos contribute task diversity is an informative decomposition.

However, there are methodological concerns. The evaluation scale is relatively small—20 trials per task for 3 tasks is modest for drawing strong conclusions. The simulated ablations use 100 episodes but only 3 tasks. The paper lacks statistical significance testing or confidence intervals. The claim of "consistently outperforms state-of-the-art baselines specialized for each inference mode" is strong given the limited evaluation scope. The scalability analysis (Appendix E) shows promising trends but with only 6 configurations.

The data pipeline relies on a cascade of off-the-shelf models (MegaSaM, UniDepth, SAM3, TAPIP3D, GPT-5-nano), making the approach dependent on the quality of these components. Error propagation through this pipeline is not analyzed.

3. Potential Impact

Direct impact on robot learning: The unified treatment of motion and action could shift how the field thinks about data utilization. The ability to incorporate action-free human videos alongside robot demonstrations addresses a real data bottleneck. The soft prompt tuning mechanism for task adaptation (optimizing only the task latent) is practically appealing for deployment scenarios.

Representation design: The argument for 3D object motion as an embodiment-agnostic intermediate representation is compelling and could influence future architecture designs. Unlike pixel-space world models, motion trajectories share the coordinate frame with robot actions and are more compact.

Cross-domain transfer: The heterogeneous training paradigm—where different data sources contribute different loss terms through masking—is a clean formulation that could generalize beyond this specific architecture.

Limitations on broader impact: The approach still requires calibrated camera-to-base extrinsics for deployment, limiting plug-and-play usage. The single-observation conditioning (no temporal history) is acknowledged as a limitation. New embodiments still require embodiment-specific action heads with compatible robot data, which partially undermines the cross-embodiment promise.

4. Timeliness & Relevance

This work arrives at a moment when the field is actively debating the right representation for robot foundation models—between end-to-end VLA approaches, video world models, and structured intermediate representations. UMA provides evidence that structured 3D motion representations can outperform both pixel-based joint models (UVA) and language-conditioned VLAs (π0.5) in manipulation settings, contributing a data point to this ongoing debate.

The problem of leveraging heterogeneous data sources (human videos, cross-embodiment data, simulation) is among the most pressing in robot learning, and the mask-based formulation provides an elegant solution to the missing-modality problem.

5. Strengths & Limitations

Strengths:

Elegant unification through masking: the same architecture and weights serve three distinct inference modes

Principled approach to heterogeneous data utilization where each source contributes what it can

Hindsight motion relabeling eliminates manual task annotation, improving scalability

The contrastive objective's role is convincingly demonstrated (30-60% drop without it in few-shot)

The cross-domain analysis reveals complementary roles of data sources

Multimodal task conditioning (language, goal images) emerges naturally from the disentangled task latent

Limitations:

Evaluation scale is small (3 real tasks, 20 trials each)

No comparison with recent VLA models beyond π0.5 on the few-shot setting

The data pipeline complexity (5+ external models) raises reproducibility and robustness concerns

Camera-frame representation requires extrinsic calibration

The 82% execution failure rate suggests motion-action coupling still needs significant improvement

Limited analysis of when/why dynamics prediction helps versus direct policy execution

Overall Assessment

UMA presents a conceptually clean framework that addresses a genuine gap in robot learning—unifying dynamics modeling and visuomotor control through 3D object motion. The masking-based approach to heterogeneous data is elegant and the experimental evidence, while limited in scale, is consistent across settings. The main contribution is architectural and representational rather than achieving dramatic performance numbers, but it opens a promising research direction. The paper would be strengthened by larger-scale evaluation and more rigorous statistical analysis.

Rating:7.2/ 10

Significance 7.5Rigor 6.5Novelty 7.8Clarity 8

Generated Jun 16, 2026

Comparison History (36)

Lostvs. Qwen-RobotManip Technical Report: Alignment Unlocks Scale for Robotic Manipulation Foundation Models

Qwen-RobotManip demonstrates higher potential impact due to its massive scale (38,100-hour pretraining corpus across 15 platforms), comprehensive alignment framework spanning representation, motion, and behavioral dimensions, and strong empirical results substantially outperforming state-of-the-art including π0.5 across multiple benchmarks and real-robot platforms. While UMA presents an elegant unified motion-action interface with novel masked generative training, Qwen-RobotManip's scale, breadth of validation, emergent generalization capabilities, and practical demonstration across diverse real robots position it for broader impact on the robotics foundation model field.

claude-opus-4-6·Jun 17, 2026

Lostvs. EgoInfinity: A Web-Scale 4D Hand-Object Interaction Data Engine for Any-View Robot Retargeting and Video-to-Action Robot Learning

EgoInfinity addresses a fundamental bottleneck in robot learning—converting internet-scale human videos into actionable robot training data—with a comprehensive, modular engine covering perception through real-robot execution. Its web-scale approach, cross-embodiment generalization, and real-robot validation across diverse tasks (grasping, cutting, wiping, pouring) suggest broader practical impact. While UMA offers an elegant unified framework for heterogeneous robot learning via masked generative objectives, EgoInfinity's data engine paradigm has greater potential to unlock open-world robot learning at scale, addressing the critical data scarcity problem with demonstrated real-world results.

claude-opus-4-6·Jun 17, 2026

Wonvs. ALAM: Algebraically Consistent Latent Transitions for Vision-Language-Action Models

Paper 1 introduces a highly novel, unified framework using 3D object motion trajectories as a universal interface for heterogeneous robot learning. This approach elegantly bridges the gap between diverse data sources (human videos, simulation, different robot morphologies) without requiring manual annotations. By providing a scalable way to leverage diverse, uncurated data for both control and dynamics modeling, it addresses a fundamental bottleneck in embodied AI, likely leading to broader adoption and greater long-term impact across the field than Paper 2's specific regularization method.