Haoqi Yuan, Zhixuan Liang, Anzhe Chen, Ye Wang, Haoyang Li, Pei Lin, Yiyang Huang, Zixing Lei
Foundation models in language and multimodality achieve strong generalization by aligning heterogeneous data under a unified formulation and training at scale. In this report, we investigate whether this scaling recipe can be applied to robotic manipulation to achieve genuine generalization. This is challenging because, unlike text, manipulation data is heterogeneous by nature, expensive to collect, and narrow in diversity, making alignment and scale simultaneously difficult. We present Qwen-RobotManip, a generalizable Vision-Language-Action foundation model built on Qwen-VL. Qwen-RobotManip introduces a unified alignment framework across the representation, motion, and behavioral dimensions of manipulation, making large-scale multi-source training coherent rather than conflicting. This alignment capability in turn enables Qwen-RobotManip to absorb manipulation data at a scale that prior training regimes could not sustain. A human-to-robot synthesis pipeline converts egocentric hand demonstrations into robot trajectories across 15 platforms, and a rigorous curation pipeline harmonizes heterogeneous datasets. Using only open-source datasets and human videos without proprietary data collection, Qwen-RobotManip constructs a ~38,100-hour pretraining corpus and exhibits emergent generalization capabilities, including zero-shot instruction following, robustness to perturbations, reactive error recovery, and cross-embodiment transfer. We find that standard benchmarks fail to capture pretraining quality and instead adopt OOD settings including RoboCasa365, LIBERO-Plus, EBench, RoboTwin-Clean2Rand, RoboTwin-IF, and RoboTwin-XE. Qwen-RobotManip substantially outperforms prior state-of-the-art models, including 0.5, across all OOD settings, ranks 1st in RoboChallenge with a 20% relative improvement, and is validated on real-robot platforms including AgileX ALOHA, Franka, UR, and ARX.
Qwen-RobotManip presents a Vision-Language-Action (VLA) foundation model built on Qwen-VL that addresses the fundamental tension between alignment and scale in robotic manipulation. The central thesis—that alignment must precede scale for cross-embodiment training—is articulated through three complementary mechanisms: (1) a canonical 80-dimensional state-action representation with per-dimension binary masking, (2) camera-frame delta pose parameterization that makes visually similar motions numerically proximate across embodiments, and (3) an in-context policy adaptation mechanism using intra-episode execution history as an implicit embodiment identifier.
The paper also contributes a human-to-robot synthesis pipeline that converts egocentric hand demonstrations into robot trajectories across 15 platforms, enabling construction of a ~38,100-hour pretraining corpus from entirely open-source data. This is significant: it demonstrates that proprietary data collection may not be the bottleneck for manipulation foundation models, provided proper synthesis and curation infrastructure exists.
The technical approach is methodologically thorough. The five-stage data curation pipeline (sudden change detection, state-action trend alignment, extreme value filtering, FK consistency, base frame alignment) plus three cross-modal checks represents one of the most comprehensive data quality pipelines published for robotic learning. The discovery that 81% of RoboMIND UR-type episodes failed state-action trend alignment illustrates the severity of data quality issues in existing corpora.
The ablation studies are particularly well-designed. The data scaling experiment (Figure 18-19) with nested subsets at 1%-100% provides convincing evidence that aligned representations exhibit log-linear scaling while misaligned ones do not. The controlled comparison between three action-space designs across both validation MSE and downstream task performance strengthens causal claims about the alignment framework's necessity.
However, some methodological concerns exist. The stochastic context sampling strategy, while well-motivated, lacks formal analysis of convergence properties. The claim of "emergent" capabilities (retry behavior, error recovery) is largely observational rather than rigorously quantified. The 9:1 ratio of robot-to-VL data and λ=0.1 loss weighting appear to be heuristically chosen without systematic justification.
Immediate impact on VLA research: The paper's strongest contribution may be its critique of evaluation methodology. The demonstration that models without pretraining match pretrained ones on standard benchmarks (Figure 4) is a wake-up call for the field. The proposed OOD benchmarks—RoboTwin-IF for instruction following and RoboTwin-XE for cross-embodiment transfer—address genuine gaps.
Human-to-robot synthesis at scale: Converting ~1,933 hours of egocentric video into ~24,808 hours across 15 embodiments provides a template for scalable data generation. The pipeline components (retargeting, SAM3 segmentation, ProPainter inpainting, base pose optimization, depth compositing) are individually reproducible.
Cross-embodiment transfer: The camera-frame delta pose representation enabling 3.2× improvement over π0.5 in zero-shot cross-embodiment transfer (23.9% vs 7.5%) could influence how future VLA models parameterize actions.
Real-world validation: Deployment across four physical platforms (AgileX ALOHA, Franka, UR, ARX) with 88.6% ID and 87.5% OOD success rates on ALOHA, plus first place on RoboChallenge Table30-v1, provides credible evidence of practical utility.
This work arrives at a critical inflection point for VLA models. The field has seen rapid model releases (π0, π0.5, GR00T, StarVLA) but growing concern about whether reported progress reflects genuine generalization. By systematically demonstrating that standard benchmarks fail to capture pretraining quality, and by showing that alignment is a prerequisite for—not merely complementary to—data scaling, the paper addresses perhaps the most pressing conceptual bottleneck in the field.
The open-source data approach is particularly timely given ongoing debates about data access and proprietary advantages in robotics.
The dual-stream co-training preventing catastrophic forgetting is not novel per se (cited from prior work), but its systematic validation across training stages (pre-training vs. post-training) with clear ablations adds value. The observation that camera-frame alignment makes the model perform better under background randomization than clean backgrounds (74.6% vs 73.2%) is an insightful finding suggesting the pretraining distribution has shifted "normalcy" away from artificial cleanliness.
The paper's scale (44 pages, comprehensive experiments) reflects an industry lab effort that academic groups would struggle to replicate, though the open-source data commitment partially democratizes access.
Generated Jun 17, 2026
Paper 2 (Qwen-VLA) likely has higher scientific impact because it proposes a broader unification: a single VLA framework spanning manipulation, navigation, and trajectory prediction across environments and embodiments, with explicit embodiment-aware prompting and a DiT-based continuous action decoder. This wider task scope increases cross-field relevance (robotics, embodied AI, VL reasoning, generative trajectory modeling) and real-world applicability beyond manipulation alone. While Paper 1 is strong and rigorous for scalable manipulation alignment, Paper 2’s generality and multi-domain benchmarks suggest greater breadth and longer-term influence.
Qwen-RobotManip demonstrates broader scientific impact through its unified VLA foundation model achieving state-of-the-art results across multiple benchmarks and real-robot platforms. Its alignment framework across representation, motion, and behavioral dimensions addresses a fundamental challenge in scaling robotic manipulation. The 38,100-hour pretraining corpus, emergent generalization capabilities (zero-shot instruction following, error recovery, cross-embodiment transfer), and substantial improvements over strong baselines like π0.5 represent a significant advance. While EgoInfinity presents a valuable data engine for video-to-action conversion, Qwen-RobotManip's end-to-end foundation model approach with demonstrated generalization has broader downstream impact for the robotics community.
Paper 2 introduces a broader alignment framework that enables scaling robotic foundation models across 15 platforms and 38,100 hours of heterogeneous data. Its emergent zero-shot capabilities, cross-embodiment transfer, and extensive benchmarking on out-of-distribution tasks suggest a broader scientific impact and wider applicability than Paper 1's narrower focus on dexterous hands, despite Paper 1's valuable scaling law discovery.
Paper 2 addresses a fundamental bottleneck in AI and robotics—scaling foundation models for physical manipulation across heterogeneous data and diverse embodiments. Its potential to generalize across multiple robotic platforms using human videos and open-source data offers a much broader impact and real-world applicability compared to the niche, albeit impressive, hardware milestone of the microrobot in Paper 1.
Qwen-RobotManip demonstrates higher potential impact due to its massive scale (38,100-hour pretraining corpus across 15 platforms), comprehensive alignment framework spanning representation, motion, and behavioral dimensions, and strong empirical results substantially outperforming state-of-the-art including π0.5 across multiple benchmarks and real-robot platforms. While UMA presents an elegant unified motion-action interface with novel masked generative training, Qwen-RobotManip's scale, breadth of validation, emergent generalization capabilities, and practical demonstration across diverse real robots position it for broader impact on the robotics foundation model field.
Qwen-RobotManip presents a more comprehensive foundation model approach with a unified alignment framework across multiple dimensions, demonstrated generalization across 15 platforms and multiple benchmarks, and a massive 38,100-hour training corpus. It achieves state-of-the-art across all OOD settings and ranks 1st in RoboChallenge. While MolmoB0T makes a strong contribution showing sim-only zero-shot transfer is viable, Qwen-RobotManip's broader scope—covering cross-embodiment transfer, error recovery, and instruction following—alongside its systematic alignment methodology, positions it for greater impact on the field's trajectory toward general-purpose robotic manipulation foundation models.
Paper 2 (Open-H-Embodiment) has higher potential scientific impact due to several factors: (1) It addresses a critical, underserved domain—medical robotics—where data scarcity is a fundamental bottleneck, making it a first-of-its-kind large-scale open dataset spanning 49+ institutions and multiple surgical platforms. (2) It enables two novel foundation models (GR00T-H and Cosmos-H-Surgical-Simulator) with direct clinical relevance. (3) The breadth of impact is exceptional—spanning surgical robotics, world modeling, synthetic data generation, and healthcare access. (4) As open infrastructure for the medical robotics community, it has multiplicative downstream impact. While Paper 1 is impressive in general manipulation, it operates in a more crowded space with incremental advances over existing VLA models.
While Paper 1 presents impressive scaling for general robotic manipulation, Paper 2 tackles a critical bottleneck in a high-stakes, historically siloed domain: medical robotics. By releasing the largest multi-embodiment dataset across 49 institutions, along with the first open VLA and world models for surgery, Paper 2 provides transformative infrastructure that will democratize and accelerate life-saving autonomous medical robotics research.
Paper 2 likely has higher impact: it proposes a unified alignment framework enabling large-scale multi-source training of a vision-language-action foundation model, demonstrates broad emergent capabilities (zero-shot instruction following, recovery, cross-embodiment transfer), and validates across many OOD benchmarks and multiple real robot platforms using mostly open data. This positions it as a general-purpose manipulation foundation model with wide applicability and timeliness. Paper 1 is novel and rigorous for scalable reward modeling with a large dataset, but its scope is narrower (reward learning) and likely affects fewer downstream paradigms than an end-to-end VLA foundation model.
Paper 2 likely has higher scientific impact due to its broader scope and scalability: a unified alignment framework enabling large-scale, multi-source vision-language-action pretraining (~38k hours) across 15 platforms with open datasets, yielding cross-embodiment transfer and strong OOD/generalization results plus multiple real-robot validations. This positions it as a foundation model infrastructure that can influence many downstream robotics tasks and benchmarks. Paper 1 is highly novel in reward-only VLM-driven online RL and robustness to reward hacking, but its impact may be narrower (reward modeling for RL) and less ecosystem-shaping than a scalable manipulation foundation model.