Gaotian Wang, Kejia Ren, Andrew Morgan, Yiting Chen, Howard H. Qian, Podshara Chanrungmaneekul, Kaiyu Hang
Internet videos constitute the largest reservoir of embodied human manipulation knowledge, yet converting arbitrary RGB footage into actionable robot training data remains a major bottleneck. Existing lab- or factory-collected datasets are narrow in scale and diversity, limiting open-world robot learning. Instead of proposing a static dataset, we introduce EgoInfinity, a universal 4D hand-object interaction data engine that enables web-scale data generation for robot retargeting and learning. EgoInfinity is a modular engine integrating perception, segmentation, reconstruction, interaction-aware refinement, and retargeting to automate this traditionally unscalable video-to-action problem without human-in-the-loop annotation. Its modular design lets the engine continuously benefit from advances in any incorporated component. With EgoInfinity, in-the-wild human manipulation videos are lifted into agent-agnostic, metric 4D hand-object representations, including hand trajectories, 6-DoF object poses, and contact-relevant states. Rather than naively connecting standalone components, EgoInfinity combines cross-module metric calibration with interaction-aware refinement to improve physical reliability, reducing drift and contact inconsistencies common in pure visual reconstruction. We further propose a novel motion retargeter that compiles the recovered 3D hand motions into executable joint trajectories for diverse robot morphologies, enabling video-to-action retargeting on any robot from arbitrary viewpoints and shot sizes (e.g., the human body is only partially visible). We validate EgoInfinity across perception fidelity, kinematic feasibility, contact consistency, cross-embodiment generalization, and real-robot skill acquisition (e.g., grasping, cutting, wiping, and pouring), demonstrating a scalable bridge from internet videos to executable robot behavior for open-world robot learning.
EgoInfinity addresses a fundamental bottleneck in robot learning: the conversion of abundant, unstructured internet video into structured, metric, robot-actionable manipulation data. Rather than proposing a fixed dataset, the authors present a modular data engine that chains together perception, segmentation, reconstruction, interaction-aware refinement, and cross-embodiment retargeting modules. The key outputs are agent-agnostic 4D representations—metric hand trajectories, 6-DoF object poses, contact states, and reconstructed object geometry—from arbitrary RGB footage without wearables, depth sensors, or human annotation.
The second major contribution is an SE(3)-equivariant neural root-frame estimator for cross-embodiment retargeting. Using vector neuron layers and flow-matching conditional generation, it estimates feasible robot root transformations from partial hand observations, enabling retargeting to diverse robot morphologies (humanoids, dual-arm setups, dexterous hands) from videos where only hands are visible.
Strengths in pipeline design: The interaction-aware refinement is a well-motivated design choice. Rather than naively chaining off-the-shelf modules, EgoInfinity introduces cross-module metric calibration (unified scale via MOGE-2, shared camera frame via GeoCalib) and state-dependent pose refinement. The six-state interaction classifier with Schmitt-trigger hysteresis, morphological smoothing, and dominant-hand resolution reflects careful engineering for robustness.
Retargeter architecture: The SE(3)-equivariant design is principled—the Vector Neuron formulation guarantees that camera-frame changes propagate correctly. The flow-matching formulation over SO(3) × ℝ³ elegantly handles the multimodality inherent in root-frame estimation from partial observations. Training entirely in simulation (MuJoCo) with extensive augmentations (tracking noise, occlusion, gravity dropout) is pragmatic.
Weaknesses in evaluation: The experimental validation is notably thin relative to the ambition of the claims. The quantitative results (Table 2) report only IK-level metrics (success rate, position/orientation error, manipulability) across three robots, without comparison to any baseline retargeting method. There is no systematic ablation of the interaction-aware refinement against a naive pipeline. The 106-clip curated dataset is tiny relative to the "web-scale" framing. Real-robot experiments demonstrate only a handful of skills (grasping, cutting, wiping, pouring) with no quantitative success rates, no comparison to teleoperation baselines, and no statistical significance analysis. The grasping policy experiments (Fig. 12) show only three object instances with qualitative rollouts.
The paper also lacks any quantitative evaluation of reconstruction accuracy against ground truth (e.g., on HOT3D or OakInk2 benchmarks where ground truth exists), which would be the most direct way to validate the engine's outputs.
High-impact direction: The vision of converting internet video at scale into robot training data is compelling and timely. If the engine delivers on its promise, it could dramatically reduce the data bottleneck for robot learning. The modular, continuously upgradeable design is strategically sound—as foundation models improve, the engine improves without redesign.
Practical limitations temper impact: The static-camera assumption excludes the vast majority of internet video (handheld, body-mounted, dynamic filming). The paper acknowledges this but it significantly constrains actual web-scale applicability. The lack of precise contact-level alignment, tactile information, and force consistency means the outputs may be insufficient for contact-rich manipulation tasks that are the frontier of robot learning. The retargeter requires per-robot retraining, adding friction to adoption.
Downstream utility is underexplored: The paper's strongest potential impact would be demonstrated through a large-scale policy learning experiment (e.g., training a generalist manipulation policy on thousands of EgoInfinity-processed videos and showing improved performance over lab-only data). This experiment is absent, leaving the scalability claim largely aspirational.
The paper is well-timed. The convergence of strong monocular depth (MOGE-2), open-vocabulary segmentation (SAM-3), hand reconstruction (WiLOR), and object reconstruction (SAM-3D) makes this integration feasible now in a way it wasn't two years ago. The community's push toward generalist robot policies (RT-X, Open X-Embodiment) creates demand for diverse, scalable data sources. The "data engine" framing (vs. static dataset) aligns with the emerging paradigm shift in how the field thinks about training data.
EgoInfinity presents an ambitious, well-engineered system for a genuinely important problem. The modular pipeline architecture and equivariant retargeter are technically sound contributions. However, the paper's impact is limited by a significant gap between its ambitious claims (web-scale, any-view, open-world) and the modest experimental validation provided. The work would be substantially strengthened by ground-truth benchmarking, baseline comparisons, and—most critically—a demonstration that scaled engine output actually improves downstream robot learning performance. As presented, it is a promising infrastructure paper with solid engineering but incomplete validation.
Generated Jun 17, 2026
Qwen-RobotManip demonstrates broader scientific impact through its unified VLA foundation model achieving state-of-the-art results across multiple benchmarks and real-robot platforms. Its alignment framework across representation, motion, and behavioral dimensions addresses a fundamental challenge in scaling robotic manipulation. The 38,100-hour pretraining corpus, emergent generalization capabilities (zero-shot instruction following, error recovery, cross-embodiment transfer), and substantial improvements over strong baselines like π0.5 represent a significant advance. While EgoInfinity presents a valuable data engine for video-to-action conversion, Qwen-RobotManip's end-to-end foundation model approach with demonstrated generalization has broader downstream impact for the robotics community.
EgoInfinity addresses a fundamental bottleneck in robot learning—converting internet-scale human videos into actionable robot training data—with a comprehensive, modular engine covering perception through real-robot execution. Its web-scale approach, cross-embodiment generalization, and real-robot validation across diverse tasks (grasping, cutting, wiping, pouring) suggest broader practical impact. While UMA offers an elegant unified framework for heterogeneous robot learning via masked generative objectives, EgoInfinity's data engine paradigm has greater potential to unlock open-world robot learning at scale, addressing the critical data scarcity problem with demonstrated real-world results.
Paper 1 addresses a fundamental bottleneck in embodied AI—the scarcity of scalable, diverse robot training data—by unlocking web-scale internet videos for robot learning. This has the potential to revolutionize general-purpose robotics similar to how large-scale datasets transformed NLP and computer vision. While Paper 2 represents a significant milestone in microrobotics hardware and control, Paper 1's software-driven approach offers broader, more immediate impacts across the entire field of open-world robot learning and cross-embodiment generalization.
Paper 1 provides a monumental infrastructural contribution by breaking severe data silos in medical robotics. By releasing the largest multi-institution, multi-embodiment dataset alongside pioneering foundation and world models, it directly catalyzes advancements in a high-stakes, life-saving domain. While Paper 2 offers an innovative pipeline for general robot learning, Paper 1 addresses a fundamental, critical bottleneck in healthcare automation, promising profound real-world clinical applications and unprecedented democratization of surgical robotics research.
MolmoBot challenges the fundamental assumption that real-world data is necessary for sim-to-real transfer, demonstrating effective zero-shot manipulation from purely simulated data at scale. Its fully open-source pipeline, 1.8M trajectory dataset, multiple policy architectures, and strong real-world results (79.2% vs π₀.₅'s 39.2%) make it highly impactful. While EgoInfinity is innovative in converting internet videos to robot actions, MolmoBot's paradigm shift—proving simulation alone suffices—has broader implications for democratizing robot learning and could reshape the field's approach to data collection.
Paper 2 (Robometer) likely has higher scientific impact due to its broadly applicable reward-modeling formulation, large-scale dataset (RBM-1M, 1M+ trajectories), and direct relevance to scaling robot learning with abundant failure data—an increasingly central bottleneck for RL and imitation learning. Preference + progress supervision is methodologically clean and can transfer across tasks/embodiments, potentially benefiting many labs beyond those focused on video-to-action pipelines. Paper 1 is innovative and impactful for leveraging internet video, but depends heavily on complex perception/reconstruction components whose reliability may limit adoption and generalization.
Paper 2 presents a universal engine to convert arbitrary internet videos into actionable 4D robot training data. By fully automating the video-to-action pipeline without human-in-the-loop annotations and enabling cross-embodiment retargeting from any viewpoint, it fundamentally addresses the robotic data bottleneck. While Paper 1 demonstrates impressive scaling laws on a specific dataset, Paper 2 provides a methodological breakthrough that unlocks the entirety of web video for open-world robot learning, promising broader, more sustained impact across the field.
Paper 2 likely has higher scientific impact due to broader applicability and scalability: a web-scale, modular 4D data engine can continuously improve with component advances and can feed many downstream paradigms (imitation, offline RL, sim-to-real, foundation policy training) across robots and tasks. Its real-world pathway—turning arbitrary internet video into metric hand/object states and executable retargeted trajectories—addresses a central bottleneck for open-world robotics. Paper 1 is highly novel for reward modeling and robust on-robot RL, but is more specific to VLM-based reward shaping and may generalize less broadly than a general-purpose data infrastructure.
Open-H-Embodiment addresses a critical data gap in medical robotics with a massive multi-institutional, multi-embodiment dataset spanning 49+ institutions. It demonstrates two foundation models (GR00T-H and Cosmos-H-Surgical-Simulator) with concrete benchmarks showing meaningful task completion. The medical domain impact is enormous given healthcare's societal importance. While EgoInfinity is innovative in converting internet videos to robot actions, Open-H-Embodiment creates essential research infrastructure for an underserved high-impact domain, establishes community-wide data sharing norms, and enables both policy learning and world modeling—likely catalyzing broader follow-on research.
Paper 2 has higher potential impact due to its web-scale leverage: turning ubiquitous internet RGB video into metric 4D hand-object trajectories and retargetable robot actions addresses a core data bottleneck for open-world robotics. The modular “engine” framing can continuously improve with better perception components and generalizes across robot morphologies and viewpoints, broadening applicability across robotics, vision, and graphics. While Paper 1 is methodologically strong (1B-scale model + standardized 30k-hour dataset) and impactful within foundation-model training, it still depends on curated embodied logs, whereas Paper 2 opens a much larger, timelier data source.