Kevin Yuanbo Wu, Tianxing Zhou, Isaac Tu, Billy Yan, Irmak Guzey, David Fouhey, Dandan Shan, Lerrel Pinto
Humans can grasp objects effortlessly, whereas multi-fingered robots are far from this level of generality. We argue that the most natural source of robot grasping data is from humans, who pick up thousands of objects every day. We present HUG, a flow-matching model that generates diverse human grasps for any user-specified object in a single RGB-D image captured from a stereo camera. Using smart glasses, we first collect 1M-HUGs, an egocentric dataset of human grasps spanning 1M frames (27.8 hrs) and 6,707 object instances across 41 buildings. Next, to model the distribution of natural human grasps, our novel flow-matching model fuses RGB and depth observations to output a grasp parameterized by wrist translation, wrist rotation, and MANO hand pose. Predicted grasps can be retargeted to various robot hands, enabling zero-shot grasping in everyday scenes. To standardize evaluation, we build a new simulated benchmark, HUG-Bench, of 90 unseen objects from five geometric categories and various sizes, with metric-scale 3D meshes. We evaluate HUG in the real world on the 30-object test set of HUG-Bench across multiple stereo cameras, robot embodiments, and household environments. HUG outperforms the state-of-the-art grasping baselines by +23% and +34% on our challenging object set. Code, data, benchmark, checkpoints, and an interactive demo are released on our website: https://grasping.io/
HUG presents a complete pipeline for dexterous robot grasping that is trained *entirely* on human grasp data—no robot demonstrations, no simulation-generated grasps. The key insight is that humans are the most natural and scalable source of grasping data, and modern egocentric sensors (Aria Gen 2 smart glasses) now provide sufficiently accurate RGB-D and hand tracking to make this practical. The system has three tightly integrated contributions:
The paradigm shift from "generate grasps in simulation or teleoperate robots" to "observe humans grasping in the wild and retarget" is the paper's central conceptual contribution.
The experimental design is notably thorough. The ablation study systematically isolates the contribution of each component: the 3D loss is shown to be critical (+40 points in test SR), point painting and cropping each contribute ~15 points, and the dual-modality design is well-motivated by complementary failure modes (RGB misses spatially, PC lacks semantics). The data scaling curve shows no saturation at 1M frames, suggesting the approach is data-bound rather than capacity-bound.
The evaluation protocol is commendably rigorous: the best checkpoint is selected on val objects in simulation and deployed directly on unseen test objects in the real world with no per-object tuning—300 consecutive trials with no retries. The failure mode analysis (Figure 11) adds transparency. The human grasp oracle provides a meaningful ceiling, and the paper honestly reports that it falls short of 100% due to tracking noise and open-loop execution.
One concern is that the baselines are somewhat heterogeneous: Dex1B uses a different paradigm (sim-generated grasps), and CAP is a parallel-jaw gripper method. A comparison against a method that also uses human demonstrations but different architecture, or against RL-based dexterous grasping methods evaluated on the same objects, would strengthen the claims. However, the +23% and +34% margins over Dex1B and CAP respectively are substantial.
Immediate applications: The zero-shot cross-embodiment transfer (demonstrated on Ability and WUJI hands across xArm and YOR platforms with different cameras) is directly useful for deploying dexterous manipulation in household, logistics, and assistive robotics settings.
Data collection paradigm: The "wear glasses and grasp objects" protocol could fundamentally change how robot learning data is collected. It requires no robot hardware, no teleoperation expertise, and scales to arbitrary environments. If the community adopts this, it could accelerate dexterous manipulation research significantly.
Benchmark contribution: HUG-Bench fills an important gap—most dexterous grasping benchmarks are simulation-only or use a narrow set of YCB-like objects. The 90-object set spanning challenging geometries (articulated, very flat, large) with purchasable real objects and metric meshes could become a standard evaluation suite.
Broader influence: The work connects computer vision (egocentric sensing, hand reconstruction), generative modeling (flow matching), and robotics (retargeting, deployment) in a cohesive pipeline, potentially influencing all three communities.
This paper arrives at an excellent time. The convergence of (a) consumer-grade egocentric sensors with calibrated depth and hand tracking, (b) anthropomorphic robot hands becoming commercially available, and (c) learned retargeting methods maturing creates a window where this approach becomes feasible. The paper explicitly acknowledges and leverages this convergence. The field has been struggling with the sim-to-real gap for dexterous manipulation; learning from real human data sidesteps this entirely.
Scalability: The unsaturated scaling curve is promising—performance should improve with more data, which is cheap to collect. The 10-hour training time on 2 GPUs is reasonable.
HUG represents a compelling paradigm shift in dexterous grasping: replacing simulation-heavy or teleoperation-heavy pipelines with scalable human observation. The execution is thorough across dataset, method, and benchmark contributions, with strong real-world results and honest limitations analysis. The open-source release of the full stack maximizes potential for adoption.
Generated Jun 16, 2026
World Engine addresses a fundamental safety limitation in autonomous driving—the scarcity of long-tail safety-critical events—through a novel post-training paradigm using synthesized interactions. Its demonstrated real-world deployment on a production-scale system with measurable on-road improvements gives it exceptional practical impact. The concept of post-training for embodied AI parallels the transformative RLHF paradigm in LLMs, suggesting broad methodological influence. While HUG is a strong contribution to robotic grasping with impressive data collection, World Engine's potential to reshape autonomous driving safety practices and its paradigm-level framing give it higher estimated impact.
Paper 2 offers higher potential impact due to its introduction of a massive dataset (1M-HUGs) and a standardized benchmark (HUG-Bench). While Paper 1 presents a strong algorithmic innovation for VLA models, creating novel, large-scale datasets and standardized benchmarks historically catalyzes broader community adoption, drives subsequent research, and yields higher citation counts. Furthermore, Paper 2's approach of zero-shot retargeting human grasps to multiple diverse robot embodiments provides highly versatile real-world applicability across various hardware platforms.
Paper 1 addresses the fundamental robotics challenge of universal grasping by introducing a massive, novel egocentric dataset (1M-HUGs) and a flow-matching model capable of zero-shot transfer to various robot hands. The release of a large-scale dataset, benchmark, and interactive demo is highly likely to catalyze broad follow-on research. While Paper 2 offers significant methodological advancements in scaling influence functions for VLA data curation, Paper 1's contribution to universally applicable grasping and data resources gives it a wider potential impact across the robotics community.
Paper 2 likely has higher impact due to broader applicability and stronger ecosystem contribution: a large-scale (1M frames) real-world human grasp dataset, a novel flow-matching grasp generator, a standardized benchmark (HUG-Bench), and public release of code/data/models. This can accelerate progress across robotics, computer vision, representation learning, and sim-to-real manipulation. Paper 1 is technically solid and relevant but targets a narrower capability (quadrotor fall recovery with bidirectional thrust) and is less likely to generalize across many tasks or communities compared to a foundational dataset+benchmark+model for universal grasping.
Paper 2 has higher potential impact due to stronger novelty (flow-matching for human-grasp distribution modeling), larger and broadly useful resources (1M-frame egocentric dataset + standardized benchmark + checkpoints), and wider cross-field relevance (robotics manipulation, human–robot interaction, vision, generative modeling). It demonstrates real-world generalization across cameras, robot hands, and environments with substantial gains over baselines, suggesting methodological rigor and practical applicability. Paper 1 is valuable for autonomous driving simulation pipelines, but its scope is narrower and the reported accuracy gains are more incremental relative to existing mapping/HD map generation work.
Paper 2 presents a massive new egocentric dataset (1M frames) and a novel flow-matching model for human universal grasping, outperforming SOTA by significant margins. The release of the dataset, benchmark, and code will likely drive extensive follow-up research in robotic manipulation. Paper 1 is a valuable infrastructure toolchain, but Paper 2 offers a more foundational breakthrough in zero-shot grasping with broader immediate utility to the community.
Paper 2 (HUG) has higher potential scientific impact due to several factors: (1) It addresses a fundamental robotics challenge—universal grasping—with a novel human-centric data collection paradigm using smart glasses, which is highly innovative. (2) The scale of contribution is substantial: a 1M-frame dataset, a new benchmark (HUG-Bench), a flow-matching model, and cross-embodiment transfer. (3) It demonstrates strong real-world results (+23-34% over SOTA) across multiple robot hands and environments. (4) The breadth of impact spans computer vision, robotics, and human-robot interaction. (5) Open-sourced assets maximize community adoption. Paper 1 is rigorous but more incremental in scope.
Paper 1 tackles a fundamental challenge in robotics (universal grasping) by introducing a massive dataset of 1M human grasps, a novel flow-matching model, and a standardized benchmark. Its open-source release and significant performance improvements over state-of-the-art baselines suggest a high potential for broad impact across robotic manipulation and human-robot interaction. Paper 2, while presenting a solid open-vocabulary mapping system, appears more incremental, building on existing VLMaps architectures with dual-memory and confidence cues.
Paper 2 (HUG) likely has higher impact due to a large-scale, broadly reusable dataset (1M egocentric human grasps), a concrete generative grasping model, and strong emphasis on standardized evaluation (new benchmark) plus full public release of data/code/checkpoints. This combination accelerates follow-on work across robotics, computer vision, human sensing, and hand-object interaction, with immediate real-world applicability via retargeting to multiple robot hands. Paper 1 is novel in architecturally repurposing geometric foundation models for manipulation, but its impact may depend more on access to specific pretrained GFMs and task setups, and it offers less community infrastructure.
Paper 2 likely has higher impact due to a large-scale, uniquely sourced egocentric human grasp dataset (1M frames across thousands of objects/environments), a broadly applicable generative grasp model with robot retargeting enabling zero-shot deployment, and strong end-to-end validation (new benchmark, sim+real evaluations, multi-embodiment/camera tests, sizable gains over SOTA) plus full public release. Its contributions span robotics, vision, generative modeling, and human hand pose/interaction. Paper 1 is valuable and timely but is more incremental—primarily a careful scaling/conditioning study and training recipe within diffusion imitation learning.