Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang
Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.
Open-H-Embodiment addresses a well-recognized structural bottleneck in autonomous surgical robotics: the absence of a large-scale, multi-embodiment, openly shared dataset of medical robotic video with synchronized kinematics. The paper delivers three interrelated contributions: (1) a 770-hour dataset spanning 20 robotic platforms, 49+ institutions, and 119 constituent datasets; (2) GR00T-H, a vision-language-action (VLA) model post-trained on this corpus that achieves the first reported end-to-end autonomous suturing completion on the SutureBot benchmark; and (3) Cosmos-H-Surgical-Simulator, the first multi-embodiment action-conditioned world model for surgical simulation.
The core insight — that the scaling laws validated in general-purpose robotics (via Open-X-Embodiment and successors) should transfer to the medical domain given sufficient domain-specific data — is not itself novel, but the execution is significant. The paper correctly identifies that general-purpose VLAs fail on surgical tasks not due to architectural limitations but because of a domain gap that cannot be bridged without in-domain data. Open-H is the first serious attempt to close this gap at scale.
Dataset construction is handled with considerable care. The adoption of LeRobot v2.1 format with healthcare-specific schema extensions, structured READMEs documenting platform nuances (clutching, cable wear, operator skill), and the explicit acknowledgment of data heterogeneity challenges are commendable engineering decisions. The five-tier realism spectrum (simulation through clinical) is well-motivated.
GR00T-H evaluation follows a reasonably rigorous protocol. The end-to-end suturing evaluation uses a per-setup matched design where all models attempt the same configuration. Statistical reporting includes Clopper-Pearson confidence intervals and Fisher's exact tests with Holm-Bonferroni correction. However, several concerns arise:
Cosmos-H-Surgical-Simulator evaluation is acknowledged as preliminary. The reliance on L1 and SSIM metrics for action-conditioned generation is a known limitation — these metrics don't capture surgical-specific fidelity (instrument position accuracy, tissue interaction plausibility). The paper is transparent about this gap.
Dataset composition bias is a significant concern: CMR Versius clinical data accounts for ~65% of total hours (499/770), creating a heavily skewed distribution. While sampling caps at 20% during GR00T-H training partially mitigate this, the effective diversity is less than the headline numbers suggest.
The potential impact is substantial across multiple dimensions:
Infrastructure contribution: The dataset itself, released under CC-BY-4.0 with model weights and code publicly available, could become foundational infrastructure analogous to what ImageNet was for vision or Open-X-Embodiment for general robotics. The multi-institutional, multi-platform nature lowers barriers for researchers without access to specific surgical platforms.
Clinical trajectory: While far from clinical deployment (the paper appropriately notes this), the work establishes a credible research pipeline from data collection through foundation model training to policy evaluation. The ex vivo results on pork belly tissue represent meaningful progress toward clinical relevance.
World modeling: Cosmos-H-Surgical-Simulator opens a new direction — using world models for in silico policy evaluation could dramatically reduce the cost of surgical robot policy development, which currently requires expensive physical robot time.
Community building: The 49-institution consortium itself is a significant organizational achievement that could catalyze sustained collaboration in a field historically fragmented by proprietary platforms and institutional silos.
The paper arrives at an inflection point where foundation models have demonstrated clear scaling benefits in general robotics, but surgical robotics remains locked out due to data scarcity. The projected surgeon shortage provides genuine urgency. The timing relative to the explosion of VLA models (OpenVLA, π0, GR00T-N1) makes this contribution directly actionable — the community now has both the architectures and (with Open-H) the data to explore surgical foundation models.
Overall Assessment: This is a high-impact community resource paper that addresses a genuine bottleneck. The dataset contribution alone justifies publication at a top venue. The foundation model results, while preliminary, provide compelling evidence that domain-specific pretraining matters for surgical robotics. The work's greatest impact will likely be measured not by the specific models presented, but by what the community builds on this infrastructure over the coming years.
Generated Apr 24, 2026
Paper 2 likely has higher scientific impact because it provides critical open infrastructure: a uniquely large, multi-institution, multi-platform medical robotics dataset with synchronized kinematics, addressing a major bottleneck for the field. Its real-world relevance is high (clinical robotics), and openness enables broad downstream reuse across robot learning, simulation, benchmarking, and foundation model research. The accompanying foundation models (VLA and world model) demonstrate immediate utility and set baselines. Paper 1 is methodologically novel and broadly applicable, but its impact is more contingent on adoption and reproducibility than a widely shared dataset plus demonstrated models.
Paper 2 introduces a massive, multi-institutional dataset and the first foundation models for medical robotics. By solving a critical data bottleneck in a high-stakes domain like surgery, it is highly likely to catalyze widespread downstream research and real-world clinical applications. While Paper 1 offers a valuable algorithmic contribution to robot memory, the sheer scale, domain importance, and field-enabling nature of Paper 2's dataset and models give it a significantly higher potential for broad scientific and societal impact.
Paper 1 addresses a critical bottleneck in a high-stakes domain (medical robotics) by releasing an unprecedented, multi-institutional, multi-embodiment dataset. Coupled with the introduction of the first open foundation and world models for medical robotics, it provides essential infrastructure likely to spur broad advancements in surgical automation, making its potential real-world and scientific impact exceptionally high.
Paper 2 likely has higher scientific impact due to its unique, community-enabling contribution: the largest open multi-institution, multi-platform medical robotics dataset with synchronized kinematics—critical infrastructure in a data-scarce, high-stakes domain. Its immediate real-world relevance (healthcare), breadth across embodiments/procedures, and demonstration via two foundational models (VLA and action-conditioned world model) increase downstream reuse and cross-field influence (robot learning, simulation, medical AI). Paper 1 is strong technically, but appears more incremental within a crowded embodied-foundation-model race and less domain-transformative.
Paper 1 likely has higher impact due to its creation of a large, open, multi-institution, multi-embodiment medical robotics dataset—critical infrastructure that can catalyze broad downstream work (foundation models, benchmarking, simulation, data generation) in a high-stakes domain. Its real-world relevance (medical robotics) and openness increase reproducibility and community adoption. While Paper 2 offers a strong algorithmic advance with impressive benchmarks, it is more incremental and narrower in scope; Paper 1’s dataset + models can shift the field’s trajectory and enable many future methods.
Paper 1 has higher potential scientific impact because it addresses a fundamental bottleneck in a high-stakes field: the lack of open, large-scale datasets in medical robotics. By providing data spanning 49 institutions and multiple platforms, along with two successful foundation models, it acts as critical infrastructure that will likely catalyze widespread research and innovation in autonomous surgery. While Paper 2 offers a strong architectural advancement in general embodied AI, releasing an unprecedented dataset in a notoriously siloed domain like medical robotics typically yields broader, field-defining impact.
Paper 1 provides a foundational dataset and models that address a critical bottleneck in medical robotics. By open-sourcing large-scale, multi-platform data and introducing the first medical VLA and world models, it serves as vital infrastructure likely to catalyze widespread research and accelerate real-world healthcare applications. While Paper 2 offers a strong methodological advance in dexterous manipulation, Paper 1 has a broader, field-defining impact.
Paper 1 is likely higher impact because it delivers broadly enabling infrastructure: the largest open multi-institution, multi-embodiment medical robotics dataset plus two foundation/world models that can catalyze many downstream studies. Its real-world relevance is high (medical robotics), timeliness aligns with foundation-model/data scaling trends, and breadth spans robot learning, simulation, and world modeling across platforms. Paper 2 is methodologically innovative for single-video dexterous skill acquisition, but is narrower in scope/applicability and less likely to reshape the field as a shared dataset + benchmarks + baseline models.
Open-H-Embodiment represents a larger-scale infrastructure contribution spanning 49+ institutions, multiple robotic platforms, and two foundation models (GR00T-H and Cosmos-H-Surgical-Simulator) that are firsts in medical robotics. Its impact is broader: it addresses a fundamental data bottleneck in autonomous medical robotics, provides an open large-scale dataset as community infrastructure, and demonstrates both vision-language-action models and world models for surgical simulation. While RoboDream presents an innovative approach to synthetic data generation, Open-H-Embodiment's multi-institutional scale, clinical relevance, and potential to catalyze an entire subfield give it higher estimated impact.
Open-H-Embodiment represents a transformative contribution to medical robotics by addressing a critical infrastructure gap—the lack of large-scale, multi-embodiment open datasets. Spanning 49+ institutions and multiple robotic platforms, it enables foundation models (GR00T-H and Cosmos-H) that achieve unprecedented results in surgical tasks. Its breadth of impact spans medical AI, robot learning, and world modeling, with direct real-world clinical implications. Paper 2 offers a useful methodological improvement (CoFi) for compositional diffusion planning, but is more incremental—an inference-time sampling strategy improving efficiency and coherence over existing baselines, with narrower scope of impact.