Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang

Apr 22, 2026arXiv:2604.21017v2

cs.ROcs.AI

v1v2

#1of 3760·Robotics

#1 of 3760 · Robotics

Tournament Score

1705±35

10501800

100%

Win Rate

Wins

Losses

Matches

Rating

8.5/ 10

Significance9

Rigor7

Novelty8

Clarity8

Abstract

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Open-H-Embodiment

1. Core Contribution

Open-H-Embodiment addresses a fundamental infrastructure gap in medical robotics: the absence of a large-scale, multi-embodiment, openly shared dataset with synchronized video and kinematics. The paper delivers three interlinked contributions: (1) a 770-hour dataset spanning 20 robotic platforms, 49+ institutions, and 119 constituent datasets; (2) GR00T-H, a vision-language-action (VLA) model post-trained on this corpus that achieves the first reported end-to-end task completion on the SutureBot suturing benchmark (25% vs. 0% for all baselines); and (3) Cosmos-H-Surgical-Simulator, the first multi-embodiment action-conditioned world model for surgical simulation from a single checkpoint.

The paper's central thesis—that the scaling laws demonstrated in general-purpose robotics (Open-X-Embodiment, RT-2-X) can transfer to the medical domain given sufficient diverse data—is directly tested and supported by the experimental results. This is a significant conceptual validation because the surgical domain has unique challenges (deformable tissues, endoscopic optics, irreversible actions, regulatory constraints) that could plausibly invalidate scaling assumptions from rigid-object manipulation.

2. Methodological Rigor

Dataset construction is handled with considerable care. The adoption of LeRobot v2.1 format, structured READMEs documenting platform-specific nuances (clutching mechanisms, operator skill levels), and per-dataset normalization statistics demonstrate attention to the practical challenges of multi-institution data aggregation. The five-tier realism spectrum (simulation through clinical) is well-structured for progressive validation.

GR00T-H evaluation is reasonably thorough. The paper evaluates across four dimensions: end-to-end task completion (n=20 per model), out-of-distribution generalization (n=30), data efficiency (33% vs. 100% fine-tuning data), and multi-embodiment transfer (three platforms). Statistical reporting includes Clopper-Pearson confidence intervals and Fisher's exact tests with Holm-Bonferroni correction. The per-setup evaluation protocol for end-to-end suturing (holding setup constant across models) strengthens internal validity.

However, there are methodological concerns. The 25% end-to-end success rate, while the only non-zero result, is based on 5/20 trials—a small absolute number. The LingBot-VA baseline was evaluated in a separate session with "best effort" to mirror conditions, introducing potential confounds. The ex vivo evaluation (64% average across 29 subtasks) tests individual subtask success but doesn't report end-to-end completion rates on biological tissue, which would be the most clinically meaningful metric.

Cosmos-H-Surgical-Simulator evaluation relies on L1 and SSIM metrics against ground-truth replay, which the authors themselves acknowledge are limited—they don't capture surgery-specific fidelity (instrument positioning, tool-tissue interaction plausibility). The absence of closed-loop evaluation or domain-specific metrics weakens the world model contribution relative to the dataset and VLA contributions.

3. Potential Impact

Immediate impact: The dataset itself, released under CC-BY-4.0, could become foundational infrastructure for the surgical robotics community, analogous to what ImageNet was for computer vision or Open-X-Embodiment for general robotics. The 770-hour scale represents a >35x increase over the largest prior dataset (ImitateCholec, ~20 hours), and the multi-embodiment nature is entirely novel.

Downstream applications: Beyond VLA training and world modeling, the authors correctly note the dataset supports monocular depth estimation, instrument segmentation, procedural phase recognition, and language-conditioned planning. The clinical procedure annotations (phase and gesture labels for hundreds of hours of Versius data) add value for surgical workflow analysis.

Industry and clinical pathway: The involvement of major surgical robotics companies (CMR Surgical contributing 489 hours, plus Intuitive/dVRK, Virtual Incision, Rob Surgical, Moon Surgical) signals potential for translational impact, though the paper is appropriately cautious about clinical deployment timelines.

Broader field influence: This work could catalyze a paradigm shift in surgical robotics from single-institution, single-platform research toward community-driven data sharing, analogous to the shift seen in NLP and general robotics.

4. Timeliness & Relevance

The paper arrives at a critical inflection point. Foundation models have transformed general robotics (OpenVLA, π0, GR00T-N1), but SutureBot recently demonstrated their complete failure on surgical tasks. This paper directly addresses that documented failure with a data-centric solution. The projected surgeon shortage (13,000–86,000 by 2036) provides genuine clinical urgency. The convergence of VLA architectures, video foundation models, and the demonstrated need for domain-specific pretraining makes this contribution highly timely.

5. Strengths & Limitations

Key Strengths:

Scale and diversity unprecedented in surgical robotics (770h, 20 platforms, 49 institutions)

Open release of dataset, model weights, and code under permissive licenses

Demonstrated transfer benefits across multiple evaluation axes (task success, data efficiency, generalization, multi-embodiment)

Practical engineering contributions (data formatting standards, structured documentation templates)

The world model contribution opens a novel research direction in multi-embodiment surgical simulation

Notable Limitations:

Dataset distribution is heavily skewed: CMR Versius contributes 65% of hours, mostly clinical video that may have limited kinematic diversity

The 25% end-to-end success rate, while state-of-the-art, remains far from clinical relevance

No in vivo policy evaluation; all GR00T-H results are on phantoms or ex vivo tissue

World model evaluation metrics are acknowledged as inadequate for the domain

The massive author list (180+ authors) makes individual contribution assessment difficult, though this is inherent to community datasets

Privacy and de-identification protocols for clinical data receive minimal discussion

The causal relationship between Open-H post-training and improved performance could be better isolated (e.g., ablations on dataset composition, scale curves)

6. Additional Observations

The paper represents a significant organizational achievement—coordinating data collection across 49 institutions with heterogeneous hardware and regulatory environments. The practical contribution of establishing data standards, conversion tools, and documentation templates may have impact beyond the specific models trained. The observation about hardware drift robustness (GR00T-H maintaining performance despite instrument wear) is a practically important finding for real-world deployment.

Rating:8.5/ 10

Significance 9Rigor 7Novelty 8Clarity 8

Generated May 5, 2026

Comparison History (65)

Wonvs. EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Open-H-Embodiment presents a massive, multi-institutional, multi-embodiment open dataset for medical robotics—addressing a fundamental infrastructure gap—along with two foundation models (a vision-language-action model and a world model) demonstrating concrete surgical task completion. Its scale (49+ institutions, multiple platforms), domain importance (medical robotics), and role as community infrastructure give it broader and deeper potential impact. While EgoEngine introduces a clever framework for converting human videos to robot demonstrations (a notable contribution), its scope and potential to reshape a field are comparatively narrower than the foundational data and model infrastructure Paper 1 provides.

claude-opus-4-6·Jun 12, 2026

Wonvs. Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

While Paper 1 presents a strong general embodied foundation model, Paper 2 tackles a critical data bottleneck in a high-stakes domain by releasing the largest open medical robotics dataset. By democratizing data access and providing pioneering baseline models for surgical robotics, Paper 2 has immense potential for life-saving real-world applications and will likely catalyze a new subfield of medical foundation models.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Paper 2 addresses a fundamental data bottleneck in a high-stakes, high-impact domain (medical robotics) by introducing the largest open, multi-embodiment dataset to date. By also providing the first open foundation VLA model and action-conditioned world model for surgery, it provides critical infrastructure likely to catalyze widespread research and real-world clinical applications. While Paper 1 offers a strong algorithmic improvement for general robotic manipulation, the scale, domain importance, and foundational nature of Paper 2 give it a significantly higher potential for broad scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Paper 2 introduces Open-H-Embodiment, a landmark large-scale open dataset for medical robotics spanning 49+ institutions and multiple platforms—addressing a critical data bottleneck. Its impact is broader: it provides foundational infrastructure for an entire research community, enables two novel foundation models (first VLA and first action-conditioned world model for medical robotics), and targets high-stakes healthcare applications. While Paper 1 presents a technically strong unified model architecture, Paper 2's open dataset contribution, cross-institutional scale, and potential to democratize medical robotics research represent a more transformative and lasting scientific contribution.

claude-opus-4-6·Jun 10, 2026

Wonvs. Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Open-H-Embodiment introduces critical shared infrastructure (the largest open medical robotic dataset spanning 49+ institutions and multiple platforms) along with two foundation models (GR00T-H and Cosmos-H-Surgical-Simulator) that are firsts in their domain. Its breadth of impact is enormous: it addresses a fundamental data bottleneck in medical robotics, enables community-wide research, and has direct clinical implications for patient outcomes and healthcare access. While Paper 2 presents a clever embodiment-gap solution for dexterous manipulation with strong results, its scope is narrower. Paper 1's combination of large-scale open data, multi-embodiment coverage, and medical domain significance gives it higher potential impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Paper 2 addresses a critical data bottleneck in a high-impact domain—medical robotics—by providing a large-scale, multi-embodiment dataset and introducing two novel foundation models. While Paper 1 offers an innovative sim-to-real framework for general manipulation, Paper 2's dataset is likely to serve as foundational infrastructure for a broad research community. By democratizing data and establishing new baselines for surgical foundation models, Paper 2 has the potential to catalyze massive downstream research and real-world healthcare applications, giving it a higher estimated scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. RoboDream: Compositional World Models for Scalable Robot Data Synthesis

Paper 1 likely has higher scientific impact due to its creation of critical open infrastructure: a uniquely large, multi-institution, multi-robot medical robotics dataset with synchronized video+kinematics, enabling broad community reuse and benchmarking. It also demonstrates two foundational models (vision-language-action and action-conditioned world model) across multiple embodiments in a high-stakes domain with clear real-world relevance and timeliness. Paper 2 is innovative for scalable synthetic data via compositional world models, but its impact may be narrower and more dependent on downstream adoption compared to a major open dataset and foundation-model suite in medical robotics.

gpt-5.2·Jun 2, 2026

Wonvs. World-Task Factorization for Robot Learning

Paper 2 introduces a massive, multi-embodiment dataset and the first foundation models for medical robotics, directly solving a critical data bottleneck in the field. While Paper 1 offers an elegant theoretical framework for general robotics, large-scale open datasets historically drive massive paradigm shifts and follow-on research in AI. Furthermore, the direct application to surgery provides immense real-world societal value, giving Paper 2 a broader and more transformative potential scientific impact.

gemini-3.1-pro-preview·Jun 2, 2026

Wonvs. Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

Paper 1 likely has higher impact due to its large, open, multi-institution, multi-embodiment medical robotics dataset plus two enabling foundation/world models—critical infrastructure that can catalyze broad downstream research and real-world clinical translation. Its applications (surgery, ultrasound, endoscopy) are highly consequential and timely, and open data addresses a key bottleneck for the field. Paper 2 is a strong methodological contribution with broad applicability and efficiency gains, but it is primarily an inference-time algorithmic improvement that may be more incremental and less transformative than a major open dataset + benchmarks in a high-stakes domain.

gpt-5.2·Jun 2, 2026

Wonvs. Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Paper 2 likely has higher scientific impact because it delivers broadly enabling infrastructure: the largest open multi-institution, multi-platform medical robotics dataset with synchronized kinematics, addressing a key bottleneck (data scarcity/closed access) and catalyzing community-wide progress. It also demonstrates strong downstream utility via two foundation models (VLA and action-conditioned world model) and directly targets a high-stakes, timely application area (medical robotics) with clear translational potential. Paper 1 is technically innovative and strong, but is more incremental within a crowded general VLA space and less uniquely enabling.

gpt-5.2·May 29, 2026

#1of 3760·Robotics

#1 of 3760 · Robotics

Tournament Score

1705±35

10501800

100%

Win Rate

Wins

Losses

Matches

Rating

8.5/ 10

Significance9

Rigor7

Novelty8

Clarity8