Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

Open-H-Embodiment Consortium, :, Nigel Nelson, Juo-Tung Chen, Jesse Haworth, Xinhao Chen, Lukas Zbinden, Dianye Huang

Apr 22, 2026arXiv:2604.21017v1

cs.ROcs.AI

v1v2

Frozen v1 — this version was superseded on arXiv. Stats reflect the state at freeze time.View latest (v2) →

#2of 3810·Robotics

Gold · Week 17, 2026

Tournament Score

1703±32

10501800

99%

Win Rate

Wins

Losses

Matches

Rating

8.5/ 10

Significance9

Rigor6.5

Novelty7.5

Clarity8

Abstract

Autonomous medical robots hold promise to improve patient outcomes, reduce provider workload, democratize access to care, and enable superhuman precision. However, autonomous medical robotics has been limited by a fundamental data problem: existing medical robotic datasets are small, single-embodiment, and rarely shared openly, restricting the development of foundation models that the field needs to advance. We introduce Open-H-Embodiment, the largest open dataset of medical robotic video with synchronized kinematics to date, spanning more than 49 institutions and multiple robotic platforms including the CMR Versius, Intuitive Surgical's da Vinci, da Vinci Research Kit (dVRK), Rob Surgical BiTrack, Virtual Incision's MIRA, Moon Surgical Maestro, and a variety of custom systems, spanning surgical manipulation, robotic ultrasound, and endoscopy procedures. We demonstrate the research enabled by this dataset through two foundation models. GR00T-H is the first open foundation vision-language-action model for medical robotics, which is the only evaluated model to achieve full end-to-end task completion on a structured suturing benchmark (25% of trials vs. 0% for all others) and achieves 64% average success across a 29-step ex vivo suturing sequence. We also train Cosmos-H-Surgical-Simulator, the first action-conditioned world model to enable multi-embodiment surgical simulation from a single checkpoint, spanning nine robotic platforms and supporting in silico policy evaluation and synthetic data generation for the medical domain. These results suggest that open, large-scale medical robot data collection can serve as critical infrastructure for the research community, enabling advances in robot learning, world modeling, and beyond.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: Open-H-Embodiment

1. Core Contribution

Open-H-Embodiment addresses a well-recognized structural bottleneck in autonomous surgical robotics: the absence of a large-scale, multi-embodiment, openly shared dataset of medical robotic video with synchronized kinematics. The paper delivers three interrelated contributions: (1) a 770-hour dataset spanning 20 robotic platforms, 49+ institutions, and 119 constituent datasets; (2) GR00T-H, a vision-language-action (VLA) model post-trained on this corpus that achieves the first reported end-to-end autonomous suturing completion on the SutureBot benchmark; and (3) Cosmos-H-Surgical-Simulator, the first multi-embodiment action-conditioned world model for surgical simulation.

The core insight — that the scaling laws validated in general-purpose robotics (via Open-X-Embodiment and successors) should transfer to the medical domain given sufficient domain-specific data — is not itself novel, but the execution is significant. The paper correctly identifies that general-purpose VLAs fail on surgical tasks not due to architectural limitations but because of a domain gap that cannot be bridged without in-domain data. Open-H is the first serious attempt to close this gap at scale.

2. Methodological Rigor

Dataset construction is handled with considerable care. The adoption of LeRobot v2.1 format with healthcare-specific schema extensions, structured READMEs documenting platform nuances (clutching, cable wear, operator skill), and the explicit acknowledgment of data heterogeneity challenges are commendable engineering decisions. The five-tier realism spectrum (simulation through clinical) is well-motivated.

GR00T-H evaluation follows a reasonably rigorous protocol. The end-to-end suturing evaluation uses a per-setup matched design where all models attempt the same configuration. Statistical reporting includes Clopper-Pearson confidence intervals and Fisher's exact tests with Holm-Bonferroni correction. However, several concerns arise:

The 25% end-to-end success rate (5/20) for GR00T-H, while the only non-zero result, is modest. The confidence interval around this is wide.

LingBot-VA was evaluated in separate sessions rather than matched conditions, weakening the comparison.

The ex vivo evaluation (64% average across 29 subtasks, n=10 per subtask) demonstrates capability but with notable failures in clinically critical steps (cutting at 20-30%).

ACT's underperformance relative to its originally reported results is attributed to hardware drift, which complicates interpretation — it simultaneously argues for GR00T-H's robustness while undermining the fairness of the comparison.

Cosmos-H-Surgical-Simulator evaluation is acknowledged as preliminary. The reliance on L1 and SSIM metrics for action-conditioned generation is a known limitation — these metrics don't capture surgical-specific fidelity (instrument position accuracy, tissue interaction plausibility). The paper is transparent about this gap.

Dataset composition bias is a significant concern: CMR Versius clinical data accounts for ~65% of total hours (499/770), creating a heavily skewed distribution. While sampling caps at 20% during GR00T-H training partially mitigate this, the effective diversity is less than the headline numbers suggest.

3. Potential Impact

The potential impact is substantial across multiple dimensions:

Infrastructure contribution: The dataset itself, released under CC-BY-4.0 with model weights and code publicly available, could become foundational infrastructure analogous to what ImageNet was for vision or Open-X-Embodiment for general robotics. The multi-institutional, multi-platform nature lowers barriers for researchers without access to specific surgical platforms.

Clinical trajectory: While far from clinical deployment (the paper appropriately notes this), the work establishes a credible research pipeline from data collection through foundation model training to policy evaluation. The ex vivo results on pork belly tissue represent meaningful progress toward clinical relevance.

World modeling: Cosmos-H-Surgical-Simulator opens a new direction — using world models for in silico policy evaluation could dramatically reduce the cost of surgical robot policy development, which currently requires expensive physical robot time.

Community building: The 49-institution consortium itself is a significant organizational achievement that could catalyze sustained collaboration in a field historically fragmented by proprietary platforms and institutional silos.

4. Timeliness & Relevance

The paper arrives at an inflection point where foundation models have demonstrated clear scaling benefits in general robotics, but surgical robotics remains locked out due to data scarcity. The projected surgeon shortage provides genuine urgency. The timing relative to the explosion of VLA models (OpenVLA, π0, GR00T-N1) makes this contribution directly actionable — the community now has both the architectures and (with Open-H) the data to explore surgical foundation models.

5. Strengths & Limitations

Key Strengths:

Scale and diversity unprecedented in medical robotics (770 hours, 20 platforms, 49 institutions)

Complete open release of data, models, and code — unusual for medical robotics

Demonstrates concrete downstream utility through two distinct foundation model applications

Thoughtful engineering of data standardization and documentation practices

Honest assessment of limitations and failure modes

Notable Limitations:

Heavy skew toward Versius clinical data (65% by duration) limits effective multi-embodiment diversity

Clinical data lacks kinematics ground truth for most procedures (the Versius contribution provides kinematics but limited camera viewpoints)

End-to-end success rates remain low (25% on SutureBot, ~0% for cutting steps in ex vivo)

The dataset predominantly contains successful demonstrations; lack of failure data limits world model training

Cross-embodiment transfer benefits are demonstrated but the mechanism is not deeply analyzed — whether performance gains come from data diversity, scale, or domain-specific priors remains unclear

The comparison with LingBot-VA under different conditions weakens one of the four baseline comparisons

No in vivo animal evaluation, making clinical relevance claims preliminary

The paper's massive author list and multi-institutional nature, while reflecting genuine collaboration, makes it difficult to assess individual contribution quality

Overall Assessment: This is a high-impact community resource paper that addresses a genuine bottleneck. The dataset contribution alone justifies publication at a top venue. The foundation model results, while preliminary, provide compelling evidence that domain-specific pretraining matters for surgical robotics. The work's greatest impact will likely be measured not by the specific models presented, but by what the community builds on this infrastructure over the coming years.

Rating:8.5/ 10

Significance 9Rigor 6.5Novelty 7.5Clarity 8

Generated Apr 24, 2026

Comparison History (76)

Wonvs. Unified Motion-Action Modeling for Heterogeneous Robot Learning

Paper 2 likely has higher scientific impact because it provides critical open infrastructure: a uniquely large, multi-institution, multi-platform medical robotics dataset with synchronized kinematics, addressing a major bottleneck for the field. Its real-world relevance is high (clinical robotics), and openness enables broad downstream reuse across robot learning, simulation, benchmarking, and foundation model research. The accompanying foundation models (VLA and world model) demonstrate immediate utility and set baselines. Paper 1 is methodologically novel and broadly applicable, but its impact is more contingent on adoption and reproducibility than a widely shared dataset plus demonstrated models.

gpt-5.2·Jun 16, 2026

Wonvs. Scaling Short-Term Memory of Visuomotor Policies for Long-Horizon Tasks

Paper 2 introduces a massive, multi-institutional dataset and the first foundation models for medical robotics. By solving a critical data bottleneck in a high-stakes domain like surgery, it is highly likely to catalyze widespread downstream research and real-world clinical applications. While Paper 1 offers a valuable algorithmic contribution to robot memory, the sheer scale, domain importance, and field-enabling nature of Paper 2's dataset and models give it a significantly higher potential for broad scientific and societal impact.

gemini-3.1-pro-preview·Jun 16, 2026

Wonvs. EgoEngine: From Egocentric Human Videos to High-Fidelity Dexterous Robot Demonstrations

Paper 1 addresses a critical bottleneck in a high-stakes domain (medical robotics) by releasing an unprecedented, multi-institutional, multi-embodiment dataset. Coupled with the introduction of the first open foundation and world models for medical robotics, it provides essential infrastructure likely to spur broad advancements in surgical automation, making its potential real-world and scientific impact exceptionally high.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Paper 2 likely has higher scientific impact due to its unique, community-enabling contribution: the largest open multi-institution, multi-platform medical robotics dataset with synchronized kinematics—critical infrastructure in a data-scarce, high-stakes domain. Its immediate real-world relevance (healthcare), breadth across embodiments/procedures, and demonstration via two foundational models (VLA and action-conditioned world model) increase downstream reuse and cross-field influence (robot learning, simulation, medical AI). Paper 1 is strong technically, but appears more incremental within a crowded embodied-foundation-model race and less domain-transformative.

gpt-5.2·Jun 11, 2026

Wonvs. World Pilot: Steering Vision-Language-Action Models with World-Action Priors

Paper 1 likely has higher impact due to its creation of a large, open, multi-institution, multi-embodiment medical robotics dataset—critical infrastructure that can catalyze broad downstream work (foundation models, benchmarking, simulation, data generation) in a high-stakes domain. Its real-world relevance (medical robotics) and openness increase reproducibility and community adoption. While Paper 2 offers a strong algorithmic advance with impressive benchmarks, it is more incremental and narrower in scope; Paper 1’s dataset + models can shift the field’s trajectory and enable many future methods.

gpt-5.2·Jun 11, 2026

Wonvs. World-Language-Action Model for Unified World Modeling, Language Reasoning, and Action Synthesis

Paper 1 has higher potential scientific impact because it addresses a fundamental bottleneck in a high-stakes field: the lack of open, large-scale datasets in medical robotics. By providing data spanning 49 institutions and multiple platforms, along with two successful foundation models, it acts as critical infrastructure that will likely catalyze widespread research and innovation in autonomous surgery. While Paper 2 offers a strong architectural advancement in general embodied AI, releasing an unprecedented dataset in a notoriously siloed domain like medical robotics typically yields broader, field-defining impact.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Dexterous Point Policy: Learning Point-based Dexterous Hand Policies from Human Demonstrations

Paper 1 provides a foundational dataset and models that address a critical bottleneck in medical robotics. By open-sourcing large-scale, multi-platform data and introducing the first medical VLA and world models, it serves as vital infrastructure likely to catalyze widespread research and accelerate real-world healthcare applications. While Paper 2 offers a strong methodological advance in dexterous manipulation, Paper 1 has a broader, field-defining impact.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Video2Sim2Real: Full-Stack Autonomous Dexterous Skill Acquisition from a Single Human Video

Paper 1 is likely higher impact because it delivers broadly enabling infrastructure: the largest open multi-institution, multi-embodiment medical robotics dataset plus two foundation/world models that can catalyze many downstream studies. Its real-world relevance is high (medical robotics), timeliness aligns with foundation-model/data scaling trends, and breadth spans robot learning, simulation, and world modeling across platforms. Paper 2 is methodologically innovative for single-video dexterous skill acquisition, but is narrower in scope/applicability and less likely to reshape the field as a shared dataset + benchmarks + baseline models.

gpt-5.2·Jun 9, 2026

Wonvs. RoboDream: Compositional World Models for Scalable Robot Data Synthesis

Open-H-Embodiment represents a larger-scale infrastructure contribution spanning 49+ institutions, multiple robotic platforms, and two foundation models (GR00T-H and Cosmos-H-Surgical-Simulator) that are firsts in medical robotics. Its impact is broader: it addresses a fundamental data bottleneck in autonomous medical robotics, provides an open large-scale dataset as community infrastructure, and demonstrates both vision-language-action models and world models for surgical simulation. While RoboDream presents an innovative approach to synthetic data generation, Open-H-Embodiment's multi-institutional scale, clinical relevance, and potential to catalyze an entire subfield give it higher estimated impact.

claude-opus-4-6·Jun 2, 2026

Wonvs. Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

Open-H-Embodiment represents a transformative contribution to medical robotics by addressing a critical infrastructure gap—the lack of large-scale, multi-embodiment open datasets. Spanning 49+ institutions and multiple robotic platforms, it enables foundation models (GR00T-H and Cosmos-H) that achieve unprecedented results in surgical tasks. Its breadth of impact spans medical AI, robot learning, and world modeling, with direct real-world clinical implications. Paper 2 offers a useful methodological improvement (CoFi) for compositional diffusion planning, but is more incremental—an inference-time sampling strategy improving efficiency and coherence over existing baselines, with narrower scope of impact.

claude-opus-4-6·Jun 2, 2026

#2of 3810·Robotics

Gold · Week 17, 2026

Tournament Score

1703±32

10501800

99%

Win Rate

Wins

Losses

Matches

Rating

8.5/ 10

Significance9

Rigor6.5

Novelty7.5

Clarity8