MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents

Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang

Jun 2, 2026

arXiv:2606.03203v1 PDF

cs.AI(primary)

#611of 3355·Artificial Intelligence

#611 of 3355 · Artificial Intelligence

Tournament Score

1473±42

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7

Clarity8

Tournament Score

1473±42

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MedCUA-Bench

1. Core Contribution

MedCUA-Bench is the first interactive benchmark specifically designed to evaluate computer-use agents (CUAs) on clinical graphical user interfaces. The benchmark covers 18 clinical scenarios across 10 medical domains (outpatient, inpatient, ICU, nursing, imaging, etc.), comprising 432 task instances. Three key design innovations distinguish it from prior work:

Paired intent/step goals: Each task has both a high-level clinical objective (what a clinician would delegate) and a step-by-step procedural decomposition, enabling researchers to disentangle clinical reasoning failures from UI execution failures.

Safety-aware evaluation: Beyond binary task completion, a deterministic checker evaluates five clinical safety dimensions (patient identity, data accuracy, information fidelity, record integrity, workflow safety) with severity-weighted penalties that can yield negative rewards for harmful completions.

Three-tier fidelity: The benchmark spans synthetic HTML reconstructions of clinical workstations, a real OpenEMR EHR instance, and real OHIF imaging viewers, balancing reproducibility with authenticity.

2. Methodological Rigor

The experimental design is thorough, evaluating 23 vision-capable agents under a consistent screenshot-only protocol with no DOM/accessibility tree access. The evaluation framework is well-structured:

Strengths in methodology:

The deterministic checker avoids the noise and cost of LLM-judge evaluation, enhancing reproducibility.

The severity-weight sensitivity analysis (Appendix L) demonstrates that agent rankings are robust across 10 alternative weighting schemes (Spearman ρ ≥ 0.957).

The failure-mode taxonomy (five mutually exclusive categories) provides genuinely actionable diagnostic information, revealing that closed-source failures are dominated by exploration timeouts while open-source failures are split between format lockouts and heavy loops.

The human baseline pilot, while limited (n=24, single operator), provides a meaningful reference point at 83.3% success.

Methodological concerns:

The 15 synthetic scenarios are HTML reconstructions, not the actual clinical software. While the paper acknowledges this as an upper bound, the gap between synthetic and real-system performance (54.7% vs. 8.3% for GPT-5.4) raises questions about the ecological validity of synthetic-tier results.

Single-run evaluation per task provides only point estimates without variance characterization.

The safety checker is currently under-exercised (0 critical, 53 major, 15 minor violations across 9,936 episodes) because most agents fail before reaching safety-critical actions. This makes the safety evaluation more of a forward-looking design feature than a current discriminator.

The human baseline uses only one annotator on a small stratified sample, limiting its interpretive power.

3. Potential Impact

Immediate impact: The benchmark fills a genuine gap at the intersection of CUA research and healthcare AI. The finding that even GPT-5.4 reaches only 54.2% overall and drops to 8.3% on real OpenEMR is a sobering reality check that should temper enthusiasm about near-term clinical CUA deployment. This quantification of the capability gap is inherently valuable for the community.

Broader implications:

The paired goal design could be adopted by other domain-specific benchmarks to separate reasoning from execution evaluation.

The safety-aware reward function with severity weighting provides a template for any high-stakes domain (legal, financial, aviation).

The benchmark could catalyze development of clinical-domain-specific fine-tuning for CUAs.

Limitations on impact:

The benchmark is somewhat narrow in its representation of real clinical systems (only OpenEMR and OHIF are real software). Most deployed clinical systems (Epic, Cerner, MEDITECH) remain inaccessible.

The 432-task scale, while reasonable for a first benchmark, may not capture the full diversity of clinical computing workflows.

4. Timeliness & Relevance

This work is highly timely. CUAs are rapidly advancing (GPT-5, Claude Opus 4.7, etc. are all 2026 models), and healthcare is frequently cited as a high-value application domain. The paper directly addresses a critical validation gap: while general CUA benchmarks proliferate, none adequately stress-test agents in clinical environments. Clinician burnout from EHR documentation is a well-documented problem (Sinsky et al., 2016), making the automation of clinical screen-based work an active area of interest. Establishing that current agents are nowhere near reliable enough for clinical deployment—before premature deployment occurs—is a valuable contribution.

5. Strengths & Limitations

Key strengths:

The four-part motivation (domain knowledge, distinctive UI design, unavailable test environments, safety validation) is clearly articulated and well-supported.

The analysis is multi-faceted: overall performance, page fidelity tiers, goal granularity effects, failure-mode decomposition, and safety profiling provide a comprehensive picture.

The finding that newer/larger open-source models don't necessarily outperform older ones (Qwen3.5-27B at 2.3% vs. Qwen2.5-VL-32B at 16.2%) yields insights about what capabilities matter for clinical GUI interaction.

The BrowserGym integration and deterministic evaluation should make the benchmark highly reproducible.

Notable weaknesses:

The synthetic environments, while carefully reconstructed, are fundamentally approximations. The dramatic performance drop on real OpenEMR (≤8.3% for all models) suggests that synthetic-tier results may overstate agent capability.

The physician involvement (two practitioners) in benchmark design, while valuable, is limited and could benefit from broader clinical input.

No fine-tuning or prompting optimization is explored; reporting only zero-shot performance on a standardized harness may understate what purpose-built clinical CUAs could achieve.

The paper does not explore multi-modal clinical scenarios where agents must integrate information across multiple screens simultaneously, which is common in real clinical practice.

Overall Assessment

MedCUA-Bench makes a well-motivated and well-executed contribution to an important gap in AI evaluation. Its primary value lies in establishing a reproducible baseline for clinical CUA capability, introducing the paired goal and safety-aware evaluation methodology, and providing concrete evidence that current agents are far from clinically reliable. The benchmark design is thoughtful and the analysis is thorough, though the reliance on synthetic environments for 15 of 18 scenarios and the limited safety signal under current agent capabilities are notable limitations.

Rating:7/ 10

Significance 7.5Rigor 7Novelty 7Clarity 8

Generated Jun 3, 2026

Comparison History (21)

vs. PersistBench: When Should Long-Term Memories Be Forgotten by LLMs?

claude-opus-4.66/6/2026

MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous practical importance. It introduces a comprehensive benchmark with 18 clinical scenarios, evaluates 23 agents, includes safety dimensions specific to clinical use, and reveals stark performance gaps (best model at 54.2%, open-source at 2.5%). Its methodological rigor (deterministic checkers, intent vs. step-level evaluation, real EMR testing) and breadth across 10 medical domains give it broader impact potential. PersistBench identifies important memory safety risks but addresses a narrower problem with less immediate real-world consequence.

vs. Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization

gpt-5.26/6/2026

Paper 2 likely has higher scientific impact because it introduces a broadly useful, timely benchmark and evaluation methodology for a high-stakes real-world domain (clinical UI automation), enabling reproducible comparison across many agents and spurring measurable progress. Its deterministic checker plus safety dimensions directly address deployment-critical validation gaps, and the benchmark can influence multiple fields (agentic AI, HCI, medical informatics, safety). Paper 1 is novel and rigorous, but its framework is narrower in immediate community-wide adoption than a benchmark that can become a standard.

vs. No Need to Train Your RDB Foundation Model

claude-opus-4.66/6/2026

MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous practical importance. It introduces a comprehensive benchmark with safety evaluation dimensions, tests 23 agents, and reveals significant performance gaps that will drive future research. The clinical safety focus and reproducibility make it highly impactful. Paper 2, while technically sound in extending ICL to relational databases without training, addresses a narrower technical problem with less broad societal impact. The healthcare AI safety angle of Paper 1 gives it greater timeliness and cross-disciplinary relevance.

vs. PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

gpt-5.26/5/2026

Paper 2 likely has higher impact: it targets a high-stakes, highly regulated real-world domain (clinical UI automation) with clear safety requirements, offering immediate applicability to evaluation and model development. Its benchmark design addresses a major gap (medical GUIs absent from existing agent benchmarks), provides deterministic checking plus multi-dimensional clinical safety metrics, and includes a realistic external validation (OpenEMR) showing strong generalization gaps—useful for the broader agent community. Paper 1 is timely and rigorous, but negotiation benchmarking is less safety-critical and more crowded, with narrower direct deployment pathways.

vs. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact because it introduces a timely, broadly relevant benchmark for evaluating computer-use agents in high-stakes clinical GUI settings, with safety-oriented metrics and realistic tasks. Benchmarks often catalyze rapid progress across many groups and methods, and this one targets an underrepresented but societally important domain (healthcare automation). Its methodological rigor (deterministic checker, step-/intent-level goals, safety dimensions, multi-agent evaluation, real-system testing) and cross-field relevance (HCI, agentic AI, medical informatics, safety) suggest wide adoption. Paper 1 is innovative but more niche to RTL synthesis.

vs. Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

gpt-5.26/5/2026

Paper 2 likely has higher impact due to immediate real-world relevance and adoption potential: it introduces a practical, reproducible benchmark targeting a high-stakes domain (clinical GUIs) with safety-oriented evaluation, enabling standardized comparison across many agents and catalyzing progress. Its methodology (interactive tasks, deterministic checker, intent vs step goals, real-system validation on OpenEMR) supports rigorous empirical research. Paper 1 is more theoretically novel, but its impact may be narrower and slower to translate, especially given negative/impossibility results for common classification settings.

vs. BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

gemini-3.16/5/2026

Paper 2 introduces a critical benchmark in a high-stakes, real-world domain (clinical healthcare), addressing safety and reliability in medical UI automation. Benchmarks often drive significant community effort and standard-setting. While Paper 1 presents a strong algorithmic innovation for geometry, Paper 2's focus on healthcare automation presents broader societal applications, addresses a clear gap in evaluating clinical AI agents, and provides a foundational testbed for future research.

vs. TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment

gpt-5.26/3/2026

Paper 2 is more novel and timely by introducing a domain-specific, interactive benchmark for clinical computer-use agents with screenshot-only access, paired intent/step goals, and deterministic evaluation including multiple safety dimensions. Its real-world applicability is high given clinical UI automation needs and stringent safety requirements, and it provides a reusable testbed likely to become a standard for evaluating medical GUI agents. Paper 1 is useful and accessible, but multi-metric LLM evaluation pipelines are crowded, and its contribution appears more incremental compared to a new benchmark revealing major capability gaps in a high-stakes domain.

vs. Harness-1: Reinforcement Learning for Search Agents with State-Externalizing Harnesses

claude-opus-4.66/3/2026

Harness-1 introduces a novel architectural principle—externalizing state management from the RL policy into a structured harness—that is broadly applicable beyond retrieval to many agent settings. It demonstrates strong empirical results across 8 benchmarks with clear generalization, and the core idea of separating semantic decisions from bookkeeping could influence RL-based agent design broadly. MedCUA-Bench is a valuable benchmark contribution for an important niche (clinical computer-use agents), but benchmarks typically have narrower impact than new training paradigms, and the domain is relatively specialized compared to Harness-1's generalizable methodology.

vs. SafeMCP: Proactive Power Regulation for LLM Agent Defense via Environment-Grounded Look-Ahead Reasoning

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact: it proposes a general, deployable server-side defense for MCP-based LLM agents addressing timely, high-stakes agent safety (power-seeking/tool acquisition). The approach (environment-grounded look-ahead with proactive filtering + intervention) is broadly applicable across domains and aligns with current concerns in AI governance and tool-using agents, increasing real-world adoption potential. It also includes a training pipeline and evaluation on multiple established safety benchmarks. Paper 1 is valuable but narrower (clinical UI benchmarking) and primarily infrastructural, with impact concentrated in healthcare agent evaluation.

vs. Recognize Your Orchestrator: An Entropy Dynamics Perspective for LLM Multi-Agent Systems

gpt-5.26/3/2026

Paper 2 likely has higher impact: it introduces a concrete, reusable benchmark in a high-stakes domain (clinical UI automation) with clear real-world applications and safety-oriented evaluation. The dataset/benchmark can become a community standard, enabling broad, comparable progress across agents and model types, and is timely given rapid growth of computer-use agents. Paper 1 is conceptually novel but more speculative: the mean-field “entropy dynamics” framing and “reasoning trap” insight may be valuable, yet impact depends on adoption and validation beyond fitted trajectories, with less immediate applicability and standardization potential.

vs. ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents

claude-opus-4.66/3/2026

MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous practical importance. It introduces a comprehensive benchmark (18 scenarios, 10 domains, 23 agents evaluated) with novel safety-aware evaluation dimensions specific to clinical use. The finding that even the best models achieve only 54.2% success (and <9% on real systems) highlights a significant and actionable research gap. Its breadth of impact spans AI safety, clinical informatics, and agent research. ToolGate, while technically sound, addresses a narrower efficiency optimization problem with more incremental contributions.

vs. COMAP: Co-Evolving World Models and Agent Policies for LLM Agents

gemini-3.16/3/2026

Paper 1 introduces a foundational algorithmic framework (COMAP) for co-evolving world models and agent policies, applicable across diverse domains like embodied planning, web navigation, and tool use. While Paper 2 provides a highly valuable domain-specific medical benchmark, Paper 1's methodology addresses a core challenge in general AI agent design, offering broader theoretical impact and wider applicability across the machine learning landscape.

vs. Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

gpt-5.26/3/2026

Paper 2 has higher estimated impact due to broader applicability beyond a single domain: diagnosing and improving LLM tool/API use affects code generation, agents, software engineering, and LLM alignment/reliability. Its automated pipeline (discovering novel APIs, generating executable tasks, and fine-grained failure diagnostics) is methodologically strong and scalable across libraries and models, enabling continuous evaluation as APIs evolve. The findings on retrieval vs. parametric adaptation and non-interchangeable knowledge components are actionable and timely for current agentic coding systems. Paper 1 is valuable but more domain-specific and constrained by clinical UI availability.

vs. Solipsistic Superintelligence is Unlikely to be Cooperative

gemini-3.16/3/2026

Paper 1 provides a rigorous, immediately applicable benchmark for a critical real-world domain (healthcare), addressing a tangible gap in evaluating clinical AI safety. While Paper 2 offers a broad theoretical perspective on AI alignment, Paper 1 demonstrates superior methodological rigor through its interactive testbed, deterministic safety checking, and empirical baseline evaluation across 23 agents, guaranteeing immediate, measurable scientific impact and practical utility.

vs. Before the Model Learns the Bug:Fuzzing RLVR Verifiers

gemini-3.16/3/2026

Paper 2 introduces a rigorous, domain-specific benchmark for a high-stakes, real-world application (clinical computer-use agents). Benchmarks historically drive significant community progress and citations. Furthermore, it spans multiple fields (AI, healthcare, HCI) and incorporates critical safety dimensions. While Paper 1 addresses an important emerging issue in RL, Paper 2 offers broader interdisciplinary impact and tackles a major bottleneck in medical AI deployment.

vs. GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

gpt-5.26/3/2026

Paper 2 has higher potential impact due to timeliness and real-world applicability: validating computer-use agents in clinical GUIs directly affects healthcare safety, workflow automation, and regulatory evaluation. Its interactive, screenshot-only design plus deterministic checking and explicit safety dimensions improves methodological rigor and reproducibility for a high-stakes domain. The benchmark spans multiple medical domains and evaluates many agents, enabling broad adoption across ML, HCI, and clinical informatics. Paper 1 is valuable for AI-in-math evaluation, but its narrower domain (graph theory) and less direct deployment pathway likely limit cross-field and societal impact.

vs. Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

claude-opus-4.66/3/2026

MedCUA-Bench addresses a critical gap in evaluating AI agents for clinical software automation—a high-stakes, rapidly growing area. It introduces a novel benchmark spanning 18 clinical scenarios with safety-aware evaluation, tested across 23 agents, revealing significant reliability gaps. Its breadth of impact spans AI safety, healthcare informatics, and human-computer interaction. Paper 2, while methodologically sound, addresses a narrower educational technology problem (automated CS1 grading) with incremental improvements over existing approaches, limiting its cross-disciplinary impact and novelty.

vs. StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

claude-opus-4.66/3/2026

MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous real-world impact. It introduces a novel benchmark covering 18 clinical scenarios with safety evaluation dimensions, revealing a significant performance gap (best model at 54.2%, open-source at 2.5%) that will drive substantial future research. Its breadth across 10 medical domains, evaluation of 23 agents, and focus on clinical safety make it highly relevant and timely. StepFinder, while technically solid, addresses a narrower problem (failure attribution in multi-agent systems) with more incremental contributions.

vs. Don't Gamble, GAMBLe: An Analytical Framework for AI-Driven Research Systems

gemini-3.16/3/2026

Paper 2 offers a foundational analytical framework for AI-Driven Research Systems, broadly applicable across multiple scientific and engineering domains. Its extensive methodological rigor and ability to challenge assumptions about frontier models provide a high breadth of impact. Paper 1, while highly relevant and addressing an important clinical gap, is a domain-specific benchmark with narrower applicability compared to the generalized framework presented in Paper 2.