MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
Jia Yu, Zilong Wang, Xinyang Jiang, Dongsheng Li, Shuo Wang
Abstract
Computer-use agents could automate repetitive screen-based clinical work, but their reliability in medical graphical user interfaces remains largely unvalidated. Existing benchmarks focus on general web or desktop tasks and underrepresent medical software, which requires domain knowledge, exhibits markedly different UI design from mainstream applications, lacks public testing environments, and demands safety validation beyond task completion. We introduce MedCUA-Bench, an interactive benchmark for clinical computer-use agents. It covers 18 clinical scenarios across 10 medical domains, reconstructed from real product manuals and open-source medical systems to capture authentic clinical interfaces while avoiding licensing and privacy constraints. Each task ships with paired intent- and step-level goals to disentangle clinical reasoning from UI execution, and is evaluated by a deterministic checker over task completion and five clinical safety dimensions. Across 23 agents, the best closed-source model reaches 54.2% strict success, while all models remain below 9% on the real OpenEMR. Open-source agents average only 2.5%, with the best reaching 16.2%. MedCUA-Bench exposes the gap between current agents and reliable clinical software use, providing a reproducible testbed for future research.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MedCUA-Bench
1. Core Contribution
MedCUA-Bench is the first interactive benchmark specifically designed to evaluate computer-use agents (CUAs) on clinical graphical user interfaces. The benchmark covers 18 clinical scenarios across 10 medical domains (outpatient, inpatient, ICU, nursing, imaging, etc.), comprising 432 task instances. Three key design innovations distinguish it from prior work:
2. Methodological Rigor
The experimental design is thorough, evaluating 23 vision-capable agents under a consistent screenshot-only protocol with no DOM/accessibility tree access. The evaluation framework is well-structured:
Strengths in methodology:
Methodological concerns:
3. Potential Impact
Immediate impact: The benchmark fills a genuine gap at the intersection of CUA research and healthcare AI. The finding that even GPT-5.4 reaches only 54.2% overall and drops to 8.3% on real OpenEMR is a sobering reality check that should temper enthusiasm about near-term clinical CUA deployment. This quantification of the capability gap is inherently valuable for the community.
Broader implications:
Limitations on impact:
4. Timeliness & Relevance
This work is highly timely. CUAs are rapidly advancing (GPT-5, Claude Opus 4.7, etc. are all 2026 models), and healthcare is frequently cited as a high-value application domain. The paper directly addresses a critical validation gap: while general CUA benchmarks proliferate, none adequately stress-test agents in clinical environments. Clinician burnout from EHR documentation is a well-documented problem (Sinsky et al., 2016), making the automation of clinical screen-based work an active area of interest. Establishing that current agents are nowhere near reliable enough for clinical deployment—before premature deployment occurs—is a valuable contribution.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Overall Assessment
MedCUA-Bench makes a well-motivated and well-executed contribution to an important gap in AI evaluation. Its primary value lies in establishing a reproducible baseline for clinical CUA capability, introducing the paired goal and safety-aware evaluation methodology, and providing concrete evidence that current agents are far from clinically reliable. The benchmark design is thoughtful and the analysis is thorough, though the reliance on synthetic environments for 15 of 18 scenarios and the limited safety signal under current agent capabilities are notable limitations.
Generated Jun 3, 2026
Comparison History (21)
MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous practical importance. It introduces a comprehensive benchmark with 18 clinical scenarios, evaluates 23 agents, includes safety dimensions specific to clinical use, and reveals stark performance gaps (best model at 54.2%, open-source at 2.5%). Its methodological rigor (deterministic checkers, intent vs. step-level evaluation, real EMR testing) and breadth across 10 medical domains give it broader impact potential. PersistBench identifies important memory safety risks but addresses a narrower problem with less immediate real-world consequence.
Paper 2 likely has higher scientific impact because it introduces a broadly useful, timely benchmark and evaluation methodology for a high-stakes real-world domain (clinical UI automation), enabling reproducible comparison across many agents and spurring measurable progress. Its deterministic checker plus safety dimensions directly address deployment-critical validation gaps, and the benchmark can influence multiple fields (agentic AI, HCI, medical informatics, safety). Paper 1 is novel and rigorous, but its framework is narrower in immediate community-wide adoption than a benchmark that can become a standard.
MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous practical importance. It introduces a comprehensive benchmark with safety evaluation dimensions, tests 23 agents, and reveals significant performance gaps that will drive future research. The clinical safety focus and reproducibility make it highly impactful. Paper 2, while technically sound in extending ICL to relational databases without training, addresses a narrower technical problem with less broad societal impact. The healthcare AI safety angle of Paper 1 gives it greater timeliness and cross-disciplinary relevance.
Paper 2 likely has higher impact: it targets a high-stakes, highly regulated real-world domain (clinical UI automation) with clear safety requirements, offering immediate applicability to evaluation and model development. Its benchmark design addresses a major gap (medical GUIs absent from existing agent benchmarks), provides deterministic checking plus multi-dimensional clinical safety metrics, and includes a realistic external validation (OpenEMR) showing strong generalization gaps—useful for the broader agent community. Paper 1 is timely and rigorous, but negotiation benchmarking is less safety-critical and more crowded, with narrower direct deployment pathways.
Paper 2 likely has higher scientific impact because it introduces a timely, broadly relevant benchmark for evaluating computer-use agents in high-stakes clinical GUI settings, with safety-oriented metrics and realistic tasks. Benchmarks often catalyze rapid progress across many groups and methods, and this one targets an underrepresented but societally important domain (healthcare automation). Its methodological rigor (deterministic checker, step-/intent-level goals, safety dimensions, multi-agent evaluation, real-system testing) and cross-field relevance (HCI, agentic AI, medical informatics, safety) suggest wide adoption. Paper 1 is innovative but more niche to RTL synthesis.
Paper 2 likely has higher impact due to immediate real-world relevance and adoption potential: it introduces a practical, reproducible benchmark targeting a high-stakes domain (clinical GUIs) with safety-oriented evaluation, enabling standardized comparison across many agents and catalyzing progress. Its methodology (interactive tasks, deterministic checker, intent vs step goals, real-system validation on OpenEMR) supports rigorous empirical research. Paper 1 is more theoretically novel, but its impact may be narrower and slower to translate, especially given negative/impossibility results for common classification settings.
Paper 2 introduces a critical benchmark in a high-stakes, real-world domain (clinical healthcare), addressing safety and reliability in medical UI automation. Benchmarks often drive significant community effort and standard-setting. While Paper 1 presents a strong algorithmic innovation for geometry, Paper 2's focus on healthcare automation presents broader societal applications, addresses a clear gap in evaluating clinical AI agents, and provides a foundational testbed for future research.
Paper 2 is more novel and timely by introducing a domain-specific, interactive benchmark for clinical computer-use agents with screenshot-only access, paired intent/step goals, and deterministic evaluation including multiple safety dimensions. Its real-world applicability is high given clinical UI automation needs and stringent safety requirements, and it provides a reusable testbed likely to become a standard for evaluating medical GUI agents. Paper 1 is useful and accessible, but multi-metric LLM evaluation pipelines are crowded, and its contribution appears more incremental compared to a new benchmark revealing major capability gaps in a high-stakes domain.
Harness-1 introduces a novel architectural principle—externalizing state management from the RL policy into a structured harness—that is broadly applicable beyond retrieval to many agent settings. It demonstrates strong empirical results across 8 benchmarks with clear generalization, and the core idea of separating semantic decisions from bookkeeping could influence RL-based agent design broadly. MedCUA-Bench is a valuable benchmark contribution for an important niche (clinical computer-use agents), but benchmarks typically have narrower impact than new training paradigms, and the domain is relatively specialized compared to Harness-1's generalizable methodology.
Paper 2 likely has higher scientific impact: it proposes a general, deployable server-side defense for MCP-based LLM agents addressing timely, high-stakes agent safety (power-seeking/tool acquisition). The approach (environment-grounded look-ahead with proactive filtering + intervention) is broadly applicable across domains and aligns with current concerns in AI governance and tool-using agents, increasing real-world adoption potential. It also includes a training pipeline and evaluation on multiple established safety benchmarks. Paper 1 is valuable but narrower (clinical UI benchmarking) and primarily infrastructural, with impact concentrated in healthcare agent evaluation.
Paper 2 likely has higher impact: it introduces a concrete, reusable benchmark in a high-stakes domain (clinical UI automation) with clear real-world applications and safety-oriented evaluation. The dataset/benchmark can become a community standard, enabling broad, comparable progress across agents and model types, and is timely given rapid growth of computer-use agents. Paper 1 is conceptually novel but more speculative: the mean-field “entropy dynamics” framing and “reasoning trap” insight may be valuable, yet impact depends on adoption and validation beyond fitted trajectories, with less immediate applicability and standardization potential.
MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous practical importance. It introduces a comprehensive benchmark (18 scenarios, 10 domains, 23 agents evaluated) with novel safety-aware evaluation dimensions specific to clinical use. The finding that even the best models achieve only 54.2% success (and <9% on real systems) highlights a significant and actionable research gap. Its breadth of impact spans AI safety, clinical informatics, and agent research. ToolGate, while technically sound, addresses a narrower efficiency optimization problem with more incremental contributions.
Paper 1 introduces a foundational algorithmic framework (COMAP) for co-evolving world models and agent policies, applicable across diverse domains like embodied planning, web navigation, and tool use. While Paper 2 provides a highly valuable domain-specific medical benchmark, Paper 1's methodology addresses a core challenge in general AI agent design, offering broader theoretical impact and wider applicability across the machine learning landscape.
Paper 2 has higher estimated impact due to broader applicability beyond a single domain: diagnosing and improving LLM tool/API use affects code generation, agents, software engineering, and LLM alignment/reliability. Its automated pipeline (discovering novel APIs, generating executable tasks, and fine-grained failure diagnostics) is methodologically strong and scalable across libraries and models, enabling continuous evaluation as APIs evolve. The findings on retrieval vs. parametric adaptation and non-interchangeable knowledge components are actionable and timely for current agentic coding systems. Paper 1 is valuable but more domain-specific and constrained by clinical UI availability.
Paper 1 provides a rigorous, immediately applicable benchmark for a critical real-world domain (healthcare), addressing a tangible gap in evaluating clinical AI safety. While Paper 2 offers a broad theoretical perspective on AI alignment, Paper 1 demonstrates superior methodological rigor through its interactive testbed, deterministic safety checking, and empirical baseline evaluation across 23 agents, guaranteeing immediate, measurable scientific impact and practical utility.
Paper 2 introduces a rigorous, domain-specific benchmark for a high-stakes, real-world application (clinical computer-use agents). Benchmarks historically drive significant community progress and citations. Furthermore, it spans multiple fields (AI, healthcare, HCI) and incorporates critical safety dimensions. While Paper 1 addresses an important emerging issue in RL, Paper 2 offers broader interdisciplinary impact and tackles a major bottleneck in medical AI deployment.
Paper 2 has higher potential impact due to timeliness and real-world applicability: validating computer-use agents in clinical GUIs directly affects healthcare safety, workflow automation, and regulatory evaluation. Its interactive, screenshot-only design plus deterministic checking and explicit safety dimensions improves methodological rigor and reproducibility for a high-stakes domain. The benchmark spans multiple medical domains and evaluates many agents, enabling broad adoption across ML, HCI, and clinical informatics. Paper 1 is valuable for AI-in-math evaluation, but its narrower domain (graph theory) and less direct deployment pathway likely limit cross-field and societal impact.
MedCUA-Bench addresses a critical gap in evaluating AI agents for clinical software automation—a high-stakes, rapidly growing area. It introduces a novel benchmark spanning 18 clinical scenarios with safety-aware evaluation, tested across 23 agents, revealing significant reliability gaps. Its breadth of impact spans AI safety, healthcare informatics, and human-computer interaction. Paper 2, while methodologically sound, addresses a narrower educational technology problem (automated CS1 grading) with incremental improvements over existing approaches, limiting its cross-disciplinary impact and novelty.
MedCUA-Bench addresses a critical gap at the intersection of AI agents and healthcare—a high-stakes domain with enormous real-world impact. It introduces a novel benchmark covering 18 clinical scenarios with safety evaluation dimensions, revealing a significant performance gap (best model at 54.2%, open-source at 2.5%) that will drive substantial future research. Its breadth across 10 medical domains, evaluation of 23 agents, and focus on clinical safety make it highly relevant and timely. StepFinder, while technically solid, addresses a narrower problem (failure attribution in multi-agent systems) with more incremental contributions.
Paper 2 offers a foundational analytical framework for AI-Driven Research Systems, broadly applicable across multiple scientific and engineering domains. Its extensive methodological rigor and ability to challenge assumptions about frontier models provide a high breadth of impact. Paper 1, while highly relevant and addressing an important clinical gap, is a domain-specific benchmark with narrower applicability compared to the generalized framework presented in Paper 2.