PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black
Abstract
We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.
AI Impact Assessments
(1 models)Scientific Impact Assessment: PhysicianBench
1. Core Contribution
PhysicianBench introduces a benchmark of 100 long-horizon clinical tasks for evaluating LLM agents operating within realistic Electronic Health Record (EHR) environments. The benchmark addresses a clear gap at the intersection of three properties absent from prior work: (1) long-horizon, multi-step tasks reflecting real clinical workflows (averaging 27 tool calls per task), (2) execution-grounded evaluation against a FHIR-compliant EHR environment where agent actions must modify actual system state, and (3) physician-validated tasks sourced from real e-consult cases rather than synthetic scenarios or exam questions. The key conceptual advance is moving medical AI evaluation from knowledge recall or intent assessment to verifiable workflow execution—the agent must not just know the right answer but must retrieve the right data, reason correctly, place the correct orders in the EHR, and produce appropriate documentation.
2. Methodological Rigor
Task Design: The POMDP formulation is clean and appropriate. Tasks are sourced from real e-consult cases spanning 21 subspecialties, providing genuine clinical complexity including comorbidities, longitudinal data, and real-world messiness. The multi-round physician validation pipeline (11 physicians, iterative review-revision-approval cycles) is well-designed, though the paper could be more transparent about inter-rater reliability and the rate of task rejection/revision.
Evaluation Framework: The checkpoint-based evaluation (670 checkpoints across 100 tasks) is a significant methodological contribution. The three-tier grader system (code, hybrid, LLM-judge) is thoughtfully designed—code graders verify EHR state changes, hybrid graders combine deterministic computation with LLM extraction, and LLM-judge graders handle open-ended clinical reasoning. However, the reliance on LLM judges for clinical reasoning and documentation checkpoints introduces potential bias, and the paper does not report inter-annotator agreement or human-LLM judge concordance rates.
Experimental Design: The evaluation of 12 models across 3 independent runs with pass@1, pass@3, and pass^3 metrics is thorough. The use of isolated Docker containers per task ensures reproducibility. The lightweight agent framework avoids confounding model capability with scaffolding differences. One concern: all models use default temperature, which may advantage some over others.
Statistical Reporting: Error bars are reported for pass@1, but confidence intervals are narrow (only 3 runs), and some key claims (e.g., specialty-level differences) may lack statistical power given the small per-specialty sample sizes (as few as 6 tasks per group).
3. Potential Impact
Immediate Impact on Clinical AI Development: PhysicianBench directly addresses the growing need for rigorous evaluation as LLM agents move toward clinical deployment. The benchmark provides a concrete, reproducible testbed that can serve as a gatekeeper for clinical agent development—the 46% ceiling even for the best model clearly demonstrates that autonomous clinical agents are not ready for deployment.
Benchmarking Infrastructure: The fully open-source release (environments, agent framework, evaluation harness) lowers the barrier for the community to build upon this work. The FHIR-based design ensures alignment with real-world EHR standards (Epic, Cerner), making results more translatable than benchmarks built on proprietary or ad-hoc schemas.
Error Analysis Insights: The finding that clinical reasoning (50.4% of failures) rather than tool use or data retrieval is the primary bottleneck is actionable—it redirects research attention toward improving clinical reasoning capabilities rather than tool-calling mechanics. The fine-grained failure taxonomy (incomplete reasoning, output gap, cascade errors) provides diagnostic value for model developers.
Cross-domain Relevance: The benchmark design pattern—long-horizon tasks in domain-specific environments with checkpoint-based evaluation—could inspire similar benchmarks in other high-stakes professional domains (legal, financial, engineering).
4. Timeliness & Relevance
This work is exceptionally timely. The rush to deploy LLM agents in healthcare creates urgent need for rigorous, realistic evaluation. Existing benchmarks (MedQA, HealthBench, AgentClinic, MedAgentBench) each capture only partial aspects of clinical work. The paper arrives as major EHR vendors are actively integrating LLM capabilities, and the FHIR-based design directly mirrors the API infrastructure these vendors use. The documentation of EHR burden as motivation connects to a real clinical pain point.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
PhysicianBench makes a meaningful contribution by establishing the first benchmark that combines long-horizon clinical tasks, real EHR environments with execution-grounded verification, and physician validation. It addresses a genuine and timely gap in clinical AI evaluation. The methodology is generally sound, though limitations in scale, institutional diversity, and LLM-judge validation temper the strength of conclusions. The benchmark is likely to become a standard reference point for measuring progress toward autonomous clinical agents, though future iterations should address the institutional monoculture and scale limitations.
Generated May 5, 2026
Comparison History (80)
Paper 2 introduces a rigorous, real-world benchmark for LLM agents in a critical, high-stakes domain (healthcare). Benchmarks in AI often drive immediate empirical progress and garner high citations. While Paper 1 proposes a potentially foundational theoretical framework for LLM reasoning, its impact relies on untested empirical adoption, whereas Paper 2 offers a highly structured, timely, and practical evaluation tool with clear, immediate utility for medical AI development.
PhysicianBench offers a concrete, execution-grounded benchmark with real EHR data, 100 long-horizon clinical tasks, and measurable results across 13 LLMs, filling a clear gap in medical AI evaluation. Its practical contribution—a reusable benchmark revealing a substantial capability gap (46% best performance)—will likely drive targeted research improvements. Paper 1 presents a thoughtful conceptual framework for contextual multi-objective optimization but lacks empirical validation or implementation results, making it more of a position paper. Benchmarks with concrete findings tend to generate more citations and downstream research impact than theoretical frameworks without empirical grounding.
Paper 1 is more likely to have higher scientific impact because it delivers a concrete, execution-grounded benchmark in a high-stakes, data-rich real-world environment (EHRs), enabling reproducible measurement and driving immediate empirical progress in clinical agent research. Its methodological rigor (physician-reviewed tasks, real records, API-executable actions, scripted checkpoint verification) and clear performance gaps make it a strong catalyst for follow-on work across ML, HCI, and health informatics. Paper 2 is timely and potentially broad but is primarily a conceptual framework with less direct empirical validation, reducing near-term measurable impact.
Paper 2 introduces a rigorously designed, real-world benchmark in a high-stakes domain (healthcare). By bridging the gap between static medical benchmarks and complex clinical workflows, it will drive critical advancements in medical AI. While Paper 1 offers a valuable efficiency improvement for LLMs, Paper 2's potential to shape the safe deployment of AI in clinical settings gives it broader societal and scientific impact.
Paper 2 likely has higher impact: it introduces an execution-grounded, real-EHR benchmark for long-horizon clinical agent workflows with physician-reviewed tasks and scripted verification, enabling rigorous, reproducible measurement that can steer the field. Its real-world applicability (healthcare/EHR automation), breadth (agent evaluation, tool-use, safety, clinical NLP), and timeliness (LLM agents) are strong. Paper 1 is novel but centers on offensive social-engineering in AR, which may limit adoption, ethical acceptability, and downstream reuse despite an IRB study and dataset.
PhysicianBench addresses a critical gap in evaluating LLM agents for real-world clinical workflows using actual EHR environments with real patient records. It provides a rigorous, execution-grounded benchmark spanning 21 specialties with 670 structured checkpoints, offering broad utility for the rapidly growing clinical AI community. Its methodological rigor (real APIs, physician-reviewed tasks, structured evaluation) and practical relevance to healthcare AI deployment give it wider and more constructive scientific impact. PhySE, while novel, focuses on enabling social engineering attacks—a narrower, ethically contentious domain with more limited positive applications.
Paper 2 likely has higher impact due to its novelty and rigor in creating an execution-grounded, real-EHR benchmark with physician-reviewed long-horizon tasks and scripted verification. It enables standardized, realistic evaluation of LLM agents in a high-stakes domain, with broad relevance to ML, agentic systems, healthcare informatics, safety, and policy. The demonstrated performance gap provides a clear research target and will likely drive follow-on work. Paper 1 is practically valuable (cost/accuracy) but conceptually closer to existing retrieval/distillation paradigms and is narrower in domain impact.
Paper 2 likely has higher impact: it introduces a realistic, execution-grounded benchmark in EHR environments with physician-reviewed long-horizon tasks across 21 specialties, enabling standardized evaluation of clinical LLM agents and revealing major capability gaps. Its applications and societal relevance are broad and timely (healthcare AI, agent evaluation, safety), and the methodology includes verifiable tool execution plus structured checkpoints. Paper 1 is innovative for geoscience agentic lithology classification, but its domain scope and downstream impact are narrower compared to a widely usable clinical benchmark.
PhysicianBench addresses a broader and more impactful problem—evaluating LLM agents in real-world clinical EHR environments—with significant implications for healthcare AI safety and deployment. Its benchmark spans 21 medical specialties, uses real patient records with execution-grounded verification, and reveals critical capability gaps in current LLM agents (best at 46% success). This has wider cross-disciplinary impact (AI, medicine, health informatics) and higher timeliness given the rapid deployment of clinical AI. GeoMind, while novel in applying agentic workflows to geoscience, targets a narrower domain with more limited broader impact.
Paper 2 likely has higher impact: it introduces an execution-grounded, long-horizon benchmark embedded in real EHR-like environments with physician-reviewed tasks across 21 specialties, addressing a timely and widely recognized bottleneck for deploying LLM agents in healthcare. Its methodological rigor (API-level verification, structured checkpoints) and broad relevance to clinical AI, agent evaluation, safety, and human-computer interaction increase cross-field influence. Paper 1 is innovative and useful for drug discovery, but its reported extraction F1 (0.32) suggests earlier-stage capability and a narrower immediate audience than an EHR benchmark likely to become a standard evaluation reference.
Paper 2 (BioMiner) has higher likely scientific impact due to broader cross-field applicability (drug discovery, cheminformatics, bioNLP/vision, database curation), clear real-world utility demonstrated via large-scale mining plus measurable downstream gains (QSAR improvements, hit discovery, annotation speed/accuracy), and a sizable new benchmark (BioVista). While Paper 1 is timely and rigorous for evaluating clinical LLM agents, it is primarily a benchmark with narrower deployment constraints (EHR access, privacy/regulation). BioMiner’s outputs directly enable new data resources and model improvements across many biomedical pipelines.
Paper 1 addresses a critical bottleneck in a high-stakes domain (healthcare AI) by moving beyond static benchmarks to real-world, execution-grounded EHR workflows. Its scale, clinical validation, and direct applicability to developing autonomous medical agents give it broader implications and higher potential impact across both AI research and clinical practice compared to the more niche educational application presented in Paper 2.
Paper 2 introduces a complex, execution-grounded benchmark in healthcare, a critical and high-stakes field for AI development. By addressing the significant gap between static knowledge recall and long-horizon, real-world clinical workflows within EHR environments, it provides a foundational tool for the AI community. Benchmarks of this scale and rigor typically drive substantial field-wide progress and receive high citations, offering broader scientific and societal impact compared to the specific educational application presented in Paper 1.
PhysicianBench addresses a critical gap in evaluating LLM agents for real clinical workflows using actual EHR systems and patient records, with rigorous physician-reviewed tasks across 21 specialties. Its execution-grounded benchmark methodology, revealing a substantial performance gap (best model at 46%), provides a concrete, reproducible measuring stick for the high-stakes domain of clinical AI. While ResearchEVO is innovative in automating scientific discovery, its claims of 'publication-ready' papers and novel discoveries need extensive validation. PhysicianBench's immediate practical relevance to healthcare AI safety and its methodological rigor give it broader and more grounded impact.
Paper 2 is likely to have higher scientific impact due to strong real-world relevance and immediate applicability: it introduces an execution-grounded, long-horizon benchmark embedded in realistic EHR workflows with physician review, multi-specialty coverage, and scripted verification—addressing a central bottleneck for deploying clinical LLM agents safely. Methodological rigor (real APIs, checkpoints, environment execution) and timeliness (healthcare AI evaluation) suggest broad adoption by both academia and industry. Paper 1 is novel, but end-to-end “automated discovery + paper writing” claims may face reproducibility and trust hurdles that can slow uptake.
Paper 1 likely has higher impact due to stronger novelty and real-world relevance: it evaluates long-horizon, execution-grounded LLM agent workflows in realistic EHR environments using standard APIs and real patient records, with physician-reviewed tasks and verifiable checkpoints. This directly targets a high-stakes domain (clinical operations) with clear applications and broad implications for agent evaluation, safety, and healthcare AI. Paper 2 is a useful, timely benchmark for MLLM visual-text reconstruction, but is narrower in application scope and appears less methodologically demanding than an interactive, tool-using EHR benchmark.
Paper 2 likely has higher impact due to broad, timely relevance: it introduces an execution-grounded, long-horizon benchmark closely reflecting real EHR workflows, enabling standardized evaluation across many agent designs and driving progress in clinical AI safety and autonomy. Its methodological rigor (real records, physician review, API-based environment, scripted verification checkpoints) supports reproducibility and meaningful comparisons. The benchmark can influence multiple fields (LLM agents, healthcare informatics, human-computer interaction, evaluation science). Paper 1 is innovative and efficient, but its impact may be narrower to EHR-QA modeling.
Paper 1 addresses a critical, high-stakes domain (healthcare) by introducing an execution-grounded benchmark for LLM agents in real-world EHR environments. Its focus on long-horizon, composite clinical workflows evaluated by medical professionals significantly advances beyond existing static medical benchmarks. While Paper 2 offers a valuable tool for AI governance and documentation, the potential real-world impact, methodological rigor involving real clinical systems, and the urgent need to address physician burnout and healthcare automation make Paper 1's contribution more transformative and likely to drive substantial future research.
Paper 1 likely has higher impact due to strong timeliness (LLM agents in healthcare), high novelty in providing execution-grounded, long-horizon EHR benchmarking with real records/APIs, and broad applicability across AI evaluation, clinical NLP/agents, human-AI interaction, and health informatics. Its methodological rigor (physician-reviewed tasks, scripted verification checkpoints) enables standardized progress measurement and could shape both research and industry practice. Paper 2 is technically solid and useful for engineering PDE workflows, but is more niche (PINNs/meta-learning) with narrower cross-field spillover and less immediate societal impact.
Paper 2 proposes a novel, tool-grounded, closed-loop multi-agent framework with continual self-improvement via an interpretable, reusable skill library, and demonstrates substantial PPA gains on 20 real-world RTL designs against a strong commercial baseline—suggesting immediate industrial applicability and methodological rigor. Its impact can extend across hardware design automation, agent learning, and program optimization. Paper 1 is timely and valuable as a realistic clinical benchmark, but is primarily an evaluation contribution (limited direct downstream utility without deployment pathways) and may face data/access constraints that reduce breadth of adoption.