PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments

Ruoqi Liu, Imran Q. Mohiuddin, Austin J. Schoeffler, Kavita Renduchintala, Ashwin Nayak, Prasantha L. Vemu, Shivam C. Vedak, Kameron C. Black

May 4, 2026

arXiv:2605.02240v1 PDF

cs.AI(primary)

#103of 2292·Artificial Intelligence

#103 of 2292 · Artificial Intelligence

Tournament Score

1541±36

10501800

81%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance8

Rigor7

Novelty7.5

Clarity8

Tournament Score

1541±36

10501800

81%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We introduce PhysicianBench, a benchmark for evaluating LLM agents on physician tasks grounded in real clinical setting within electronic health record (EHR) environments. Existing medical agent benchmarks primarily focus on static knowledge recall, single-step atomic actions, or action intent without verifiable execution against the environment. As a result, they fail to capture the long-horizon, composite workflows that characterize real clinical systems. PhysicianBench comprises 100 long-horizon tasks adapted from real consultation cases between primary care and subspecialty physicians, with each task independently reviewed by a separate panel of physicians. Tasks are instantiated in an EHR environment with real patient records and accessed through the same standard APIs used by commercial EHR vendors. Tasks span 21 specialties (e.g., cardiology, endocrinology, oncology, psychiatry) and diverse workflow types (e.g., diagnosis interpretation, medication prescribing, treatment planning), requiring an average of 27 tool calls per task. Solving each task requires retrieving data across encounters, reasoning over heterogeneous clinical information, executing consequential clinical actions, and producing clinical documentation. Each task is decomposed into structured checkpoints (670 in total across the benchmark) capturing distinct stages of completion graded by task-specific scripts with execution-grounded verification. Across 13 proprietary and open-source LLM agents, the best-performing model achieves only 46% success rate (pass@1), while open-source models reach at most 19%, revealing a substantial gap between current agent capabilities and the demands of real-world clinical workflows. PhysicianBench provides a realistic and execution-grounded benchmark for measuring progress toward autonomous clinical agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PhysicianBench

1. Core Contribution

PhysicianBench introduces a benchmark of 100 long-horizon clinical tasks for evaluating LLM agents operating within realistic Electronic Health Record (EHR) environments. The benchmark addresses a clear gap at the intersection of three properties absent from prior work: (1) long-horizon, multi-step tasks reflecting real clinical workflows (averaging 27 tool calls per task), (2) execution-grounded evaluation against a FHIR-compliant EHR environment where agent actions must modify actual system state, and (3) physician-validated tasks sourced from real e-consult cases rather than synthetic scenarios or exam questions. The key conceptual advance is moving medical AI evaluation from knowledge recall or intent assessment to verifiable workflow execution—the agent must not just know the right answer but must retrieve the right data, reason correctly, place the correct orders in the EHR, and produce appropriate documentation.

2. Methodological Rigor

Task Design: The POMDP formulation is clean and appropriate. Tasks are sourced from real e-consult cases spanning 21 subspecialties, providing genuine clinical complexity including comorbidities, longitudinal data, and real-world messiness. The multi-round physician validation pipeline (11 physicians, iterative review-revision-approval cycles) is well-designed, though the paper could be more transparent about inter-rater reliability and the rate of task rejection/revision.

Evaluation Framework: The checkpoint-based evaluation (670 checkpoints across 100 tasks) is a significant methodological contribution. The three-tier grader system (code, hybrid, LLM-judge) is thoughtfully designed—code graders verify EHR state changes, hybrid graders combine deterministic computation with LLM extraction, and LLM-judge graders handle open-ended clinical reasoning. However, the reliance on LLM judges for clinical reasoning and documentation checkpoints introduces potential bias, and the paper does not report inter-annotator agreement or human-LLM judge concordance rates.

Experimental Design: The evaluation of 12 models across 3 independent runs with pass@1, pass@3, and pass^3 metrics is thorough. The use of isolated Docker containers per task ensures reproducibility. The lightweight agent framework avoids confounding model capability with scaffolding differences. One concern: all models use default temperature, which may advantage some over others.

Statistical Reporting: Error bars are reported for pass@1, but confidence intervals are narrow (only 3 runs), and some key claims (e.g., specialty-level differences) may lack statistical power given the small per-specialty sample sizes (as few as 6 tasks per group).

3. Potential Impact

Immediate Impact on Clinical AI Development: PhysicianBench directly addresses the growing need for rigorous evaluation as LLM agents move toward clinical deployment. The benchmark provides a concrete, reproducible testbed that can serve as a gatekeeper for clinical agent development—the 46% ceiling even for the best model clearly demonstrates that autonomous clinical agents are not ready for deployment.

Benchmarking Infrastructure: The fully open-source release (environments, agent framework, evaluation harness) lowers the barrier for the community to build upon this work. The FHIR-based design ensures alignment with real-world EHR standards (Epic, Cerner), making results more translatable than benchmarks built on proprietary or ad-hoc schemas.

Error Analysis Insights: The finding that clinical reasoning (50.4% of failures) rather than tool use or data retrieval is the primary bottleneck is actionable—it redirects research attention toward improving clinical reasoning capabilities rather than tool-calling mechanics. The fine-grained failure taxonomy (incomplete reasoning, output gap, cascade errors) provides diagnostic value for model developers.

Cross-domain Relevance: The benchmark design pattern—long-horizon tasks in domain-specific environments with checkpoint-based evaluation—could inspire similar benchmarks in other high-stakes professional domains (legal, financial, engineering).

4. Timeliness & Relevance

This work is exceptionally timely. The rush to deploy LLM agents in healthcare creates urgent need for rigorous, realistic evaluation. Existing benchmarks (MedQA, HealthBench, AgentClinic, MedAgentBench) each capture only partial aspects of clinical work. The paper arrives as major EHR vendors are actively integrating LLM capabilities, and the FHIR-based design directly mirrors the API infrastructure these vendors use. The documentation of EHR burden as motivation connects to a real clinical pain point.

5. Strengths & Limitations

Key Strengths:

Ecological validity: Tasks derived from real e-consult cases with real (de-identified) patient records capture the complexity, heterogeneity, and messiness of clinical practice in ways synthetic benchmarks cannot.

Execution-grounded verification: Checking EHR state post-execution rather than just intent is a critical advance for safety-relevant evaluation.

Comprehensive error analysis: The multi-level failure taxonomy and head-to-head analysis (GPT-5.5 vs. Claude Opus 4.6) provide genuine insight rather than just leaderboard rankings.

Reproducibility commitment: Open-source release with containerized environments sets a high standard.

Notable Limitations:

Scale: 100 tasks is relatively small; per-specialty analyses (some groups have only 6 tasks) have limited statistical power. The benchmark may be sensitive to task selection.

Single institution: All tasks derive from Stanford's e-consult system, potentially introducing institutional biases in clinical practice patterns, terminology usage, and documentation style.

LLM-judge reliability: A substantial fraction of checkpoints (clinical reasoning and documentation) rely on LLM judges, but no validation of judge accuracy against human gold standard is provided.

Narrow clinical scope: The e-consult framing, while natural, excludes important clinical workflows (inpatient care, emergency settings, procedural planning, longitudinal chronic disease management).

Privacy considerations: While de-identification is described, the use of real patient data with perturbations warrants scrutiny—the paper describes the process but doesn't validate re-identification risk.

No human baseline: The paper does not compare agent performance against human physician completion rates on the same tasks, which would contextualize the 46% ceiling more meaningfully.

Agent architecture: All models use a minimal tool-calling loop; more sophisticated agent architectures (planning, memory, retrieval augmentation) are not explored, leaving it unclear how much headroom exists from better scaffolding vs. better models.

Overall Assessment

PhysicianBench makes a meaningful contribution by establishing the first benchmark that combines long-horizon clinical tasks, real EHR environments with execution-grounded verification, and physician validation. It addresses a genuine and timely gap in clinical AI evaluation. The methodology is generally sound, though limitations in scale, institutional diversity, and LLM-judge validation temper the strength of conclusions. The benchmark is likely to become a standard reference point for measuring progress toward autonomous clinical agents, though future iterations should address the institutional monoculture and scale limitations.

Rating:7.5/ 10

Significance 8Rigor 7Novelty 7.5Clarity 8

Generated May 5, 2026

Comparison History (80)

vs. Enhanced and Efficient Reasoning in Large Learning Models

gemini-3.15/15/2026

Paper 2 introduces a rigorous, real-world benchmark for LLM agents in a critical, high-stakes domain (healthcare). Benchmarks in AI often drive immediate empirical progress and garner high citations. While Paper 1 proposes a potentially foundational theoretical framework for LLM reasoning, its impact relies on untested empirical adoption, whereas Paper 2 offers a highly structured, timely, and practical evaluation tool with clear, immediate utility for medical AI development.

vs. Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

claude-opus-4.65/6/2026

PhysicianBench offers a concrete, execution-grounded benchmark with real EHR data, 100 long-horizon clinical tasks, and measurable results across 13 LLMs, filling a clear gap in medical AI evaluation. Its practical contribution—a reusable benchmark revealing a substantial capability gap (46% best performance)—will likely drive targeted research improvements. Paper 1 presents a thoughtful conceptual framework for contextual multi-objective optimization but lacks empirical validation or implementation results, making it more of a position paper. Benchmarks with concrete findings tend to generate more citations and downstream research impact than theoretical frameworks without empirical grounding.

vs. Contextual Multi-Objective Optimization: Rethinking Objectives in Frontier AI Systems

gpt-5.25/6/2026

Paper 1 is more likely to have higher scientific impact because it delivers a concrete, execution-grounded benchmark in a high-stakes, data-rich real-world environment (EHRs), enabling reproducible measurement and driving immediate empirical progress in clinical agent research. Its methodological rigor (physician-reviewed tasks, real records, API-executable actions, scripted checkpoint verification) and clear performance gaps make it a strong catalyst for follow-on work across ML, HCI, and health informatics. Paper 2 is timely and potentially broad but is primarily a conceptual framework with less direct empirical validation, reducing near-term measurable impact.

vs. Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

gemini-35/5/2026

Paper 2 introduces a rigorously designed, real-world benchmark in a high-stakes domain (healthcare). By bridging the gap between static medical benchmarks and complex clinical workflows, it will drive critical advancements in medical AI. While Paper 1 offers a valuable efficiency improvement for LLMs, Paper 2's potential to shape the safe deployment of AI in clinical settings gives it broader societal and scientific impact.

vs. PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces an execution-grounded, real-EHR benchmark for long-horizon clinical agent workflows with physician-reviewed tasks and scripted verification, enabling rigorous, reproducible measurement that can steer the field. Its real-world applicability (healthcare/EHR automation), breadth (agent evaluation, tool-use, safety, clinical NLP), and timeliness (LLM agents) are strong. Paper 1 is novel but centers on offensive social-engineering in AR, which may limit adoption, ethical acceptability, and downstream reuse despite an IRB study and dataset.

vs. PhySE: A Psychological Framework for Real-Time AR-LLM Social Engineering Attacks

claude-opus-4.65/5/2026

PhysicianBench addresses a critical gap in evaluating LLM agents for real-world clinical workflows using actual EHR environments with real patient records. It provides a rigorous, execution-grounded benchmark spanning 21 specialties with 670 structured checkpoints, offering broad utility for the rapidly growing clinical AI community. Its methodological rigor (real APIs, physician-reviewed tasks, structured evaluation) and practical relevance to healthcare AI deployment give it wider and more constructive scientific impact. PhySE, while novel, focuses on enabling social engineering attacks—a narrower, ethically contentious domain with more limited positive applications.

vs. Thinking with Reasoning Skills: Fewer Tokens, More Accuracy

gpt-5.25/5/2026

Paper 2 likely has higher impact due to its novelty and rigor in creating an execution-grounded, real-EHR benchmark with physician-reviewed long-horizon tasks and scripted verification. It enables standardized, realistic evaluation of LLM agents in a high-stakes domain, with broad relevance to ML, agentic systems, healthcare informatics, safety, and policy. The demonstrated performance gap provides a clear research target and will likely drive follow-on work. Paper 1 is practically valuable (cost/accuracy) but conceptually closer to existing retrieval/distillation paradigms and is narrower in domain impact.

vs. GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a realistic, execution-grounded benchmark in EHR environments with physician-reviewed long-horizon tasks across 21 specialties, enabling standardized evaluation of clinical LLM agents and revealing major capability gaps. Its applications and societal relevance are broad and timely (healthcare AI, agent evaluation, safety), and the methodology includes verifiable tool execution plus structured checkpoints. Paper 1 is innovative for geoscience agentic lithology classification, but its domain scope and downstream impact are narrower compared to a widely usable clinical benchmark.

vs. GeoMind: An Agentic Workflow for Lithology Classification with Reasoned Tool Invocation

claude-opus-4.65/5/2026

PhysicianBench addresses a broader and more impactful problem—evaluating LLM agents in real-world clinical EHR environments—with significant implications for healthcare AI safety and deployment. Its benchmark spans 21 medical specialties, uses real patient records with execution-grounded verification, and reveals critical capability gaps in current LLM agents (best at 46% success). This has wider cross-disciplinary impact (AI, medicine, health informatics) and higher timeliness given the rapid deployment of clinical AI. GeoMind, while novel in applying agentic workflows to geoscience, targets a narrower domain with more limited broader impact.

vs. BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces an execution-grounded, long-horizon benchmark embedded in real EHR-like environments with physician-reviewed tasks across 21 specialties, addressing a timely and widely recognized bottleneck for deploying LLM agents in healthcare. Its methodological rigor (API-level verification, structured checkpoints) and broad relevance to clinical AI, agent evaluation, safety, and human-computer interaction increase cross-field influence. Paper 1 is innovative and useful for drug discovery, but its reported extraction F1 (0.32) suggests earlier-stage capability and a narrower immediate audience than an EHR benchmark likely to become a standard evaluation reference.

vs. BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

gpt-5.25/5/2026

Paper 2 (BioMiner) has higher likely scientific impact due to broader cross-field applicability (drug discovery, cheminformatics, bioNLP/vision, database curation), clear real-world utility demonstrated via large-scale mining plus measurable downstream gains (QSAR improvements, hit discovery, annotation speed/accuracy), and a sizable new benchmark (BioVista). While Paper 1 is timely and rigorous for evaluating clinical LLM agents, it is primarily a benchmark with narrower deployment constraints (EHR access, privacy/regulation). BioMiner’s outputs directly enable new data resources and model improvements across many biomedical pipelines.

vs. ArguAgent: AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms

gemini-35/5/2026

Paper 1 addresses a critical bottleneck in a high-stakes domain (healthcare AI) by moving beyond static benchmarks to real-world, execution-grounded EHR workflows. Its scale, clinical validation, and direct applicability to developing autonomous medical agents give it broader implications and higher potential impact across both AI research and clinical practice compared to the more niche educational application presented in Paper 2.

vs. ArguAgent: AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms

gemini-35/5/2026

Paper 2 introduces a complex, execution-grounded benchmark in healthcare, a critical and high-stakes field for AI development. By addressing the significant gap between static knowledge recall and long-horizon, real-world clinical workflows within EHR environments, it provides a foundational tool for the AI community. Benchmarks of this scale and rigor typically drive substantial field-wide progress and receive high citations, offering broader scientific and societal impact compared to the specific educational application presented in Paper 1.

vs. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

claude-opus-4.65/5/2026

PhysicianBench addresses a critical gap in evaluating LLM agents for real clinical workflows using actual EHR systems and patient records, with rigorous physician-reviewed tasks across 21 specialties. Its execution-grounded benchmark methodology, revealing a substantial performance gap (best model at 46%), provides a concrete, reproducible measuring stick for the high-stakes domain of clinical AI. While ResearchEVO is innovative in automating scientific discovery, its claims of 'publication-ready' papers and novel discoveries need extensive validation. PhysicianBench's immediate practical relevance to healthcare AI safety and its methodological rigor give it broader and more grounded impact.

vs. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

gpt-5.25/5/2026

Paper 2 is likely to have higher scientific impact due to strong real-world relevance and immediate applicability: it introduces an execution-grounded, long-horizon benchmark embedded in realistic EHR workflows with physician review, multi-specialty coverage, and scripted verification—addressing a central bottleneck for deploying clinical LLM agents safely. Methodological rigor (real APIs, checkpoints, environment execution) and timeliness (healthcare AI evaluation) suggest broad adoption by both academia and industry. Paper 1 is novel, but end-to-end “automated discovery + paper writing” claims may face reproducibility and trust hurdles that can slow uptake.

vs. Can MLLMs "Read" What is Missing?

gpt-5.25/5/2026

Paper 1 likely has higher impact due to stronger novelty and real-world relevance: it evaluates long-horizon, execution-grounded LLM agent workflows in realistic EHR environments using standard APIs and real patient records, with physician-reviewed tasks and verifiable checkpoints. This directly targets a high-stakes domain (clinical operations) with clear applications and broad implications for agent evaluation, safety, and healthcare AI. Paper 2 is a useful, timely benchmark for MLLM visual-text reconstruction, but is narrower in application scope and appears less methodologically demanding than an interactive, tool-using EHR benchmark.

vs. HypEHR: Hyperbolic Modeling of Electronic Health Records for Efficient Question Answering

gpt-5.25/5/2026

Paper 2 likely has higher impact due to broad, timely relevance: it introduces an execution-grounded, long-horizon benchmark closely reflecting real EHR workflows, enabling standardized evaluation across many agent designs and driving progress in clinical AI safety and autonomy. Its methodological rigor (real records, physician review, API-based environment, scripted verification checkpoints) supports reproducibility and meaningful comparisons. The benchmark can influence multiple fields (LLM agents, healthcare informatics, human-computer interaction, evaluation science). Paper 1 is innovative and efficient, but its impact may be narrower to EHR-QA modeling.

vs. MetaGAI: A Large-Scale and High-Quality Benchmark for Generative AI Model and Data Card Generation

gemini-35/5/2026

Paper 1 addresses a critical, high-stakes domain (healthcare) by introducing an execution-grounded benchmark for LLM agents in real-world EHR environments. Its focus on long-horizon, composite clinical workflows evaluated by medical professionals significantly advances beyond existing static medical benchmarks. While Paper 2 offers a valuable tool for AI governance and documentation, the potential real-world impact, methodological rigor involving real clinical systems, and the urgent need to address physician burnout and healthcare automation make Paper 1's contribution more transformative and likely to drive substantial future research.

vs. Compositional Meta-Learning for Mitigating Task Heterogeneity in Physics-Informed Neural Networks

gpt-5.25/5/2026

Paper 1 likely has higher impact due to strong timeliness (LLM agents in healthcare), high novelty in providing execution-grounded, long-horizon EHR benchmarking with real records/APIs, and broad applicability across AI evaluation, clinical NLP/agents, human-AI interaction, and health informatics. Its methodological rigor (physician-reviewed tasks, scripted verification checkpoints) enables standardized progress measurement and could shape both research and industry practice. Paper 2 is technically solid and useful for engineering PDE workflows, but is more niche (PINNs/meta-learning) with narrower cross-field spillover and less immediate societal impact.

vs. Dr.~RTL: Autonomous Agentic RTL Optimization through Tool-Grounded Self-Improvement

gpt-5.25/5/2026

Paper 2 proposes a novel, tool-grounded, closed-loop multi-agent framework with continual self-improvement via an interpretable, reusable skill library, and demonstrates substantial PPA gains on 20 real-world RTL designs against a strong commercial baseline—suggesting immediate industrial applicability and methodological rigor. Its impact can extend across hardware design automation, agent learning, and program optimization. Paper 1 is timely and valuable as a realistic clinical benchmark, but is primarily an evaluation contribution (limited direct downstream utility without deployment pathways) and may face data/access constraints that reduce breadth of adoption.