PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

Keqi Han, Ryan Young, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang

Jun 3, 2026arXiv:2606.05463v1

cs.AI

#1877of 3572·Artificial Intelligence

#1877 of 3572 · Artificial Intelligence

Tournament Score

1393±42

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor8

Novelty8

Clarity7.5

Abstract

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PSEBench

1. Core Contribution

PSEBench introduces a policy-grounded benchmark construction methodology and a concrete 5,074-case benchmark for evaluating LLMs on patient safety event (PSE) triage—specifically, determining whether clinical events are reportable under Minnesota's 29 Reportable Adverse Health Events (MN29) framework. The central innovation is the clause card, a structured decision specification that decomposes regulatory text into auditable components: boundary conditions, event elements, legal basis, and intended verdicts. This factorization enables a controlled, closed-loop generation pipeline where synthetic narratives inherit ground truth by construction, eliminating the need for expensive post-hoc expert annotation.

The benchmark addresses three distinct triage regimes: complete cases (all facts present), missing-information cases (requiring proactive information seeking), and uncertain/gray-zone cases (requiring principled abstention). An agentic two-role evaluation environment allows multi-turn interaction between the evaluated LLM and an information provider, going well beyond static classification benchmarks.

2. Methodological Rigor

The methodology is exceptionally well-designed for a benchmark construction paper. Several aspects stand out:

Closed-loop verification. The two-phase pipeline (factual instantiation → narrative verbalization) includes dedicated verifiers at each stage. The instantiation verifier checks boundary condition satisfaction and constraint compliance; the narrative verifier ensures no fact distortion, omission, or verdict leakage. Missing-variant verification adds a third axis: ensuring masked boundary conditions remain genuinely indeterminate. The retry mechanism (up to 3 attempts with verifier feedback) provides a principled quality control loop, with generation success rates of 89.5–94.7%.

Expert validation. Two patient safety experts reviewed 90 stratified cases, finding high clinical realism (mean 4.50–4.63/5) and strong agreement with by-construction ground truth (28/30 complete, 29/30 missing, 30/30 uncertain). This validation, while limited in sample size, provides meaningful external calibration.

Multi-dimensional evaluation. The 8-metric evaluation suite (M1–M8) is well-motivated, capturing verdict accuracy, clause identification, evidence citation F1, boundary condition hit rate, missing case detection, missing slot identification, uncertainty detection, and reportable detection. The use of both hard-coded metrics and LLM-as-judge scorers is appropriate given the mix of structured and free-text outputs.

Potential concerns: The entire pipeline relies heavily on GPT-5.2 for generation, verification, information provision, and judging. While holding these fixed ensures consistency, it introduces systematic biases. The anchor materials from Japan (JQ database), translated by LLM, may introduce cultural and stylistic artifacts not representative of U.S. clinical reporting. The expert validation sample (n=90) is small relative to the 5,074-case benchmark.

3. Potential Impact

Direct applications. The benchmark fills a genuine operational need: hospitals manually triage safety events daily, and LLM-assisted triage could significantly reduce cognitive burden. PSEBench provides the first systematic way to evaluate whether LLMs are ready for this task.

Methodological contribution. The clause card abstraction and closed-loop verification pipeline are generalizable beyond MN29 to other regulatory frameworks (OSHA, FDA adverse event reporting, financial compliance). This is arguably the paper's most transferable contribution—a principled methodology for converting opaque regulatory text into verifiable benchmark specifications.

Findings with broad implications. The evaluation reveals critical insights: (a) verdict accuracy is an unreliable proxy for triage readiness, as models achieving >90% accuracy still fail on evidence grounding and uncertainty detection; (b) medical-specialty models (HuatuoGPT, MedGemma) perform worse than general-purpose models on policy-grounded tasks, suggesting clinical fine-tuning may actually harm regulatory reasoning; (c) proactive information-seeking and principled abstention degrade far faster than classification accuracy as models scale down, revealing a hierarchy of capability acquisition.

4. Timeliness & Relevance

This work is highly timely. The deployment of LLMs in clinical workflows is accelerating, yet evaluation frameworks lag behind, particularly for tasks requiring regulatory reasoning rather than clinical knowledge. The paper correctly identifies that existing PSE-related NLP work focuses on descriptive classification (event type, severity) rather than the actual triage workflow. The emphasis on information seeking and uncertainty-aware abstention aligns with emerging best practices in interactive clinical AI evaluation (AgentClinic, MediQ, MedAbstain).

5. Strengths & Limitations

Key Strengths:

The clause card formalism is elegant and well-specified, providing both a control surface (for generation) and a verification surface (for evaluation) from the same representation.

The missing-information and uncertain variants are generated through principled extensions of the same framework rather than ad-hoc modifications.

The paper's evaluation is comprehensive: 15 models across 4 families, 8 metrics, stratified analyses by case type, per-clause difficulty analysis, and behavioral profiling.

The end-to-end generation example (Exhibits A.1–A.7) demonstrates the pipeline's rigor, showing how the verifier catches and corrects a specific constraint violation.

Notable Limitations:

Single jurisdiction. MN29 is one state's framework; generalizability to other regulatory systems is claimed but not demonstrated.

Synthetic narratives. Despite anchor materials, the narratives lack the "extreme structural disorder, severe grammatical noise, and idiosyncratic abbreviations" of real incident reports, as the authors acknowledge.

Circular LLM dependency. Using GPT-5.2 for generation, verification, information provision, and M4 judging creates a closed ecosystem that may systematically favor or disadvantage certain model families.

Scale of expert validation. 90/5,074 cases (1.8%) reviewed by experts, with only two reviewers providing joint (not independent) assessments—no inter-rater reliability reported.

Information Provider idealization. The stateless oracle abstracts away the messy reality of EHR navigation, which is arguably the hardest part of real-world information seeking.

Overall Assessment

PSEBench represents a substantial contribution to the intersection of NLP, clinical safety, and regulatory compliance. The clause card methodology is the paper's most lasting contribution—a principled, generalizable approach to converting regulatory text into verifiable benchmark specifications. The evaluation findings are actionable and reveal important gaps in current LLM capabilities for high-stakes regulatory reasoning. The paper's main limitation is its reliance on synthetic generation within a single regulatory framework, but the methodology is clearly designed for broader instantiation.

Rating:7.8/ 10

Significance 8Rigor 8Novelty 8Clarity 7.5

Generated Jun 5, 2026

Comparison History (18)

Wonvs. Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

Paper 2 likely has higher impact due to its broadly useful, verifiable evaluation framework for high-stakes clinical policy reasoning. It introduces a generalizable benchmark construction methodology (clause cards + closed-loop verification) with by-construction ground truth, supports agentic behaviors (information seeking, abstention), and provides a sizable benchmark spanning real regulatory requirements—making it timely for LLM safety/healthcare deployment and useful across NLP evaluation, clinical informatics, and AI governance. Paper 1 is innovative and practical for embedding reliability/serving, but is narrower in scope and closer to an applied optimization study.

gpt-5.2·Jun 9, 2026

Wonvs. Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Paper 1 addresses a critical, high-stakes real-world application (healthcare triage) and introduces a novel, rigorous methodology (clause cards) for generating verifiable, policy-grounded benchmarks. Its scalable approach to handling missing information and ambiguity provides a highly impactful framework for deploying LLMs in regulated domains, offering broader practical utility than the agent evaluation insights in Paper 2.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

IMUG-Bench addresses a broader and more active research area (unified multimodal models) with wider applicability across AI/ML. It evaluates both understanding and generation in multi-turn settings, identifies exposure bias as a key failure mode, and explores mitigation strategies—contributions relevant to the rapidly growing UMM community. While PSEBench is rigorous and valuable for healthcare AI, its scope is narrower (Minnesota-specific patient safety regulations), limiting its broader scientific influence. IMUG-Bench's findings on exposure bias and test-time scaling strategies have wider methodological implications across multimodal AI research.

claude-opus-4-6·Jun 9, 2026

Wonvs. Act As a Real Researcher: A Suite of Benchmarks Evaluating Frontier LLMs and Agentic Harnesses in Research Lifecycle

Paper 2 has higher likely impact due to its direct, high-stakes clinical application, strong methodological contributions (clause cards, anchor-driven instantiation, closed-loop verification) enabling controllable, auditable, and by-construction ground truth, and a sizable benchmark with an agentic environment. It targets a timely need—reliable evaluation of LLMs for safety-critical healthcare workflows—and its ideas (structured policy factorization, verifiable synthetic data) generalize to other regulated domains. Paper 1 is valuable but more meta/benchmark-focused with less immediate real-world deployment leverage.

gpt-5.2·Jun 8, 2026

Lostvs. Towards Healthy Evolution: Exploring the Role and Mechanisms of Human-Agent Interaction in Self-Evolving Systems

Paper 1 addresses a foundational challenge in AI safety and alignment: mitigating safety drift in self-evolving agents. Its framework offers broad theoretical and practical implications across all domains of autonomous AI development, a highly timely and critical research area. While Paper 2 provides a rigorous, high-stakes benchmark for healthcare, its impact is more domain-specific compared to the generalized, cross-field relevance of AI alignment explored in Paper 1.

gemini-3.1-pro-preview·Jun 6, 2026

Lostvs. MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation

Paper 2 demonstrates massive real-world impact and scalability, having already been deployed in an industrial setting (Baidu Maps) across over 360 cities to achieve 95% automation in lane-level map generation. While Paper 1 introduces a valuable benchmark for clinical LLM evaluation, Paper 2's proven integration into critical autonomous driving infrastructure and its solution to a major scalability bottleneck give it a much higher and more immediate impact.

gemini-3.1-pro-preview·Jun 6, 2026

Lostvs. A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice

Paper 1 addresses a fundamental gap in the growing field of human-AI collaboration by providing the first formal framework for measuring appropriate reliance on set-valued AI advice. This has broad applicability across any domain where AI communicates uncertainty (classification and regression), making it foundational for future research. Paper 2, while rigorous and addressing an important clinical need, is more domain-specific (patient safety triage) and benchmark-focused. Paper 1's theoretical contribution to measuring human-AI interaction with uncertainty communication has wider cross-field impact and longer-term influence on how researchers design and evaluate AI advisory systems.

claude-opus-4-6·Jun 5, 2026

Wonvs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

Paper 1 is likely higher impact due to stronger methodological innovation (clause cards, anchor-driven instantiation, closed-loop verification) yielding auditable, by-construction ground truth and an agentic evaluation setting, addressing key evaluation gaps (evidence-grounded reasoning, information seeking, abstention). It targets a high-stakes clinical workflow with clear real-world applicability and regulatory relevance. While Paper 2 is timely and useful, it relies on scraped conversations and LLM-as-judge evaluation with more potential confounds and narrower generalizability; its primary contribution is a labeled dataset rather than a broadly reusable benchmark construction methodology.

gpt-5.2·Jun 5, 2026

Lostvs. Universal Quantum Transformer

Paper 2 proposes a fundamental paradigm shift by introducing a quantum-native transformer architecture. It tackles core limitations of classical neural networks in exact mathematical reasoning and theoretical scaling (quadratic attention bottlenecks). Its successful deployment on real quantum hardware and implications for both quantum computing and artificial general intelligence give it substantially broader and more disruptive scientific impact compared to Paper 1, which, while rigorous and practically useful, is a domain-specific evaluation benchmark for existing LLMs.

gemini-3.1-pro-preview·Jun 5, 2026

Wonvs. SubtleMemory: A Benchmark for Fine-Grained Relational Memory Discrimination in Long-Horizon AI Agents

Paper 2 has higher impact potential due to a clearer, high-stakes real-world application (clinical patient-safety triage) and a methodology (clause cards + verification) that yields auditable, by-construction ground truth—supporting rigorous, reproducible evaluation and deployment-relevant behaviors (info seeking, abstention). Its agentic environment and larger scale further strengthen rigor and usefulness. Paper 1 is timely and novel for long-horizon memory relations, but its benchmark is less directly tied to an immediately regulated, high-impact domain and may be narrower in near-term practical adoption.

gpt-5.2·Jun 5, 2026

#1877of 3572·Artificial Intelligence

#1877 of 3572 · Artificial Intelligence

Tournament Score

1393±42

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor8

Novelty8

Clarity7.5