What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

Victor Ojewale, Suresh Venkatasubramanian

Jun 1, 2026

arXiv:2606.02965v1 PDF

cs.AI(primary)

#335of 3355·Artificial Intelligence

#335 of 3355 · Artificial Intelligence

Tournament Score

1504±47

10501800

84%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance7

Rigor5.5

Novelty6.5

Clarity8

Tournament Score

1504±47

10501800

84%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Benchmarks for autonomous agents measure whether agents complete tasks, yet this framing is systematically blind to whether an agent should have proceeded at all. Agents trained under human-feedback objectives develop a structural tendency to proceed even when they lack the inputs, evidence, or authorization to act safely, a disposition we term compliance bias, because both the reward signal and the benchmark scoring regime treat proceeding as the correct default regardless of whether the preconditions for safe action are present. We make three contributions. We first show that compliance bias originates in reward hacking within human-feedback pipelines and is entrenched by prominent agent benchmarks, which either penalize agents for pausing or are architecturally unable to distinguish a principled pause from a silent failure. We then introduce a three-gap taxonomy of abstention-warranted scenarios, covering specification gaps where required information is absent, verification gaps where world state cannot be confirmed, and authority gaps where explicit authorization has not been given, which together provide a principled basis for constructing abstention-aware agent benchmarks. Finally, we propose abstention evaluation protocols (Safety Rate, Usability Rate, and Informed Refusal Rate) and report preliminary results across 144 enterprise agent scenarios and five model families, in which a runtime-enforced abstention mechanism achieves up to 89.2% hazardous-action blocking and 87.5% usability on authorized scenarios, demonstrating that the safety--usability tradeoff is tunable rather than inherent and that its shape varies substantially across model families. We treat this as preliminary work and offer the taxonomy and composite metrics as a starting point for further conversations.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes a blind spot in autonomous agent evaluation: current benchmarks measure task completion but cannot distinguish an agent that appropriately declines to act from one that silently fails. The authors introduce the concept of "compliance bias" — the structural tendency of RLHF-trained agents to proceed even when preconditions for safe action are absent — and trace its origins through the reward pipeline into benchmark design. Three concrete contributions follow: (1) a theoretical account linking RLHF sycophancy to agent-level compliance bias, (2) a three-gap taxonomy (specification, verification, authority) of abstention-warranted scenarios, and (3) a composite evaluation protocol (SR/UR/IRR) validated across 144 scenarios and seven model families under three conditions.

The core insight — that benchmarks which only reward completion create training incentives against cautious behavior — is important and, while not entirely novel in isolation (related observations appear in Kapoor et al., Zhu et al., and the over-refusal literature), the paper synthesizes these threads into a coherent framework with actionable evaluation protocols. The distinction between "principled pause" and "silent failure" is particularly well-articulated.

Methodological Rigor

The methodology is reasonable for a workshop paper but has notable limitations. The 144-scenario dataset is small and constructed from only 24 human-authored seeds augmented by GPT-4o-mini, raising concerns about diversity and ecological validity. The paper acknowledges this is preliminary work, which is appropriate given the scale.

The three-condition comparison (Baseline, Prompt-Only, Checkpoint) is well-designed and reveals genuinely interesting patterns — particularly the "usability cliff" where prompt-only safety instructions cause models like GPT-4o to drop from 79.2% to 4.2% usability. The finding that Claude models exhibit inverted compliance bias (refuse-always rather than proceed-always) adds nuance to the narrative.

However, several methodological concerns warrant attention. The IRR metric for Baseline and Prompt-Only conditions relies on an LLM judge (GPT-4o), introducing potential circularity when evaluating GPT-4o's own responses. The Checkpoint condition uses a separate GPT-4o guard model for the commitment checkpoint, meaning the "runtime enforcement" results are partially dependent on model capability rather than being purely architectural. The convergence of all models to SR 88-91% under Checkpoint is attributed to a "structural ceiling" but could also reflect limitations in scenario design. The mock execution environment, while controlled, is far from the complexity of real enterprise deployments.

Potential Impact

The practical impact could be significant if the benchmark community adopts the proposed framework. The paper addresses a genuine gap: as agents move from text generation to irreversible real-world actions (database modifications, API calls, financial transactions), the inability to evaluate appropriate restraint becomes a safety-critical omission. The four concrete design recommendations (include pause-as-ground-truth scenarios, redesign step budgets, require matched safe controls, report SR/UR/IRR jointly) are immediately actionable.

The finding that the safety-usability tradeoff is "tunable rather than inherent" is the paper's most impactful empirical claim. Demonstrating that runtime enforcement can achieve ~90% SR and ~85% UR simultaneously — when prompt-only approaches create catastrophic usability failures for some models — provides a compelling argument for architectural rather than purely instructional safety mechanisms.

The three-gap taxonomy, while simple, provides useful vocabulary for the community. Its mapping onto concrete engineering patterns (schema validation, state polling, authorization gates) makes it directly implementable.

Timeliness & Relevance

This paper is highly timely. The rapid deployment of agentic AI systems in enterprise contexts (coding assistants, customer service agents, infrastructure management) makes the absence of abstention evaluation increasingly dangerous. The paper appears at a moment when the community is actively debating agent benchmark design (HAL, ABC, AgentHarm), and offers a complementary perspective that none of these works fully addresses. The workshop venue (RLEval @ ACM CAIS '26) is appropriate for positioning this as a conversation-starter rather than a definitive solution.

Strengths

1. Clear problem framing: The paper articulates why compliance bias is structural rather than incidental, connecting RLHF dynamics to benchmark design to deployment risk in a compelling causal chain.

2. Paired evaluation design: Requiring matched safe/hazardous scenario pairs is methodologically sound and prevents the common pitfall of evaluating safety without usability (or vice versa).

3. Cross-model analysis reveals diversity: The finding that compliance bias manifests as proceed-always (Llama) OR refuse-always (Claude) challenges simplistic narratives and demonstrates why model-specific evaluation is necessary.

4. Actionable recommendations: The four benchmark design implications are concrete enough to implement immediately.

5. Honest scope claims: The paper consistently frames results as preliminary, which is refreshing and appropriate.

Limitations

1. Scale: 144 scenarios across two domains (HR, DevOps) is insufficient to validate the taxonomy's completeness or the metrics' robustness. Enterprise environments involve thousands of action types with complex interdependencies.

2. Taxonomy completeness: The three-gap taxonomy may not be exhaustive. Scenarios involving conflicting instructions, resource contention, or temporal constraints could warrant additional gap types.

3. Guard model dependency: The Commitment Checkpoint relies on GPT-4o as a guard, meaning the approach's effectiveness is contingent on the availability of a capable classifier model — a dependency the paper doesn't fully explore.

4. No adversarial evaluation: The scenarios appear to be straightforward; there is no testing of adversarial inputs designed to circumvent the abstention mechanism, which is critical for real-world deployment.

5. Missing cost analysis: Runtime enforcement adds latency (polling with retries) and cost (guard model calls). No analysis of these overheads is provided.

6. LLM-as-judge circularity: Using GPT-4o to judge outcomes for GPT-4o responses is methodologically questionable, even if common practice.

Overall Assessment

This is a well-motivated workshop paper that identifies a genuine and important gap in agent evaluation methodology. Its contributions are primarily conceptual (the compliance bias framing, three-gap taxonomy, and composite metrics) rather than empirical, which is appropriate for the scale of the work. The preliminary results are suggestive rather than conclusive but point in promising directions. The paper's impact will depend on whether the community adopts and extends the proposed framework — the groundwork laid here is solid enough to serve as a foundation, though substantially more work is needed to validate the approach at scale and in realistic deployment contexts.

Rating:6.3/ 10

Significance 7Rigor 5.5Novelty 6.5Clarity 8

Generated Jun 3, 2026

Comparison History (19)

vs. Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

claude-opus-4.66/5/2026

Paper 2 addresses a fundamental and broadly applicable problem in AI safety—compliance bias in autonomous agents—that affects the entire field of agent development. Its three-gap taxonomy and abstention evaluation protocols provide a conceptual framework applicable across all agent benchmarks, not just one domain. While Paper 1 is technically strong with impressive empirical results on VRP optimization, it targets a narrower domain. Paper 2's timeliness is higher given the rapid deployment of autonomous agents, and its potential to reshape how the community evaluates agent safety gives it broader cross-field impact.

vs. Scaling Self-Evolving Agents via Parametric Memory

gemini-3.16/5/2026

Paper 2 addresses a fundamental flaw in current AI agent evaluation and training (compliance bias), which has broad implications for AI safety, alignment, and real-world deployment. By proposing a new taxonomy and evaluation metrics for agent abstention, it has the potential to shift the community's benchmarking paradigm. While Paper 1 presents a strong technical innovation in agent memory, Paper 2's focus on foundational safety metrics gives it a wider, more urgent impact across the field.

vs. Forget Attention: Importance-Aware Attention Is All You Need

claude-opus-4.66/3/2026

Paper 2 introduces a concrete architectural innovation (SISA) that addresses a fundamental challenge in language modeling—fusing attention and SSM mechanisms at the score level rather than block or head level. This defines a new design axis with empirical results showing improvements on standard benchmarks, and it integrates seamlessly with existing infrastructure (stock SDPA). Paper 1 addresses an important safety concern (compliance bias in agents) with a useful taxonomy and evaluation protocols, but its contributions are more framework/position-oriented with preliminary results on a small-scale evaluation. Paper 2's architectural contribution is more likely to drive widespread follow-up research and adoption in the rapidly evolving LLM architecture space.

vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental and underexplored safety gap in autonomous agent evaluation—compliance bias and the absence of abstention benchmarks. It introduces a novel taxonomy, concrete evaluation protocols, and preliminary empirical results across 144 scenarios. This has broad impact across AI safety, agent deployment, and benchmark design communities. Paper 2, while technically strong in co-evolving training harnesses, represents a more incremental advance in RL training methodology. Paper 1's contribution to safety evaluation frameworks is more timely and broadly impactful as autonomous agents are increasingly deployed in real-world settings.

vs. Inducing Reasoning Primitives from Agent Traces

gpt-5.26/3/2026

Paper 2 likely has higher impact: it reframes agent evaluation around abstention competence, identifying a systemic benchmark blind spot (compliance bias) with clear safety implications and broad relevance to RLHF, alignment, evaluation, and deployment governance. It contributes a general taxonomy plus concrete, portable metrics and protocols, and backs them with multi-model results over 144 enterprise scenarios—supporting methodological rigor and real-world applicability. Paper 1 is technically strong and yields large gains, but it is more incremental within agent prompting/tooling and narrower in cross-field implications than a new evaluation paradigm for safe autonomy.

vs. DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees

claude-opus-4.66/3/2026

Paper 1 addresses a fundamental and timely gap in AI safety evaluation—compliance bias in autonomous agents—with broad implications for the entire agent evaluation ecosystem. Its taxonomy, metrics, and empirical results across 144 scenarios and five model families provide a foundational framework that could reshape how agent benchmarks are designed. The safety-critical nature of the problem gives it high real-world relevance. Paper 2 offers a solid technical contribution to memory management for LLM agents, but its scope is narrower, addressing efficiency and retrieval conflicts rather than a systemic safety concern with field-wide implications.

vs. Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

gpt-5.26/3/2026

Paper 1 targets a timely, high-impact problem in AI safety and evaluation: agents should abstain when lacking information/authorization. It introduces a clear taxonomy, proposes new metrics and protocols, and provides empirical results across many enterprise scenarios and model families, making it readily actionable for benchmark design and deployment practices. Its potential real-world applications (safer autonomous agents) and cross-field relevance (ML, HCI, safety, policy) are broad. Paper 2 is methodologically rigorous and novel within non-monotonic/modal logic, but its likely impact is narrower and more theoretical with fewer immediate applications.

vs. The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

gemini-3.16/3/2026

Paper 1 addresses a fundamental flaw in current autonomous agent evaluation by focusing on 'abstention competence' and AI safety. By shifting the paradigm from mere task completion to safe refusal, it has broad implications for AI alignment, benchmarking, and real-world deployment. While Paper 2 offers a valuable algorithmic optimization for LLM inference costs, Paper 1's conceptual framework and taxonomy address critical, foundational challenges in safe AI behavior, promising broader and more transformative impact across the field.

vs. Overlaying Governance: A Compositional Authorization Framework for Delegation and Scope in Agentic AI

gpt-5.26/3/2026

Paper 2 is likely to have higher impact due to timeliness and breadth: abstention competence targets a widely observed failure mode in current autonomous agents and can directly reshape benchmark design, evaluation practice, and safety engineering across labs and products. It introduces a clear taxonomy and concrete, adoptable metrics/protocols, supported by multi-model empirical results on many enterprise scenarios—facilitating replication and immediate real-world use. Paper 1 is novel and potentially important for governance, but its impact may be narrower and slower-moving because it depends on integration into IAM/policy ecosystems and organizational adoption of new authorization semantics.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

gpt-5.26/3/2026

Paper 2 has higher potential impact because it introduces a broadly applicable conceptual shift in evaluating autonomous agents: measuring abstention competence and diagnosing “compliance bias” as a benchmark/reward-structure artifact. It contributes a general taxonomy (specification/verification/authority gaps), proposes new evaluation protocols and composite metrics, and provides empirical results across many scenarios and model families, making it timely and relevant to safety-critical deployment across domains. Paper 1 is a rigorous, useful scoping review for dentistry, but its impact is narrower and less methodologically innovative than Paper 2’s framework-level contribution.

vs. The DeepSpeak-Agentic Dataset

gemini-3.16/3/2026

Paper 2 addresses a critical flaw in AI agent training and evaluation (compliance bias) with high relevance to AI safety and alignment. By introducing a novel taxonomy and evaluation protocols for abstention competence, it has broad implications for how future autonomous agents are benchmarked and deployed safely in real-world scenarios. While Paper 1 provides a useful dataset, Paper 2 offers a fundamental conceptual and methodological shift with wider potential impact across the AI field.

vs. ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

gpt-5.26/3/2026

Paper 1 introduces a timely and broadly relevant reframing of agent evaluation: measuring abstention competence and diagnosing “compliance bias” induced by HF/benchmark incentives. Its three-gap taxonomy and composite metrics could standardize safety-critical evaluation across many agent settings (enterprise, tool use, autonomy), with clear real-world implications for safer deployment and governance. While Paper 2 (ReSkill) is technically innovative and valuable for agentic RL performance/generalization, its impact is more domain-specific and contingent on adoption within particular RL training stacks, whereas abstention-aware benchmarking could influence multiple subfields (alignment, evaluation, HCI, safety, policy).

vs. Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

claude-opus-4.66/3/2026

Paper 1 offers a novel theoretical contribution connecting two previously separate fields—optimal control theory and behavioral economics—by showing that prospect-theory-like behavior emerges naturally from Bellman optimality in MDPs with catastrophic states, without requiring any behavioral assumptions. It provides closed-form results, extensive numerical validation (R²=0.999), and robustness across noise models. This is a fundamental insight with broad implications for economics, AI, and decision theory. Paper 2 addresses an important practical problem (abstention in AI agents) but is more of a position/framework paper with preliminary empirical results, offering less theoretical depth and narrower conceptual novelty.

vs. Probe Before You Edit: Probing-Guided Molecular Optimization for LLM Agents in Structure-Based Drug Design

gemini-3.16/3/2026

Paper 2 addresses a critical, cross-cutting issue in AI safety—abstention competence—applicable to all autonomous agents. By exposing fundamental flaws in current benchmarking and proposing a novel taxonomy and evaluation metrics, it has the potential to reshape agent evaluation broadly. While Paper 1 is highly innovative and valuable in the specific domain of drug discovery, Paper 2's foundational contribution to AI safety and evaluation methodology ensures a much wider breadth of impact across multiple fields.

vs. POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems

gpt-5.26/3/2026

Paper 2 targets a foundational blind spot in agent evaluation—whether an agent should act at all—introducing a general taxonomy (specification/verification/authority gaps) and concrete, reusable metrics/protocols for abstention competence. This reframes benchmark design in a way likely to influence multiple subfields (alignment, RLHF, agent safety, evaluation, enterprise deployment) and is timely given real-world autonomy and governance needs. Paper 1 is practical and open-sourced, but is more specific to multi-agent failure detection and may have narrower cross-domain influence than abstention-aware evaluation frameworks.

vs. Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

gemini-3.16/3/2026

Paper 1 proposes a paradigm-shifting, planet-scale operating system that dynamically interconnects disparate AI and physical scientific tools. By enabling emergent collaboration across disciplines, its potential to accelerate scientific discovery universally gives it profound breadth and novelty. While Paper 2 addresses a critical AI safety gap in benchmarking, Paper 1's framework fundamentally alters how cross-disciplinary science is conducted, promising a vastly broader, field-agnostic scientific impact.

vs. Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

gemini-3.16/3/2026

Paper 2 addresses a critical and under-evaluated area in AI safety: an agent's ability to abstain from unsafe actions. By critiquing current RLHF paradigms, introducing a novel taxonomy for abstention, and proposing new evaluation metrics, it establishes the groundwork for a new paradigm in agent benchmarking. This conceptual shift has broader, more fundamental implications for the safe real-world deployment of autonomous agents than Paper 1's methodological, albeit effective, improvement in instruction following.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 1 has higher likely impact: it targets a timely, high-stakes failure mode (unsafe over-compliance) in autonomous agents, proposes a clear taxonomy (specification/verification/authority gaps), and introduces actionable, benchmarkable metrics with preliminary multi-model, multi-scenario evidence. The work is broadly applicable across agent evaluation, RLHF safety, and enterprise deployment, with direct real-world utility for reducing hazardous actions. Paper 2 is novel for social simulation interpretability, but its applications are narrower and its validation appears more domain-specific, likely yielding more incremental impact.

vs. HLL: Can Agents Cross Humanity's Last Line of Verification?

gemini-3.16/3/2026

Paper 1 addresses a fundamental flaw in AI agent alignment—compliance bias and the inability to abstain—which has profound implications for AI safety and enterprise deployment across all domains. Paper 2, while offering a useful benchmark for web agents interacting with CAPTCHAs, is significantly narrower in scope. Paper 1's introduction of a novel taxonomy and metrics for safe abstention will likely drive broader theoretical and practical shifts in how autonomous agents are trained, evaluated, and deployed.