Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

Tanmay Asthana, Aman Saksena, Divyansh Sahu

May 17, 2026

arXiv:2605.17554v1 PDF

cs.AI(primary)cs.LG

#1400of 2292·Artificial Intelligence

#1400 of 2292 · Artificial Intelligence

Tournament Score

1387±42

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty6.5

Clarity7.5

Tournament Score

1387±42

10501800

46%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper introduces a benchmark for evaluating frontier deep research agents (DRAs) — specifically Claude Opus 4.6, OpenAI o3-deep-research, and Gemini 3.1 Pro deep-research — on structured analytical deliverables typical of management consulting work. The benchmark comprises 42 SME-authored prompts yielding 126 responses, scored via a dual-layer system: deterministic binary verifiers (mean 13.8 per task) and a five-criterion 0–3 ordinal rubric graded by human subject-matter experts. These are composed into a Verifier-Rubric Score (VRS) on 0–100. The key innovation is the combination of (a) deterministic SME-graded verifiers, (b) multi-criterion ordinal rubrics, (c) cognitive traps embedded in source materials, and (d) a prompt taxonomy organized by cognitive capability rather than topic domain.

The paper addresses a genuine gap: DRAs are being deployed for decision-grade enterprise work, yet existing benchmarks evaluate factual recall, single-hop QA, or generic tool use — not the production of structured multi-document deliverables. The framing is well-motivated with a concrete example (a €4.5B CapEx decision depending on a single revenue calculation).

2. Methodological Rigor

Strengths in design: The dual-layer scoring system is well-conceived. The strict VRS variant (zeroing out when any criterion scores 0) and the conjunctive ACCEPT rule (rubric mean ≥2.5 AND verifier rate ≥80%) are defensible for production-readiness assessment. The sensitivity analysis (Section 4.11) showing robustness to VRS weight choice is thorough. The rubric validation (Section 4.10) with Spearman correlations and sole-cause failure analysis is methodologically sophisticated.

Statistical limitations are significant and candidly acknowledged. With only 42 prompts and per-class samples of n=6–11, the paired McNemar tests fail to reach significance (Table 17). The Cohen's d effect sizes describe magnitude but carry wide confidence intervals at these sample sizes. The paper appropriately caveats these limitations but this fundamentally constrains the inferential claims.

Inter-rater reliability is absent. Each cell was graded by one SME with a QC pass by a second — an asymmetric verification, not parallel annotation. No Cohen's κ is reported. For a benchmark paper advocating human SME grading over LLM-as-judge, the absence of IRR statistics is a notable gap, though the authors plan to address this in v2.

The cognitive trap design is novel and well-motivated (inconsistent units, footnote-body contradictions, placeholder values requiring live verification), but the paper provides limited quantitative analysis of trap-specific failure rates. How often each specific trap type was triggered and by which agent would strengthen the contribution.

Single-domain limitation: All 42 prompts are management consulting. The paper acknowledges this and plans investment banking tasks for v2, but generalizability claims are limited.

3. Potential Impact

The benchmark fills a practical gap in enterprise AI evaluation. As organizations adopt DRAs for knowledge work, the finding that acceptance rates are uniformly low (9.5–21.4%) under production-quality thresholds is immediately decision-relevant. The agent-distinct failure signatures are particularly valuable:

Claude: Reliable deliverable production (90% file output) but highest fabrication signature — a dangerous combination for enterprise deployment where polished formatting masks invented content.

o3: Cleanest reasoning average but drops sections and propagates arithmetic errors — a pattern invisible to rubric-only evaluation.

Gemini: Bimodal quality with no graceful degradation — either excellent or catastrophically failed.

These profiles are actionable for both model developers and enterprise buyers. The concrete failure examples (wrong python-docx API calls, hallucinated citations from real but irrelevant URLs) add reproducible evidence.

The open release of the prompt corpus, verifier specifications, and evaluation infrastructure enhances reproducibility and enables extension by others.

4. Timeliness & Relevance

Highly timely. DRAs from all three major providers are being marketed for exactly the kind of work this benchmark evaluates. The paper correctly identifies that evaluation methodology has lagged deployment speed. The management consulting framing targets high-stakes decision work where errors have quantifiable financial consequences, making the benchmark more practically grounded than academic QA benchmarks.

The comparison with APEX-v1 (64.2), ProfBench (65.9), and ResearchRubrics (<68%) on mean scores, and with APEX-Agents' Pass@1 band on acceptance rates, provides useful calibration against the emerging landscape of professional-task benchmarks.

5. Strengths & Limitations

Key Strengths:

Novel combination of deterministic verifiers + ordinal rubric + cognitive traps on the same response

Five-class capability taxonomy (CRP, RCP, SCP, LDP, FSP) isolating distinct reasoning failures

Transparent reporting of multiple metrics without collapsing to a single leaderboard number

Agent-distinct failure mode analysis providing differentiated rather than ranked assessment

Public release of all materials including the 42-prompt corpus and evaluation infrastructure

Worked examples in appendices showing substantial depth in task design

Notable Weaknesses:

Small sample size (n=42) fundamentally limits statistical power; many conclusions are point estimates without significance support

No inter-rater reliability statistics despite advocating human SME grading as superior to LLM-as-judge

Single domain (management consulting) limits generalizability claims

Rubric dimensionality concerns: mean off-diagonal correlation of 0.61 suggests only 2–3 latent factors, not 5 independent criteria — the paper acknowledges this but doesn't resolve it

The cognitive trap analysis lacks granularity: trap-by-trap detection rates would strengthen the contribution

Black-box evaluation constraints: corpus discipline is measured rather than enforced, limiting control over the evaluation condition

Additional observations: The paper's comparison framework against APEX, ProfBench, and other benchmarks (Table 1) is useful but somewhat self-serving — the comparison emphasizes features this benchmark has that others lack without equally scrutinizing what others have (larger sample sizes, established IRR, cross-domain coverage). The v2 promises are appropriate but the current contribution must be evaluated on v1 evidence alone.

Overall, this is a well-executed benchmark paper addressing a real and timely gap. The methodological framework is sound and thoughtfully designed, but the empirical evidence base is thin relative to the claims' scope. The qualitative insights about agent-specific failure modes may prove more influential than the quantitative rankings.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 6.5Clarity 7.5

Generated May 19, 2026

Comparison History (24)

vs. Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

claude-opus-4.65/20/2026

Paper 2 introduces a novel theoretical formalization connecting trust calibration for AI agents to preference learning via Preferential Bayesian Optimization, providing a principled mathematical framework with broad applicability across human-AI interaction, autonomous systems, and AI safety. This bridges multiple fields (Bayesian optimization, human-robot interaction, AI alignment) and offers reusable theoretical machinery. Paper 1, while practically useful, is primarily a benchmark/evaluation study of existing systems—inherently more ephemeral as models rapidly improve. Paper 2's conceptual contribution has longer-lasting impact potential and broader methodological influence.

vs. Generative Auto-Bidding with Unified Modeling and Exploration

gemini-3.15/20/2026

Paper 2 offers a timely and broadly impactful benchmark for emerging Deep Research Agents (DRAs). While Paper 1 demonstrates impressive real-world economic impact in digital advertising, its scientific scope is relatively narrow. Paper 2 addresses a critical gap in evaluating frontier LLMs on complex, multi-step knowledge work. By introducing rigorous SME-authored rubrics and cognitive traps to evaluate state-of-the-art models (o3, Gemini, Claude), it sets a foundational standard for future DRA development. Benchmarks like this typically garner high citations and drive widespread methodological advancements across the broader AI community.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

gemini-3.15/20/2026

Paper 1 addresses a critical and highly timely gap in the field: evaluating frontier Deep Research agents on complex, real-world enterprise tasks. Its novel methodological approach, including conjunctive grading and cognitive traps, combined with benchmarking state-of-the-art models, gives it broad relevance and high citation potential across the rapidly growing domain of autonomous AI agents.

vs. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

gpt-5.25/20/2026

Paper 1 is more novel methodologically: it introduces a new optimization framework (LLM-elicited adaptive embeddings + GP Bayesian optimization) for aggregate-only prompt tuning, a hard and practically important setting. This can generalize beyond prompt optimization to black-box optimization over discrete artifacts, influencing optimization, HCI, and LLM alignment/tooling. Paper 2 is timely and useful as an evaluation benchmark, but benchmarks tend to have narrower scientific reach and shorter half-life as models/harnesses change, unless they become a dominant standard. Paper 1’s algorithmic contribution is likelier to inspire follow-on research.

vs. Neurosymbolic Learning for Inference-Time Argumentation

claude-opus-4.65/20/2026

Paper 1 introduces a novel neurosymbolic framework (ITA) that combines formal argumentation semantics with LLM training for claim verification, offering a principled approach to faithful explanations and uncertainty handling. This methodological contribution has broader theoretical significance and potential for impact across AI, NLP, and formal reasoning communities. Paper 2, while timely and practically useful, is primarily a benchmarking study evaluating existing commercial systems—its contributions are more ephemeral as the specific agents tested will quickly be superseded, whereas Paper 1's framework and methodology have longer-lasting scientific value.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

gemini-3.15/20/2026

Paper 2 introduces a fundamental methodological improvement for Reinforcement Learning with Verifiable Rewards (RLVR), a critical area in LLM post-training. By dynamically adapting reward weights, it broadly advances model alignment and training efficiency. In contrast, Paper 1 is an evaluation benchmark for specific, current proprietary models in a niche domain (consulting). While valuable, benchmarks of proprietary models tend to have shorter-lived relevance compared to fundamental algorithmic advancements that can be applied to train future models across multiple domains.

vs. A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

gpt-5.25/20/2026

Paper 2 has higher potential impact: it introduces a reusable architectural primitive (stochastic-deterministic boundary) plus a pattern catalog, selection methodology, diagnostics, and a reference implementation—artifacts likely to generalize across many production agent systems and influence both research and engineering practice. Its framing connects to distributed-systems theory and targets reliability/scalability, making it broadly applicable and timely as agents move into long-horizon deployments. Paper 1 is valuable and rigorous as an evaluation benchmark, but its impact is narrower (benchmark-specific) and more rapidly obsoleted by model and task drift.

vs. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to broader relevance and timeliness: it introduces a rigorous, decision-grade benchmark for deep research agents used in enterprise settings, with multi-layer evaluation (deterministic verifiers + SME rubrics) and adversarial “cognitive traps,” enabling more reliable measurement across models and deployments. Its methodology and scoring framework are generalizable across domains and can shape evaluation standards for agentic systems. Paper 1 is novel and valuable for mental-health audio modeling, but its impact is narrower due to dataset size, domain specificity, and data availability constraints.

vs. POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

gpt-5.25/19/2026

Paper 2 likely has higher impact: it introduces a timely, broadly relevant evaluation benchmark for deep research agents, addressing an urgent gap as DRAs are rapidly deployed. The benchmark’s design (SME-authored tasks, deterministic verifiers, structured rubrics, and cognitive traps) is methodologically rigorous and reusable, enabling standardized measurement and progress across academia and industry. Its applicability spans AI evaluation, HCI, enterprise automation, and safety. Paper 1 is a strong, domain-specific algorithmic contribution to MTS anomaly detection with a useful localization benchmark, but its reach is narrower across fields.

vs. Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

claude-opus-4.65/19/2026

Paper 2 addresses a fundamental methodological concern—psychometric reliability of AI-inferred user states—that has broad implications across all fields using LLMs for measurement and adaptation. Its replicable evaluation framework and finding that only 31/213 metrics meet reliability criteria provides a foundational contribution applicable to HCI, psychology, education, healthcare, and any domain using LLM-based assessment. Paper 1, while valuable, is a narrower benchmark evaluation of specific deep research agents that will quickly become outdated as models improve. Paper 2's framework has lasting methodological impact across disciplines.

vs. Human-Inspired Memory Architecture for LLM Agents

claude-opus-4.65/19/2026

Paper 2 introduces a novel, biologically-grounded memory architecture for LLM agents that addresses a fundamental limitation (persistent memory management) with broad applicability across all LLM agent applications. Its principled approach combining cognitive science with engineering (six distinct mechanisms, synthetic calibration to prevent evaluation leakage) offers lasting methodological contributions. Paper 1, while practically valuable, is primarily an evaluation benchmark for current deep research agents—its findings are time-sensitive and will deprecate as models improve. Paper 2's architectural innovations have broader cross-field impact and longer-term relevance.

vs. From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

gemini-3.15/19/2026

Paper 1 addresses a highly timely and critical gap in evaluating frontier AI agents on complex, real-world enterprise workflows. Its novel benchmark methodology offers broad applicability and significant impact across AI research and industry, whereas Paper 2 focuses on a narrow application of shallow RL to a specific card game with limited broader relevance.

vs. GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

gemini-3.15/19/2026

Paper 2 introduces a timely and highly relevant benchmark for evaluating state-of-the-art deep research agents on complex, real-world tasks. Benchmarks for frontier AI models typically have a broad and immediate impact across the AI community, guiding future research and development. While Paper 1 presents a solid methodological contribution for the cybersecurity domain, Paper 2's focus on evaluating general-purpose autonomous research agents offers wider applicability and addresses an urgent gap in understanding current AI capabilities and limitations.

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

gpt-5.25/19/2026

Paper 2 is likely to have higher scientific impact due to its broad, timely relevance: robust evaluation of deployed deep research agents affects academia and industry across NLP, agent benchmarking, safety/reliability, and human-AI workflows. It contributes a new benchmark with SME-authored tasks, deterministic verifiers plus rubrics, and cognitive-trap design—methodology that can generalize and become a standard. Paper 1 is a solid, novel RL method for diffusion-based multimodal models with clear image-generation gains, but its impact is narrower (specific to dMLLM RL training) and may compete in a crowded methods space.

vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

gemini-3.15/19/2026

Paper 2 presents a novel, training-free, algorithmic solution to hallucination reduction, a fundamental problem in LLMs. Its broad applicability across 15 models with significant performance gains without external retrieval or finetuning suggests high potential for widespread adoption. Paper 1 introduces a valuable benchmark for evaluating deep research agents, but its impact is inherently more niche and evaluative compared to the fundamental, cross-architectural improvements offered by Paper 2.

vs. Reasoning Compression with Mixed-Policy Distillation

gemini-3.15/19/2026

Paper 1 proposes a novel methodological framework (Mixed-Policy Distillation) for compressing LLM reasoning, directly addressing critical bottlenecks in AI deployment like latency and computational cost. Its approach to improving small model efficiency has broad applicability across the field of AI training. Paper 2, while valuable for evaluation, introduces a domain-specific benchmark which typically offers narrower, more specialized impact compared to foundational algorithmic improvements that can be widely adopted.

vs. X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it introduces a timely, broadly useful evaluation benchmark for deep research agents with SME-designed tasks, deterministic verifiers, explicit rubrics, and cognitive traps—elements that can standardize measurement and drive progress across models, labs, and enterprises. Its methodology is comparatively easier to reproduce and generalize than Paper 1’s organization-specific attention/behavioral “digital twin” pipeline, which, while novel and application-relevant, may face privacy, deployment, and external-validity constraints that limit cross-field uptake.

vs. Learning Developmental Scaffoldings to Guide Self-Organisation

claude-opus-4.65/19/2026

Paper 1 introduces a novel computational framework connecting biological development, self-organization, and information theory through jointly learned pre-patterns and NCA rules. It addresses fundamental questions about how information is distributed in developmental systems, with broad implications across developmental biology, artificial life, and self-organizing systems. Paper 2, while practically useful, is primarily an empirical benchmark evaluation of existing AI agents with a narrower scope and shorter relevance horizon, as specific agent capabilities evolve rapidly. Paper 1's theoretical contributions and interdisciplinary nature give it greater lasting scientific impact.

vs. Responsible Agentic AI Requires Explicit Provenance

claude-opus-4.65/19/2026

Paper 2 addresses a fundamental structural problem in agentic AI—the lack of explicit provenance for responsibility attribution—proposing a formal framework (causal attribution function, responsibility tensor) with broad applicability across all agentic AI systems. Its impact extends beyond benchmarking into governance, policy, and system design, touching multiple stakeholder communities. Paper 1, while methodologically rigorous and timely, is primarily an empirical benchmark evaluation of specific current models that will quickly become dated. Paper 2's conceptual contributions have longer-lasting and broader cross-disciplinary impact spanning AI safety, law, and sociotechnical systems.

vs. Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

gemini-3.15/19/2026

Paper 2 introduces a highly timely and relevant benchmark for evaluating frontier Deep Research Agents (DRAs) on complex, real-world expert consulting tasks. Given the rapid deployment of agentic workflows in enterprise settings, rigorous evaluation frameworks targeting multi-document, decision-grade work are urgently needed. Its use of deterministic verifiers, SME rubrics, and cognitive traps provides a comprehensive and rigorous methodology that will likely drive significant future research in agent evaluation and safety, leading to a broader scientific and industry impact than the methodological query clustering approach in Paper 1.