Back to Rankings

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

cs.AIcs.CLcs.LG
Share
#1774 of 3489 · Artificial Intelligence
Tournament Score
1397±45
10501800
53%
Win Rate
9
Wins
8
Losses
17
Matches
Rating
6.5/ 10
Significance7
Rigor6.5
Novelty6
Clarity8

Abstract

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately 88-1515 points and yielding a roughly 3535-40%40\% incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to 24%24\% of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a genuine gap in how Deep Research Agents (DRAs) are evaluated: the lack of multi-turn assessment. While existing benchmarks (DRACO, DeepResearch Bench, etc.) evaluate single-shot outputs, this work investigates whether DRAs can iteratively improve when given feedback. The paper introduces Research Gap Inference (RGI), a method that analyzes patterns of satisfied/unsatisfied rubric criteria to generate process-level feedback targeting research strategy gaps rather than individual content deficiencies.

The key distinction from prior work (particularly Chen et al., 2026) is the shift from criterion-level feedback (pointing out specific missing content) to process-level feedback (diagnosing how the agent's research approach fell short). This is a meaningful conceptual contribution, as it more closely mirrors how a human mentor would guide a researcher—pointing toward methodological or strategic shortcomings rather than listing missing facts.

Methodological Rigor

Strengths in design:

  • The experimental framework is well-structured with clear baselines (Turn 1 → self-reflection vs. Turn 1 → RGI Turn 2 → RGI Turn 3), enabling controlled comparison across feedback types.
  • The use of three models (GPT-4.1-mini, GPT-4.1, DeepSeek-V4-Flash) under identical scaffolding (LC-ODR) isolates model capability from architectural differences.
  • Metrics are well-chosen: normalized score, pass rate, incorporation rate, regression rate, and net criterion gain provide complementary perspectives. The paper astutely notes that per-axis incorporation/regression rates can be misleading due to differing criterion counts, and addresses this with net criterion gain.
  • Trace-level diagnostics (web searches, URLs, citation retention, n-gram overlap) provide mechanistic explanations for observed patterns.
  • Limitations in rigor:

  • The sample size of 50 tasks from DRACO is modest, though the authors acknowledge this. Statistical tests are provided for the headroom analysis but not systematically for all claims.
  • The RGI feedback generator is a single fixed model (GPT-4.1), introducing a confound: the quality of feedback is itself dependent on an LLM's diagnostic capability, which is not independently validated.
  • The rubric judge uses GPT-5.2, but there's no discussion of inter-rater reliability or human validation of the evaluation itself. The entire evaluation pipeline is LLM-dependent.
  • LC-ODR is the only agent framework tested. While the authors argue that all current DRAs share the full-rewrite paradigm, the specifics of LC-ODR's multi-agent decomposition (Planner → Supervisor → Researcher → Reporter) may introduce framework-specific artifacts.
  • The paper does not compare RGI process-level feedback against criterion-level feedback (as in Chen et al.), explicitly deferring this to future work. This makes it difficult to isolate the specific value of process-level over criterion-level feedback.
  • Key Findings and Their Significance

    The three main findings are well-supported and practically important:

    1. Self-reflection is ineffective: Incorporation and regression rates nearly cancel (e.g., DeepSeek-V4-Flash: 199 incorporations vs. 198 regressions). This aligns with and extends findings from Huang et al. (2023) and Tyen et al. (2024) into the DRA domain.

    2. Process-level feedback yields substantial one-shot gains: +8-15 normalized score points with ~35-40% incorporation rate. This demonstrates that current DRAs have latent capability that can be unlocked by targeted guidance.

    3. Gains don't compound: The most novel and impactful finding. The regression at Turn 3 (up to 24% of previously satisfied criteria) reveals a fundamental architectural limitation—the full-rewrite paradigm. The paper's analysis tracing this to citation retention and textual overlap differences between models is particularly insightful: DeepSeek-V4-Flash regresses less (8.96% vs. 18-24%) because it implicitly preserves more content (53.96% citation retention vs. 27-37%), but at 2-4× the computational cost.

    Potential Impact

    Direct impact on DRA development: The paper makes a compelling case that current DRA architectures need explicit content-preservation mechanisms for multi-turn operation. This is an actionable insight for system builders.

    Benchmarking methodology: The shift from single-shot to multi-turn evaluation, and from criterion-level to process-level feedback, could influence how future DRA benchmarks are designed.

    Practical implications: The finding that a single round of process-level feedback yields substantial gains but subsequent rounds don't suggests a practical ceiling for iterative refinement with current architectures. This has implications for product design of DRA systems.

    Timeliness & Relevance

    The paper is highly timely. DRAs (Gemini Deep Research, OpenAI Deep Research, Perplexity) are being rapidly deployed commercially, yet evaluation methodology has lagged behind deployment. The multi-turn setting is particularly relevant as these systems are increasingly used iteratively in practice.

    Strengths & Limitations

    Key strengths:

  • Clear, well-defined problem with practical relevance
  • The RGI method is well-motivated and the two-step procedure (gap analysis → feedback) is principled
  • Rich analysis beyond aggregate scores: axis-wise, domain-wise, task-level headroom analysis, trace diagnostics, and case studies provide multiple angles of understanding
  • The rewrite behavior analysis (Table 4) connecting citation retention to regression rates is a genuinely useful diagnostic insight
  • Code and results publicly available
  • Notable weaknesses:

  • Single framework (LC-ODR) limits generalizability claims
  • No direct comparison with criterion-level feedback
  • The RGI method itself is essentially a carefully prompted LLM, not a novel algorithmic contribution—its effectiveness depends on prompt engineering quality
  • 50 tasks may be insufficient for robust domain-level conclusions (some domains have only 3 tasks)
  • The paper doesn't explore whether architectural modifications (e.g., incremental editing rather than full rewrite) could address the identified limitations, which would strengthen the prescriptive value
  • Overall Assessment

    This is a solid empirical study that identifies an important limitation of current DRA architectures and provides useful diagnostic tools and analysis. The findings are clearly presented and practically relevant. The main limitation is the relatively narrow experimental scope (one framework, 50 tasks, no comparison with criterion-level feedback). The RGI method, while useful, is more of a well-crafted prompting strategy than a deeply novel algorithmic contribution. The paper's greatest value lies in its diagnostic insights about regression patterns and the full-rewrite bottleneck.

    Rating:6.5/ 10
    Significance 7Rigor 6.5Novelty 6Clarity 8

    Generated Jun 9, 2026

    Comparison History (17)

    Lostvs. ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

    Paper 1 likely has higher scientific impact: it introduces a practical, training-free decoding-time KV-cache budget allocation framework (layer-wise + online head-wise) that directly addresses a major, timely bottleneck in long-CoT LLM inference, with clear deployment value across many reasoning-capable models and tasks. The method appears broadly applicable (plug-and-play with eviction policies) and targets real-world efficiency constraints. Paper 2 provides a useful evaluation protocol and insights about feedback limitations, but its contributions are more benchmark/diagnostic and may have narrower downstream adoption than an inference optimization technique.

    gpt-5.2·Jun 10, 2026
    Wonvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

    Paper 2 addresses a highly timely and critical challenge in artificial intelligence: evaluating and improving Deep Research Agents. Its findings on the limitations of self-reflection and the impact of process-level feedback have broad implications for the rapidly growing field of AI agents, potentially affecting how autonomous research systems are designed across all scientific disciplines. In contrast, Paper 1 is highly specialized within sports analytics. Thus, Paper 2 offers significantly broader impact and relevance to a larger scientific community.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

    Paper 2 likely has higher impact: it targets AI safety/control—an urgent, high-stakes area—with a broadly applicable benchmark (CIAware-Bench) spanning multiple domains and frontier models. Measuring control-intervention awareness is novel and directly informs real-world deployment of monitoring/control protocols, affecting governance, security, and alignment research across labs. Paper 1 is valuable for evaluating deep research agents and introduces RGI, but its scope is narrower (DRA multi-turn rubric improvement) and primarily advances evaluation methodology rather than addressing a cross-cutting safety concern. Both are timely, but Paper 2’s broader relevance and applications dominate.

    gpt-5.2·Jun 10, 2026
    Wonvs. Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

    Paper 2 offers higher scientific impact because it identifies a fundamental limitation of deep research agents—their inability to reliably improve through iterative feedback—which has broad implications for all agentic AI systems. The novel Research Gap Inference method and the finding that self-reflection yields negligible improvement while process-level feedback gains don't compound are actionable insights that directly inform agent architecture design. While Paper 1 is a solid benchmark contribution, it primarily confirms expected difficulties (low success rates on hard tasks). Paper 2's multi-turn evaluation paradigm addresses a more fundamental and timely question about agent learning dynamics.

    claude-opus-4-6·Jun 10, 2026
    Wonvs. (Auto)formalization is supposed to be easy: Trellis process semantics for spelling out rigorous proofs

    Paper 1 addresses a critical and timely gap in evaluating deep research agents—multi-turn improvement under feedback—revealing fundamental limitations (regression during revision) that have broad implications for AI agent design. Its systematic methodology, quantitative findings, and publicly available code make it highly actionable for the rapidly growing field of AI agents. Paper 2, while innovative in combining LLMs with formal verification via a novel workflow, targets a narrower audience (autoformalization/theorem proving) and its impact depends on scalability beyond the demonstrated case study. Paper 1's findings are more broadly applicable across AI research.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

    Paper 2 addresses a fundamental and highly timely problem in AI: the ability of autonomous agents to improve through multi-turn feedback. While Paper 1 offers a strong, domain-specific contribution to autonomous driving, Paper 2's focus on deep research agents and process-level feedback provides insights and evaluation methodologies that are broadly applicable across the rapidly expanding field of LLM agents, likely leading to wider cross-disciplinary adoption and higher citation impact.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

    Paper 2 addresses a more fundamental and timely question about deep research agents' ability to improve through feedback, revealing important limitations (regression during revision, non-compounding gains) that have broad implications for the rapidly growing field of AI agents. Its findings about self-reflection's ineffectiveness and the ceiling on multi-turn improvement challenge prevailing assumptions and will influence agent architecture design. Paper 1, while technically sound in error attribution, addresses a narrower problem. Paper 2's benchmark contribution, open-source release, and broadly applicable insights give it wider cross-field impact.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

    Paper 1 addresses a critical, high-stakes real-world application (healthcare triage) and introduces a novel, rigorous methodology (clause cards) for generating verifiable, policy-grounded benchmarks. Its scalable approach to handling missing information and ambiguity provides a highly impactful framework for deploying LLMs in regulated domains, offering broader practical utility than the agent evaluation insights in Paper 2.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. Unsupervised Skill Discovery for Agentic Data Analysis

    Paper 1 (DataCOPE) presents a novel unsupervised framework for skill discovery that demonstrates substantial performance improvements (9.71% and 32.30%) across multiple settings. It introduces a generalizable methodology with concrete algorithmic contributions (Adaptive Checklist Verifier, Answer Agreement Verifier, contrastive skill distillation) applicable beyond its specific evaluation domains. Paper 2 provides valuable empirical insights about multi-turn evaluation limitations of deep research agents, but is primarily an evaluation/analysis study with more limited methodological novelty. Paper 1's constructive framework for improving agents has broader applicability and stronger potential to influence future agent development.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

    Paper 2 has higher impact potential: it proposes a training methodology (harness-guided delegation trajectories for SFT) that directly improves long-horizon agent performance under finite context, with strong benchmark gains and planned releases of model/weights/data enabling adoption. Its applications (scalable research agents, tool-using systems, enterprise workflows) are broad and timely as multi-agent LLM systems proliferate. Paper 1 is methodologically useful and insightful for evaluation/feedback limits, but is primarily diagnostic/benchmarking and may have narrower downstream impact than a broadly applicable capability-training approach.

    gpt-5.2·Jun 9, 2026