Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan
Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately - points and yielding a roughly - incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.
This paper addresses a genuine gap in how Deep Research Agents (DRAs) are evaluated: the lack of multi-turn assessment. While existing benchmarks (DRACO, DeepResearch Bench, etc.) evaluate single-shot outputs, this work investigates whether DRAs can iteratively improve when given feedback. The paper introduces Research Gap Inference (RGI), a method that analyzes patterns of satisfied/unsatisfied rubric criteria to generate process-level feedback targeting research strategy gaps rather than individual content deficiencies.
The key distinction from prior work (particularly Chen et al., 2026) is the shift from criterion-level feedback (pointing out specific missing content) to process-level feedback (diagnosing how the agent's research approach fell short). This is a meaningful conceptual contribution, as it more closely mirrors how a human mentor would guide a researcher—pointing toward methodological or strategic shortcomings rather than listing missing facts.
The three main findings are well-supported and practically important:
1. Self-reflection is ineffective: Incorporation and regression rates nearly cancel (e.g., DeepSeek-V4-Flash: 199 incorporations vs. 198 regressions). This aligns with and extends findings from Huang et al. (2023) and Tyen et al. (2024) into the DRA domain.
2. Process-level feedback yields substantial one-shot gains: +8-15 normalized score points with ~35-40% incorporation rate. This demonstrates that current DRAs have latent capability that can be unlocked by targeted guidance.
3. Gains don't compound: The most novel and impactful finding. The regression at Turn 3 (up to 24% of previously satisfied criteria) reveals a fundamental architectural limitation—the full-rewrite paradigm. The paper's analysis tracing this to citation retention and textual overlap differences between models is particularly insightful: DeepSeek-V4-Flash regresses less (8.96% vs. 18-24%) because it implicitly preserves more content (53.96% citation retention vs. 27-37%), but at 2-4× the computational cost.
Direct impact on DRA development: The paper makes a compelling case that current DRA architectures need explicit content-preservation mechanisms for multi-turn operation. This is an actionable insight for system builders.
Benchmarking methodology: The shift from single-shot to multi-turn evaluation, and from criterion-level to process-level feedback, could influence how future DRA benchmarks are designed.
Practical implications: The finding that a single round of process-level feedback yields substantial gains but subsequent rounds don't suggests a practical ceiling for iterative refinement with current architectures. This has implications for product design of DRA systems.
The paper is highly timely. DRAs (Gemini Deep Research, OpenAI Deep Research, Perplexity) are being rapidly deployed commercially, yet evaluation methodology has lagged behind deployment. The multi-turn setting is particularly relevant as these systems are increasingly used iteratively in practice.
This is a solid empirical study that identifies an important limitation of current DRA architectures and provides useful diagnostic tools and analysis. The findings are clearly presented and practically relevant. The main limitation is the relatively narrow experimental scope (one framework, 50 tasks, no comparison with criterion-level feedback). The RGI method, while useful, is more of a well-crafted prompting strategy than a deeply novel algorithmic contribution. The paper's greatest value lies in its diagnostic insights about regression patterns and the full-rewrite bottleneck.
Generated Jun 9, 2026
Paper 1 likely has higher scientific impact: it introduces a practical, training-free decoding-time KV-cache budget allocation framework (layer-wise + online head-wise) that directly addresses a major, timely bottleneck in long-CoT LLM inference, with clear deployment value across many reasoning-capable models and tasks. The method appears broadly applicable (plug-and-play with eviction policies) and targets real-world efficiency constraints. Paper 2 provides a useful evaluation protocol and insights about feedback limitations, but its contributions are more benchmark/diagnostic and may have narrower downstream adoption than an inference optimization technique.
Paper 2 addresses a highly timely and critical challenge in artificial intelligence: evaluating and improving Deep Research Agents. Its findings on the limitations of self-reflection and the impact of process-level feedback have broad implications for the rapidly growing field of AI agents, potentially affecting how autonomous research systems are designed across all scientific disciplines. In contrast, Paper 1 is highly specialized within sports analytics. Thus, Paper 2 offers significantly broader impact and relevance to a larger scientific community.
Paper 2 likely has higher impact: it targets AI safety/control—an urgent, high-stakes area—with a broadly applicable benchmark (CIAware-Bench) spanning multiple domains and frontier models. Measuring control-intervention awareness is novel and directly informs real-world deployment of monitoring/control protocols, affecting governance, security, and alignment research across labs. Paper 1 is valuable for evaluating deep research agents and introduces RGI, but its scope is narrower (DRA multi-turn rubric improvement) and primarily advances evaluation methodology rather than addressing a cross-cutting safety concern. Both are timely, but Paper 2’s broader relevance and applications dominate.
Paper 2 offers higher scientific impact because it identifies a fundamental limitation of deep research agents—their inability to reliably improve through iterative feedback—which has broad implications for all agentic AI systems. The novel Research Gap Inference method and the finding that self-reflection yields negligible improvement while process-level feedback gains don't compound are actionable insights that directly inform agent architecture design. While Paper 1 is a solid benchmark contribution, it primarily confirms expected difficulties (low success rates on hard tasks). Paper 2's multi-turn evaluation paradigm addresses a more fundamental and timely question about agent learning dynamics.
Paper 1 addresses a critical and timely gap in evaluating deep research agents—multi-turn improvement under feedback—revealing fundamental limitations (regression during revision) that have broad implications for AI agent design. Its systematic methodology, quantitative findings, and publicly available code make it highly actionable for the rapidly growing field of AI agents. Paper 2, while innovative in combining LLMs with formal verification via a novel workflow, targets a narrower audience (autoformalization/theorem proving) and its impact depends on scalability beyond the demonstrated case study. Paper 1's findings are more broadly applicable across AI research.
Paper 2 addresses a fundamental and highly timely problem in AI: the ability of autonomous agents to improve through multi-turn feedback. While Paper 1 offers a strong, domain-specific contribution to autonomous driving, Paper 2's focus on deep research agents and process-level feedback provides insights and evaluation methodologies that are broadly applicable across the rapidly expanding field of LLM agents, likely leading to wider cross-disciplinary adoption and higher citation impact.
Paper 2 addresses a more fundamental and timely question about deep research agents' ability to improve through feedback, revealing important limitations (regression during revision, non-compounding gains) that have broad implications for the rapidly growing field of AI agents. Its findings about self-reflection's ineffectiveness and the ceiling on multi-turn improvement challenge prevailing assumptions and will influence agent architecture design. Paper 1, while technically sound in error attribution, addresses a narrower problem. Paper 2's benchmark contribution, open-source release, and broadly applicable insights give it wider cross-field impact.
Paper 1 addresses a critical, high-stakes real-world application (healthcare triage) and introduces a novel, rigorous methodology (clause cards) for generating verifiable, policy-grounded benchmarks. Its scalable approach to handling missing information and ambiguity provides a highly impactful framework for deploying LLMs in regulated domains, offering broader practical utility than the agent evaluation insights in Paper 2.
Paper 1 (DataCOPE) presents a novel unsupervised framework for skill discovery that demonstrates substantial performance improvements (9.71% and 32.30%) across multiple settings. It introduces a generalizable methodology with concrete algorithmic contributions (Adaptive Checklist Verifier, Answer Agreement Verifier, contrastive skill distillation) applicable beyond its specific evaluation domains. Paper 2 provides valuable empirical insights about multi-turn evaluation limitations of deep research agents, but is primarily an evaluation/analysis study with more limited methodological novelty. Paper 1's constructive framework for improving agents has broader applicability and stronger potential to influence future agent development.
Paper 2 has higher impact potential: it proposes a training methodology (harness-guided delegation trajectories for SFT) that directly improves long-horizon agent performance under finite context, with strong benchmark gains and planned releases of model/weights/data enabling adoption. Its applications (scalable research agents, tool-using systems, enterprise workflows) are broad and timely as multi-agent LLM systems proliferate. Paper 1 is methodologically useful and insightful for evaluation/feedback limits, but is primarily diagnostic/benchmarking and may have narrower downstream impact than a broadly applicable capability-training approach.