DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning

Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li

Jun 5, 2026arXiv:2606.07299v1

cs.AI

#2503of 3489·Artificial Intelligence

#2503 of 3489 · Artificial Intelligence

Tournament Score

1341±43

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance5.5

Rigor4

Novelty4.5

Clarity7

Abstract

Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: DuMate-DeepResearch

1. Core Contribution

DuMate-DeepResearch presents a multi-agent framework for automated deep research—the task of producing comprehensive, evidence-grounded, long-form reports from complex open-ended queries. The system is built on the Qianfan Agent Foundry (Baidu AI Cloud) and introduces three interlocking mechanisms: (i) graph-based dynamic planning that maintains a DAG-structured research roadmap with coarse-to-fine expansion, reflection, backtracking, and parallel branching; (ii) recursive two-level execution where an outer Research Agent delegates complex retrieval sub-tasks to inner Search Agents, each running their own planning-execution loop; and (iii) rubric-based test-time optimization that dynamically generates persistent and ephemeral quality criteria serving as live reasoning scaffolds during both planning and synthesis. The paper claims state-of-the-art results on DeepResearch Bench (58.03%) and DeepResearch Bench II (61.95%).

The core problem addressed—orchestrating long-horizon, multi-step research workflows with auditability—is genuinely important. The proposed solution synthesizes several individually known ideas (DAG-based planning, hierarchical agents, rubric-guided generation) into a cohesive system architecture.

2. Methodological Rigor

Strengths in formalization: The paper provides a clean state-transition formalization (Equation 1-2), algorithms for the main loop and planning, and a clear ready-frontier scheduling mechanism (Equation 3). The recursive agent nesting is well-defined with explicit depth bounding.

Weaknesses in experimental rigor: Several concerns limit confidence in the empirical claims:

Baseline fairness is unclear. Baseline scores are "taken from official benchmark sources and leaderboards." This means different systems likely use different backbone LLMs, different search APIs, different compute budgets, and different numbers of retrieval rounds. The paper never discloses which backbone LLM powers DuMate-DeepResearch's Agent Core, making it impossible to attribute gains to architectural innovation versus model capability. The ablation replacing the report-stage model with alternatives (DeepSeek V4 Pro, GLM 5.1, etc.) actually reveals that the writing model is the single most impactful component—larger than any architectural contribution.

Limited ablation scope. The ablations only test rubric removal and report-stage model swaps. There is no ablation of the graph-based planning (vs. linear/ReAct-style), no ablation of recursive two-level execution (vs. flat single-agent), and no ablation of the coarse-to-fine expansion strategy. These are the paper's three headline contributions, yet none are isolated experimentally.

Variance reporting is incomplete. Results are averaged over 3 runs, but no standard deviations or confidence intervals are reported. Given margins as small as 0.76-1.34% over second-best systems, statistical significance is unestablished.

Evaluation concerns. Both benchmarks rely on LLM-as-judge evaluation, which has known biases (verbosity bias, position bias). The system produces extremely long reports (68K-261K characters/words), which may inflate comprehensiveness scores without corresponding quality improvements.

3. Potential Impact

Practical relevance: Deep research systems represent a commercially important application area, with products from OpenAI, Google, Perplexity, and others. The architectural patterns described—particularly hierarchical agent decomposition and rubric-guided generation—are practically useful design principles that practitioners can adopt.

Limited scientific novelty: The individual components draw heavily from established ideas: DAG-structured task planning has precedents in hierarchical task networks; multi-agent delegation is well-studied; and using evaluation criteria as generation guidance is an increasingly common pattern (DR-Tulu preceded this with rubric-based RL). The novelty lies primarily in their combination and engineering within a production system.

Auditability claim: The paper emphasizes auditability as a contribution, but the treatment is descriptive rather than rigorous. There is no formal definition of what constitutes an auditable trace, no user study on whether the traces are actually useful for debugging or trust-building, and no comparison of auditability against other systems.

4. Timeliness & Relevance

The paper is highly timely. Deep research agents are an active frontier in 2025-2026, with numerous concurrent systems and benchmarks. The paper engages thoroughly with the rapidly growing literature. The benchmarks used are recent and relevant. The multi-agent, tool-augmented paradigm addresses real deployment needs.

However, the fast-moving nature of this field also means leaderboard positions are ephemeral. Several competing systems (iFlow-Researcher, ZTE Nebula, Xiaoyi, etc.) achieve scores within 1-2% on the same benchmarks, suggesting the field is converging rather than being transformed by any single system.

5. Strengths & Limitations

Key Strengths:

Well-articulated problem decomposition identifying four specific challenges

Clean formal framework with state-transition notation and algorithms

Comprehensive qualitative case studies (Section 3.3) that effectively illustrate system behavior

Practical system architecture with clear separation of concerns

Competitive benchmark performance across two complementary evaluation protocols

Key Limitations:

Undisclosed backbone model makes it impossible to disentangle system design from model capability

Missing critical ablations for the three headline architectural contributions

No statistical significance testing despite narrow margins

Auditability is asserted, not evaluated—no user study or formal verification

Reproducibility concerns: despite a GitHub link, the prompts are "desensitized" and key details are omitted; the system relies on proprietary Baidu Search infrastructure

Report length correlation with scores is not analyzed—the extremely long outputs (up to 261K characters) may game comprehensiveness metrics

The rubric ablation reveals surprisingly small effect sizes (0.42-0.50% overall), raising questions about the practical significance of the rubric mechanism despite its prominent positioning

6. Additional Observations

The paper reads more as a technical report/system description than a research paper with controlled experiments isolating causal contributions. The qualitative case studies, while informative, are cherry-picked demonstrations rather than systematic evaluations. The related work section is thorough but predominantly positions this work relative to open-source/academic systems while the primary competition appears to be commercial products.

The contribution of making prompts partially available (Appendix A) is appreciated but limited by the extensive redactions. The framework's tight coupling with Baidu's proprietary infrastructure limits external adoption.

Rating:5.2/ 10

Significance 5.5Rigor 4Novelty 4.5Clarity 7

Generated Jun 8, 2026

Comparison History (19)

Lostvs. Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

Paper 1 offers a highly novel intersection of LLMs and industrial control theory, rigorously defining the exact utility of LLMs as sample-efficient structural priors rather than mere optimizers. While Paper 2 presents a strong engineering framework for AI agents achieving SOTA results, Paper 1 provides deeper methodological insights and solves a notoriously difficult problem in physical systems, giving it broader cross-disciplinary scientific impact.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

Paper 2 has higher potential impact because it proposes a new, domain-substantive mathematical conjecture (Neural Jacobian Conjecture) bridging classical algebraic geometry (Jacobian conjecture) and neural network theory, with partial proofs and multiple independent proof routes—creating a new research direction with cross-field relevance. Paper 1 is a strong engineering advance in multi-agent deep-research systems with benchmark SOTA, but its novelty is more incremental within a fast-moving area and likely to be quickly superseded. Paper 2’s conjecture-driven framing could catalyze broader, longer-lived follow-on work.

gpt-5.2·Jun 10, 2026

Wonvs. AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Paper 1 presents a generalizable, state-of-the-art multi-agent framework for automated deep research. By introducing recursive execution and rubric-based test-time optimization, it directly addresses critical bottlenecks in AI agent planning and synthesis, achieving top performance on key benchmarks. While Paper 2 provides valuable empirical insights regarding data dependency in a specific domain (drug-asset valuation), Paper 1 offers a foundational architectural advancement that is broadly applicable across multiple scientific and analytical domains, ensuring a wider and more transformative methodological impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

Paper 2 presents a concrete, implemented multi-agent system (DuMate-DeepResearch) with novel architectural contributions (graph-based dynamic planning, recursive two-level execution, rubric-based optimization) and demonstrates state-of-the-art results on established benchmarks. Paper 1 is primarily a conceptual framework and literature survey for 'MetaAI recursive self-design' that acknowledges its protocol lacks experimental results. Paper 2 offers stronger methodological rigor, verified empirical contributions, and addresses practical limitations in deep research systems with broader near-term applicability.

claude-opus-4-6·Jun 9, 2026

Lostvs. Leveraging Structural Constraints for Diffusion-based Neural TSP Solvers

Paper 1 offers a clearer, more rigorous algorithmic contribution: a retraining-free, plug-and-play inference-time method (PCI) that replaces costly gradient refinement with structure-aware projections and achieves measurable gains (quality, speed, memory) on large-scale TSP. This is timely for diffusion/consistency-model combinatorial optimization and has practical impact for operations research and ML, with transferable ideas (projection-based decoding, hybrid local search) to other discrete problems. Paper 2 is an engineering-heavy multi-agent framework with benchmark gains, but its novelty is more system integration and may be less generalizable and harder to reproduce rigorously.

gpt-5.2·Jun 9, 2026

Wonvs. Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

DuMate-DeepResearch addresses a more broadly impactful problem—autonomous deep research with multi-agent systems—achieving state-of-the-art results on established benchmarks. Its contributions (graph-based dynamic planning, recursive search agents, rubric-grounded reasoning) are more immediately applicable across diverse research and industry settings. Paper 1 tackles a narrower problem (skill construction from traces) with a more incremental contribution and evaluation on only 70 skills. Paper 2's auditability focus and benchmark-leading results position it for wider adoption and citation impact in the rapidly growing agentic AI field.

claude-opus-4-6·Jun 8, 2026

Wonvs. Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation

Paper 2 presents a concrete, implemented multi-agent system with empirical state-of-the-art results on established benchmarks, demonstrating immediate practical impact in the rapidly growing Deep Research paradigm. Paper 1, while intellectually compelling in arguing for ante-hoc probabilistic mediation via Bayesian networks, remains a conceptual/position paper outlining a framework without implementation or empirical validation. Paper 2's specific architectural innovations (recursive search agents, rubric-grounded reasoning, graph-based planning) are immediately actionable and reproducible, giving it broader near-term scientific influence despite Paper 1's important theoretical contribution to AI accountability.

claude-opus-4-6·Jun 8, 2026

Lostvs. Exploring Agentic Tool-Calling Decisions via Uncertainty-Aligned Reinforcement Learning

Paper 1 introduces a more novel, generalizable methodological contribution: integrating uncertainty quantification directly into RL reward shaping to preserve uncertainty separation and reduce overconfident tool-use errors. This targets a foundational learning dynamics issue likely applicable beyond tool-calling (e.g., decision-making, exploration, calibration), offering broad cross-field impact and stronger scientific novelty. Paper 2 is a capable systems/engineering advance with benchmark gains and auditability, but its contributions are more architectural and platform-specific, with less clear methodological generality and rigor compared to a principled learning objective.

gpt-5.2·Jun 8, 2026

Lostvs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation

Paper 1 is more likely to yield higher scientific impact: it introduces a new UAV navigation benchmark targeting partial observability in dense urban “canyons” and proposes a world-model-based VLA architecture that tightly couples future video imagination with action via flow matching—advancing embodied autonomy under real physical constraints. This has clear downstream applications in robotics (inspection, delivery, search-and-rescue) and broader relevance to model-based RL and embodied AI. Paper 2 is timely and useful for LLM systems engineering, but appears more incremental (agent orchestration, auditability, planning heuristics) and may be less methodologically universal.

gpt-5.2·Jun 8, 2026

Lostvs. A Geometric Account of Activation Steering through Angle-Norm Decomposition

Paper 1 offers higher scientific impact due to clearer conceptual novelty and broader foundational relevance: it provides a principled geometric decomposition (angle vs. norm) that explains and unifies additive and spherical activation steering across seven language models, yielding interpretable parameters and actionable guidance for model control and mechanistic understanding. This advances core methodology in interpretability/control with likely downstream influence on safety, alignment, and editing. Paper 2 is timely and application-driven, but appears as an engineering/benchmark-driven technical report whose contributions (multi-agent planning, recursion, rubric-guided TTO) are more incremental and may depend on specific platform details, limiting generalizability.

gpt-5.2·Jun 8, 2026

#2503of 3489·Artificial Intelligence

#2503 of 3489 · Artificial Intelligence

Tournament Score

1341±43

10501800

37%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance5.5

Rigor4

Novelty4.5

Clarity7