Lingyong Yan, Can Xu, Yukun Zhao, Wenxuan Li, Qingyang Chen, Jiulong Wu, Wenli Song, Xiangnan Li
Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such tasks within a single agent, hallucination risk in long-form synthesis, and limited process auditability. This technical report presents DuMate-DeepResearch, a multi-agent DR framework built on the Qianfan Agent Foundry. The framework decouples the Agent Core, which handles task understanding, planning, and scheduling, from an extensible Tool Ecosystem for retrieval, evidence acquisition, and report rendering, making every intermediate decision and tool invocation explicitly traceable. Building on this infrastructure, DuMate-DeepResearch further introduces three mechanisms: (i) a graph-based dynamic planning strategy expands the research roadmap coarse-to-fine and continuously revises it through reflection, re-planning, backtracking, and parallel branching; (ii) a recursive two-level execution design delegates each complex search sub-task to an inner Search Agent that runs its own planning loop, isolating noisy retrieval and stabilizing long-horizon execution; (iii) a rubric-based test-time optimization mechanism dynamically generates task-specific quality criteria and uses them as live reasoning scaffolds for evidence-grounded synthesis and adaptive stopping. Across two deep research benchmarks, DuMate-DeepResearch establishes new state-of-the-art results: the best overall score (58.03%) on DeepResearch Bench, and the best overall score (61.95%) on DeepResearch Bench II while ranking first in information recall and analysis.
DuMate-DeepResearch presents a multi-agent framework for automated deep research—the task of producing comprehensive, evidence-grounded, long-form reports from complex open-ended queries. The system is built on the Qianfan Agent Foundry (Baidu AI Cloud) and introduces three interlocking mechanisms: (i) graph-based dynamic planning that maintains a DAG-structured research roadmap with coarse-to-fine expansion, reflection, backtracking, and parallel branching; (ii) recursive two-level execution where an outer Research Agent delegates complex retrieval sub-tasks to inner Search Agents, each running their own planning-execution loop; and (iii) rubric-based test-time optimization that dynamically generates persistent and ephemeral quality criteria serving as live reasoning scaffolds during both planning and synthesis. The paper claims state-of-the-art results on DeepResearch Bench (58.03%) and DeepResearch Bench II (61.95%).
The core problem addressed—orchestrating long-horizon, multi-step research workflows with auditability—is genuinely important. The proposed solution synthesizes several individually known ideas (DAG-based planning, hierarchical agents, rubric-guided generation) into a cohesive system architecture.
Strengths in formalization: The paper provides a clean state-transition formalization (Equation 1-2), algorithms for the main loop and planning, and a clear ready-frontier scheduling mechanism (Equation 3). The recursive agent nesting is well-defined with explicit depth bounding.
Weaknesses in experimental rigor: Several concerns limit confidence in the empirical claims:
Practical relevance: Deep research systems represent a commercially important application area, with products from OpenAI, Google, Perplexity, and others. The architectural patterns described—particularly hierarchical agent decomposition and rubric-guided generation—are practically useful design principles that practitioners can adopt.
Limited scientific novelty: The individual components draw heavily from established ideas: DAG-structured task planning has precedents in hierarchical task networks; multi-agent delegation is well-studied; and using evaluation criteria as generation guidance is an increasingly common pattern (DR-Tulu preceded this with rubric-based RL). The novelty lies primarily in their combination and engineering within a production system.
Auditability claim: The paper emphasizes auditability as a contribution, but the treatment is descriptive rather than rigorous. There is no formal definition of what constitutes an auditable trace, no user study on whether the traces are actually useful for debugging or trust-building, and no comparison of auditability against other systems.
The paper is highly timely. Deep research agents are an active frontier in 2025-2026, with numerous concurrent systems and benchmarks. The paper engages thoroughly with the rapidly growing literature. The benchmarks used are recent and relevant. The multi-agent, tool-augmented paradigm addresses real deployment needs.
However, the fast-moving nature of this field also means leaderboard positions are ephemeral. Several competing systems (iFlow-Researcher, ZTE Nebula, Xiaoyi, etc.) achieve scores within 1-2% on the same benchmarks, suggesting the field is converging rather than being transformed by any single system.
The paper reads more as a technical report/system description than a research paper with controlled experiments isolating causal contributions. The qualitative case studies, while informative, are cherry-picked demonstrations rather than systematic evaluations. The related work section is thorough but predominantly positions this work relative to open-source/academic systems while the primary competition appears to be commercial products.
The contribution of making prompts partially available (Appendix A) is appreciated but limited by the extensive redactions. The framework's tight coupling with Baidu's proprietary infrastructure limits external adoption.
Generated Jun 8, 2026
Paper 1 offers a highly novel intersection of LLMs and industrial control theory, rigorously defining the exact utility of LLMs as sample-efficient structural priors rather than mere optimizers. While Paper 2 presents a strong engineering framework for AI agents achieving SOTA results, Paper 1 provides deeper methodological insights and solves a notoriously difficult problem in physical systems, giving it broader cross-disciplinary scientific impact.
Paper 2 has higher potential impact because it proposes a new, domain-substantive mathematical conjecture (Neural Jacobian Conjecture) bridging classical algebraic geometry (Jacobian conjecture) and neural network theory, with partial proofs and multiple independent proof routes—creating a new research direction with cross-field relevance. Paper 1 is a strong engineering advance in multi-agent deep-research systems with benchmark SOTA, but its novelty is more incremental within a fast-moving area and likely to be quickly superseded. Paper 2’s conjecture-driven framing could catalyze broader, longer-lived follow-on work.
Paper 1 presents a generalizable, state-of-the-art multi-agent framework for automated deep research. By introducing recursive execution and rubric-based test-time optimization, it directly addresses critical bottlenecks in AI agent planning and synthesis, achieving top performance on key benchmarks. While Paper 2 provides valuable empirical insights regarding data dependency in a specific domain (drug-asset valuation), Paper 1 offers a foundational architectural advancement that is broadly applicable across multiple scientific and analytical domains, ensuring a wider and more transformative methodological impact.
Paper 2 presents a concrete, implemented multi-agent system (DuMate-DeepResearch) with novel architectural contributions (graph-based dynamic planning, recursive two-level execution, rubric-based optimization) and demonstrates state-of-the-art results on established benchmarks. Paper 1 is primarily a conceptual framework and literature survey for 'MetaAI recursive self-design' that acknowledges its protocol lacks experimental results. Paper 2 offers stronger methodological rigor, verified empirical contributions, and addresses practical limitations in deep research systems with broader near-term applicability.
Paper 1 offers a clearer, more rigorous algorithmic contribution: a retraining-free, plug-and-play inference-time method (PCI) that replaces costly gradient refinement with structure-aware projections and achieves measurable gains (quality, speed, memory) on large-scale TSP. This is timely for diffusion/consistency-model combinatorial optimization and has practical impact for operations research and ML, with transferable ideas (projection-based decoding, hybrid local search) to other discrete problems. Paper 2 is an engineering-heavy multi-agent framework with benchmark gains, but its novelty is more system integration and may be less generalizable and harder to reproduce rigorously.
DuMate-DeepResearch addresses a more broadly impactful problem—autonomous deep research with multi-agent systems—achieving state-of-the-art results on established benchmarks. Its contributions (graph-based dynamic planning, recursive search agents, rubric-grounded reasoning) are more immediately applicable across diverse research and industry settings. Paper 1 tackles a narrower problem (skill construction from traces) with a more incremental contribution and evaluation on only 70 skills. Paper 2's auditability focus and benchmark-leading results position it for wider adoption and citation impact in the rapidly growing agentic AI field.
Paper 2 presents a concrete, implemented multi-agent system with empirical state-of-the-art results on established benchmarks, demonstrating immediate practical impact in the rapidly growing Deep Research paradigm. Paper 1, while intellectually compelling in arguing for ante-hoc probabilistic mediation via Bayesian networks, remains a conceptual/position paper outlining a framework without implementation or empirical validation. Paper 2's specific architectural innovations (recursive search agents, rubric-grounded reasoning, graph-based planning) are immediately actionable and reproducible, giving it broader near-term scientific influence despite Paper 1's important theoretical contribution to AI accountability.
Paper 1 introduces a more novel, generalizable methodological contribution: integrating uncertainty quantification directly into RL reward shaping to preserve uncertainty separation and reduce overconfident tool-use errors. This targets a foundational learning dynamics issue likely applicable beyond tool-calling (e.g., decision-making, exploration, calibration), offering broad cross-field impact and stronger scientific novelty. Paper 2 is a capable systems/engineering advance with benchmark gains and auditability, but its contributions are more architectural and platform-specific, with less clear methodological generality and rigor compared to a principled learning objective.
Paper 1 is more likely to yield higher scientific impact: it introduces a new UAV navigation benchmark targeting partial observability in dense urban “canyons” and proposes a world-model-based VLA architecture that tightly couples future video imagination with action via flow matching—advancing embodied autonomy under real physical constraints. This has clear downstream applications in robotics (inspection, delivery, search-and-rescue) and broader relevance to model-based RL and embodied AI. Paper 2 is timely and useful for LLM systems engineering, but appears more incremental (agent orchestration, auditability, planning heuristics) and may be less methodologically universal.
Paper 1 offers higher scientific impact due to clearer conceptual novelty and broader foundational relevance: it provides a principled geometric decomposition (angle vs. norm) that explains and unifies additive and spherical activation steering across seven language models, yielding interpretable parameters and actionable guidance for model control and mechanistic understanding. This advances core methodology in interpretability/control with likely downstream influence on safety, alignment, and editing. Paper 2 is timely and application-driven, but appears as an engineering/benchmark-driven technical report whose contributions (multi-agent planning, recursion, rubric-guided TTO) are more incremental and may depend on specific platform details, limiting generalizability.