Back to Rankings

SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen

cs.AI
Share
#1697 of 3489 · Artificial Intelligence
Tournament Score
1402±45
10501800
50%
Win Rate
8
Wins
8
Losses
16
Matches
Rating
7/ 10
Significance7
Rigor6.5
Novelty6.5
Clarity8

Abstract

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SearchSwarm

1. Core Contribution

SearchSwarm introduces the concept of delegation intelligence — the ability of an LLM agent to decompose complex tasks, delegate subtasks to independent subagents, provide comprehensive briefs, and integrate returned results. The paper's main contributions are threefold: (1) a harness design that elicits high-quality delegation behavior at inference time through four principles (encouraging delegation, comprehensive briefing, retaining core judgment, citation-grounded reporting); (2) a method for synthesizing supervised fine-tuning (SFT) data from harness-guided trajectories; and (3) a resulting model (SearchSwarm-30B-A3B) that achieves state-of-the-art results among comparable-scale models on multiple deep research benchmarks.

The framing of delegation as active context management — where the model intelligently compresses information by dispatching work to fresh-context subagents and receiving only condensed reports — is conceptually clean and well-articulated. The paper correctly identifies that this is functionally a single-model system where the same weights serve both main agent and subagent roles, distinguishing it from true multi-agent systems.

2. Methodological Rigor

The experimental evaluation is relatively thorough, spanning four short-answer benchmarks (BrowseComp, BrowseComp-ZH, GAIA, xbench-DeepSearch) and four open-ended benchmarks (ScholarQA-v2, HealthBench, ResearchQA, DeepResearchBench). The comparison set is comprehensive, including closed-source frontier models, large open-source models, and lightweight models at the same 30B-A3B scale.

Strengths in rigor:

  • The ablation study (Section 3.3) with DeepSeek V3.2 demonstrates that the full harness (+10.0 over base) substantially outperforms simply providing the delegation tool (+2.3), validating the harness design principles.
  • The "Tongyi DR Swarm" experiment elegantly shows that the base model never invokes `call_sub_agent` without fine-tuning, confirming that delegation behavior must be explicitly trained.
  • Training on a different base model (Qwen3-30B-A3B-Thinking-2507) demonstrates data transferability.
  • Single-agent generalization experiments (Section 3.5) show improvements even without the delegation tool, suggesting the training teaches broader investigative skills.
  • Weaknesses in rigor:

  • The ablation study uses only a 200-question subset, and some comparisons are acknowledged as approximate (e.g., RedSearcher's subset vs. full benchmark).
  • There is no detailed ablation of individual harness principles (e.g., removing comprehensive briefing alone, or removing citation requirements alone), making it hard to attribute gains to specific design choices.
  • The paper lacks analysis of failure modes — when does delegation hurt? Are there task types where the overhead of delegation outweighs benefits?
  • The filtering criteria for training data (correct answers only) introduces survivorship bias; there's no discussion of how much data was discarded or whether negative examples could be beneficial.
  • Cost analysis is absent — how many total tokens/API calls does SearchSwarm consume compared to single-agent baselines? The delegation paradigm likely uses significantly more compute.
  • 3. Potential Impact

    The paper addresses a genuine and growing need: as LLM agents tackle increasingly complex, long-horizon tasks, context management becomes a critical bottleneck. The delegation paradigm offers a principled solution that scales naturally — adding more subagent calls extends effective context without engineering longer windows.

    Practical impact: The open-source release of harness, model weights, and training data lowers the barrier for the community to build on this work. The approach is model-agnostic (demonstrated on two base models) and the harness design principles are transferable.

    Broader implications: The delegation intelligence concept could extend beyond deep research to software engineering (SWE-bench style tasks), scientific discovery, and enterprise workflows. The paper's insight that training on delegation trajectories improves even single-agent performance suggests the investigative structure itself — decomposition, hypothesis management, evidence verification — is a valuable inductive bias.

    4. Timeliness & Relevance

    This paper is highly timely. The field is rapidly moving from single-turn chat to multi-step agentic workflows, and context management is an active bottleneck. Several concurrent efforts (Kimi K2.5's Agent Swarm, Anthropic's multi-agent research system, Step 3.5 Flash) explore similar territory but lack open-source recipes. SearchSwarm fills a clear gap by providing a complete, reproducible pipeline.

    The benchmark results are current (comparing against GPT-5.2, Claude 4.5 Opus, Gemini 3.0 Pro), placing this firmly in the 2025-2026 frontier.

    5. Strengths & Limitations

    Key Strengths:

  • Complete recipe: Unlike prior work that describes architecture or training algorithms in isolation, this paper provides harness design, data synthesis, filtering, and training — a full pipeline.
  • Strong empirical results: 68.1 on BrowseComp with a 3B-active-parameter model competing with 37B+ active parameter models is impressive.
  • Elegant conceptual framing: Reframing delegation as context management unifies the approach with prior work and makes fair comparisons possible.
  • Generalization evidence: The model's transfer to single-agent settings and open-ended tasks (without open-ended training data) is noteworthy.
  • Detailed case study (Appendix C): The walkthrough convincingly demonstrates all four harness principles in action.
  • Notable Limitations:

  • SFT only, no RL: The paper uses only supervised fine-tuning. Reinforcement learning (as in Kimi's Agent Swarm) could potentially yield stronger delegation policies, and the paper doesn't explore this.
  • Single-level delegation: Subagents cannot invoke `call_sub_agent`, limiting depth to one level. For truly complex tasks, hierarchical delegation might be necessary.
  • No cost-performance analysis: Without reporting token usage or latency, it's impossible to judge the efficiency tradeoff.
  • Limited novelty in individual components: The harness principles (comprehensive briefing, citation grounding) are well-known best practices in prompt engineering; the novelty lies in their systematization and use for data synthesis.
  • Benchmark limitations: BrowseComp-family benchmarks test factual retrieval rather than deeper reasoning; performance on more reasoning-intensive tasks would strengthen the contribution.
  • 6. Additional Observations

    The behavioral analysis (Figure 3) revealing that the main agent primarily uses `visit` (for verification) while subagents primarily use `search` (for exploration) is an interesting emergent specialization that validates the design. The observation that incorrectly answered questions show flatter, higher subagent-call distributions suggests the model has learned when to persist vs. when to stop — though this could also simply reflect harder questions requiring more exploration regardless of strategy quality.

    The paper is well-written and clearly structured, though the "preliminary exploration" framing somewhat undersells the contribution given the strong empirical results.

    Rating:7/ 10
    Significance 7Rigor 6.5Novelty 6.5Clarity 8

    Generated Jun 9, 2026

    Comparison History (16)

    Wonvs. Superficial Beliefs in LLM Decision-Making

    Paper 1 introduces a practical, open-source approach to delegation intelligence in agentic LLMs, addressing a critical bottleneck in long-horizon tasks like deep research. Its actionable methodology, SOTA results, and release of models and data will likely drive immediate follow-up research and real-world applications. While Paper 2 provides valuable theoretical insights into LLM interpretability, Paper 1's direct contribution to scalable, autonomous AI systems offers broader and more immediate technological impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. ComplexConstraints and Beyond: Expert Rubrics for RLVR

    Paper 1 is likely higher impact due to a more broadly applicable and methodologically grounded contribution: a principled framework for expert rubric construction, a new dataset with fine-grained atomic criteria, and evidence that rubrics improve both evaluation fidelity and RL training across domains with measurable transfer to multiple OOD benchmarks. This advances a core bottleneck (reliable evaluation/training signals) relevant to most LLM development. Paper 2 is timely and useful for agentic delegation, but is framed as preliminary, more domain-specific (deep research/browsing), and its gains may depend on a particular harness design.

    gpt-5.2·Jun 9, 2026
    Lostvs. Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

    Paper 2 has higher potential impact due to a clearer, broadly applicable architectural insight (vision-token saturation and depth-asymmetric processing) and a simple, parameter-efficient method (late-layer fusion routing) that can reduce compute while preserving performance across many MLLM variants and deployment settings. This targets a timely bottleneck—multimodal inference/training efficiency—relevant to both academia and industry, and the analysis-to-design linkage suggests strong methodological rigor. Paper 1 is valuable for agent training/data synthesis, but its impact may be narrower and more sensitive to task setup and evaluation benchmarks.

    gpt-5.2·Jun 9, 2026
    Lostvs. WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

    Paper 1 introduces a comprehensive benchmark for hybrid-interface computer-use agents, a rapidly expanding frontier in AI. By providing a rigorous evaluation framework and exposing a significant performance gap (41.2% pass rate), it is likely to become a standard testing ground, driving broad methodological advancements and accumulating high citations across the agentic AI community.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG

    FIDES addresses a critical and ubiquitous problem in LLMs (retrieval-memory conflict in RAG) with a novel, training-free token-level contrastive decoding method. Its ability to significantly improve context fidelity across multiple model backbones up to 70B without requiring fine-tuning offers broader immediate applicability, real-world utility, and methodological rigor compared to Paper 1's more exploratory focus on agentic task delegation.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. PRISM: Recovering Instruction Sets from Language Model Activations

    Paper 1 addresses a critical and fundamental problem in AI safety and interpretability: extracting the actual instructions driving an LLM's behavior directly from its hidden states. This approach provides a novel solution to urgent security challenges like prompt injection and hidden objectives. While Paper 2 offers a valuable engineering contribution for long-horizon tasks via agent delegation, Paper 1's deep dive into model internals and its broad implications for trustworthy AI deployment give it a higher potential for foundational scientific impact.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. IMUG-Bench: Benchmarking Unified Multimodal Models on Interleaved Understanding and Generation

    Paper 2 addresses a highly critical bottleneck in current AI: long-horizon agentic workflows and finite context windows. By introducing a methodology to synthesize training data for 'delegation intelligence' in multi-agent systems, it directly impacts the rapidly growing field of AI agents and 'deep research.' While Paper 1 provides a valuable benchmark for multimodal models, Paper 2's focus on internalizing multi-agent delegation into model weights represents a more profound methodological innovation with broader applications across all complex, real-world LLM tasks.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

    Paper 1 explores recursive self-design, a foundational step toward artificial general intelligence. By providing a clear evaluation framework and reproducible protocol for AI systems that can modify their own design space, it addresses a highly profound, paradigm-shifting concept. While Paper 2 offers strong empirical results in agentic delegation, Paper 1's focus on self-improving AI has broader, more transformative long-term implications across the entire field of AI research and safety.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. SIGA: Self-Evolving Coding-Agent Adapters for Scientific Simulation

    Paper 1 directly accelerates complex scientific workflows by adapting AI agents to operate domain-specific simulators (e.g., GEOS, OpenFOAM, LAMMPS). This provides immediate, measurable value (e.g., 36x speedups) to computational scientists across physics and chemistry. While Paper 2 offers a valuable foundational AI framework for long-horizon tasks, Paper 1 has a more direct, applied, and transformative impact specifically on the execution of scientific research.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

    Paper 2 has higher impact potential: it proposes a training methodology (harness-guided delegation trajectories for SFT) that directly improves long-horizon agent performance under finite context, with strong benchmark gains and planned releases of model/weights/data enabling adoption. Its applications (scalable research agents, tool-using systems, enterprise workflows) are broad and timely as multi-agent LLM systems proliferate. Paper 1 is methodologically useful and insightful for evaluation/feedback limits, but is primarily diagnostic/benchmarking and may have narrower downstream impact than a broadly applicable capability-training approach.

    gpt-5.2·Jun 9, 2026