SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen

Jun 8, 2026arXiv:2606.09730v1

cs.AI

#1697of 3489·Artificial Intelligence

#1697 of 3489 · Artificial Intelligence

Tournament Score

1402±45

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity8

Abstract

Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SearchSwarm

1. Core Contribution

SearchSwarm introduces the concept of delegation intelligence — the ability of an LLM agent to decompose complex tasks, delegate subtasks to independent subagents, provide comprehensive briefs, and integrate returned results. The paper's main contributions are threefold: (1) a harness design that elicits high-quality delegation behavior at inference time through four principles (encouraging delegation, comprehensive briefing, retaining core judgment, citation-grounded reporting); (2) a method for synthesizing supervised fine-tuning (SFT) data from harness-guided trajectories; and (3) a resulting model (SearchSwarm-30B-A3B) that achieves state-of-the-art results among comparable-scale models on multiple deep research benchmarks.

The framing of delegation as active context management — where the model intelligently compresses information by dispatching work to fresh-context subagents and receiving only condensed reports — is conceptually clean and well-articulated. The paper correctly identifies that this is functionally a single-model system where the same weights serve both main agent and subagent roles, distinguishing it from true multi-agent systems.

2. Methodological Rigor

The experimental evaluation is relatively thorough, spanning four short-answer benchmarks (BrowseComp, BrowseComp-ZH, GAIA, xbench-DeepSearch) and four open-ended benchmarks (ScholarQA-v2, HealthBench, ResearchQA, DeepResearchBench). The comparison set is comprehensive, including closed-source frontier models, large open-source models, and lightweight models at the same 30B-A3B scale.

Strengths in rigor:

The ablation study (Section 3.3) with DeepSeek V3.2 demonstrates that the full harness (+10.0 over base) substantially outperforms simply providing the delegation tool (+2.3), validating the harness design principles.

The "Tongyi DR Swarm" experiment elegantly shows that the base model never invokes `call_sub_agent` without fine-tuning, confirming that delegation behavior must be explicitly trained.

Training on a different base model (Qwen3-30B-A3B-Thinking-2507) demonstrates data transferability.

Single-agent generalization experiments (Section 3.5) show improvements even without the delegation tool, suggesting the training teaches broader investigative skills.

Weaknesses in rigor:

The ablation study uses only a 200-question subset, and some comparisons are acknowledged as approximate (e.g., RedSearcher's subset vs. full benchmark).

There is no detailed ablation of individual harness principles (e.g., removing comprehensive briefing alone, or removing citation requirements alone), making it hard to attribute gains to specific design choices.

The paper lacks analysis of failure modes — when does delegation hurt? Are there task types where the overhead of delegation outweighs benefits?

The filtering criteria for training data (correct answers only) introduces survivorship bias; there's no discussion of how much data was discarded or whether negative examples could be beneficial.

Cost analysis is absent — how many total tokens/API calls does SearchSwarm consume compared to single-agent baselines? The delegation paradigm likely uses significantly more compute.

3. Potential Impact

The paper addresses a genuine and growing need: as LLM agents tackle increasingly complex, long-horizon tasks, context management becomes a critical bottleneck. The delegation paradigm offers a principled solution that scales naturally — adding more subagent calls extends effective context without engineering longer windows.

Practical impact: The open-source release of harness, model weights, and training data lowers the barrier for the community to build on this work. The approach is model-agnostic (demonstrated on two base models) and the harness design principles are transferable.

Broader implications: The delegation intelligence concept could extend beyond deep research to software engineering (SWE-bench style tasks), scientific discovery, and enterprise workflows. The paper's insight that training on delegation trajectories improves even single-agent performance suggests the investigative structure itself — decomposition, hypothesis management, evidence verification — is a valuable inductive bias.

4. Timeliness & Relevance

This paper is highly timely. The field is rapidly moving from single-turn chat to multi-step agentic workflows, and context management is an active bottleneck. Several concurrent efforts (Kimi K2.5's Agent Swarm, Anthropic's multi-agent research system, Step 3.5 Flash) explore similar territory but lack open-source recipes. SearchSwarm fills a clear gap by providing a complete, reproducible pipeline.

The benchmark results are current (comparing against GPT-5.2, Claude 4.5 Opus, Gemini 3.0 Pro), placing this firmly in the 2025-2026 frontier.

5. Strengths & Limitations

Key Strengths:

Complete recipe: Unlike prior work that describes architecture or training algorithms in isolation, this paper provides harness design, data synthesis, filtering, and training — a full pipeline.

Strong empirical results: 68.1 on BrowseComp with a 3B-active-parameter model competing with 37B+ active parameter models is impressive.

Elegant conceptual framing: Reframing delegation as context management unifies the approach with prior work and makes fair comparisons possible.

Generalization evidence: The model's transfer to single-agent settings and open-ended tasks (without open-ended training data) is noteworthy.

Detailed case study (Appendix C): The walkthrough convincingly demonstrates all four harness principles in action.

Notable Limitations:

SFT only, no RL: The paper uses only supervised fine-tuning. Reinforcement learning (as in Kimi's Agent Swarm) could potentially yield stronger delegation policies, and the paper doesn't explore this.

Single-level delegation: Subagents cannot invoke `call_sub_agent`, limiting depth to one level. For truly complex tasks, hierarchical delegation might be necessary.

No cost-performance analysis: Without reporting token usage or latency, it's impossible to judge the efficiency tradeoff.

Limited novelty in individual components: The harness principles (comprehensive briefing, citation grounding) are well-known best practices in prompt engineering; the novelty lies in their systematization and use for data synthesis.

Benchmark limitations: BrowseComp-family benchmarks test factual retrieval rather than deeper reasoning; performance on more reasoning-intensive tasks would strengthen the contribution.

6. Additional Observations

The behavioral analysis (Figure 3) revealing that the main agent primarily uses `visit` (for verification) while subagents primarily use `search` (for exploration) is an interesting emergent specialization that validates the design. The observation that incorrectly answered questions show flatter, higher subagent-call distributions suggests the model has learned when to persist vs. when to stop — though this could also simply reflect harder questions requiring more exploration regardless of strategy quality.

The paper is well-written and clearly structured, though the "preliminary exploration" framing somewhat undersells the contribution given the strong empirical results.

Rating:7/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 8

Generated Jun 9, 2026

Comparison History (16)

Wonvs. Superficial Beliefs in LLM Decision-Making

Paper 1 introduces a practical, open-source approach to delegation intelligence in agentic LLMs, addressing a critical bottleneck in long-horizon tasks like deep research. Its actionable methodology, SOTA results, and release of models and data will likely drive immediate follow-up research and real-world applications. While Paper 2 provides valuable theoretical insights into LLM interpretability, Paper 1's direct contribution to scalable, autonomous AI systems offers broader and more immediate technological impact.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. ComplexConstraints and Beyond: Expert Rubrics for RLVR

Paper 1 is likely higher impact due to a more broadly applicable and methodologically grounded contribution: a principled framework for expert rubric construction, a new dataset with fine-grained atomic criteria, and evidence that rubrics improve both evaluation fidelity and RL training across domains with measurable transfer to multiple OOD benchmarks. This advances a core bottleneck (reliable evaluation/training signals) relevant to most LLM development. Paper 2 is timely and useful for agentic delegation, but is framed as preliminary, more domain-specific (deep research/browsing), and its gains may depend on a particular harness design.

gpt-5.2·Jun 9, 2026

Lostvs. Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

Paper 2 has higher potential impact due to a clearer, broadly applicable architectural insight (vision-token saturation and depth-asymmetric processing) and a simple, parameter-efficient method (late-layer fusion routing) that can reduce compute while preserving performance across many MLLM variants and deployment settings. This targets a timely bottleneck—multimodal inference/training efficiency—relevant to both academia and industry, and the analysis-to-design linkage suggests strong methodological rigor. Paper 1 is valuable for agent training/data synthesis, but its impact may be narrower and more sensitive to task setup and evaluation benchmarks.

gpt-5.2·Jun 9, 2026

Lostvs. WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Paper 1 introduces a comprehensive benchmark for hybrid-interface computer-use agents, a rapidly expanding frontier in AI. By providing a rigorous evaluation framework and exposing a significant performance gap (41.2% pass rate), it is likely to become a standard testing ground, driving broad methodological advancements and accumulating high citations across the agentic AI community.