Pu Ning, Quan Chen, Kun Tao, Xinyu Tang, Tianshu Wang, Qianggang Cao, Xinyu Kong, Zujie Wen
Large language models are increasingly expected to handle complex, long-horizon real-world tasks whose context demands can grow without bound, yet model context windows remain inherently finite. Recent work explores a paradigm where a main agent decomposes tasks and dispatches subtasks to subagents, which execute and return only summarized results, conserving the main agent's context budget. However, performing this well requires delegation intelligence: the ability to decompose complex tasks, determine when and what to delegate, and integrate returned results into the ongoing workflow. Training data for this capability is scarce in naturally occurring text, and to our knowledge, how to synthesize such data and train models to acquire this capability remains largely unexplored in the open-source community. To bridge this gap, we present a preliminary exploration targeting deep research, a representative long-horizon agent task. Specifically, we design a harness that guides the model toward high-quality task decomposition and delegation, while constraining subagents to return results properly to support the main agent's workflow. The harness-guided trajectories naturally encode correct delegation decisions, which we use as supervised fine-tuning data to internalize delegation intelligence into model weights. Our resulting model, SearchSwarm-30B-A3B, achieves 68.1 on BrowseComp and 73.3 on BrowseComp-ZH, the best results among all models of comparable scale. We will release our harness, model weights, and training data to facilitate future research.
SearchSwarm introduces the concept of delegation intelligence — the ability of an LLM agent to decompose complex tasks, delegate subtasks to independent subagents, provide comprehensive briefs, and integrate returned results. The paper's main contributions are threefold: (1) a harness design that elicits high-quality delegation behavior at inference time through four principles (encouraging delegation, comprehensive briefing, retaining core judgment, citation-grounded reporting); (2) a method for synthesizing supervised fine-tuning (SFT) data from harness-guided trajectories; and (3) a resulting model (SearchSwarm-30B-A3B) that achieves state-of-the-art results among comparable-scale models on multiple deep research benchmarks.
The framing of delegation as active context management — where the model intelligently compresses information by dispatching work to fresh-context subagents and receiving only condensed reports — is conceptually clean and well-articulated. The paper correctly identifies that this is functionally a single-model system where the same weights serve both main agent and subagent roles, distinguishing it from true multi-agent systems.
The experimental evaluation is relatively thorough, spanning four short-answer benchmarks (BrowseComp, BrowseComp-ZH, GAIA, xbench-DeepSearch) and four open-ended benchmarks (ScholarQA-v2, HealthBench, ResearchQA, DeepResearchBench). The comparison set is comprehensive, including closed-source frontier models, large open-source models, and lightweight models at the same 30B-A3B scale.
The paper addresses a genuine and growing need: as LLM agents tackle increasingly complex, long-horizon tasks, context management becomes a critical bottleneck. The delegation paradigm offers a principled solution that scales naturally — adding more subagent calls extends effective context without engineering longer windows.
Practical impact: The open-source release of harness, model weights, and training data lowers the barrier for the community to build on this work. The approach is model-agnostic (demonstrated on two base models) and the harness design principles are transferable.
Broader implications: The delegation intelligence concept could extend beyond deep research to software engineering (SWE-bench style tasks), scientific discovery, and enterprise workflows. The paper's insight that training on delegation trajectories improves even single-agent performance suggests the investigative structure itself — decomposition, hypothesis management, evidence verification — is a valuable inductive bias.
This paper is highly timely. The field is rapidly moving from single-turn chat to multi-step agentic workflows, and context management is an active bottleneck. Several concurrent efforts (Kimi K2.5's Agent Swarm, Anthropic's multi-agent research system, Step 3.5 Flash) explore similar territory but lack open-source recipes. SearchSwarm fills a clear gap by providing a complete, reproducible pipeline.
The benchmark results are current (comparing against GPT-5.2, Claude 4.5 Opus, Gemini 3.0 Pro), placing this firmly in the 2025-2026 frontier.
The behavioral analysis (Figure 3) revealing that the main agent primarily uses `visit` (for verification) while subagents primarily use `search` (for exploration) is an interesting emergent specialization that validates the design. The observation that incorrectly answered questions show flatter, higher subagent-call distributions suggests the model has learned when to persist vs. when to stop — though this could also simply reflect harder questions requiring more exploration regardless of strategy quality.
The paper is well-written and clearly structured, though the "preliminary exploration" framing somewhat undersells the contribution given the strong empirical results.
Generated Jun 9, 2026
Paper 1 introduces a practical, open-source approach to delegation intelligence in agentic LLMs, addressing a critical bottleneck in long-horizon tasks like deep research. Its actionable methodology, SOTA results, and release of models and data will likely drive immediate follow-up research and real-world applications. While Paper 2 provides valuable theoretical insights into LLM interpretability, Paper 1's direct contribution to scalable, autonomous AI systems offers broader and more immediate technological impact.
Paper 1 is likely higher impact due to a more broadly applicable and methodologically grounded contribution: a principled framework for expert rubric construction, a new dataset with fine-grained atomic criteria, and evidence that rubrics improve both evaluation fidelity and RL training across domains with measurable transfer to multiple OOD benchmarks. This advances a core bottleneck (reliable evaluation/training signals) relevant to most LLM development. Paper 2 is timely and useful for agentic delegation, but is framed as preliminary, more domain-specific (deep research/browsing), and its gains may depend on a particular harness design.
Paper 2 has higher potential impact due to a clearer, broadly applicable architectural insight (vision-token saturation and depth-asymmetric processing) and a simple, parameter-efficient method (late-layer fusion routing) that can reduce compute while preserving performance across many MLLM variants and deployment settings. This targets a timely bottleneck—multimodal inference/training efficiency—relevant to both academia and industry, and the analysis-to-design linkage suggests strong methodological rigor. Paper 1 is valuable for agent training/data synthesis, but its impact may be narrower and more sensitive to task setup and evaluation benchmarks.
Paper 1 introduces a comprehensive benchmark for hybrid-interface computer-use agents, a rapidly expanding frontier in AI. By providing a rigorous evaluation framework and exposing a significant performance gap (41.2% pass rate), it is likely to become a standard testing ground, driving broad methodological advancements and accumulating high citations across the agentic AI community.
FIDES addresses a critical and ubiquitous problem in LLMs (retrieval-memory conflict in RAG) with a novel, training-free token-level contrastive decoding method. Its ability to significantly improve context fidelity across multiple model backbones up to 70B without requiring fine-tuning offers broader immediate applicability, real-world utility, and methodological rigor compared to Paper 1's more exploratory focus on agentic task delegation.
Paper 1 addresses a critical and fundamental problem in AI safety and interpretability: extracting the actual instructions driving an LLM's behavior directly from its hidden states. This approach provides a novel solution to urgent security challenges like prompt injection and hidden objectives. While Paper 2 offers a valuable engineering contribution for long-horizon tasks via agent delegation, Paper 1's deep dive into model internals and its broad implications for trustworthy AI deployment give it a higher potential for foundational scientific impact.
Paper 2 addresses a highly critical bottleneck in current AI: long-horizon agentic workflows and finite context windows. By introducing a methodology to synthesize training data for 'delegation intelligence' in multi-agent systems, it directly impacts the rapidly growing field of AI agents and 'deep research.' While Paper 1 provides a valuable benchmark for multimodal models, Paper 2's focus on internalizing multi-agent delegation into model weights represents a more profound methodological innovation with broader applications across all complex, real-world LLM tasks.
Paper 1 explores recursive self-design, a foundational step toward artificial general intelligence. By providing a clear evaluation framework and reproducible protocol for AI systems that can modify their own design space, it addresses a highly profound, paradigm-shifting concept. While Paper 2 offers strong empirical results in agentic delegation, Paper 1's focus on self-improving AI has broader, more transformative long-term implications across the entire field of AI research and safety.
Paper 1 directly accelerates complex scientific workflows by adapting AI agents to operate domain-specific simulators (e.g., GEOS, OpenFOAM, LAMMPS). This provides immediate, measurable value (e.g., 36x speedups) to computational scientists across physics and chemistry. While Paper 2 offers a valuable foundational AI framework for long-horizon tasks, Paper 1 has a more direct, applied, and transformative impact specifically on the execution of scientific research.
Paper 2 has higher impact potential: it proposes a training methodology (harness-guided delegation trajectories for SFT) that directly improves long-horizon agent performance under finite context, with strong benchmark gains and planned releases of model/weights/data enabling adoption. Its applications (scalable research agents, tool-using systems, enterprise workflows) are broad and timely as multi-agent LLM systems proliferate. Paper 1 is methodologically useful and insightful for evaluation/feedback limits, but is primarily diagnostic/benchmarking and may have narrower downstream impact than a broadly applicable capability-training approach.