AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Tong Zhao, Xiaoxi Li, Zheng Liu, Zhicheng Dou
Abstract
Recent progress on long-horizon agentic tasks has been driven largely by scaling up individual agents through stronger models, better tools, and more effective scaffolding. In contrast, much less is understood about scaling out: whether multiple peer agents, all targeting the same task, can become an additional source of capability without relying on explicit role specialization or workflow orchestration. We study this question and propose AgentFugue, a collective reasoning framework built around a shared reasoning hub. As peer agents explore the same task in parallel, the hub records concise notes on what each agent has established, attempted, or ruled out, and enables each agent to selectively access what other agents have discovered in a form useful for its current search. This design turns otherwise isolated trajectories into a connected ecology of reusable intermediate reasoning without requiring centralized planning. We instantiate the hub as a plug-in communication layer, trained with supervised fine-tuning and end-to-end reinforcement learning. Across the challenging long-horizon settings we study, AgentFugue improves over strong baselines. Our results suggest that collective reasoning can turn scaling out peer agent systems into a distinct source of capability gains, rather than merely a way of spending more compute.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AgentFugue
1. Core Contribution
AgentFugue introduces a shared reasoning hub that enables multiple peer LLM agents working on the same task to selectively exchange intermediate reasoning progress without centralized planning or role specialization. The key insight is framing "scaling out" (adding more peer agents) as a distinct capability axis from "scaling up" (making individual agents stronger). The hub operates through two mechanisms: episode writing (compressing completed trajectory segments into reusable notes) and intent-driven reading (allowing agents to selectively query and synthesize relevant teammate episodes). This is conceptually positioned between independent parallel sampling (best-of-N) and tightly orchestrated multi-agent systems with predefined roles.
The formalization through "target knowledge space" K*(x) and discovered subspaces K(τ_i) provides a clean conceptual framework, though it remains informal rather than yielding provable guarantees. The hub is trained via supervised fine-tuning followed by GRPO reinforcement learning, with task agents frozen—an interesting design choice that isolates the communication layer's contribution.
2. Methodological Rigor
Strengths: The experimental design is reasonably thorough. Three diverse benchmarks (BrowseComp for retrieval-heavy multi-hop QA, WideSearch for breadth-oriented evidence collection, HLE for reasoning-centric problems) test different facets of long-horizon capability. The comparison against both single-agent baselines (ReAct, DeepResearch systems) and multi-agent baselines (Naive-Multi-Agent, Swarm-Multi-Agent) with matched tool stacks and interaction budgets is fair. The homogeneous/heterogeneous team distinction is a meaningful experimental axis.
Weaknesses: Several methodological concerns limit confidence:
3. Potential Impact
The paper addresses an important and timely question: whether peer-agent parallelism can be more than just independent sampling. The shared reasoning hub concept is practically useful—it's model-agnostic, operates as a plug-in layer, and doesn't require modifying task agents. This modularity could make it adoptable across different agentic frameworks.
The heterogeneous team results (§3.5) are particularly interesting: weaker models benefit substantially from stronger teammates' discoveries, suggesting a practical deployment pattern where expensive frontier models help cheaper models perform better through shared intermediate reasoning.
However, the impact may be bounded by several factors: (1) the hub requires its own training pipeline with task-specific GRPO, limiting zero-shot transferability; (2) the compute overhead of running N agents plus a hub model may not always be justified relative to simply running the best single agent N times and aggregating; (3) the comparison against best-of-N with proper aggregation is not directly presented, making it hard to quantify the marginal value of mid-trajectory communication over post-hoc aggregation.
4. Timeliness & Relevance
This work is highly timely. The field is actively exploring test-time compute scaling, and the distinction between depth scaling (longer reasoning chains) and breadth scaling (more parallel trajectories) is a live research question. Recent work on repeated sampling, self-consistency, and agentic aggregation has shown the power of parallel trajectories, but the question of whether mid-trajectory communication adds value beyond post-hoc merging is underexplored. AgentFugue provides a concrete, trainable answer to this question.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations:
The paper would benefit from a direct compute-matched comparison: given a fixed total token budget, how does AgentFugue (N agents with hub) compare to N independent agents with post-hoc aggregation? The current setup matches per-agent budgets but not total compute, since the hub model adds overhead. The workload analysis (Fig. 2b) partially addresses this by showing per-agent costs decrease, but a clean compute-normalized comparison would strengthen the central claim considerably.
Generated May 26, 2026
Comparison History (24)
Paper 1 introduces a concrete new framework (AgentFugue) for scaling out peer agents via a shared reasoning hub, with an implemented communication layer trained by SFT and end-to-end RL and evaluated on challenging long-horizon tasks with reported gains over strong baselines. This combination of novel system design, methodological contribution, and empirical evidence suggests clearer near-term real-world applicability (multi-agent assistants, complex workflows) and timely relevance to agent scaling. Paper 2 is valuable as a unifying conceptual/taxonomic synthesis, but is less likely to drive immediate capability advances without new algorithms or results.
Paper 2 explores 'scaling out' multi-agent systems via collective reasoning for long-horizon tasks, a highly active and critical frontier in AI. Its approach to decentralized, shared reasoning without explicit roles offers significant innovation over traditional orchestration, likely inspiring numerous downstream applications. While Paper 1 provides crucial methodological improvements for RAG evaluation, Paper 2's potential to fundamentally advance autonomous agent capabilities and scale gives it a broader and more transformative scientific impact.
Paper 2 has higher likely impact due to broader applicability and timeliness: scaling-out collective reasoning for long-horizon agent tasks generalizes across domains (software engineering, robotics, scientific discovery, operations), not just medicine. Its framework-level contribution (shared reasoning hub + SFT/RL training) can influence multi-agent architectures and evaluation paradigms widely. Paper 1 is innovative and high-value for clinical AI, but its impact is narrower to guideline-rich healthcare settings and depends on guideline availability/maintainability and clinical validation pathways. Overall, AgentFugue’s cross-field breadth and relevance to current agentic research give it higher estimated impact.
Paper 1 addresses a fundamental and timely question about the safety and controllability of large reasoning models (LRMs), revealing that chain-of-thought creates a dual encoding of refusal that complicates existing alignment techniques. This has immediate implications for AI safety research and the rapidly growing deployment of reasoning models like DeepSeek-R1 and OpenAI o1. The finding that CoT both strengthens robustness against activation steering but opens new attack surfaces is novel and consequential. Paper 2 presents a useful engineering contribution for multi-agent scaling, but its impact is more incremental within the agent framework literature.
Paper 1 likely has higher scientific impact due to its cross-disciplinary novelty (linking child cognition, Bayesian/program induction, and LLM agents as experimental model organisms) and broader relevance to psychology, cognitive science, AI alignment/interpretability, and education. It offers a principled task formalization with complementary computational interpretations and tests mechanistic hypotheses about evidence reliability and information seeking, supporting methodological rigor. Paper 2 is timely and practically useful for long-horizon agent scaling, but its impact may be narrower (systems/engineering) and more contingent on benchmark generalization and competitive baselines.
Paper 1 addresses a critical bottleneck in AI—multi-agent collective reasoning for long-horizon tasks. Its novel decentralized hub approach offers broad, immediate applicability in AI development and scaling. In contrast, while Paper 2 presents an interesting interdisciplinary neuroimaging study on AI hallucinations, its small sample size (27 participants) and niche focus limit its transformative potential compared to foundational AI scaling methodologies.
While Paper 1 offers significant clinical value by improving interpretable medical AI, Paper 2 has a higher potential for broad scientific impact. AgentFugue addresses a fundamental challenge in general AI: scaling out multi-agent systems for long-horizon tasks without centralized orchestration. By introducing a shared reasoning hub that turns isolated agent trajectories into reusable collective reasoning, its methodology can be generalized across countless domains. The breadth of impact, timeliness in the fast-moving field of LLM agents, and architectural innovation give Paper 2 the edge in overall scientific influence.
AgentFugue addresses a fundamental scaling question in AI agents—whether multiple peer agents can collectively improve performance on long-horizon tasks through shared reasoning without centralized orchestration. This has broad applicability across agentic AI systems, introduces a novel architectural paradigm (shared reasoning hub), and combines SFT with RL training. Paper 2 makes a valuable conceptual contribution about process vs. output alignment in pluralistic contexts, but its scope is narrower (two specific legal/credit domains) and its impact is more limited to the alignment evaluation community. AgentFugue's framework is more likely to inspire follow-up work across multiple AI subfields.
Paper 1 addresses a foundational bottleneck in AI deployment—accountability and trust—by proposing a formal framework for explicit provenance. Its focus on sociotechnical safety, causal attribution, and regulatory alignment gives it broader cross-disciplinary and societal impact compared to Paper 2, which offers a narrower, albeit valuable, algorithmic improvement for multi-agent reasoning capabilities.
AgentFugue addresses a fundamental and timely question in AI agent scaling—whether collective reasoning among peer agents can serve as a distinct capability source beyond individual agent improvements. The shared reasoning hub concept is novel, broadly applicable across long-horizon tasks, and introduces a new paradigm (scaling out vs. scaling up) with practical implications for multi-agent systems. Paper 2 provides useful empirical analysis of MoE routing under safety-relevant conditions, but is narrower in scope (single model, primarily observational) and offers more incremental insights into existing architecture behavior rather than introducing a new framework with broad applicability.
AgentFugue addresses a fundamental and broadly applicable question—whether scaling out peer agents via collective reasoning can yield capability gains—introducing a general-purpose framework applicable across diverse long-horizon tasks. Its contributions (shared reasoning hub, RL-trained communication layer, ecology of reusable reasoning) are domain-agnostic and relevant to the rapidly growing multi-agent systems field. Trace2Skill, while rigorous and valuable, targets a narrower domain (EDA/Verilog design) with more specialized applicability. AgentFugue's breadth of potential impact across AI agent research gives it higher estimated scientific impact.
AgentFugue addresses a fundamental and timely question in AI—how to scale multi-agent systems for long-horizon tasks through collective reasoning rather than just scaling individual agents. This has broad implications across AI research, multi-agent systems, and numerous application domains. The concept of a shared reasoning hub enabling emergent collective intelligence without centralized planning is novel and generalizable. Paper 1, while rigorous and practically useful, addresses a narrower domain (financial NLP with noisy labels) with more incremental contributions. Paper 2's broader applicability and alignment with the rapidly growing agentic AI paradigm give it higher potential impact.
Paper 1 introduces a novel, decentralized multi-agent framework that addresses a critical bottleneck in scaling agentic systems. By enabling collective reasoning through a shared hub without explicit orchestration, it offers a scalable architectural innovation. Paper 2 addresses context learning, a well-explored area, whereas Paper 1 pioneers new methods in the rapidly growing field of multi-agent scaling, likely yielding broader downstream applications and methodological impact.
Paper 1 is likely higher impact due to greater novelty and broader cross-domain relevance: a general collective-reasoning architecture for scaling multi-agent long-horizon problem solving, with a trainable shared “reasoning hub,” can affect agent design across software engineering, robotics, and scientific discovery. It also aligns with a timely frontier (scaling out LLM agents beyond single-agent scaffolding). Paper 2 is methodologically valuable and practical for learning analytics, but its scope is narrower and primarily improves evaluation protocol rather than introducing a broadly transferable modeling paradigm.
AgentFugue addresses the fundamental question of scaling out multi-agent systems through collective reasoning, which is a timely and broadly impactful research direction. Its novel shared reasoning hub concept with RL training opens new paradigms for agent coordination without centralized planning, with applications across diverse long-horizon tasks. While PALoRA makes a solid contribution to the important but more narrowly scoped problem of knowledge injection without reasoning degradation (an incremental advance in the PEFT literature), AgentFugue's framework has broader implications for how we think about agent scaling and could influence multiple research communities working on multi-agent systems, reasoning, and planning.
Paper 1 likely has higher impact due to broader applicability and timeliness: collective reasoning for scaling multi-agent systems targets a central, rapidly growing area (LLM agents and long-horizon tasks) with potential downstream use across software engineering, robotics, science assistants, and tool-using agents. The shared “reasoning hub” is a relatively novel scaling-out mechanism beyond role orchestration, and the plug-in layer with SFT+RL suggests methodological maturity. Paper 2 is strong and rigorous within imperfect-information game AI, but its impact is narrower and more domain-specific.
Paper 2 has a significantly higher potential impact due to its massive scale, rigorous external validation across nine independent cohorts (1.5 million ECGs), and immediate life-saving clinical applications. While Paper 1 presents a novel AI multi-agent architecture, Paper 2 demonstrates a highly mature medical foundation model capable of detecting both common and rare cardiovascular diseases. Its ability to serve as an opportunistic screening tool using routine, low-cost ECGs represents a transformative leap in accessible global healthcare and medical AI.
AgentFugue introduces a novel and timely framework for scaling multi-agent systems through collective reasoning, addressing a fundamental open question in AI agent research. Its combination of a shared reasoning hub with reinforcement learning training offers broad applicability across long-horizon tasks and represents a distinct methodological contribution. Paper 1, while valuable for the axiomatic design community, is primarily a pedagogical clarification of existing theory (Suh's work) rather than introducing new methodology, limiting its broader scientific impact. Paper 2's relevance to the rapidly growing field of AI agents gives it significantly higher potential for citations and follow-on work.
AgentFugue addresses a fundamental and broadly applicable question about scaling AI agent systems through collective reasoning, proposing a novel framework with a shared reasoning hub that enables parallel agents to collaboratively solve long-horizon tasks. This has wide applicability across many agentic AI domains. CausaLab, while valuable as a benchmark for causal discovery evaluation, is more niche—it provides an evaluation environment rather than a new capability. AgentFugue's contribution of demonstrating that 'scaling out' is a distinct source of capability gains introduces a new paradigm for multi-agent systems with broader downstream impact.
Paper 2 introduces a fundamental algorithmic innovation by addressing how to effectively 'scale out' multi-agent systems without centralized orchestration. This collective reasoning framework offers a novel paradigm for utilizing test-time compute in long-horizon tasks, giving it broader methodological impact across various AI domains compared to Paper 1, which primarily introduces a domain-specific benchmark, albeit an ambitious one.