Recursive Multi-Agent Systems
Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong
Abstract
Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2-2.4 end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.
AI Impact Assessments
(3 models)Scientific Impact Assessment: Recursive Multi-Agent Systems
1. Core Contribution
RecursiveMAS introduces a framework that extends the recursive computation paradigm—previously applied to single language models—to multi-agent systems. The key insight is treating each agent as a "layer" in a recursive loop, where agents communicate through continuous latent representations rather than generated text. The framework has two main architectural innovations: (1) RecursiveLink, a lightweight two-layer residual projection module with inner (within-agent) and outer (cross-agent) variants that enable latent-space information transfer between heterogeneous models; and (2) an inner-outer loop training algorithm that first warm-starts each agent's latent generation capability and then jointly optimizes the full recursive system through gradient backpropagation across recursion rounds.
The problem addressed is genuine: text-based multi-agent communication is expensive (repeated encoding/decoding), introduces information bottlenecks (lossy discretization), and breaks gradient flow for end-to-end optimization. RecursiveMAS sidesteps all three issues by keeping inter-agent communication in latent space.
2. Methodological Rigor
Theoretical analysis is provided for both runtime complexity (Proposition 3.1) and gradient stability (Theorem 4.1). The runtime analysis shows RecursiveMAS replaces vocabulary-space decoding cost (m|V|d_h) with latent transformation cost (md²_h), which is meaningful since d_h ≪ |V|. The gradient stability theorem demonstrates that text-based SFT recursion suffers gradient vanishing (bounded by O(ε) where ε is token entropy), while RecursiveLink maintains near-unit gradients. Both proofs are relatively straightforward but provide useful formal justification for the design choices.
Experimental evaluation is comprehensive: 9 benchmarks across 5 domains, 4 collaboration patterns, multiple model families (Qwen, LLama, Gemma, Mistral), and comparisons against single-agent fine-tuning, MoA, TextGrad, LoopLM, and text-based recursive MAS. The reported improvements are substantial: 8.3% average accuracy gain, 1.2×-2.4× speedup, and 34.6%-75.6% token reduction. Standard deviations are reported (±0.0041 accuracy across 5 runs), suggesting reliable results.
However, several methodological concerns arise:
3. Potential Impact
The paper opens a genuinely novel research direction: system-level recursive scaling for multi-agent systems. While individual components (latent communication, recursive LMs, MAS) exist, their synthesis is non-trivial and well-executed.
Practical implications are significant:
Broader influence could extend to: agentic AI systems where latent-space coordination could replace verbose text exchanges; knowledge distillation where the expert-learner pattern showed 8% improvement; and potentially to any pipeline involving sequential LLM calls.
4. Timeliness & Relevance
The paper is highly timely. Recursive/looped LMs (LoopLM, recursive self-calling) represent an active 2025 research frontier, and multi-agent systems are seeing rapid adoption. The convergence of these two trends is natural and the paper arrives at an opportune moment. The focus on efficiency (token reduction, inference speedup) addresses a critical bottleneck as MAS deployments scale.
The paper also addresses the growing concern about the cost of multi-agent systems—text-based MAS generates enormous token volumes that scale with recursion depth, whereas RecursiveMAS shows sublinear token growth.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Reproducibility: Code and data are promised via the project page. The training procedure is well-documented, and model configurations are clearly specified.
Summary
RecursiveMAS presents a well-executed synthesis of recursive computation and multi-agent systems, with strong empirical results and reasonable theoretical support. The efficiency gains are the most compelling contribution—demonstrating that latent-space recursion can simultaneously improve accuracy and reduce cost. While the supervised training requirement and lack of interpretability are limitations, the framework establishes a promising new direction for scalable multi-agent collaboration.
Generated Apr 29, 2026
Comparison History (39)
Paper 2 has higher likely scientific impact due to a clearer, high-stakes real-world application (personalized medicine/ICU decision support), strong timeliness, and broader downstream implications for causal inference and clinical ML. It addresses a fundamental methodological tension (bias-precision paradox) with a novel stochastic alignment (sMMD), validated on large cohorts with distribution-shift testing plus human-AI evaluation and interpretability—evidence closer to translation. Paper 1 is innovative for LLM multi-agent efficiency, but impact depends on adoption in a fast-moving area with less direct societal deployment evidence.
Paper 2 offers a highly novel, mechanistic, causally validated explanation of LLM persuasion via a compact circuit (specific heads + rank-one routing feature), with interventions that both induce and block the failure mode across models and realistic attack settings. This is timely for AI safety, broadly relevant to interpretability, robustness, and security, and yields actionable monitoring/mitigation handles. Paper 1 is innovative and potentially useful for efficiency and performance in multi-agent LLM systems, but its impact may be more incremental/engineering-oriented and sensitive to rapidly evolving agent frameworks, whereas Paper 2’s mechanistic insight is likely to generalize and influence multiple subfields.
Paper 2 likely has higher scientific impact due to greater breadth and timeliness: it proposes a general recursion-based scaling paradigm for multi-agent LLM systems, with shared optimization across recursion rounds and demonstrated gains across diverse benchmarks (math, science, medicine, search, code). Its approach is broadly reusable across AI subfields and applications, aligning with a fast-moving research frontier (agentic LLMs and recursive computation) and offering efficiency improvements (latency/token reduction). Paper 1 is methodologically strong and practically relevant but is more domain-specific (EV ride-hailing) and thus narrower in cross-field impact.
Paper 1 likely has higher scientific impact due to broader, more general contributions: a novel recursive scaling axis for multi-agent LLM systems, unified latent-space recursion, and a gradient-based co-optimization algorithm with theoretical analysis and wide benchmark coverage (math, science, medicine, search, code). These ideas can transfer across many LLM/MAS applications and may influence future architectures. Paper 2 is methodologically strong and highly practical for EV fleet control with feasibility guarantees, but is more domain-specific and builds on established RL + MILP projection and robust RL techniques, limiting breadth.
Paper 2 likely has higher impact due to a broader, more general framework (recursive scaling for multi-agent collaboration) with strong, measurable practical gains (accuracy +8.3%, 1.2–2.4× speedup, large token reduction) across diverse benchmarks and domains. The inner–outer loop co-optimization with shared credit assignment and analysis of efficiency/stable gradients suggests solid methodological rigor and timeliness given current interest in recursive computation and agentic systems. Paper 1 is novel for multi-preference alignment, but its scope is narrower (alignment trade-offs) and may affect fewer downstream applications than a general MAS efficiency/reasoning advance.
Paper 2 likely has higher impact due to broader applicability and timeliness: recursive computation and multi-agent systems are fast-moving areas, and unifying MAS as latent-space recursion with an inner–outer loop co-optimization algorithm targets both capability and efficiency (speed/token reductions) across many benchmarks/domains. The methodological contribution (credit assignment across recursion rounds, runtime/gradient stability analysis) and clear practical gains suggest strong real-world adoption potential. Paper 1 is novel for multi-objective LLM alignment, but its scope is narrower (preference trade-offs) and impact may depend on adoption in alignment pipelines.
Paper 1 introduces a fundamental methodological advancement by scaling agent collaboration through latent-space recursion, backed by theoretical analysis and strong empirical gains across diverse domains (speed, token reduction, and accuracy). In contrast, Paper 2 provides a valuable but narrower case study on agentic failures within a specific domain (astrophysics). Paper 1's foundational framework for optimizing multi-agent systems is likely to see broader adoption and stimulate more follow-up research across the broader AI community.
Paper 1 introduces a novel, scalable recursive multi-agent framework with broad applicability across diverse domains. Its strong theoretical foundations, combined with empirical results demonstrating improved accuracy, speedup, and token efficiency, suggest a high foundational impact on AI research. In contrast, Paper 2 provides a valuable but narrower domain-specific evaluation of agentic failures in astrophysics, making its overall scientific impact likely more localized.
Paper 2 introduces a concrete, novel framework (RecursiveMAS) with strong empirical results across 9 benchmarks, demonstrating significant accuracy improvements, speedups, and token reductions. It offers a new scaling axis for multi-agent systems with theoretical grounding and practical applicability. While Paper 1 raises important philosophical/epistemological arguments about AI evaluation, it is primarily a position paper without empirical contributions. Paper 2's technical novelty, reproducibility (code provided), and broad benchmark coverage give it higher near-term scientific impact and likelihood of influencing follow-up research in the rapidly growing multi-agent AI field.
Paper 1 introduces a novel framework (RecursiveMAS) that extends recursive computation to multi-agent systems, demonstrating significant empirical improvements across 9 diverse benchmarks with strong practical benefits (accuracy gains, speedup, token reduction). It addresses a timely topic at the intersection of LLM scaling and multi-agent collaboration, with broad real-world applications. Paper 2 makes a solid theoretical contribution to knowledge compilation by generalizing OBDDs, but its impact is narrower, primarily within the computational complexity and knowledge representation communities. Paper 1's breadth of applications and timeliness give it higher potential impact.
RecursiveMAS introduces a novel framework extending recursive computation to multi-agent systems with strong empirical results (8.3% accuracy improvement, significant speedups and token reduction across 9 benchmarks). It addresses the timely and high-impact area of LLM-based multi-agent collaboration, with broad real-world applications spanning mathematics, science, medicine, and code generation. Paper 2 makes a solid theoretical contribution to knowledge compilation by generalizing OBDDs, but its impact is confined to a narrower community in computational complexity and Boolean function representation. The breadth, timeliness, and practical applicability of Paper 1 give it higher potential impact.
Paper 1 introduces a highly novel architectural paradigm by transitioning multi-agent collaboration from text to recursive latent-space computation. This fundamentally addresses efficiency bottlenecks in current MAS, supported by theoretical analysis and strong empirical gains across diverse domains. Paper 2 offers valuable insights into unstructured pruning for test-time scaling, but Paper 1 represents a broader, more innovative shift in how AI systems can be structured and optimized, yielding a wider potential impact on the field of multi-agent systems and model scaling.
Paper 1 introduces a foundational scaling principle for multi-agent systems, shifting from text-based to latent-space recursive collaboration. This approach offers broad, cross-disciplinary applications in reasoning, science, and coding, yielding significant efficiency and accuracy gains. In contrast, Paper 2 provides a valuable but more narrowly focused contribution to AI safety and adversarial jailbreaking methodologies. Due to its broader applicability and potential to reshape agentic architectures, Paper 1 exhibits higher potential scientific impact.
Paper 1 introduces a highly novel, broadly applicable recursive multi-agent framework operating in latent space. It demonstrates significant empirical gains across diverse domains (math, science, medicine, coding) along with efficiency improvements and theoretical grounding. In contrast, Paper 2 offers a valuable but niche evaluation protocol specifically tailored to AI-finance. Paper 1's generalizability, methodological breadth, and potential to shift paradigms in multi-agent system scaling give it a much higher potential for widespread scientific impact.
RecursiveMAS introduces a broadly applicable framework for scaling multi-agent systems through recursion in latent space, with strong empirical results across 9 diverse benchmarks, theoretical grounding, and clear practical benefits (accuracy gains, speedups, token reduction). Its breadth of impact spans multiple fields (math, science, medicine, code generation) and addresses a fundamental question about scaling agent collaboration. ValueAlpha, while methodologically rigorous, addresses a narrow niche (LLM-judged investment rationale validation) with contributions primarily relevant to AI-finance evaluation governance, limiting its broader scientific impact.
Paper 1 introduces a fundamental algorithmic paradigm shift for multi-agent LLM systems via latent-space recursion, impacting diverse AI domains like math, science, and coding. It includes theoretical analyses of learning dynamics. While Paper 2 demonstrates impressive real-world deployment at a massive commercial scale in recommender systems, Paper 1 offers broader foundational scientific implications for AI scaling and autonomous agent architectures, giving it higher potential to influence future core AI research.
Paper 2 likely has higher impact due to a broader, more general contribution: a scalable recursive computation principle for multi-agent systems with a unified latent-space formulation, new training algorithm (inner-outer loop co-optimization with credit assignment), theoretical analysis, and consistent gains across 9 diverse benchmarks plus speed/token-efficiency improvements—strong real-world deployment relevance. Paper 1 is novel in tying interpretability to data selection and shows strong data efficiency, but is more task/fine-tuning specific and depends on interpretability tooling assumptions, likely narrowing breadth and adoption.
Paper 1 introduces a highly novel recursive latent-space framework for multi-agent systems, backed by theoretical analyses of learning dynamics and runtime complexity. Its comprehensive empirical evaluation across 9 diverse benchmarks demonstrates significant and specific improvements in accuracy, inference speed, and token reduction. Paper 2 is also innovative but has a narrower evaluation scope and less concrete performance metrics in its abstract, making Paper 1's potential breadth of impact and methodological rigor stand out.
Paper 2 introduces a novel technical framework (RecursiveMAS) that extends recursive computation to multi-agent systems with concrete empirical gains (8.3% accuracy improvement, significant speedups and token reduction) across 9 benchmarks. It offers a new scaling axis for multi-agent AI with theoretical grounding and practical demonstrations. While Paper 1 addresses the important topic of auditability for LLM agents with a well-structured framework, it is more of a position/systematization paper proposing dimensions and cards rather than introducing a fundamentally new technical method. Paper 2's broader technical contributions and quantitative results suggest higher near-term scientific impact and citation potential.
Paper 1 offers groundbreaking insights into mechanistic interpretability and AI safety by identifying internal emotion representations that causally drive alignment-relevant behaviors like sycophancy and reward hacking. This fundamentally advances our understanding of LLM internals and safety risks. While Paper 2 presents a strong architectural improvement for multi-agent systems with practical efficiency gains, Paper 1's profound implications for understanding artificial cognition and ensuring AI alignment give it higher potential for broad, foundational scientific impact.