Recursive Multi-Agent Systems

Xiyuan Yang, Jiaru Zou, Rui Pan, Ruizhong Qiu, Pan Lu, Shizhe Diao, Jindong Jiang, Hanghang Tong

Apr 28, 2026

arXiv:2604.25917v1 PDF

cs.AI(primary)cs.CLcs.LG

#71of 2292·Artificial Intelligence

#71 of 2292 · Artificial Intelligence

Tournament Score

1555±36

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.8

Tournament Score

1555±36

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion? To this end, we introduce RecursiveMAS, a recursive multi-agent framework that casts the entire system as a unified latent-space recursive computation. RecursiveMAS connects heterogeneous agents as a collaboration loop through the lightweight RecursiveLink module, enabling in-distribution latent thoughts generation and cross-agent latent state transfer. To optimize our framework, we develop an inner-outer loop learning algorithm for iterative whole-system co-optimization through shared gradient-based credit assignment across recursion rounds. Theoretical analyses of runtime complexity and learning dynamics establish that RecursiveMAS is more efficient than standard text-based MAS and maintains stable gradients during recursive training. Empirically, we instantiate RecursiveMAS under 4 representative agent collaboration patterns and evaluate across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. In comparison with advanced single/multi-agent and recursive computation baselines, RecursiveMAS consistently delivers an average accuracy improvement of 8.3%, together with 1.2 $\times$ -2.4 $\times$ end-to-end inference speedup, and 34.6%-75.6% token usage reduction. Code and Data are provided in https://recursivemas.github.io.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: Recursive Multi-Agent Systems

1. Core Contribution

RecursiveMAS introduces a framework that extends the recursive computation paradigm—previously applied to single language models—to multi-agent systems. The key insight is treating each agent as a "layer" in a recursive loop, where agents communicate through continuous latent representations rather than generated text. The framework has two main architectural innovations: (1) RecursiveLink, a lightweight two-layer residual projection module with inner (within-agent) and outer (cross-agent) variants that enable latent-space information transfer between heterogeneous models; and (2) an inner-outer loop training algorithm that first warm-starts each agent's latent generation capability and then jointly optimizes the full recursive system through gradient backpropagation across recursion rounds.

The problem addressed is genuine: text-based multi-agent communication is expensive (repeated encoding/decoding), introduces information bottlenecks (lossy discretization), and breaks gradient flow for end-to-end optimization. RecursiveMAS sidesteps all three issues by keeping inter-agent communication in latent space.

2. Methodological Rigor

Theoretical analysis is provided for both runtime complexity (Proposition 3.1) and gradient stability (Theorem 4.1). The runtime analysis shows RecursiveMAS replaces vocabulary-space decoding cost (m|V|d_h) with latent transformation cost (md²_h), which is meaningful since d_h ≪ |V|. The gradient stability theorem demonstrates that text-based SFT recursion suffers gradient vanishing (bounded by O(ε) where ε is token entropy), while RecursiveLink maintains near-unit gradients. Both proofs are relatively straightforward but provide useful formal justification for the design choices.

Experimental evaluation is comprehensive: 9 benchmarks across 5 domains, 4 collaboration patterns, multiple model families (Qwen, LLama, Gemma, Mistral), and comparisons against single-agent fine-tuning, MoA, TextGrad, LoopLM, and text-based recursive MAS. The reported improvements are substantial: 8.3% average accuracy gain, 1.2×-2.4× speedup, and 34.6%-75.6% token reduction. Standard deviations are reported (±0.0041 accuracy across 5 runs), suggesting reliable results.

However, several methodological concerns arise:

The comparison with Recursive-TextMAS is somewhat unfair since the text-based variant lacks the gradient-flow advantages during training—it's unclear how much of the improvement comes from better optimization vs. the latent communication itself.

The inner-loop training uses cosine similarity regression to ground-truth embeddings, which requires access to ground-truth answers for each agent role—this limits applicability to supervised settings.

The training data curation process uses Qwen3.5-397B to generate role-specific supervision, introducing a strong teacher model dependency that isn't fully acknowledged.

3. Potential Impact

The paper opens a genuinely novel research direction: system-level recursive scaling for multi-agent systems. While individual components (latent communication, recursive LMs, MAS) exist, their synthesis is non-trivial and well-executed.

Practical implications are significant:

The efficiency gains (token reduction, speedup) directly translate to reduced inference costs for production MAS deployments.

The framework's compatibility with heterogeneous model families and diverse collaboration patterns enhances its practical utility.

The lightweight RecursiveLink (13.12M parameters, 0.31% of total) makes adoption feasible without retraining base models.

Broader influence could extend to: agentic AI systems where latent-space coordination could replace verbose text exchanges; knowledge distillation where the expert-learner pattern showed 8% improvement; and potentially to any pipeline involving sequential LLM calls.

4. Timeliness & Relevance

The paper is highly timely. Recursive/looped LMs (LoopLM, recursive self-calling) represent an active 2025 research frontier, and multi-agent systems are seeing rapid adoption. The convergence of these two trends is natural and the paper arrives at an opportune moment. The focus on efficiency (token reduction, inference speedup) addresses a critical bottleneck as MAS deployments scale.

The paper also addresses the growing concern about the cost of multi-agent systems—text-based MAS generates enormous token volumes that scale with recursion depth, whereas RecursiveMAS shows sublinear token growth.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framework that naturally extends recursive computation from single models to systems

Impressive efficiency gains that increase with recursion depth (a desirable property)

Structure-agnostic design demonstrated across 4 distinct collaboration patterns

Theoretical grounding for both architectural and training decisions

Comprehensive evaluation across diverse domains and model families

Very low training overhead (13.12M parameters, $4.27 estimated cost)

Notable Limitations:

The framework requires supervised training data with ground-truth answers for each agent role, limiting applicability to RL-based or self-play MAS scenarios

Latent thoughts are not interpretable—unlike text-based MAS, there's no way to inspect intermediate reasoning, which may limit debugging and trust

The "latent thoughts" generated via auto-regressive forward passes in continuous space lack theoretical grounding for why this produces meaningful representations (beyond empirical validation)

Scaling analysis is limited to small-to-medium models (sub-10B); behavior with frontier-scale models is unknown

The PCA visualization (Figure 7) showing distribution alignment is suggestive but doesn't rigorously establish that latent representations carry task-relevant semantics

The paper doesn't compare against concurrent latent communication methods (Du et al., 2025; Zheng et al., 2025) under the same recursive setting

Reproducibility: Code and data are promised via the project page. The training procedure is well-documented, and model configurations are clearly specified.

Summary

RecursiveMAS presents a well-executed synthesis of recursive computation and multi-agent systems, with strong empirical results and reasonable theoretical support. The efficiency gains are the most compelling contribution—demonstrating that latent-space recursion can simultaneously improve accuracy and reduce cost. While the supervised training requirement and lack of interpretability are limitations, the framework establishes a promising new direction for scalable multi-agent collaboration.

Rating:7.4/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.8

Generated Apr 29, 2026

Comparison History (39)

vs. Resolving the bias-precision paradox with stochastic causal representation learning for personalized medicine

gpt-5.25/16/2026

Paper 2 has higher likely scientific impact due to a clearer, high-stakes real-world application (personalized medicine/ICU decision support), strong timeliness, and broader downstream implications for causal inference and clinical ML. It addresses a fundamental methodological tension (bias-precision paradox) with a novel stochastic alignment (sMMD), validated on large cohorts with distribution-shift testing plus human-AI evaluation and interpretability—evidence closer to translation. Paper 1 is innovative for LLM multi-agent efficiency, but impact depends on adoption in a fast-moving area with less direct societal deployment evidence.

vs. How LLMs Are Persuaded: A Few Attention Heads, Rerouted

gpt-5.25/16/2026

Paper 2 offers a highly novel, mechanistic, causally validated explanation of LLM persuasion via a compact circuit (specific heads + rank-one routing feature), with interventions that both induce and block the failure mode across models and realistic attack settings. This is timely for AI safety, broadly relevant to interpretability, robustness, and security, and yields actionable monitoring/mitigation handles. Paper 1 is innovative and potentially useful for efficiency and performance in multi-agent LLM systems, but its impact may be more incremental/engineering-oriented and sensitive to rapidly evolving agent frameworks, whereas Paper 2’s mechanistic insight is likely to generalize and influence multiple subfields.

vs. Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

gpt-5.24/29/2026

Paper 2 likely has higher scientific impact due to greater breadth and timeliness: it proposes a general recursion-based scaling paradigm for multi-agent LLM systems, with shared optimization across recursion rounds and demonstrated gains across diverse benchmarks (math, science, medicine, search, code). Its approach is broadly reusable across AI subfields and applications, aligning with a fast-moving research frontier (agentic LLMs and recursive computation) and offering efficiency improvements (latency/token reduction). Paper 1 is methodologically strong and practically relevant but is more domain-specific (EV ride-hailing) and thus narrower in cross-field impact.

vs. Semi-Markov Reinforcement Learning for City-Scale EV Ride-Hailing with Feasibility-Guaranteed Actions

gpt-5.24/29/2026

Paper 1 likely has higher scientific impact due to broader, more general contributions: a novel recursive scaling axis for multi-agent LLM systems, unified latent-space recursion, and a gradient-based co-optimization algorithm with theoretical analysis and wide benchmark coverage (math, science, medicine, search, code). These ideas can transfer across many LLM/MAS applications and may influence future architectures. Paper 2 is methodologically strong and highly practical for EV fleet control with feasibility guarantees, but is more domain-specific and builds on established RL + MILP projection and robust RL techniques, limiting breadth.

vs. Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

gpt-5.24/29/2026

Paper 2 likely has higher impact due to a broader, more general framework (recursive scaling for multi-agent collaboration) with strong, measurable practical gains (accuracy +8.3%, 1.2–2.4× speedup, large token reduction) across diverse benchmarks and domains. The inner–outer loop co-optimization with shared credit assignment and analysis of efficiency/stable gradients suggests solid methodological rigor and timeliness given current interest in recursive computation and agentic systems. Paper 1 is novel for multi-preference alignment, but its scope is narrower (alignment trade-offs) and may affect fewer downstream applications than a general MAS efficiency/reasoning advance.

vs. Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

gpt-5.24/29/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: recursive computation and multi-agent systems are fast-moving areas, and unifying MAS as latent-space recursion with an inner–outer loop co-optimization algorithm targets both capability and efficiency (speed/token reductions) across many benchmarks/domains. The methodological contribution (credit assignment across recursion rounds, runtime/gradient stability analysis) and clear practical gains suggest strong real-world adoption potential. Paper 1 is novel for multi-objective LLM alignment, but its scope is narrower (preference trade-offs) and impact may depend on adoption in alignment pipelines.

vs. Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

gemini-34/29/2026

Paper 1 introduces a fundamental methodological advancement by scaling agent collaboration through latent-space recursion, backed by theoretical analysis and strong empirical gains across diverse domains (speed, token reduction, and accuracy). In contrast, Paper 2 provides a valuable but narrower case study on agentic failures within a specific domain (astrophysics). Paper 1's foundational framework for optimizing multi-agent systems is likely to see broader adoption and stimulate more follow-up research across the broader AI community.

vs. Plausible but Wrong: A case study on Agentic Failures in Astrophysical Workflows

gemini-34/29/2026

Paper 1 introduces a novel, scalable recursive multi-agent framework with broad applicability across diverse domains. Its strong theoretical foundations, combined with empirical results demonstrating improved accuracy, speedup, and token efficiency, suggest a high foundational impact on AI research. In contrast, Paper 2 provides a valuable but narrower domain-specific evaluation of agentic failures in astrophysics, making its overall scientific impact likely more localized.

vs. Beyond Behavior: Why AI Evaluation Needs a Cognitive Revolution

claude-opus-4.64/29/2026

Paper 2 introduces a concrete, novel framework (RecursiveMAS) with strong empirical results across 9 benchmarks, demonstrating significant accuracy improvements, speedups, and token reductions. It offers a new scaling axis for multi-agent systems with theoretical grounding and practical applicability. While Paper 1 raises important philosophical/epistemological arguments about AI evaluation, it is primarily a position paper without empirical contributions. Paper 2's technical novelty, reproducibility (code provided), and broad benchmark coverage give it higher near-term scientific impact and likelihood of influencing follow-up research in the rapidly growing multi-agent AI field.

vs. A canonical generalization of OBDD

claude-opus-4.64/29/2026

Paper 1 introduces a novel framework (RecursiveMAS) that extends recursive computation to multi-agent systems, demonstrating significant empirical improvements across 9 diverse benchmarks with strong practical benefits (accuracy gains, speedup, token reduction). It addresses a timely topic at the intersection of LLM scaling and multi-agent collaboration, with broad real-world applications. Paper 2 makes a solid theoretical contribution to knowledge compilation by generalizing OBDDs, but its impact is narrower, primarily within the computational complexity and knowledge representation communities. Paper 1's breadth of applications and timeliness give it higher potential impact.

vs. A canonical generalization of OBDD

claude-opus-4.64/29/2026

RecursiveMAS introduces a novel framework extending recursive computation to multi-agent systems with strong empirical results (8.3% accuracy improvement, significant speedups and token reduction across 9 benchmarks). It addresses the timely and high-impact area of LLM-based multi-agent collaboration, with broad real-world applications spanning mathematics, science, medicine, and code generation. Paper 2 makes a solid theoretical contribution to knowledge compilation by generalizing OBDDs, but its impact is confined to a narrower community in computational complexity and Boolean function representation. The breadth, timeliness, and practical applicability of Paper 1 give it higher potential impact.

vs. Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

gemini-34/29/2026

Paper 1 introduces a highly novel architectural paradigm by transitioning multi-agent collaboration from text to recursive latent-space computation. This fundamentally addresses efficiency bottlenecks in current MAS, supported by theoretical analysis and strong empirical gains across diverse domains. Paper 2 offers valuable insights into unstructured pruning for test-time scaling, but Paper 1 represents a broader, more innovative shift in how AI systems can be structured and optimized, yielding a wider potential impact on the field of multi-agent systems and model scaling.

vs. Adaptive Prompt Embedding Optimization for LLM Jailbreaking

gemini-34/29/2026

Paper 1 introduces a foundational scaling principle for multi-agent systems, shifting from text-based to latent-space recursive collaboration. This approach offers broad, cross-disciplinary applications in reasoning, science, and coding, yielding significant efficiency and accuracy gains. In contrast, Paper 2 provides a valuable but more narrowly focused contribution to AI safety and adversarial jailbreaking methodologies. Due to its broader applicability and potential to reshape agentic architectures, Paper 1 exhibits higher potential scientific impact.

vs. ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

gemini-34/29/2026

Paper 1 introduces a highly novel, broadly applicable recursive multi-agent framework operating in latent space. It demonstrates significant empirical gains across diverse domains (math, science, medicine, coding) along with efficiency improvements and theoretical grounding. In contrast, Paper 2 offers a valuable but niche evaluation protocol specifically tailored to AI-finance. Paper 1's generalizability, methodological breadth, and potential to shift paradigms in multi-agent system scaling give it a much higher potential for widespread scientific impact.

vs. ValueAlpha: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable

claude-opus-4.64/29/2026

RecursiveMAS introduces a broadly applicable framework for scaling multi-agent systems through recursion in latent space, with strong empirical results across 9 diverse benchmarks, theoretical grounding, and clear practical benefits (accuracy gains, speedups, token reduction). Its breadth of impact spans multiple fields (math, science, medicine, code generation) and addresses a fundamental question about scaling agent collaboration. ValueAlpha, while methodologically rigorous, addresses a narrow niche (LLM-judged investment rationale validation) with contributions primarily relevant to AI-finance evaluation governance, limiting its broader scientific impact.

vs. Action-Aware Generative Sequence Modeling for Short Video Recommendation

gemini-34/29/2026

Paper 1 introduces a fundamental algorithmic paradigm shift for multi-agent LLM systems via latent-space recursion, impacting diverse AI domains like math, science, and coding. It includes theoretical analyses of learning dynamics. While Paper 2 demonstrates impressive real-world deployment at a massive commercial scale in recommender systems, Paper 1 offers broader foundational scientific implications for AI scaling and autonomous agent architectures, giving it higher potential to influence future core AI research.

vs. From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models

gpt-5.24/29/2026

Paper 2 likely has higher impact due to a broader, more general contribution: a scalable recursive computation principle for multi-agent systems with a unified latent-space formulation, new training algorithm (inner-outer loop co-optimization with credit assignment), theoretical analysis, and consistent gains across 9 diverse benchmarks plus speed/token-efficiency improvements—strong real-world deployment relevance. Paper 1 is novel in tying interpretability to data selection and shows strong data efficiency, but is more task/fine-tuning specific and depends on interpretability tooling assumptions, likely narrowing breadth and adoption.

vs. SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

gemini-34/29/2026

Paper 1 introduces a highly novel recursive latent-space framework for multi-agent systems, backed by theoretical analyses of learning dynamics and runtime complexity. Its comprehensive empirical evaluation across 9 diverse benchmarks demonstrates significant and specific improvements in accuracy, inference speed, and token reduction. Paper 2 is also innovative but has a narrower evaluation scope and less concrete performance metrics in its abstract, making Paper 1's potential breadth of impact and methodological rigor stand out.

vs. Auditable Agents

claude-opus-4.64/29/2026

Paper 2 introduces a novel technical framework (RecursiveMAS) that extends recursive computation to multi-agent systems with concrete empirical gains (8.3% accuracy improvement, significant speedups and token reduction) across 9 benchmarks. It offers a new scaling axis for multi-agent AI with theoretical grounding and practical demonstrations. While Paper 1 addresses the important topic of auditability for LLM agents with a well-structured framework, it is more of a position/systematization paper proposing dimensions and cards rather than introducing a fundamentally new technical method. Paper 2's broader technical contributions and quantitative results suggest higher near-term scientific impact and citation potential.

vs. Emotion Concepts and their Function in a Large Language Model

gemini-34/29/2026

Paper 1 offers groundbreaking insights into mechanistic interpretability and AI safety by identifying internal emotion representations that causally drive alignment-relevant behaviors like sycophancy and reward hacking. This fundamentally advances our understanding of LLM internals and safety risks. While Paper 2 presents a strong architectural improvement for multi-agent systems with practical efficiency gains, Paper 1's profound implications for understanding artificial cognition and ensuring AI alignment give it higher potential for broad, foundational scientific impact.