Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems

Ye Yu, Heming Liu, Haibo Jin, Xiaopeng Yuan, Peng Kuang, Haohan Wang

Apr 23, 2026

arXiv:2604.21794v1 PDF

cs.AI(primary)cs.CLcs.MA

#159of 2292·Artificial Intelligence

#159 of 2292 · Artificial Intelligence

Tournament Score

1528±33

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.8

Novelty6.5

Clarity7

Tournament Score

1528±33

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Multi-agent systems built on large language models have shown strong performance on complex reasoning tasks, yet most work focuses on agent roles and orchestration while treating inter-agent communication as a fixed interface. Latent communication through internal representations such as key-value caches offers a promising alternative to text-based protocols, but existing approaches do not jointly optimize communication with multi-agent reasoning. Therefore we propose DiffMAS, a training framework that treats latent communication as a learnable component of multi-agent systems. DiffMAS performs parameter-efficient supervised training over multi-agent latent trajectories, enabling agents to jointly learn how information should be encoded and interpreted across interactions. Experiments on mathematical reasoning, scientific QA, code generation, and commonsense benchmarks show that DiffMAS consistently improves reasoning accuracy and decoding stability over single-agent inference, text-based multi-agent systems, and prior latent communication methods, achieving 26.7% on AIME24, 20.2% on GPQA-Diamond, and consistent gains across reasoning benchmarks.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: DiffMAS – Learning to Communicate in Multi-Agent Language Systems

1. Core Contribution

DiffMAS proposes treating inter-agent communication in LLM-based multi-agent systems (MAS) as a learnable, differentiable component rather than a fixed interface. The key idea is to use KV (key-value) caches as a continuous latent communication medium between sequential agents, then apply parameter-efficient supervised fine-tuning (LoRA) over multi-agent latent trajectories so that the system jointly learns how to encode and interpret information across agent boundaries. The framework operates in two stages: upstream agents construct a shared KV trace without gradient updates, and then the final agent performs autoregressive decoding with SFT, backpropagating through the accumulated latent trace.

The core novelty lies at the intersection of latent reasoning and multi-agent optimization. While prior work has explored either training-free latent communication (e.g., sharing KV caches directly) or learned communication modules (e.g., Cache-to-Cache), DiffMAS specifically optimizes the communication interface end-to-end with the downstream reasoning task. The formalization of multi-agent communication as a composition of differentiable stage operators with non-overwriting trace concatenation is a clean abstraction.

2. Methodological Rigor

Theoretical grounding: Proposition 3.1 provides an interface-level analysis showing that concatenative (non-overwriting) communication avoids depth-dependent gradient attenuation compared to overwriting communication. This is a relatively straightforward observation—concatenation preserves direct gradient paths by construction—but it is correctly scoped as an interface-level guarantee rather than an end-to-end claim. The authors appropriately acknowledge that attention weights in the decoder can still introduce attenuation.

Experimental design: The evaluation spans five model scales (4B to 32B), six benchmarks across math, science, code, and commonsense reasoning, and four baselines (single-agent, TextMAS, LatentMAS, C2C). This breadth is commendable. However, several methodological concerns arise:

Training data leakage concerns: Training on 210 samples from Hendrycks Math then evaluating on AIME is reasonable (different distribution), but the 50 HumanEval samples used for code training overlap with the evaluation benchmark (they mention "excluding the 50 training samples"). This partial overlap warrants scrutiny.

Small evaluation sets: AIME 2024/2025 contain only 30 problems each. A +20% improvement translates to roughly 6 additional correct answers, making results susceptible to high variance. The self-consistency analysis (4 samples per problem) partially addresses this but doesn't fully resolve statistical significance concerns.

C2C baseline fairness: The authors acknowledge C2C was trained on OpenHermes-2.5 (instruction-following data), creating a distribution mismatch. This makes the comparison somewhat unfair—C2C's catastrophic failures (0% on AIME) likely reflect this mismatch rather than fundamental limitations of the approach.

Gradient flow claim: While DiffMAS is described as enabling end-to-end gradient flow, Stage I operates without gradients. Only the final agent's LoRA parameters are updated, with gradients flowing through the KV trace. This is a more limited form of end-to-end optimization than initially suggested.

3. Potential Impact

The paper addresses a genuine gap: most MAS research treats communication as a fixed protocol, and making it learnable is a natural and important direction. The practical implications include:

Efficiency: Remarkably small training sets (50-700 samples) yield meaningful improvements, suggesting the approach is practical for resource-constrained settings.

Generality: Consistent gains across diverse tasks and model scales (4B-32B) suggest broad applicability.

Decoding stability: The perplexity and entropy analyses demonstrate that DiffMAS doesn't just improve accuracy but makes multi-agent reasoning more reliable—critical for deployment.

However, the impact may be limited by several factors: the approach requires all agents to share the same model architecture (for KV cache compatibility), it's currently restricted to sequential pipelines, and the training only updates the final agent's parameters. Extending to heterogeneous agents, non-sequential topologies, or full multi-agent gradient propagation would significantly increase impact.

4. Timeliness & Relevance

This work is highly timely. Multi-agent LLM systems are rapidly gaining adoption (AutoGen, MetaGPT, ChatDev), and the field is transitioning from prompt engineering to systematic optimization. The observation that communication itself should be optimized—not just agent capabilities—fills an important conceptual gap. The concurrent emergence of latent reasoning research (Quiet-STaR, COCONUT) makes this a natural convergence point.

5. Strengths & Limitations

Strengths:

Clean formalization of multi-agent communication as differentiable stage operators

Impressive data efficiency: meaningful gains from 50-210 training samples

Comprehensive analysis suite (perplexity, self-consistency, entropy dynamics, ablations)

Strong ablation separating "learning to solve" from "learning to communicate" (Table 4)

The StitchMAS ablation (Table 5) effectively isolates the contribution of continuous vs. stitched KV traces

Communication step ablation revealing that 10 steps suffice is practically useful

Limitations:

Only the final agent is trained; upstream agents generate KV states without adaptation, limiting the "joint optimization" claim

Restricted to homogeneous agent architectures sharing the same backbone

Sequential pipeline only; no exploration of parallel or graph-structured topologies

Statistical significance is unclear on small benchmarks (30-problem AIME sets)

The framework's dependence on KV cache compatibility limits applicability to heterogeneous model ecosystems

No comparison with reinforcement learning-based multi-agent optimization (e.g., MALT)

The case study, while illustrative, cherry-picks a single example

Overall Assessment

DiffMAS presents a well-motivated and cleanly executed contribution to an important emerging problem. The core idea of making latent communication learnable is sound, and the empirical results are promising across a good range of benchmarks. The paper's main limitation is that its "end-to-end" claims somewhat overstate the actual optimization scope (only final-agent LoRA is trained). The work opens interesting directions for fully differentiable multi-agent systems but represents an incremental step rather than a paradigm shift. The small evaluation sets and limited statistical analysis somewhat weaken confidence in the reported gains.

Rating:6.2/ 10

Significance 6.5Rigor 5.8Novelty 6.5Clarity 7

Generated Apr 24, 2026

Comparison History (43)

vs. Bias by Necessity: Impossibility Theorems for Sequential Processing with Convergent AI and Human Validation

gemini-3.15/16/2026

Paper 1 establishes fundamental mathematical theorems linking AI architectural constraints to human cognitive biases, bridging computer science, psychology, and cognitive science. Its theoretical depth, impossibility theorems, and cross-disciplinary validation offer broader foundational impact than Paper 2, which presents a valuable but narrower methodological improvement in multi-agent LLM systems.

vs. On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

claude-opus-4.65/16/2026

Paper 1 (FATE) addresses a critical and timely problem—safety alignment for tool-using LLM agents—with a novel framework that tackles trajectory-level failures rather than just response-level issues. Its Pareto-front optimization for balancing safety and utility is methodologically innovative. The problem has immediate real-world implications as agents are deployed in high-stakes settings. Paper 2 (DiffMAS) proposes interesting latent communication optimization for multi-agent systems, but addresses a less urgent problem with more incremental gains. FATE's broader safety implications, novel self-evolution approach without expert demonstrations, and strong empirical results across multiple safety benchmarks give it higher potential impact.

vs. What You Think is What You See: Driving Exploration in VLM Agents via Visual-Linguistic Curiosity

gpt-5.25/6/2026

Paper 1 likely has higher impact due to a broadly applicable, novel training framework (DiffMAS) that makes inter-agent communication itself learnable via latent trajectories, potentially affecting many multi-agent LLM settings beyond any single environment. It reports concrete, strong benchmark gains on widely watched reasoning tasks (e.g., AIME24, GPQA), suggesting timely relevance and easier downstream adoption. Paper 2 (GLANCE) is innovative for VLM exploration and intrinsic motivation, but its impact is more domain-specific to partially observable RL/embodied tasks and may depend more on environment design and RL stability.

vs. When Agents Evolve, Institutions Follow

gpt-5.25/1/2026

Paper 2 is likely higher impact due to a more novel, generalizable methodological contribution: end-to-end learnable latent inter-agent communication (DiffMAS) with parameter-efficient training over multi-agent trajectories. This directly advances the core bottleneck of multi-agent LLM systems—communication bandwidth and coordination—and is broadly applicable across tasks and model families, with strong benchmark evidence on diverse, timely evaluations (math, QA, code, commonsense). Paper 1 offers an insightful governance-inspired design space and useful empirical comparisons, but it is largely an architecture selection/benchmarking framework rather than a new trainable mechanism, potentially limiting transformative impact.

vs. Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

gemini-34/30/2026

Paper 1 introduces a novel, fundamental methodological advancement by treating multi-agent latent communication as a learnable component, shifting away from fixed text-based protocols. This has broad implications for improving complex reasoning capabilities across diverse AI domains. In contrast, Paper 2 presents an important but narrower benchmarking study on safety for a specific application (robotic health attendants). The foundational architectural innovation and strong empirical results on difficult benchmarks (AIME, GPQA) in Paper 1 give it a higher potential for widespread scientific impact across the broader AI research community.

vs. Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

claude-opus-4.64/30/2026

Paper 2 introduces DiffMAS, a novel training framework that addresses a fundamental gap in multi-agent LLM systems by making inter-agent communication a learnable, jointly optimized component via latent representations. This is a broadly applicable methodological contribution with clear benchmarks across multiple domains. Paper 1, while impressive in its real-world deployment scale and engineering rigor, is more of a systems/engineering case study specific to onchain trading agents. Paper 2's contribution to differentiable multi-agent communication has broader theoretical and practical implications across the AI research community.

vs. Operating-Layer Controls for Onchain Language-Model Agents Under Real Capital

gemini-34/30/2026

Paper 1 offers an unprecedented, large-scale empirical study of autonomous agents deploying real capital. Its focus on system-level reliability and the 'operating layer' exposes critical real-world failure modes absent in standard text benchmarks. This real-world grounding and massive scale ($20M volume) provide highly actionable insights for safe agent deployment, giving it greater potential impact than Paper 2's methodological improvements on standard reasoning benchmarks.

vs. Trustworthy Clinical Decision Support Using Meta-Predicates and Domain-Specific Languages

claude-opus-4.64/26/2026

Paper 1 addresses a fundamental challenge in multi-agent LLM systems—jointly optimizing inter-agent communication—with broad applicability across reasoning tasks. The DiffMAS framework introduces a novel paradigm (learnable latent communication) with strong empirical results across multiple benchmarks. Given the explosive growth of LLM-based multi-agent systems, this work is highly timely and could influence a large research community. Paper 2, while valuable for regulatory compliance in clinical AI, addresses a narrower niche (epistemological validation in clinical DSLs) with a single demonstration domain, limiting its breadth of impact.

vs. Mind the Prompt: Self-adaptive Generation of Task Plan Explanations via LLMs

gpt-5.24/26/2026

Paper 2 has higher likely impact due to a more broadly applicable, end-to-end learning framework (DiffMAS) that upgrades a core bottleneck in multi-agent LLM systems—communication—via jointly optimized latent protocols. It reports strong, benchmarked gains across multiple high-salience tasks (math, scientific QA, code, commonsense), suggesting immediate relevance and adoption potential across AI subfields. Paper 1 is novel in modeling prompt adaptation with a POMDP and cognitive states, but appears more domain- and system-specific (task planning explanations) and proof-of-concept, with narrower breadth despite clear real-world utility in human-centered AI.

vs. The CriticalSet problem: Identifying Critical Contributors in Bipartite Dependency Networks

claude-opus-4.64/26/2026

Paper 2 addresses a timely and broadly impactful problem—optimizing communication in LLM-based multi-agent systems—which sits at the intersection of rapidly growing fields (LLMs, multi-agent AI, emergent communication). Its end-to-end differentiable framework for latent communication is novel and has wide applicability across reasoning, QA, and code generation. Paper 1 makes solid contributions to graph mining with strong theoretical grounding, but addresses a more niche problem. Paper 2's relevance to the current AI landscape and potential to influence how multi-agent LLM systems are designed gives it higher expected impact.

vs. Time, Causality, and Observability Failures in Distributed AI Inference Systems

claude-opus-4.64/26/2026

Paper 1 (DiffMAS) introduces a novel framework for jointly optimizing latent communication in multi-agent LLM systems, addressing a fundamental gap in how agents share information. It demonstrates consistent improvements across diverse benchmarks and has broad applicability to the rapidly growing field of multi-agent AI systems. Paper 2 makes a valid but relatively narrow observation about clock skew in distributed inference observability—an important systems engineering concern but with limited novelty and narrower impact scope compared to the paradigm-shifting potential of learnable inter-agent communication.

vs. SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

claude-opus-4.64/26/2026

Paper 2 (DiffMAS) introduces a more fundamentally novel concept—jointly optimizing latent communication in multi-agent LLM systems—which opens a new research direction at the intersection of multi-agent systems and representation learning. While Paper 1 (SAT) addresses the important but incremental problem of reasoning efficiency with a well-engineered FSM-based pruning framework, Paper 2 challenges a core assumption (fixed text-based communication) and proposes end-to-end differentiable multi-agent training. This has broader implications for how multi-agent AI systems are designed and trained, potentially impacting diverse fields beyond reasoning efficiency.

vs. Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

claude-opus-4.64/26/2026

Paper 1 introduces DiffMAS, a novel framework for jointly optimizing latent communication in multi-agent LLM systems—a largely unexplored direction with broad applicability across reasoning tasks. It demonstrates consistent empirical improvements across multiple benchmarks, suggesting strong methodological contribution. Paper 2 provides valuable critique of LLM-as-judge methodology for disinformation evaluation, but is more narrowly scoped as an audit study. Paper 1's technical innovation in differentiable multi-agent communication has greater potential to influence future research directions in AI systems design and multi-agent reasoning.

vs. Beyond Surface Judgments: Human-Grounded Risk Evaluation of LLM-Generated Disinformation

gemini-34/26/2026

Paper 2 introduces a fundamental methodological innovation by shifting multi-agent communication from fixed text interfaces to learnable latent representations. End-to-end optimization of these systems addresses a major bottleneck in current multi-agent architectures, and the strong empirical gains on rigorous benchmarks (AIME24, GPQA) suggest broad applicability. While Paper 1 provides valuable critical insights into LLM evaluation, Paper 2 offers a foundational advancement in AI capability and system design with wider transformative potential across the field.

vs. Tool Attention Is All You Need: Dynamic Tool Gating and Lazy Schema Loading for Eliminating the MCP/Tools Tax in Scalable Agentic Workflows

gpt-5.24/26/2026

Paper 1 is more scientifically novel and broadly impactful: it introduces an end-to-end trainable latent communication framework for multi-agent LLMs and validates across diverse, standard reasoning benchmarks with measurable accuracy gains. This advances core methodology (learning communication protocols) likely relevant to multi-agent learning, representation learning, and efficient inference. Paper 2 targets an important engineering bottleneck (tool/schema token overhead) with a practical middleware design, but its evaluation is largely simulated and relies on projected downstream metrics rather than demonstrated end-to-end agent improvements, reducing methodological rigor and general scientific contribution.

vs. Propensity Inference: Environmental Contributors to LLM Behaviour

claude-opus-4.64/26/2026

Paper 2 (DiffMAS) introduces a novel, concrete technical framework for optimizing latent communication in multi-agent LLM systems, addressing a clear gap in the field. It demonstrates strong empirical results across multiple benchmarks and opens a new research direction (differentiable multi-agent communication). Paper 1 provides valuable methodological contributions for AI safety evaluation but is more incremental and observational in nature, with findings that are somewhat inconclusive (approximately equal contributions, no clear trends). Paper 2's approach has broader applicability across NLP tasks and is more likely to inspire follow-up work in multi-agent systems.

vs. Propensity Inference: Environmental Contributors to LLM Behaviour

claude-opus-4.64/26/2026

Paper 1 introduces DiffMAS, a novel framework for jointly optimizing latent communication in multi-agent LLM systems—a largely unexplored direction with broad applicability across reasoning tasks. It demonstrates consistent empirical gains on multiple benchmarks, addressing a fundamental architectural limitation. Paper 2 makes valuable methodological contributions to AI safety evaluation but is more incremental and narrower in scope, primarily offering observational analysis rather than a new capability. Paper 1's potential to reshape how multi-agent LLM systems are designed and trained gives it broader and more immediate scientific impact.

vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

gemini-34/26/2026

Paper 1 introduces a paradigm-shifting approach by replacing fixed text-based communication in multi-agent LLMs with learnable latent representations (DiffMAS). This foundational algorithmic breakthrough has broad theoretical and empirical implications for AI architecture, demonstrating significant performance gains. While Paper 2 offers a crucial methodological tool for mitigating bias in applied LLM analysis, Paper 1 represents a more fundamental advancement in AI capabilities and system design, likely leading to broader technological impact.

vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

gemini-34/26/2026

Paper 1 addresses a fundamental epistemological bottleneck in AI-assisted research: distinguishing between data-driven inference and parametric memory bias. By introducing 'epistemic blinding,' it provides a broadly applicable, methodology-improving protocol that enhances auditability and trust across fields like biology and finance. While Paper 2 offers a strong architectural advancement for multi-agent systems, Paper 1 has higher cross-disciplinary impact by directly safeguarding the scientific method and reproducibility in the rapidly growing domain of LLM-assisted analysis.

vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors

gemini-34/26/2026

Paper 1 introduces a conceptual paradigm shift in AI safety through unsupervised monitoring, addressing the critical limitation of relying on predefined rules for detecting novel misbehaviors. Its practical demonstration of discovering a previously unknown benchmark vulnerability highlights its immediate real-world utility and broad implications for AI evaluation and safety, giving it a broader potential impact than the architectural optimizations in Paper 2.