MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

Yao Guan, Lin Wang, Zhihu Lu, Ziyi Wang, Wenzhu Yan, Qiang Duan

Jun 1, 2026

arXiv:2606.02359v1 PDF

cs.AI(primary)

#2018of 3355·Artificial Intelligence

#2018 of 3355 · Artificial Intelligence

Tournament Score

1380±44

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5.5

Novelty6

Clarity7

Tournament Score

1380±44

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Despite the remarkable progress of Large Language Model (LLM) based Multi-Agent Systems, most research focuses on optimizing coordination topology while largely underexploring the equally critical problem: how to transmit and optimize messages among agents effectively? Current communication schemes typically rely on the direct concatenation of first-order neighbor responses, which induces a restricted evidence receptive field and leads to the dilution of crucial insights over multi-hop paths. To address these limitations, we propose the Multi-Order Communication (MOC) scheme, which reconstructs the inter-agent communication to capture multi-hop dependencies and incorporates a structural message consolidation strategy to ensure efficiency. Specifically, we formalize the communication mechanism to construct a structured multi-order evidence stream, and subsequently design a Semantic-Topological Merging algorithm to optimize semantic fidelity within token constraints. Extensive experiments across six diverse datasets and LLM backbones of varying parameter scales demonstrate that MOC consistently improves task performance and reduces communication costs. The code is available at https://github.com/yao-guan/MOC.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MOC: Multi-Order Communication in LLM-based Multi-Agent Systems

1. Core Contribution

MOC addresses a specific gap in LLM-based multi-agent systems (MAS): while prior work extensively optimizes *which* agents connect to each other (topology), the *how* of message transmission along those connections has remained primitive—typically just concatenating first-order neighbor outputs. MOC proposes two key innovations: (1) a multi-order evidence stream that exposes target agents to raw upstream responses from multiple hop distances (not just immediate neighbors), organized by topological order; and (2) a Semantic-Topological Merging algorithm that consolidates redundant messages to manage context length growth.

The analogy to MixHop in GNNs is well-drawn: just as MixHop concatenates multi-hop aggregated representations, MOC surfaces multi-hop raw messages. The key insight is that intermediate agents' paraphrasing and summarization inevitably attenuate information ("semantic attenuation"), and providing direct access to upstream sources can mitigate this.

2. Methodological Rigor

Formalization. The paper provides a clean graph-theoretic formalization of intra-round MAS communication via DAGs, topological execution ordering, and k-hop reachability via adjacency matrix powers. The framework is well-structured and the notation is consistent.

Consolidation Strategy. The Semantic-Topological Merging algorithm uses embedding-based similarity to identify redundant message pairs, applies forward-merging (anchoring at the topologically later message), and distills merged content via a lightweight model with candidate selection optimizing cosine similarity to originals. The topology-aware budget constraint (Eq. 16) and batch-wise ε-approximate merging are practical design choices.

Experimental Design. The evaluation covers six datasets across three task categories (general reasoning, math, code), three LLM backbones of different scales (27B, 32B, 685B), and multiple edge densities. This breadth is commendable. However, several concerns arise:

The improvements, while consistent, are often modest (e.g., 0.38-0.82% on DeepSeek-V3.2, <1% on several Gemma-2-27B configurations at higher densities). Statistical significance is not reported, and with relatively small test samples (e.g., 254 for AQuA, 285 for MMLU), some gains may fall within noise margins.

The number of agents is fixed at 7 for most experiments, limiting understanding of scalability behavior.

The random DAG backbone is a reasonable starting point but somewhat artificial; integration with G-Designer is shown but only on three datasets.

The distillation model (gemma2-9B) introduces non-trivial computational overhead (Table 3 shows distillation dominates runtime at ~80-83s), which somewhat undermines the "efficiency" narrative.

3. Potential Impact

Practical Relevance. As LLM-based MAS grow in complexity with more agents and deeper coordination chains, the communication bottleneck becomes real. MOC's approach of treating communication as a first-class design dimension—separate from topology—is a useful conceptual contribution that could influence how MAS frameworks are architected.

Generalizability. MOC is designed as a plug-in module compatible with any DAG-based MAS topology, demonstrated across random DAGs and task-adaptive topologies. This modularity enhances its practical adoption potential.

Limitations of Impact. The reliance on a separate distillation model for merging adds infrastructure complexity. The fixed K=2 as the "robust default" raises questions about adaptivity. The paper acknowledges that adaptive K selection is future work, but this seems critical for real deployment. Furthermore, the improvements on state-of-the-art models (DeepSeek-V3.2) are quite small, suggesting diminishing returns as base models become stronger.

4. Timeliness & Relevance

The paper addresses a timely topic. LLM-based MAS have proliferated rapidly (CAMEL, AutoGen, MetaGPT, etc.), and the community has indeed focused primarily on topology optimization while treating communication as a solved problem. The observation that naive concatenation of neighbor outputs loses multi-hop information is valid and practically relevant. The growing interest in agentic AI systems makes communication scheme optimization an increasingly important research direction.

5. Strengths & Limitations

Key Strengths:

Clean problem identification: distinguishing topology design from communication scheme design is a useful framing.

Principled formalization grounded in graph theory with clear connections to GNN message passing.

Comprehensive evaluation across datasets, models, and densities.

The consolidation operator demonstrating token reduction below Vanilla MAS (Figure 3b at 20 agents) is a compelling practical result.

Code availability enhances reproducibility.

Notable Weaknesses:

Improvements are often marginal, especially on stronger models and denser topologies, without statistical significance testing.

The overhead from the distillation model (separate LLM calls for merging) partially offsets the communication efficiency gains; total wall-clock time analysis is incomplete.

K is not adaptive—fixed at 2—limiting the framework's claimed generality.

The paper lacks comparison with other communication optimization approaches (e.g., "Cut the Crap" (Zhang et al., 2025a) is cited but not compared against experimentally).

The "theoretical formalization" contribution is somewhat overstated—it is primarily notation and definitions rather than theorems with formal guarantees.

No analysis of when/why MOC fails (e.g., the SVAMP degradation at ρ=1.0) or what properties of tasks benefit most from multi-order communication.

The evaluation uses relatively small subsets of standard benchmarks, which may limit the reliability of reported improvements.

Summary

MOC makes a reasonable contribution by formalizing and improving the communication scheme in LLM-based MAS, an underexplored dimension alongside topology design. The multi-order evidence exposure idea is intuitive and well-motivated, and the consolidation strategy is practical. However, the empirical improvements are often modest and statistical rigor is lacking, the computational overhead of the merging process is non-trivial, and the fixed communication order limits adaptivity. The paper is a solid incremental contribution to the MAS communication literature but falls short of being transformative.

Rating:5.5/ 10

Significance 5.5Rigor 5.5Novelty 6Clarity 7

Generated Jun 2, 2026

Comparison History (19)

vs. A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems

claude-opus-4.66/5/2026

AbaqusAgent addresses a concrete, high-impact application by automating finite element analysis through multi-agent LLMs, directly lowering barriers in computational mechanics education and engineering practice. It validates on 50 real problems with 86% success rate, demonstrating practical utility. While Paper 1 (MOC) makes a solid contribution to multi-agent communication theory with broad applicability, Paper 2 opens a new paradigm for human-simulation interaction in engineering, has clearer real-world applications, and bridges AI with an established domain (FEA/solid mechanics), giving it broader interdisciplinary impact and immediate practical relevance.

vs. BiNSGPS: Geometry Problem Solving via Bidirectional Neuro-Symbolic Interaction

gemini-3.16/5/2026

Paper 1 addresses a fundamental bottleneck in LLM-based Multi-Agent Systems (communication efficiency and multi-hop dependencies). Its proposed scheme has broad applicability across numerous domains utilizing multi-agent setups. While Paper 2 offers an innovative neuro-symbolic approach, its direct impact is currently confined to geometry problem solving. Thus, Paper 1 demonstrates greater breadth of impact, generalizability, and potential for widespread adoption in a rapidly expanding field.

vs. Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

gemini-3.16/3/2026

Paper 1 addresses a fundamental and pervasive challenge in modern LLMs: following complex, multi-constraint instructions. By formalizing this as a Constraint Adherence Problem and utilizing a novel knowledge graph-based bridging method, it offers a highly innovative solution to a widespread limitation. While Paper 2 presents strong methodological advancements for multi-agent systems, Paper 1's focus on core reasoning and instruction-following capabilities promises broader immediate applicability and impact across almost all domains utilizing large language models.

vs. Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

gemini-3.16/3/2026

Paper 2 addresses a fundamental bottleneck in LLM-based multi-agent systems (communication efficiency and multi-hop dependencies), offering a generalizable solution applicable across numerous domains. Paper 1, while providing a rigorous and valuable benchmark, is highly specialized to financial reasoning. The broad applicability and foundational nature of Paper 2's methodological advancements give it a higher potential for widespread scientific impact across the AI research community.

vs. Reasoning Structure of Large Language Models

claude-opus-4.66/3/2026

Paper 2 introduces a novel framework for analyzing reasoning structure in LLMs by converting unstructured traces into verifiable reasoning graphs, defining new efficiency metrics that reveal insights hidden by standard accuracy/token metrics. This addresses a fundamental gap in LLM evaluation methodology with broad applicability across the field. Paper 1, while solid, is more incremental—improving multi-agent communication via multi-hop message passing is a narrower contribution. Paper 2's structural analysis tools have potential to become widely adopted diagnostic instruments for the rapidly growing reasoning model ecosystem.

vs. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

gpt-5.26/3/2026

Paper 1 introduces a novel, modality-aligned supervision mechanism (Imaginative Perception Tokens) targeting a well-known weakness of VLMs: spatial reasoning under partial observability. It contributes new task formulations plus ~20K examples with explicit intermediate “imagination” ground truth, enabling reproducible study and likely follow-on benchmarks. The approach is broadly applicable to robotics, embodied AI, AR/VR, and navigation, and its finding that text CoT can harm spatial tasks is timely and influential. Paper 2 is useful for multi-agent LLM systems, but is closer to an engineering refinement of messaging schemes with narrower cross-field impact.

vs. EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors

claude-opus-4.66/2/2026

EVA-Net introduces a novel cross-modal framework using video as semantic priors for EEG decoding, addressing a critical challenge in BCI systems (subject-independent generalization). This tackles a fundamental bottleneck in neural engineering with clear real-world applications (assistive devices, neurorehabilitation). The approach is methodologically innovative—using video instead of text for dynamic motor process alignment—and demonstrates significant performance gains (8.66% LOSO accuracy improvement). Paper 2 proposes an incremental communication optimization for LLM multi-agent systems, which is useful but more narrow and builds on a rapidly-shifting landscape where architectural innovations quickly become obsolete.

vs. Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

gemini-3.16/2/2026

Paper 1 addresses a critical and widespread bottleneck in Retrieval-Augmented Generation (RAG)—handling semi-structured data—which has massive real-world applications in enterprise and e-commerce. By introducing a novel dual-view framework combining symbolic and semantic retrieval alongside a new benchmark dataset, it offers higher immediate utility and broader applicability across industries compared to the communication optimization in multi-agent systems presented in Paper 2.

vs. An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

gemini-3.16/2/2026

Paper 1 addresses a fundamental limitation in the core reasoning capabilities of frontier large language models, revealing a critical 'production-evaluation gap' and confirmation bias. This insight has profound implications for AI safety, alignment, and future training paradigms (like RLHF). Paper 2, while offering a solid methodological improvement for multi-agent communication efficiency, has a narrower scope and application. Paper 1's findings challenge dominant assumptions about LLM reasoning, giving it broader relevance and higher potential for widespread scientific impact across the AI community.

vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

gpt-5.26/2/2026

Paper 2 likely has higher impact: it identifies a general, underappreciated failure mode (misalignment between confidence-based decoding/training and reasoning trajectories) in masked diffusion language models, with clear evidence across multiple reasoning tasks and an actionable takeaway (random masking is more robust for the challenging tail). This has broad implications for inference policies, training objectives, and evaluation of non-autoregressive LMs, and is timely as diffusion/MDM approaches gain adoption. Paper 1 is useful and engineering-relevant, but more incremental within multi-agent LLM communication design.

vs. LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

gpt-5.26/2/2026

Paper 2 (MOC) likely has higher impact: it targets a broadly relevant, fast-growing area (LLM multi-agent systems) and introduces a general communication paradigm (multi-order evidence + semantic-topological merging) applicable across tasks, datasets, and backbones, with direct implications for scalability, efficiency, and robustness of agentic systems. Paper 1 (LFQ) is valuable but more specialized—an incremental PTQ objective/coverage improvement focused on final-block/logit alignment for generation quality—primarily impacting deployment efficiency rather than expanding system capabilities across fields.

vs. Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

gpt-5.26/2/2026

Paper 1 introduces a generally applicable communication/aggregation scheme for LLM multi-agent systems (multi-hop evidence + constrained merging), likely reusable across many agent architectures and tasks, with broad downstream impact in AI systems and coordination. It reports extensive experiments across multiple datasets and model scales and offers code, strengthening rigor and adoption. Paper 2 is timely and valuable for a specific biocuration bottleneck and provides a strong benchmark-based evaluation, but its core contribution is primarily an application/assessment of existing frontier LLMs in one domain, with impact more localized and potentially sensitive to rapid model updates.

vs. SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

claude-opus-4.66/2/2026

SAAS addresses a highly practical and timely problem—over-search in agentic LLM systems—that directly impacts inference costs and latency at scale. Its RL-based framework for cultivating self-awareness in agents is more novel and broadly applicable, as agentic search is a rapidly growing paradigm. The three-component design (boundary modeling, boundary-aware rewards, stage-wise optimization) offers a principled methodology. While MOC makes solid contributions to multi-agent communication, it addresses a more incremental improvement in message passing topology. SAAS's focus on computational efficiency and self-regulation aligns with urgent industry needs, giving it broader real-world impact potential.

vs. S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

gpt-5.26/2/2026

Paper 1 likely has higher scientific impact due to its novelty in addressing an underexplored bottleneck (message transmission/optimization) in LLM multi-agent systems, a rapidly growing and broadly applicable area. Its method (multi-hop structured evidence + semantic-topological merging under token constraints) is general-purpose, evaluated across multiple datasets and LLM scales, and accompanied by code—supporting rigor and adoption. Paper 2 is methodologically solid and impactful for energy scheduling under uncertainty, but its evaluation appears more domain-specific and narrower in cross-field reach compared to advances in LLM-agent communication.

vs. TravelEval: A Comprehensive Benchmarking Framework for Evaluating LLM-Powered Travel Planning Agents

claude-opus-4.66/2/2026

Paper 2 (MOC) addresses a fundamental and generalizable problem in LLM-based multi-agent systems—how agents communicate effectively—which has broad applicability across many domains. Its multi-order communication scheme with semantic-topological merging is a novel contribution that can improve any multi-agent system. Paper 1 (TravelEval), while thorough, is a domain-specific benchmark for travel planning with narrower impact. MOC's cross-domain applicability, demonstrated improvements across six datasets and multiple LLM scales, and its foundational contribution to multi-agent communication give it higher potential for broad scientific impact.

vs. PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

gemini-3.16/2/2026

While Paper 2 presents a highly innovative neuro-symbolic approach for physics diagrams, Paper 1 addresses a fundamental bottleneck in LLM-based multi-agent systems. By optimizing multi-hop message transmission and merging, Paper 1 offers a domain-agnostic, scalable improvement. Because multi-agent LLM architectures are being adopted across virtually all AI applications, overcoming their communication inefficiencies yields a much broader scientific and practical impact compared to the niche, domain-specific generation tasks tackled in Paper 2.

vs. Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems

gpt-5.26/2/2026

Paper 2 likely has higher impact: it targets a core blocker for real-world LLM deployment—reliability of tool-augmented agents—introducing a self-healing, budgeted control framework with verification and observability. Its controlled fault-injection benchmark and strong gains over common baselines suggest methodological rigor and actionable engineering relevance across many domains using tool-calling agents (enterprise workflows, retrieval, automation). Paper 1 is novel for multi-agent communication efficiency, but its impact may be narrower to multi-agent coordination settings and more sensitive to benchmark/task design. Paper 2 is more timely and broadly applicable.

vs. Don't Ask the LLM to Track Freshness: A Deterministic Recipe for Memory Conflict Resolution

claude-opus-4.66/2/2026

Paper 2 (MOC) addresses a broader, more generalizable problem in multi-agent LLM systems—inter-agent communication optimization—with a novel framework applicable across diverse tasks and model scales. It demonstrates consistent improvements across six datasets with available code. Paper 1, while methodologically sound and showing strong results, addresses a narrower benchmark-specific problem (memory conflict resolution via deterministic serial comparison) with findings that are somewhat intuitive (deterministic rules beat LLM judgment for simple versioning). Paper 2's multi-order communication scheme has wider applicability and relevance to the growing multi-agent systems field.

vs. TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories

gpt-5.26/2/2026

Paper 1 has higher impact potential due to a more novel, broadly applicable evaluation/diagnostic framework (shared decision landscapes) that can reinterpret existing trajectory datasets across benchmarks and yields actionable failure-region interventions. It contributes methodological tools (graph construction, trap/core overlay, event vocabulary) plus a demonstrated improvement pipeline on a high-profile, timely benchmark (SWE-bench) with measurable gains. Paper 2 is a solid systems/algorithm contribution for multi-agent LLM communication, but multi-hop/message-merging schemes are closer to incremental advances and its impact is narrower to MAS communication settings.