What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

Chen Huang, Yuhao Wu, Wenxuan Zhang

#2059 of 3404 · Artificial Intelligence
Share
Tournament Score
1377±44
10501800
41%
Win Rate
7
Wins
10
Losses
17
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Multi-agent systems (MAS) built on large language models are typically organized around roles, pipelines, and turn schedules, while the content that agents pass to one another is often left as unconstrained natural language. However, this free-form communication can rapidly inflate token usage, consume the shared context window, and ultimately affect both system performance and inference cost. We analyze five common inter-agent communication strategies across two MAS topologies, finding that no fixed strategy is universally optimal. Instead, effective inter-agent messages consistently preserve action-centered information needed by downstream agents. Building on this, we propose the PACT (Protocolized Action-state Communication and Transmission), which treats inter-agent communication as a public state-update problem and projects each raw agent output into a compact action-state record before it enters shared history. Across different MAS topologies, PACT consistently improves the performance-cost trade-off, achieving comparable or stronger task performance with substantially fewer tokens. The gains extend to production coding harnesses: PACT lifts OpenHands' resolve rate at -10% tokens-per-resolved, and is resolve-neutral on SWE-agent while halving input tokens. Our code is publicly available at https://github.com/iNLP-Lab/PACT.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper tackles a practical but underexplored problem in LLM-based multi-agent systems (MAS): what content should agents communicate to each other? While prior MAS research has focused on agent roles, topologies, and turn-taking, the actual message content has been left as unconstrained natural language. The paper makes two contributions: (1) a diagnostic analysis of five common inter-agent communication strategies across two MAS topologies, revealing that no single strategy is universally optimal but that action-centered information is consistently valuable; and (2) PACT, a training-free communication protocol that projects raw agent outputs into compact three-field records (ACTION, STATE, RESULT) before they enter shared history.

The conceptual framing — treating inter-agent communication as a "public state-update problem" and drawing a boundary between private computation and public communication — is clean and intuitive. The idea itself is not deeply novel (structured message passing is well-established in distributed systems), but its application to LLM-based MAS is timely and well-motivated.

Methodological Rigor

Diagnostic analysis. The five-strategy comparison across two topologies and three model scales (Qwen3-8B/14B/32B) is systematic and informative. The analysis credibly establishes that full-content forwarding is wasteful, generic shortening is unreliable, and artifact-only messages identify the right content type but lack protocol structure. The findings are intuitive but empirically grounded.

Experimental evaluation. The comparison against CoA, TextMAS, and Multi-Agent Debate baselines is reasonable, though the baseline selection could be broader. The results consistently show PACT achieving comparable or better performance at substantially lower token cost (38.7% average reduction). The ablation study in Table 3 demonstrates that all three PACT fields contribute, with the largest degradation when both ACTION and STATE are removed.

Limitations in rigor:

  • All experiments use only Qwen3 models. Generalization to other model families (GPT, Claude, Llama) is untested.
  • The PACT projection itself appears to be done by prompting the same LLM, but this is never explicitly discussed — the cost of the projection step itself and whether it introduces errors is not analyzed.
  • Statistical significance is not reported for any results. The avg@8 protocol for some benchmarks partially addresses variance, but confidence intervals are absent.
  • The comparison with Multi-Agent Debate is somewhat unfair since debate uses 4 agents × 3 rounds, creating inherently higher token costs by design.
  • Potential Impact

    The practical relevance is notable. Token cost is a genuine bottleneck in production MAS deployments, and the paper demonstrates PACT on real-world coding harnesses (OpenHands and SWE-agent on SWE-bench Verified). The OpenHands result (+3.6 pp resolve rate at -10.3% tokens-per-resolved) is compelling, though the SWE-agent result (−1.4 pp resolve rate with −50.4% input tokens) is more of an efficiency gain than a performance improvement.

    The protocol is lightweight (implemented as a proxy hook requiring no model training or architecture changes), which lowers the adoption barrier. However, the impact may be bounded by several factors:

  • Production MAS are rapidly evolving, and many frameworks may independently adopt structured communication.
  • The gains are primarily in token efficiency rather than capability — the performance improvements are modest and sometimes within noise.
  • The three-field schema (ACTION/STATE/RESULT) is relatively rigid and may not suit all agent interaction patterns (acknowledged in limitations).
  • Timeliness & Relevance

    The paper is highly timely. With the explosion of agentic AI systems (Claude Code, Codex, etc.) and reasoning models that produce verbose outputs, token cost management is a pressing concern. The observation that reasoning traces compound across multi-turn histories is particularly relevant given the trend toward extended thinking models. The paper addresses a real engineering bottleneck that practitioners face daily.

    Strengths

    1. Clear problem framing: The paper precisely identifies the gap — inter-agent message content is an underexplored design dimension — and builds a coherent argument from analysis to solution.

    2. Practical applicability: PACT requires no training, no new agents, and no changes to existing agent architectures. The proxy hook implementation for production harnesses is elegant.

    3. Comprehensive diagnostic: The five-strategy analysis provides genuine insight (not just a baseline comparison) and the findings about topology-dependent effectiveness are useful for the community.

    4. Real-world validation: Testing on SWE-bench Verified with OpenHands and SWE-agent goes beyond toy benchmarks and demonstrates practical utility.

    Limitations & Weaknesses

    1. Limited model diversity: All experiments use Qwen3 only. The protocol's effectiveness might vary significantly with models that have different verbosity patterns or instruction-following capabilities.

    2. Projection cost opacity: The computational cost of the PACT projection step (presumably an additional LLM call or prompt addition) is not isolated or discussed.

    3. Narrow topology coverage: Only two MAS topologies are tested. More complex settings (dynamic routing, hierarchical teams, open-ended debate) are acknowledged as unexplored.

    4. Modest performance gains: While token reductions are substantial, performance improvements are often marginal, making it unclear whether PACT helps agents reason better or simply maintains performance while being cheaper.

    5. No comparison with compression baselines: The paper doesn't compare against prompt compression methods (e.g., LLMLingua) or memory-based approaches that could achieve similar token reductions.

    6. Scalability analysis missing: How does PACT perform as the number of agents or interaction turns increases significantly?

    Overall Assessment

    This is a solid engineering contribution that addresses a real and timely problem with a clean, practical solution. The diagnostic analysis provides useful insights, and the PACT protocol is well-designed for adoption. However, the novelty is incremental — structured message protocols are a natural engineering step — and the evaluation, while competent, lacks the breadth and statistical rigor to fully establish generalizability. The paper would benefit from cross-model evaluation, projection cost analysis, and comparison with compression baselines.

    Rating:5.8/ 10
    Significance 6Rigor 5.5Novelty 5Clarity 7.5

    Generated Jun 5, 2026

    Comparison History (17)

    vs. Emergent Collaborative Deliberation in Multi-Model AI Systems: A BFT-Derived Protocol for Epistemic Synthesis
    gemini-3.16/6/2026

    Paper 1 explores fundamental aspects of AI reasoning, alignment blind spots, and epistemic behavior, offering profound insights into model cognition and safety. While Paper 2 provides a valuable engineering solution for cost and efficiency in multi-agent systems, Paper 1's findings on RLHF biases and cognitive personas have a broader, more transformative potential impact across AI safety, alignment, and cognitive modeling fields.

    vs. SentinelBench: A Benchmark for Long-Running Monitoring Agents
    gpt-5.26/6/2026

    Paper 2 likely has higher impact: it introduces a new benchmark targeting an under-measured, timely capability (long-running monitoring with cost–responsiveness tradeoffs), enabling broad, comparable evaluation across agent designs, models, and web-agent harnesses. Benchmarks tend to catalyze follow-on work across academia and industry, with clear real-world applicability (notifications, ops, finance, scheduling). Paper 1 is a solid, practical communication/protocol contribution with demonstrated token/performance gains, but it is more specialized to MAS message design and may have narrower cross-field adoption than a widely usable benchmark.

    vs. CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model
    gemini-3.16/6/2026

    Paper 2 addresses a critical, high-stakes problem in AI safety by benchmarking covert psychological manipulation in LLMs. This fills a significant gap in current safety evaluations and has profound implications for AI alignment, policy, and human-AI interaction, giving it a broader and more fundamental scientific impact compared to the practical efficiency optimizations for multi-agent systems in Paper 1.

    vs. Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo
    claude-opus-4.66/6/2026

    Paper 1 addresses a highly practical and timely problem in multi-agent LLM systems—communication efficiency—with concrete, measurable improvements on production benchmarks (OpenHands, SWE-agent). Its PACT framework offers immediately actionable design principles for the rapidly growing MAS community, with public code and clear cost-performance trade-offs. Paper 2 presents a useful conceptual framework for knowledge infusion in generative models, but its contribution is more taxonomic/organizational, with narrower empirical validation (safety alignment in diffusion models). Paper 1's broader applicability across MAS topologies and direct relevance to production systems gives it higher potential impact.

    vs. Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory
    claude-opus-4.66/6/2026

    Paper 2 addresses a broadly relevant problem in multi-agent LLM systems—efficient inter-agent communication—which is timely given the rapid growth of LLM-based multi-agent frameworks. It proposes a general, reusable protocol (PACT) with clear practical benefits (reduced token costs, maintained performance) demonstrated across multiple topologies and production systems (OpenHands, SWE-agent). Paper 1, while technically rigorous, targets a narrow industrial application (circular factory angle grinder reuse) combining known techniques (CNN-LSTM, S-N curves, Paris law) with limited generalizability beyond its specific domain.

    vs. Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns
    gemini-3.16/6/2026

    Paper 2 addresses a critical and immediate bottleneck in multi-agent LLM systems: context window exhaustion and high token inference costs. By proposing a protocolized communication strategy (PACT) and demonstrating empirical cost-performance improvements on industry-standard production harnesses like SWE-agent and OpenHands, it offers highly practical and scalable value. Paper 1 presents a useful but more abstract evaluation methodology based on existing information theory concepts. Paper 2's direct optimization of system efficiency and proven results on state-of-the-art benchmarks give it broader applicability and higher potential for immediate scientific and industry impact.

    vs. Interfaze: The Future of AI is built on Task-Specific Small Models
    gemini-3.16/5/2026

    Paper 1 proposes a foundational architectural shift by natively fusing task-specific perceptual models into a transformer decoder, addressing major inefficiencies in monolithic multimodal LLMs. Its broad applicability across vision, audio, and structured data, combined with state-of-the-art benchmark performance against next-generation models, suggests a wider scientific and industrial impact. While Paper 2 offers a valuable optimization protocol for multi-agent systems, Paper 1 represents a more significant leap in fundamental AI model design.

    vs. A Multi-AI-agent Framework Enabling End-to-end Finite Element Analysis for Solid Mechanics Problems
    claude-opus-4.66/5/2026

    Paper 1 (AbaqusAgent) has higher potential scientific impact because it addresses a concrete, high-value problem in computational mechanics—automating FEA workflows via LLM agents—bridging AI and engineering simulation in a novel way. It demonstrates practical end-to-end capability across 50 validated problems with 86% success, directly enabling real-world applications in engineering design, education, and optimization. Paper 2 (PACT) addresses important but more incremental concerns about token efficiency in multi-agent communication. While useful, it optimizes existing MAS infrastructure rather than opening a new application domain, giving it narrower cross-disciplinary impact.

    vs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
    claude-opus-4.66/5/2026

    Paper 2 addresses a fundamental and broadly applicable problem in multi-agent LLM systems—efficient inter-agent communication—proposing a principled framework (PACT) that demonstrates concrete improvements across multiple topologies and production systems. Its impact spans the rapidly growing field of LLM-based multi-agent systems with immediate practical applications (reduced cost, improved performance). Paper 1, while timely and valuable for AI safety benchmarking in companion systems, is more niche in scope, serving primarily as a dataset contribution for a specific safety evaluation domain. Paper 2's methodological contribution has broader applicability and addresses a more fundamental architectural challenge.

    vs. StepPRM-RTL: Stepwise Process-Reward Guided LLM Fine-Tuning for Enhanced RTL Synthesis
    gemini-3.16/5/2026

    Paper 2 addresses a critical and universal bottleneck in LLM-based multi-agent systems: communication efficiency and context window inflation. By proposing a generalizable protocol (PACT), its methodology can be widely adopted across diverse AI domains. In contrast, Paper 1, while methodologically strong, is highly domain-specific (hardware RTL synthesis), limiting its breadth of impact compared to foundational improvements in multi-agent architectures.

    vs. ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation
    claude-opus-4.66/5/2026

    ToolSelf introduces a more fundamentally novel paradigm—unifying task execution and self-reconfiguration within a single agent's action space via tool abstraction—addressing a core limitation of LLM agents (static configurations). Its contribution spans architecture design, training methodology (CAT), and demonstrates substantial gains (28.8 points average). Paper 2 addresses inter-agent communication efficiency, which is practically valuable but more incremental, optimizing an existing aspect (token usage) rather than introducing a new capability. ToolSelf's broader conceptual innovation and potential to reshape how agents adapt at runtime gives it higher long-term impact.

    vs. A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks
    gpt-5.26/5/2026

    Paper 2 is likely to have higher scientific impact because it addresses a broadly shared bottleneck—benchmark saturation and the high cost of creating new agent evaluations—via an automated, scalable task-generation pipeline. Its outputs (TASTE and τ^c-Bench) can become community infrastructure, influencing model development, evaluation practice, and comparisons across labs and domains, and it is timely as agent benchmarks rapidly saturate. Paper 1 is innovative and practically valuable for MAS efficiency, but its impact is more scoped to communication/compression protocols within specific multi-agent architectures.

    vs. DELTAMEM: Incremental Experience Memory for LLM Agents via Residual Trees
    claude-opus-4.66/5/2026

    DeltaMem addresses a fundamental problem in LLM agent memory—redundancy and retrieval conflicts—with a novel residual tree structure that elegantly handles incremental experience variations. The concept of residual experience memory with autonomous consolidation is more architecturally innovative and broadly applicable across agent systems. While PACT offers practical token-efficiency improvements for multi-agent communication, its contribution is more incremental (structured message formatting). DeltaMem's hierarchical memory organization with self-reorganization has deeper implications for continual learning in LLM agents, a rapidly growing research area.

    vs. MapAgent: An Industrial-Grade Agentic Framework for City-scale Lane-level Map Generation
    gemini-3.16/5/2026

    Paper 1 addresses a fundamental bottleneck in multi-agent systems (token inflation and context limits) by proposing a novel, broadly applicable communication protocol (PACT). Its foundational nature and ability to improve efficiency across general AI agent systems give it a higher potential for widespread scientific impact and citations compared to Paper 2, which, despite its impressive real-world deployment, is highly domain-specific to mapping.

    vs. An interpretable and trustworthy AI framework for large-scale longitudinal structure-pain association studies using data from the Osteoarthritis Initiative (OAI)
    gemini-3.16/5/2026

    Paper 1 addresses a critical and universal bottleneck in the rapidly expanding field of LLM-based multi-agent systems: token consumption and context window limits. By introducing a novel communication protocol that significantly improves the performance-cost trade-off, its findings have immediate, widespread applicability across numerous AI domains. While Paper 2 presents a rigorous and valuable medical AI application, Paper 1 offers a foundational methodology with broader, cross-disciplinary impact in AI development.

    vs. AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning
    gemini-3.16/5/2026

    Paper 2 presents a foundational infrastructure for distributed, heterogeneous multi-agent reinforcement learning at scale. While Paper 1 offers a highly practical solution for token efficiency in multi-agent communication, Paper 2's AgentJet framework fundamentally expands the capabilities of agentic RL research, enabling multi-model training, fault tolerance, and autonomous long-horizon research workflows. This breadth of application and methodological advancement gives Paper 2 a significantly higher potential impact on how future LLM agent research is conducted.

    vs. SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents
    claude-opus-4.66/5/2026

    SkillPyramid addresses a more fundamental challenge in AI agent development—systematic skill construction, accumulation, and transfer for self-evolving agents. Its hierarchical skill consolidation framework with self-evolution mechanisms represents a more novel architectural contribution. The substantial improvements (38.0% reward increase, 27.7% fewer steps) across multiple benchmarks and four backbone models demonstrate broad applicability. While PACT makes a solid engineering contribution to communication efficiency in multi-agent systems, SkillPyramid's focus on enabling agents to continuously learn and generalize skills has broader implications for the long-term development of autonomous AI systems.