ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

Woomin Song, Beomjun Kim, Daewon Choi, Sai Muralidhar Jayanthi, Saket Dingliwal, Jinwoo Shin, Aram Galstyan

May 21, 2026

arXiv:2605.22102v1 PDF

cs.AI(primary)

#666of 2292·Artificial Intelligence

#666 of 2292 · Artificial Intelligence

Tournament Score

1455±43

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity8

Tournament Score

1455±43

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing test-time scaling methods provide limited control over this process, as they often rely on agents to detect their own mistakes, select among flawed trajectories, or refine solutions only after errors have already shaped the reasoning path. We propose ExComm, a communication protocol for exploration-stage agentic test-time scaling. ExComm is motivated by the empirical observation that the majority of intermediate errors in parallel agentic reasoning produce detectable cross-agent factual conflicts. Leveraging the iterative structure of agentic workflows, ExComm periodically audits agent belief states to detect such conflicts, resolves them through a dedicated tool-based verification loop, and returns concise, targeted feedback to the involved agents. Corrections are incorporated through soft belief updates, which append verified feedback rather than overwriting existing beliefs. Furthermore, to prevent collapsing trajectory diversity due to communication, ExComm further introduces a trajectory diversification module that redirects redundant trajectories toward orthogonal strategies. Experiments on AIME 2024, AIME 2025, and GAIA with Gemini-2.5-Flash-Lite and Qwen3.5-4B show that ExComm consistently outperforms strong test-time scaling baselines, achieving average performance gains of 5.7% and 5.0% over the best-performing baselines, respectively. Further analyses demonstrate improved error recovery, favorable scaling behavior, stronger diversity than adapted communication baselines, and the best performance-cost trade-off among the evaluated methods.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ExComm

1. Core Contribution

ExComm introduces an exploration-stage communication protocol for parallel agentic test-time scaling that addresses error propagation—a well-known failure mode in long-horizon reasoning. The key insight is that most intermediate errors (67–71%) in parallel agent trajectories produce detectable cross-agent factual conflicts, enabling error detection without relying on unreliable self-correction. The framework consists of two modules: (1) an Online Belief Consistency Module that extracts factual conflicts across agents, resolves them via a dedicated tool-augmented verification loop, and delivers targeted soft belief updates; and (2) a Trajectory Diversification Module that monitors plan-level redundancy and redirects converging agents toward orthogonal strategies. The soft update mechanism—appending corrections rather than overwriting beliefs—is a thoughtful design choice that provides robustness against verifier errors, as demonstrated in the GAIA case study where an incorrect verification was gracefully handled.

2. Methodological Rigor

The experimental design is generally sound. The paper evaluates on three benchmarks (AIME 2024, AIME 2025, GAIA L1-L3) across two primary models (Gemini-2.5-Flash-Lite and Qwen3.5-4B), with results averaged over three runs. The baseline selection is comprehensive, covering single-agent, sequential revision, independent parallel scaling, independent + revision, and tree search methods. The comparison against adapted communication baselines (Multi-Agent Debate, Mixture-of-Agents) is particularly valuable for isolating the contribution of targeted vs. broad communication.

However, several methodological concerns arise:

Error type analysis (Table 1) uses an LLM-as-a-judge approach averaged over three critic models, but the reliability of these critics for detecting and classifying errors is not independently validated against human annotations. The circular nature—using LLMs to evaluate LLM errors—introduces potential systematic biases.

Error recovery rate measurement (Table 3) similarly depends on critic model accuracy, and while using three critics adds robustness, the agreement rates between critics are not reported.

The conflict resolution module itself is an LLM agent with tool access, meaning its accuracy depends on the same class of models whose errors it's trying to fix. The paper acknowledges this limitation but doesn't quantify the verifier's accuracy.

Statistical significance is not formally tested; only standard errors are shown in one figure.

3. Potential Impact

The paper addresses a genuinely important problem. Error propagation in agentic workflows is a practical bottleneck for deploying LLM agents on complex, multi-step tasks. The framework's modular design—operating as a communication layer on top of existing agent loops—makes it relatively easy to integrate into existing systems.

Immediate applications include improving reliability of tool-augmented reasoning systems in mathematics, research assistance, and multi-step information retrieval. The approach could generalize to any domain where parallel agents solve problems with verifiable intermediate facts.

Broader influence: The paper opens a promising design space around "exploration-stage communication"—intervening during reasoning rather than only aggregating final outputs. This is a conceptually distinct contribution from existing multi-agent debate or mixture-of-agents paradigms. The diversity-preservation aspect (avoiding trajectory collapse while sharing corrections) addresses a real tension in multi-agent systems.

However, the practical impact may be bounded by the requirement for tool-verifiable conflicts. In domains where intermediate claims are harder to verify programmatically (e.g., open-ended creative tasks, ethical reasoning), the consistency module's effectiveness would likely diminish.

4. Timeliness & Relevance

This work is highly timely. Test-time scaling and agentic AI are among the most active areas in LLM research. The paper directly addresses a current bottleneck: how to make agentic test-time scaling more reliable without simply increasing the number of trajectories. The focus on error propagation during exploration rather than post-hoc refinement fills a clear gap in the existing literature. The use of recent models (Gemini-2.5-Flash-Lite, Qwen3.5-4B) and benchmarks (AIME 2025) further demonstrates currency.

5. Strengths & Limitations

Key Strengths:

Well-motivated design: The empirical observation that 67–71% of errors produce detectable conflicts provides strong justification for the approach.

Principled soft update mechanism: Appending rather than overwriting beliefs is elegant and demonstrably important (the GAIA case study where incorrect verification is gracefully handled).

Comprehensive evaluation: Multiple models, benchmarks, scaling regimes, ablations, cost analysis, and diversity measurements provide a thorough picture.

Favorable efficiency: ExComm at N=4 outperforms baselines at N=8 with lower API cost, demonstrating that targeted communication is more efficient than brute-force scaling.

Ablation study clearly separates contributions of each component, showing the diversification module is particularly important for open-ended tasks (GAIA L3).

Notable Limitations:

Verifier reliability: The framework's effectiveness is fundamentally bounded by the quality of the conflict resolution module. No analysis quantifies how often the verifier itself makes errors or how error rates change across problem difficulty.

Scalability concerns: The centralized consistency module must jointly analyze all agents' beliefs at every step, creating a potential bottleneck as agent count grows. The N=8 experiments are encouraging but modest.

Limited benchmark diversity: Only mathematical reasoning and GAIA are tested. Performance on coding, scientific reasoning, or other agentic tasks remains unknown.

Reproducibility: The paper uses proprietary models (Gemini) and relies on the Co-Sight framework, which may limit reproducibility. Prompt templates are provided, which helps.

Diversity metrics: Using Self-BLEU for measuring trajectory diversity is a lexical measure that may not capture semantic diversity well, especially for reasoning trajectories that could express the same mathematical approach in very different surface forms.

Common errors (5.6–13.6%): These represent a ceiling on what cross-agent conflict detection cannot address. The paper does not propose mechanisms for handling errors shared across all agents.

Additional Observations

The paper's framing of "exploration-stage" vs. "solution-stage" communication is a useful conceptual distinction that could influence how the community designs multi-agent systems. The trajectory diversification module, while seemingly simple, addresses a real and often overlooked problem in parallel sampling approaches. The cost analysis (Figure 2) is particularly compelling for practitioners evaluating deployment trade-offs.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 8

Generated May 22, 2026

Comparison History (20)

vs. CLORE: Content-Level Optimization for Reasoning Efficiency

claude-opus-4.65/22/2026

ExComm addresses a fundamental problem in agentic test-time scaling—error propagation in long-horizon reasoning—with a novel multi-agent communication protocol that detects cross-agent factual conflicts and resolves them through tool-based verification. This has broader impact across agentic AI systems beyond math reasoning. While CLORE offers a solid contribution to reasoning efficiency via content-level editing, it represents a more incremental improvement within the established efficient reasoning paradigm. ExComm's approach to inter-agent communication, belief updates, and trajectory diversification introduces more novel architectural concepts with wider applicability to emerging agentic AI workflows.

vs. LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

gemini-3.15/22/2026

Paper 1 addresses error propagation in test-time scaling, a highly critical bottleneck in advancing reasoning capabilities of LLMs. Its approach to real-time error correction and trajectory diversification tackles immediate, high-impact challenges in building autonomous agents. Paper 2, while methodologically rigorous in addressing privacy in latent KV communication, focuses on a more niche and less universally adopted communication paradigm, giving Paper 1 broader applicability and a higher potential impact in the rapidly growing field of agentic reasoning.

vs. Unlocking Proactivity in Task-Oriented Dialogue

gpt-5.25/22/2026

Paper 1 offers a broadly applicable, timely contribution to agentic test-time scaling: a concrete protocol to detect and correct intermediate reasoning errors via cross-agent conflict auditing plus verification, while preserving diversity. This targets a central bottleneck (error propagation) across many LLM agent settings (math, web tasks, general long-horizon workflows) and is immediately usable as a modular inference-time method, with quantified gains on prominent benchmarks. Paper 2 is promising for proactive TOD, but relies on a specialized simulator/latent-state setup that may limit generality and real-world transfer, and the abstract provides less evidence of empirical breadth/rigor.

vs. FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

claude-opus-4.65/22/2026

ExComm addresses the fundamental and broadly relevant problem of error propagation in agentic test-time scaling, proposing a novel communication protocol with cross-agent conflict detection and soft belief updates. This has broad applicability across many reasoning tasks and LLM agent architectures, with strong empirical gains on multiple benchmarks. While FLUID solves an important industrial problem (cold-start in livestreaming recommendation) and demonstrates real-world deployment at scale, its impact is more narrowly scoped to a specific recommendation domain. ExComm's contributions to multi-agent reasoning and test-time scaling are more likely to influence a wider range of future research directions.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

gemini-3.15/22/2026

SMDD-Bench introduces a much-needed, standardized benchmark for a critical scientific domain (drug design). By addressing the lack of rigorous evaluation in LLM-driven chemistry and biology, it directly facilitates breakthroughs in real-world medicine. This foundational contribution to AI for Science has a higher potential for transformative real-world impact than the incremental algorithmic improvements in general reasoning offered by Paper 2.

vs. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

claude-opus-4.65/22/2026

ChemVA addresses a fundamental and persistent bottleneck in AI-driven chemistry—interpreting chemical reaction diagrams—with a novel framework yielding ~20 percentage point gains across 9 LLMs. It introduces a new benchmark (OCRD-Bench), bridges vision and language for chemistry, and has broad real-world applications in drug discovery, synthesis planning, and chemical education. Paper 1, while solid, represents an incremental improvement in test-time scaling for agentic reasoning with more modest gains (~5%). ChemVA's cross-disciplinary impact (AI + chemistry) and enabling of open-weight models to match proprietary systems gives it higher long-term scientific impact.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

claude-opus-4.65/22/2026

ExComm addresses a fundamental problem in test-time scaling—error propagation in multi-agent reasoning—with a principled, well-evaluated solution. Its communication protocol for detecting cross-agent conflicts, soft belief updates, and trajectory diversification are novel contributions with broad applicability across agentic AI systems. The method shows consistent gains across multiple benchmarks and models, demonstrating generalizability. While Paper 1 (Insights Generator) tackles an important practical problem of diagnosing LLM agent failures, it is more of a tooling/workflow contribution. Paper 2's methodological innovations in multi-agent coordination and error correction have broader theoretical and practical impact potential.

vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact: it introduces a novel architectural refinement (decoupled channel-wise erase/write gates) in linear attention/fast-weight memory with clear algorithmic and training-system contributions (chunkwise WY, gate-aware backward), and demonstrates strong results at large scale (1.3B, 100B tokens) on broad benchmarks including long-context retrieval. This advances core sequence modeling infrastructure with wide applicability across LLMs and efficient attention alternatives. Paper 2 is timely and useful for agentic test-time scaling, but is more workflow/protocol-level and may be more sensitive to tooling/model specifics, limiting foundational breadth.

vs. The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

gemini-3.15/22/2026

While Paper 1 presents a strong technical advancement in AI agent architecture, Paper 2 addresses a critical and highly timely societal issue: the impact of AI on human skill development. Its findings have far-reaching implications across multiple disciplines, including education, cognitive psychology, human-computer interaction, and AI policy, giving it a broader potential scientific and real-world impact compared to the specialized algorithmic improvements in Paper 1.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

gpt-5.25/22/2026

Paper 2 has higher likely scientific impact: it proposes a concrete, novel protocol (ExComm) addressing a timely and widely relevant limitation in agentic test-time scaling (error propagation), with clear methodological contributions (conflict auditing, verification loop, soft belief updates, diversification) and quantitative evaluation on major benchmarks showing consistent gains and cost–performance advantages. Its applicability spans LLM agents, reasoning systems, and tool-use workflows across many domains. Paper 1 offers valuable qualitative insights into organizational change, but its scope (24 interviews at one firm) and actionable generalizability are more limited, reducing breadth and near-term technical impact.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gemini-3.15/22/2026

Paper 2 identifies a critical inverse scaling phenomenon in high-stakes LLM forecasting (e.g., finance, epidemiology), challenging the assumption that greater capability yields better results. By exposing how standard metrics mask tail-risk failures, it offers profound implications for AI safety, evaluation methodology, and cross-disciplinary applications, granting it broader scientific impact than Paper 1's algorithmic improvement to agentic workflows.

vs. Claw AI Lab: An Autonomous Multi-Agent Research Team

gpt-5.25/22/2026

Paper 1 has higher likely scientific impact due to a clearer algorithmic novelty (conflict-audited inter-agent communication with verification and soft belief updates plus explicit diversity control), stronger methodological rigor (quantitative benchmarks on AIME/GAIA with multiple models and cost–performance analysis), and more immediate relevance to test-time scaling reliability—a widely applicable core capability for many agentic systems. Paper 2 is valuable infrastructure, but evidence is limited to internal case studies and preference judgments, making its generalizable scientific contribution and rigor less clear despite strong practical potential.

vs. Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals

claude-opus-4.65/22/2026

ExComm addresses a fundamental challenge in agentic test-time scaling—error propagation in multi-step reasoning—which is highly timely given the rapid growth of LLM-based agent systems. Its novel communication protocol for cross-agent conflict detection and resolution introduces a broadly applicable framework with demonstrated improvements across multiple benchmarks and models. Paper 1, while solid, applies existing DRL techniques (PPO, MLPs) to a well-studied scheduling problem with incremental improvements. Paper 2's broader relevance to the rapidly expanding field of LLM agents and test-time compute scaling gives it significantly higher potential impact.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

gpt-5.25/22/2026

Paper 2 has higher potential impact due to broader real-world applicability (end-to-end autonomous/assisted research), stronger cross-field relevance (any empirical science/engineering workflow), and high timeliness as labs seek reliable AI research copilots. Its contributions (self-healing execution, verifiable reporting, human intervention modes, cross-run evolution) generalize beyond a specific benchmark suite. Paper 1 is methodologically focused and valuable for test-time scaling reliability, but its scope is narrower (LLM reasoning workflows) and gains are incremental relative to Paper 2’s system-level advance and deployment potential.

vs. Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

gpt-5.25/22/2026

Paper 1 offers a more novel, generalizable algorithmic contribution (a conflict-auditing communication protocol with verification, soft belief updates, and diversity preservation) that directly targets a key failure mode in agentic test-time scaling. It is likely to be reusable across tasks, models, and agentic systems, with clear practical applications and measurable gains on established benchmarks, supporting broader impact and timeliness. Paper 2 is valuable as an evaluation study highlighting operational gaps and provider differences, but its scientific impact is narrower and more contingent on a specific game setting and rapidly changing model/provider landscape.

vs. Evaluating Large Language Models as Live Strategic Agents: Provider Performance, Hybrid Decomposition, and Operational Gaps in Timed Risk Play

gemini-3.15/22/2026

Paper 2 addresses a fundamental limitation in agentic reasoning—error propagation during test-time scaling—by introducing a novel communication protocol (ExComm). Test-time scaling is a highly critical and active area of AI research. By providing a method that improves error recovery across established benchmarks (AIME, GAIA), Paper 2 offers broad methodological utility. In contrast, Paper 1 is primarily an empirical benchmarking study of existing proprietary models in a specific game environment (Risk), which, while valuable, has a shorter shelf-life and narrower methodological impact.

vs. A Camera-Cooperative ISAC Framework for Multimodal Non-Cooperative UAVs Sensing

gemini-3.15/22/2026

Paper 2 has higher potential scientific impact due to its focus on agentic test-time scaling, currently a frontier topic in artificial intelligence. While Paper 1 offers a strong framework for UAV tracking in 6G networks, ExComm (Paper 2) addresses a critical bottleneck—error propagation—in LLM reasoning and multi-agent systems. Its methodological innovations in cross-agent conflict resolution yield strong results on rigorous, highly relevant benchmarks (AIME, GAIA). The broad applicability of improving AI reasoning across diverse domains gives Paper 2 a significantly wider and more immediate scientific impact than the domain-specific telecommunications focus of Paper 1.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gpt-5.25/22/2026

Paper 2 has higher potential impact because it reframes how frontier AI capability is measured, proposing a broadly applicable evaluation paradigm (open-world evals) that can influence research, policy, and deployment practices across many subfields. Its real-world, long-horizon tasks address a timely gap in benchmark-centric evaluation and can become a standard complement to existing methods (via CRUX). Paper 1 is a solid, method-level contribution to agentic test-time scaling with measurable gains, but its impact is narrower to LLM-agent workflow optimization and may be more incremental.

vs. \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems

claude-opus-4.65/22/2026

Paper 1 (ExComm) addresses a critical and timely problem in agentic test-time scaling—error propagation in multi-step reasoning—with a novel communication protocol that detects cross-agent conflicts, resolves them via verification, and maintains trajectory diversity. It demonstrates consistent improvements across multiple benchmarks and models. Paper 2 proposes useful evaluation metrics for uncertainty-augmented systems, which is valuable but more incremental. ExComm's broader applicability to the rapidly growing field of LLM agents, its methodological novelty combining inter-agent communication with belief updates and diversity preservation, and its strong empirical results give it higher potential impact.

vs. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

gpt-5.25/22/2026

Paper 2 (ExComm) likely has higher impact due to broader applicability and timeliness: it addresses a central bottleneck in agentic test-time scaling (error propagation) across math, reasoning, and tool-using settings, with a general communication-and-verification protocol that can be layered onto many multi-agent pipelines. Its methodology includes explicit conflict detection, tool-based verification, belief-update design, and diversity control, evaluated on widely watched benchmarks (AIME/GAIA) with clear gains and cost trade-offs. Paper 1 is practically valuable for GUI RPA, but its impact is narrower and more domain-specific.