ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling
Woomin Song, Beomjun Kim, Daewon Choi, Sai Muralidhar Jayanthi, Saket Dingliwal, Jinwoo Shin, Aram Galstyan
Abstract
A common failure mode in long-horizon agentic test-time scaling is error propagation, where factual errors or invalid deductions introduced at intermediate steps persist in the agent's belief state and contaminate later reasoning. Existing test-time scaling methods provide limited control over this process, as they often rely on agents to detect their own mistakes, select among flawed trajectories, or refine solutions only after errors have already shaped the reasoning path. We propose ExComm, a communication protocol for exploration-stage agentic test-time scaling. ExComm is motivated by the empirical observation that the majority of intermediate errors in parallel agentic reasoning produce detectable cross-agent factual conflicts. Leveraging the iterative structure of agentic workflows, ExComm periodically audits agent belief states to detect such conflicts, resolves them through a dedicated tool-based verification loop, and returns concise, targeted feedback to the involved agents. Corrections are incorporated through soft belief updates, which append verified feedback rather than overwriting existing beliefs. Furthermore, to prevent collapsing trajectory diversity due to communication, ExComm further introduces a trajectory diversification module that redirects redundant trajectories toward orthogonal strategies. Experiments on AIME 2024, AIME 2025, and GAIA with Gemini-2.5-Flash-Lite and Qwen3.5-4B show that ExComm consistently outperforms strong test-time scaling baselines, achieving average performance gains of 5.7% and 5.0% over the best-performing baselines, respectively. Further analyses demonstrate improved error recovery, favorable scaling behavior, stronger diversity than adapted communication baselines, and the best performance-cost trade-off among the evaluated methods.
AI Impact Assessments
(1 models)Scientific Impact Assessment: ExComm
1. Core Contribution
ExComm introduces an exploration-stage communication protocol for parallel agentic test-time scaling that addresses error propagation—a well-known failure mode in long-horizon reasoning. The key insight is that most intermediate errors (67–71%) in parallel agent trajectories produce detectable cross-agent factual conflicts, enabling error detection without relying on unreliable self-correction. The framework consists of two modules: (1) an Online Belief Consistency Module that extracts factual conflicts across agents, resolves them via a dedicated tool-augmented verification loop, and delivers targeted soft belief updates; and (2) a Trajectory Diversification Module that monitors plan-level redundancy and redirects converging agents toward orthogonal strategies. The soft update mechanism—appending corrections rather than overwriting beliefs—is a thoughtful design choice that provides robustness against verifier errors, as demonstrated in the GAIA case study where an incorrect verification was gracefully handled.
2. Methodological Rigor
The experimental design is generally sound. The paper evaluates on three benchmarks (AIME 2024, AIME 2025, GAIA L1-L3) across two primary models (Gemini-2.5-Flash-Lite and Qwen3.5-4B), with results averaged over three runs. The baseline selection is comprehensive, covering single-agent, sequential revision, independent parallel scaling, independent + revision, and tree search methods. The comparison against adapted communication baselines (Multi-Agent Debate, Mixture-of-Agents) is particularly valuable for isolating the contribution of targeted vs. broad communication.
However, several methodological concerns arise:
3. Potential Impact
The paper addresses a genuinely important problem. Error propagation in agentic workflows is a practical bottleneck for deploying LLM agents on complex, multi-step tasks. The framework's modular design—operating as a communication layer on top of existing agent loops—makes it relatively easy to integrate into existing systems.
Immediate applications include improving reliability of tool-augmented reasoning systems in mathematics, research assistance, and multi-step information retrieval. The approach could generalize to any domain where parallel agents solve problems with verifiable intermediate facts.
Broader influence: The paper opens a promising design space around "exploration-stage communication"—intervening during reasoning rather than only aggregating final outputs. This is a conceptually distinct contribution from existing multi-agent debate or mixture-of-agents paradigms. The diversity-preservation aspect (avoiding trajectory collapse while sharing corrections) addresses a real tension in multi-agent systems.
However, the practical impact may be bounded by the requirement for tool-verifiable conflicts. In domains where intermediate claims are harder to verify programmatically (e.g., open-ended creative tasks, ethical reasoning), the consistency module's effectiveness would likely diminish.
4. Timeliness & Relevance
This work is highly timely. Test-time scaling and agentic AI are among the most active areas in LLM research. The paper directly addresses a current bottleneck: how to make agentic test-time scaling more reliable without simply increasing the number of trajectories. The focus on error propagation during exploration rather than post-hoc refinement fills a clear gap in the existing literature. The use of recent models (Gemini-2.5-Flash-Lite, Qwen3.5-4B) and benchmarks (AIME 2025) further demonstrates currency.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing of "exploration-stage" vs. "solution-stage" communication is a useful conceptual distinction that could influence how the community designs multi-agent systems. The trajectory diversification module, while seemingly simple, addresses a real and often overlooked problem in parallel sampling approaches. The cost analysis (Figure 2) is particularly compelling for practitioners evaluating deployment trade-offs.
Generated May 22, 2026
Comparison History (20)
ExComm addresses a fundamental problem in agentic test-time scaling—error propagation in long-horizon reasoning—with a novel multi-agent communication protocol that detects cross-agent factual conflicts and resolves them through tool-based verification. This has broader impact across agentic AI systems beyond math reasoning. While CLORE offers a solid contribution to reasoning efficiency via content-level editing, it represents a more incremental improvement within the established efficient reasoning paradigm. ExComm's approach to inter-agent communication, belief updates, and trajectory diversification introduces more novel architectural concepts with wider applicability to emerging agentic AI workflows.
Paper 1 addresses error propagation in test-time scaling, a highly critical bottleneck in advancing reasoning capabilities of LLMs. Its approach to real-time error correction and trajectory diversification tackles immediate, high-impact challenges in building autonomous agents. Paper 2, while methodologically rigorous in addressing privacy in latent KV communication, focuses on a more niche and less universally adopted communication paradigm, giving Paper 1 broader applicability and a higher potential impact in the rapidly growing field of agentic reasoning.
Paper 1 offers a broadly applicable, timely contribution to agentic test-time scaling: a concrete protocol to detect and correct intermediate reasoning errors via cross-agent conflict auditing plus verification, while preserving diversity. This targets a central bottleneck (error propagation) across many LLM agent settings (math, web tasks, general long-horizon workflows) and is immediately usable as a modular inference-time method, with quantified gains on prominent benchmarks. Paper 2 is promising for proactive TOD, but relies on a specialized simulator/latent-state setup that may limit generality and real-world transfer, and the abstract provides less evidence of empirical breadth/rigor.
ExComm addresses the fundamental and broadly relevant problem of error propagation in agentic test-time scaling, proposing a novel communication protocol with cross-agent conflict detection and soft belief updates. This has broad applicability across many reasoning tasks and LLM agent architectures, with strong empirical gains on multiple benchmarks. While FLUID solves an important industrial problem (cold-start in livestreaming recommendation) and demonstrates real-world deployment at scale, its impact is more narrowly scoped to a specific recommendation domain. ExComm's contributions to multi-agent reasoning and test-time scaling are more likely to influence a wider range of future research directions.
SMDD-Bench introduces a much-needed, standardized benchmark for a critical scientific domain (drug design). By addressing the lack of rigorous evaluation in LLM-driven chemistry and biology, it directly facilitates breakthroughs in real-world medicine. This foundational contribution to AI for Science has a higher potential for transformative real-world impact than the incremental algorithmic improvements in general reasoning offered by Paper 2.
ChemVA addresses a fundamental and persistent bottleneck in AI-driven chemistry—interpreting chemical reaction diagrams—with a novel framework yielding ~20 percentage point gains across 9 LLMs. It introduces a new benchmark (OCRD-Bench), bridges vision and language for chemistry, and has broad real-world applications in drug discovery, synthesis planning, and chemical education. Paper 1, while solid, represents an incremental improvement in test-time scaling for agentic reasoning with more modest gains (~5%). ChemVA's cross-disciplinary impact (AI + chemistry) and enabling of open-weight models to match proprietary systems gives it higher long-term scientific impact.
ExComm addresses a fundamental problem in test-time scaling—error propagation in multi-agent reasoning—with a principled, well-evaluated solution. Its communication protocol for detecting cross-agent conflicts, soft belief updates, and trajectory diversification are novel contributions with broad applicability across agentic AI systems. The method shows consistent gains across multiple benchmarks and models, demonstrating generalizability. While Paper 1 (Insights Generator) tackles an important practical problem of diagnosing LLM agent failures, it is more of a tooling/workflow contribution. Paper 2's methodological innovations in multi-agent coordination and error correction have broader theoretical and practical impact potential.
Paper 1 likely has higher scientific impact: it introduces a novel architectural refinement (decoupled channel-wise erase/write gates) in linear attention/fast-weight memory with clear algorithmic and training-system contributions (chunkwise WY, gate-aware backward), and demonstrates strong results at large scale (1.3B, 100B tokens) on broad benchmarks including long-context retrieval. This advances core sequence modeling infrastructure with wide applicability across LLMs and efficient attention alternatives. Paper 2 is timely and useful for agentic test-time scaling, but is more workflow/protocol-level and may be more sensitive to tooling/model specifics, limiting foundational breadth.
While Paper 1 presents a strong technical advancement in AI agent architecture, Paper 2 addresses a critical and highly timely societal issue: the impact of AI on human skill development. Its findings have far-reaching implications across multiple disciplines, including education, cognitive psychology, human-computer interaction, and AI policy, giving it a broader potential scientific and real-world impact compared to the specialized algorithmic improvements in Paper 1.
Paper 2 has higher likely scientific impact: it proposes a concrete, novel protocol (ExComm) addressing a timely and widely relevant limitation in agentic test-time scaling (error propagation), with clear methodological contributions (conflict auditing, verification loop, soft belief updates, diversification) and quantitative evaluation on major benchmarks showing consistent gains and cost–performance advantages. Its applicability spans LLM agents, reasoning systems, and tool-use workflows across many domains. Paper 1 offers valuable qualitative insights into organizational change, but its scope (24 interviews at one firm) and actionable generalizability are more limited, reducing breadth and near-term technical impact.
Paper 2 identifies a critical inverse scaling phenomenon in high-stakes LLM forecasting (e.g., finance, epidemiology), challenging the assumption that greater capability yields better results. By exposing how standard metrics mask tail-risk failures, it offers profound implications for AI safety, evaluation methodology, and cross-disciplinary applications, granting it broader scientific impact than Paper 1's algorithmic improvement to agentic workflows.
Paper 1 has higher likely scientific impact due to a clearer algorithmic novelty (conflict-audited inter-agent communication with verification and soft belief updates plus explicit diversity control), stronger methodological rigor (quantitative benchmarks on AIME/GAIA with multiple models and cost–performance analysis), and more immediate relevance to test-time scaling reliability—a widely applicable core capability for many agentic systems. Paper 2 is valuable infrastructure, but evidence is limited to internal case studies and preference judgments, making its generalizable scientific contribution and rigor less clear despite strong practical potential.
ExComm addresses a fundamental challenge in agentic test-time scaling—error propagation in multi-step reasoning—which is highly timely given the rapid growth of LLM-based agent systems. Its novel communication protocol for cross-agent conflict detection and resolution introduces a broadly applicable framework with demonstrated improvements across multiple benchmarks and models. Paper 1, while solid, applies existing DRL techniques (PPO, MLPs) to a well-studied scheduling problem with incremental improvements. Paper 2's broader relevance to the rapidly expanding field of LLM agents and test-time compute scaling gives it significantly higher potential impact.
Paper 2 has higher potential impact due to broader real-world applicability (end-to-end autonomous/assisted research), stronger cross-field relevance (any empirical science/engineering workflow), and high timeliness as labs seek reliable AI research copilots. Its contributions (self-healing execution, verifiable reporting, human intervention modes, cross-run evolution) generalize beyond a specific benchmark suite. Paper 1 is methodologically focused and valuable for test-time scaling reliability, but its scope is narrower (LLM reasoning workflows) and gains are incremental relative to Paper 2’s system-level advance and deployment potential.
Paper 1 offers a more novel, generalizable algorithmic contribution (a conflict-auditing communication protocol with verification, soft belief updates, and diversity preservation) that directly targets a key failure mode in agentic test-time scaling. It is likely to be reusable across tasks, models, and agentic systems, with clear practical applications and measurable gains on established benchmarks, supporting broader impact and timeliness. Paper 2 is valuable as an evaluation study highlighting operational gaps and provider differences, but its scientific impact is narrower and more contingent on a specific game setting and rapidly changing model/provider landscape.
Paper 2 addresses a fundamental limitation in agentic reasoning—error propagation during test-time scaling—by introducing a novel communication protocol (ExComm). Test-time scaling is a highly critical and active area of AI research. By providing a method that improves error recovery across established benchmarks (AIME, GAIA), Paper 2 offers broad methodological utility. In contrast, Paper 1 is primarily an empirical benchmarking study of existing proprietary models in a specific game environment (Risk), which, while valuable, has a shorter shelf-life and narrower methodological impact.
Paper 2 has higher potential scientific impact due to its focus on agentic test-time scaling, currently a frontier topic in artificial intelligence. While Paper 1 offers a strong framework for UAV tracking in 6G networks, ExComm (Paper 2) addresses a critical bottleneck—error propagation—in LLM reasoning and multi-agent systems. Its methodological innovations in cross-agent conflict resolution yield strong results on rigorous, highly relevant benchmarks (AIME, GAIA). The broad applicability of improving AI reasoning across diverse domains gives Paper 2 a significantly wider and more immediate scientific impact than the domain-specific telecommunications focus of Paper 1.
Paper 2 has higher potential impact because it reframes how frontier AI capability is measured, proposing a broadly applicable evaluation paradigm (open-world evals) that can influence research, policy, and deployment practices across many subfields. Its real-world, long-horizon tasks address a timely gap in benchmark-centric evaluation and can become a standard complement to existing methods (via CRUX). Paper 1 is a solid, method-level contribution to agentic test-time scaling with measurable gains, but its impact is narrower to LLM-agent workflow optimization and may be more incremental.
Paper 1 (ExComm) addresses a critical and timely problem in agentic test-time scaling—error propagation in multi-step reasoning—with a novel communication protocol that detects cross-agent conflicts, resolves them via verification, and maintains trajectory diversity. It demonstrates consistent improvements across multiple benchmarks and models. Paper 2 proposes useful evaluation metrics for uncertainty-augmented systems, which is valuable but more incremental. ExComm's broader applicability to the rapidly growing field of LLM agents, its methodological novelty combining inter-agent communication with belief updates and diversity preservation, and its strong empirical results give it higher potential impact.
Paper 2 (ExComm) likely has higher impact due to broader applicability and timeliness: it addresses a central bottleneck in agentic test-time scaling (error propagation) across math, reasoning, and tool-using settings, with a general communication-and-verification protocol that can be layered onto many multi-agent pipelines. Its methodology includes explicit conflict detection, tool-based verification, belief-update design, and diversity control, evaluated on widely watched benchmarks (AIME/GAIA) with clear gains and cost trade-offs. Paper 1 is practically valuable for GUI RPA, but its impact is narrower and more domain-specific.