DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs
Yi Li, Songtao Wei, Dongming Jiang, Zhichun Guo, Qiannan Li, Bingzhe Li
Abstract
Multi-agent LLM systems improve reasoning by combining outputs from multiple agents, but interaction-heavy methods can introduce error propagation and high communication overhead. When agents exchange raw responses or reasoning traces, incorrect intermediate reasoning may be adopted and amplified, leading to confident but wrong consensus; multi-round communication also increases token consumption, latency, and inference cost. In this paper, we propose a controlled-communication coordination framework named DarkForest. DarkForest first keeps agents independent, so each agent produces an answer without seeing the others' outputs. It then parses the raw responses into structured candidate records, groups semantically equivalent candidates into clusters, and estimates a calibrated belief distribution over these clusters using agent reliability, confidence, parse quality, support-pattern reliability, and independence corrections. A coordinator receives only policy-permitted evidence from this belief state with controlled communication. Experiments on six reasoning benchmarks show that DarkForest achieves leading overall quality, improves the strongest baseline by up to 30.7\% on benchmark metrics, and reduces token consumption by up to compared with communication-heavy baselines.
AI Impact Assessments
(1 models)Scientific Impact Assessment: DarkForest
1. Core Contribution
DarkForest proposes a controlled-communication framework for multi-agent LLM coordination. The central insight is that unrestricted inter-agent communication can actually *harm* performance by propagating errors and creating correlated outputs, while also increasing token costs. Instead of letting agents exchange reasoning traces or debate, DarkForest keeps agents independent at generation time, then constructs a calibrated belief distribution over clustered candidate answers. Only compact, policy-permitted evidence summaries are disclosed to a final coordinator. A deterministic guardrail overrides the coordinator when the belief state strongly supports a conflicting candidate.
The problem addressed—error propagation and communication overhead in multi-agent LLM systems—is real and well-motivated. The paper's Figure 1 effectively demonstrates the "evidence destruction" phenomenon: coordination methods often fail to select correct answers that were already present among independent agent outputs. This framing is the paper's strongest conceptual contribution.
2. Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
3. Potential Impact
The paper addresses a practical concern in deploying multi-agent LLM systems: balancing accuracy against inference cost. The 6.5× token reduction over communication-heavy baselines is operationally significant. The framework is modular—agents can be heterogeneous, calibration is offline, and the guardrail is deterministic—making it relatively easy to integrate into existing systems.
However, the impact may be bounded by several factors:
4. Timeliness & Relevance
The paper is timely. Multi-agent LLM systems are increasingly deployed, and the cost/quality tradeoff is a genuine bottleneck. The "less communication is sometimes better" message is contrarian and useful. The connection to incomplete-information game theory provides a principled framework for thinking about disclosure in multi-agent AI systems, even if the formal treatment is lightweight.
The paper connects to ongoing concerns about LLM hallucination propagation and the scalability of multi-agent approaches, both active research areas.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
DarkForest presents a reasonable and well-executed engineering contribution to multi-agent LLM coordination. Its core insight—that controlled, calibrated disclosure outperforms unrestricted communication—is valuable and well-supported by experiments. However, the novelty is incremental rather than transformative: the individual components (majority voting, reliability calibration, confidence weighting) are well-known; the contribution lies in their thoughtful integration. The experimental evidence, while broadly supportive, lacks statistical rigor and scale diversity. The paper would benefit from evaluation with stronger models, larger test sets, and more challenging open-ended tasks.
Generated May 26, 2026
Comparison History (23)
Paper 2 offers broader scientific impact. While Paper 1 provides a valuable domain-specific benchmark for healthcare equity, Paper 2 addresses a fundamental algorithmic challenge in multi-agent LLM architectures. By mitigating error propagation and drastically reducing communication overhead (6.5x lower token consumption) while improving accuracy (up to 30.7%), DarkForest provides a highly scalable methodological breakthrough. Its domain-agnostic framework impacts reasoning efficiency and cost-effectiveness across all fields deploying agentic AI, giving it wider transformative potential compared to a specialized medical audit benchmark.
DarkForest addresses a fundamental and widely-studied problem in multi-agent LLM coordination—error propagation and communication overhead—with a principled framework showing strong empirical results (up to 30.7% improvement, 6.5x token reduction) across six benchmarks. This has broad applicability to any multi-agent system. Paper 1, while practically useful, is more of an engineering contribution (lightweight JS libraries for querying Parquet/Iceberg with LLM UDFs) targeting a narrower niche of client-side AI data applications. Paper 2's methodological contributions (calibrated belief estimation, controlled communication) are more likely to influence future research directions.
Paper 2 (POLAR) targets long-term personalization for embodied multimodal agents, a core capability for real-world assistants. Its multimodal memory + knowledge-graph design has clear applications in robotics, AR/VR, smart-home agents, and human–AI interaction, with broad cross-field relevance (MLLMs, memory systems, embodied AI). The problem is timely and likely to persist as agents move into continuous deployment. Paper 1 is strong and practical for multi-agent LLM coordination and efficiency, but is more narrowly scoped to inference-time aggregation; its impact may be constrained by rapid shifts in prompting/agent frameworks. Overall, POLAR appears more enabling and generalizable.
Paper 2 addresses a fundamental gap in AI safety by arguing that controllability should be a first-class objective beyond alignment. This reframing has broader impact across the entire AI safety field, influencing policy, architecture design, and deployment practices. It introduces a benchmark (ControlBench) and architectural framework applicable to all agentic AI systems. Paper 1, while technically solid with strong empirical results on multi-agent coordination, addresses a narrower optimization problem. Paper 2's timeliness—given rapid deployment of agentic AI—and its potential to reshape safety paradigms give it higher impact potential.
Paper 2 likely has higher scientific impact: it challenges a prominent and timely claim (LLM introspection/metacognition) with tighter controls, reframing evaluation methodology and setting stronger evidentiary standards. This kind of corrective, conceptual-plus-empirical critique can influence many subsequent papers across interpretability, alignment, cognitive science, and evaluation. Paper 1 is practically valuable and novel for multi-agent coordination efficiency, but its impact is more engineering-focused within a narrower subarea and may be superseded by rapid system-level iteration.
DarkForest addresses a fundamental and widely-studied problem in multi-agent LLM coordination with a clean, principled framework that demonstrates strong empirical gains (up to 30.7% improvement and 6.5x token reduction) across six benchmarks. Its contributions—structured aggregation, calibrated belief distributions, and controlled communication—are broadly applicable to any multi-agent LLM system. ProActor tackles the more niche problem of proactive task scheduling with timing-aware RL, introducing useful but domain-specific contributions. DarkForest's broader applicability, clearer novelty, and stronger empirical results suggest higher scientific impact.
Paper 2 likely has higher scientific impact due to stronger conceptual novelty and broader cross-field relevance: it offers a clear methodological framework separating behavioral sensitivity, representation, and causal dependence, with rigorous negative/causal results (steering nulls, controls, generalization). This directly informs how to interpret mechanistic evidence in LLMs and affects alignment, interpretability, and computational social science. Paper 1 is practically valuable for multi-agent LLM coordination and efficiency, but is more incremental within an active engineering space and may age with rapidly changing agentic baselines.
Paper 2 addresses a critical bottleneck in the widespread adoption of multi-agent LLM systems: error propagation and high token costs. By introducing a framework that improves accuracy by up to 30.7% while reducing communication overhead by 6.5x, it offers immediate and broad impact across general reasoning applications. While Paper 1 provides a valuable benchmark for the emerging niche of computer-use agents, Paper 2's fundamental improvements to multi-agent reasoning systems suggest broader utility and higher overall scientific impact.
Paper 2 addresses critical challenges in the rapidly growing field of multi-agent LLM systems, specifically error propagation and high computational costs. By introducing a novel framework that significantly improves accuracy while drastically reducing token consumption, it offers highly relevant, empirical contributions to AI research. In contrast, Paper 1 is more pedagogical, focusing on clarifying an existing classical engineering design framework, which typically garners less immediate, widespread scientific impact compared to major advancements in generative AI.
Paper 1 addresses a critical bottleneck in the rapidly expanding field of multi-agent LLMs (communication overhead and error propagation) with a novel framework, demonstrating substantial gains in accuracy and efficiency. Paper 2, while clinically relevant, relies on relatively standard DL techniques and explicitly notes significant performance degradation in cross-domain settings, limiting its immediate real-world impact compared to Paper 1.
Paper 1 addresses foundational challenges in multi-agent LLM systems—error propagation and token cost. Its broad applicability across various AI domains, combined with highly significant quantitative improvements (up to 30.7% better accuracy and 6.5x token reduction), gives it a wider potential scientific and practical impact compared to the domain-specific, albeit important, healthcare focus of Paper 2.
Paper 1 presents a concrete, novel coordination framework for multi-agent LLMs with clear algorithmic components (independence-first, clustering, calibrated belief estimation, controlled evidence sharing) and strong quantitative results across six reasoning benchmarks, including large quality gains and major token-cost reductions—highly timely for LLM deployment. Its methodological rigor and broadly applicable goal (reliable, efficient multi-agent reasoning) suggest impact across many NLP/agentic systems settings. Paper 2 is valuable as an evaluation/harness proposal, but evidence is mainly a single case study with less demonstrated generalizable performance impact.
Paper 2 likely has higher impact: it introduces a general coordination framework for multi-agent LLMs that improves accuracy while cutting communication cost—directly addressing scalability, reliability, and deployment constraints. The method is broadly applicable across tasks and systems, with strong empirical gains on multiple benchmarks and clear real-world utility (latency/cost reduction). Paper 1 is novel and timely as an evaluation benchmark grounded in developmental psychology, but its impact is primarily diagnostic/measurement-focused and narrower in immediate downstream utility than a coordination method that can be integrated into many production and research pipelines.
Paper 1 addresses a fundamental architectural question about LLM inference efficiency with broad implications across model design, training, and systems. Its comprehensive empirical study across 20 models, five families, and multiple task types, combined with practical hardware demonstrations (10x speedup on H100s), positions it to influence the entire LLM inference stack. The theoretical argument about inherent information bottlenecks in attention adds principled depth. Paper 2, while useful, addresses the narrower problem of multi-agent coordination and represents more of an engineering contribution with incremental improvements over existing approaches.
NeurIPS (the paper) introduces novel neuro-anatomical inductive priors for brain decoding, bridging neuroscience and deep learning with principled geometric and anatomical modeling. Its contributions—sphere-based tokenization, structure-guided MoE, dramatic training efficiency gains (60x faster convergence), and strong generalization—represent meaningful methodological innovation with broad implications for neuroimaging and clinical applications. Paper 2, while solid, proposes an incremental coordination framework for multi-agent LLMs that primarily combines known techniques (clustering, belief calibration, independence corrections) in a relatively crowded space with limited novelty beyond engineering improvements.
Paper 2 has higher estimated impact due to a more broadly applicable and timely contribution to multi-agent LLM coordination: it addresses key deployment pain points (error propagation, latency, and token cost) with a general controlled-communication framework and demonstrates large gains across six benchmarks. The method appears more complete end-to-end (coordination, calibration, and efficiency) and likely transfers across tasks and domains. Paper 1 is novel in personalization and provides an important benchmark, but its main algorithmic takeaway is lightweight and the core challenge (accurate gating) remains unresolved, limiting near-term impact.
Paper 2 likely has higher scientific impact: it proposes a self-improving loop tightly integrating classical best-first search (WA*) with learned relational GNN heuristics updated via Q-learning, enabling strong zero-shot combinatorial generalization (e.g., Blocksworld 30→488 blocks). This bridges planning and RL, offering broad relevance across DRL, automated planning, and programmatic reasoning, and targets a long-standing core challenge (generalization under sparse rewards). Paper 1 is timely and useful for multi-agent LLM coordination and efficiency, but its impact may be narrower and more incremental relative to the larger cross-field advance and generalization result in Paper 2.
DarkForest addresses a fundamental and broadly applicable problem in multi-agent LLM coordination—error propagation and communication overhead—with a principled framework applicable across diverse reasoning tasks. Its evaluation spans six benchmarks with strong improvements (up to 30.7% accuracy gain and 6.5× token reduction). Paper 2, while technically interesting in applying Shapley values to portfolio management, targets a narrower domain (crypto trading) with limited generalizability. DarkForest's broader applicability to the rapidly growing multi-agent LLM ecosystem gives it higher potential impact across multiple fields.
Paper 1 presents a massive-scale foundation model for cardiovascular care, validated on over 1.5 million external ECGs across 89 clinical tasks. Its potential to transform real-world healthcare and enable opportunistic screening of rare diseases demonstrates exceptional real-world utility and methodological rigor. While Paper 2 offers valuable algorithmic improvements for LLM efficiency, Paper 1 represents a highly impactful, transformative advance in medical AI with immediate life-saving potential.
Paper 2 introduces a highly novel paradigm by treating textual agent skills as optimizable states, akin to weight-space optimization in deep learning. Its rigorous methodology, extensive evaluation across multiple models and harnesses, and demonstration of strong transferability suggest a broader foundational impact on agent development than Paper 1's communication-reduction strategy, which is impactful but more narrowly focused on multi-agent efficiency.