Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

Anany Kotawala

May 28, 2026

arXiv:2605.30335v1 PDF

cs.AI(primary)cs.CL

#919of 2821·Artificial Intelligence

#919 of 2821 · Artificial Intelligence

Tournament Score

1445±46

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance7.5

Rigor8.5

Novelty6.5

Clarity7.5

Tournament Score

1445±46

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper formalizes a previously underappreciated failure mode in multi-component LLM systems: compositional incoherence, where individually calibrated/coherent components produce jointly incoherent probability estimates when assembled by an aggregator. The key formalization is the compositional residual ε*, defined as the L2 distance from the composed quote to the joint coherent polytope. This is a runtime-computable, distribution-free certificate that requires only the system's output and declared cross-component coupling constraints.

The theoretical backbone consists of: (1) a product-structure dichotomy (Theorem 3.3) characterizing exactly when local coherence suffices for global coherence—it does iff the joint polytope factorizes as a Cartesian product of local polytopes; (2) a Rayleigh-quotient magnitude prediction (Corollary 3.9) that predicts the expected squared residual from panel covariance alone; (3) a hierarchical Boyle-Dykstra projection as a deterministic repair; and (4) an anytime-valid e-process for sequential coherence monitoring.

Methodological Rigor

The theoretical framework is mathematically clean. The dichotomy theorem leverages classical convex analysis (Hilbert projection, Boyle-Dykstra convergence), and the contribution is properly framed as an operational reframing rather than novel convex geometry. The paper is honest about this: "The convex-analytic machinery is classical; the contribution is the operational reframing."

The experimental design is thorough with appropriate controls:

Same-model decoupling control isolates cross-model heterogeneity from coordinate isolation

Greedy-decoding control rules out sampling noise as the source

Leakage filtering ensures temporal separation between model snapshots and event resolutions

K-sweep confirms structural rather than finite-sample origins

Frontier-panel rerun tests whether capability scaling resolves the issue (it doesn't)

Coupling-visibility experiment directly probes the causal mechanism

The Rayleigh-quotient prediction matching observed residuals within 7% on three of four relation classes is a strong falsifiable prediction. The conjunction under-shoot (0.83×) is itself predicted by the theory's interior-Π̄ regime, which adds credibility.

However, some methodological limitations deserve attention. The evaluation uses a routing simulation rather than end-to-end deployed agents. The planner-discretion harness (n=20) and routing-protocol ladder (n=100) are small. The paper acknowledges this but the gap between simulated routing and real multi-component agent deployments remains substantial.

Potential Impact

Immediate applications: The ε* certificate and hierarchical repair could be integrated into any multi-component LLM pipeline that routes probabilistic questions to specialist sub-agents. The three deployment modes (monitor, repair, abstain) with calibrated thresholds (τ≈0.15 for high-recall, τ≈0.22 for high-precision) are immediately actionable.

Broader influence: This work bridges formal probability theory (de Finetti coherence, Dutch books, FTAP) with practical LLM system design. It demonstrates that per-component evaluation metrics (calibration, self-consistency, conformal prediction) are fundamentally insufficient for system-level guarantees under composition—a message with implications across AI safety, forecasting, and decision support.

The finding that three intuitive LLM-side mitigations (retrieval, partition-aware prompting, aggregator-LLM) each fail or regress is practically important: it demonstrates that the failure is structural rather than addressable by prompt engineering alone.

Adjacent fields: The framework could extend to ensemble methods in general, multi-agent decision systems, prediction markets with segmented information, and any system assembling probabilistic claims from distributed components.

Timeliness & Relevance

This paper addresses a timely bottleneck. As LLM agents become increasingly modular—with tool-calling, function-calling, and specialist routing—the composition of probabilistic outputs from independent components is a growing practical concern. The paper correctly identifies that existing evaluation paradigms are per-component and miss system-level failures. The connection to the FTAP and Dutch-book exposure provides a principled risk metric.

The frontier-panel result (ε*>0 on 97.8% of cliques even with top-tier models, though magnitude drops 39%) suggests this problem will not simply disappear with model scaling, making the geometric repair a necessary component rather than a temporary patch.

Strengths

1. Tight theory-experiment coupling: The Rayleigh-quotient prediction, the hardness ordering across relation classes, and the dichotomy's falsifiable prediction (ε*≡0 when M*=M⊠) are all empirically validated.

2. Comprehensive controls: Same-model, greedy-decoding, K-sweep, and frontier-panel controls systematically rule out alternative explanations.

3. Practical deployment framing: Runtime gating thresholds with cross-validated operating characteristics, three deployment modes, and cost comparisons make this immediately usable.

4. Regret quantification: The +0.115 nats/bet regret under proportional allocation (collapsing to +0.006 under self-coherentising bettors) quantifies when the failure matters and when downstream systems absorb it.

5. Reproducibility: Full code, prompts, sample dumps, and per-clique residuals are released.

Limitations

1. Explicit coupling set assumption: The entire framework requires C to be specified. The paper acknowledges this but the extension to implicit C from unstructured transcripts—arguably the most common deployment regime—remains open.

2. Routing simulation vs. deployment: Most experiments are controlled routing simulations, not end-to-end agent evaluations.

3. L2/Brier specificity: The framework is tied to L2 projection and Brier scoring; extension to other scoring rules is mentioned but not developed.

4. Resolution noise: The disjunction reversal (Cor. 3.7) highlights sensitivity to label noise in resolutions, partially undermining the predictive Brier guarantee.

5. Scale of evaluation: 1,876 cliques across four relation classes is moderate; the partition cliques (67 base events) are relatively few.

Overall Assessment

This is a well-executed paper that identifies a genuine structural failure mode, provides clean mathematical characterization, and validates it empirically with appropriate controls. The theory is not algorithmically novel but the operational reframing is valuable and the empirical validation is thorough. The practical impact depends on adoption in multi-component agent pipelines, which requires the coupling set specification barrier to be addressed. The paper represents a solid contribution at the intersection of formal probability theory and LLM system design.

Rating:7.4/ 10

Significance 7.5Rigor 8.5Novelty 6.5Clarity 7.5

Generated May 29, 2026

Comparison History (21)

vs. Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

gemini-3.15/29/2026

While Paper 1 offers rigorous theoretical contributions to multi-agent LLM coherence, Paper 2 tackles the foundational and widespread issue of reproducibility in machine learning. By proposing a scalable, machine-actionable metadata format that enables automated reproduction, Paper 2 has the potential to fundamentally improve scientific methodology and evaluation standards across the entire ML community.

vs. Provably Secure Agent Guardrail

gemini-3.15/29/2026

Paper 2 addresses a critical and timely bottleneck in AI agent deployment: security and safety. By introducing a provably secure, formal verification-based guardrail, it shifts the paradigm from empirical to deterministic security. This offers massive potential for real-world applications in autonomous systems. While Paper 1 is methodologically rigorous, its focus on compositional probabilistic incoherence is more niche, whereas Paper 2 tackles a fundamental crisis in AI safety with broad implications across the field.

vs. Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

claude-opus-4.65/29/2026

Paper 1 introduces a novel theoretical framework for understanding compositional incoherence in multi-component LLM agents, formalizing a fundamental failure mode with rigorous mathematical tools (coherent polytope, Rayleigh-quotient bounds, e-processes). It addresses a critical and growing problem as LLM agent systems scale, with broad implications for AI safety and reliability. Paper 2 makes useful empirical contributions to literature search evaluation but is more incremental—improving a retrieval pipeline and critiquing evaluation methodology. Paper 1's theoretical depth, methodological rigor, and applicability across the rapidly expanding multi-agent LLM ecosystem give it substantially higher potential impact.

vs. Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification

gemini-3.15/29/2026

Paper 2 addresses a fundamental theoretical flaw in multi-agent LLM systems (compositional incoherence), providing rigorous mathematical formalization, bounds, and repair mechanisms. While Paper 1 offers a strong applied framework for optimization modeling, Paper 2's insights into probabilistic coherence have broader implications for the foundational design and reliability of multi-component AI systems across all domains, yielding higher potential for widespread scientific impact.

vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

claude-opus-4.65/29/2026

Paper 1 introduces a novel theoretical framework for a previously unformalised but fundamental problem in multi-component LLM systems—compositional incoherence—with rigorous mathematical characterization (coherent polytope, Rayleigh-quotient bounds, e-processes). It demonstrates that intuitive mitigations fail, providing deep insight. Paper 2 presents a solid engineering contribution combining regret matching with RL for multi-agent reasoning, but addresses a more incremental problem. Paper 1's theoretical foundations, actionable diagnostics, and surprising negative results about mitigation strategies give it broader and more lasting impact across AI safety, decision theory, and system design.

vs. RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

claude-opus-4.65/29/2026

Paper 2 (RankQ) addresses a practical and timely problem in offline-to-online RL with a novel self-supervised ranking loss, demonstrating strong empirical results across diverse benchmarks including real-world robotics with sim-to-real transfer (43.1% to 84.7% improvement). Its broad applicability to VLA fine-tuning and practical robot learning gives it wider immediate impact. Paper 1 is intellectually interesting in formalizing compositional incoherence in multi-LLM agents, but its scope is narrower, the practical mitigations fail, and the contribution is more diagnostic than constructive, limiting its downstream impact.

vs. MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

gpt-5.25/29/2026

Paper 1 offers a more novel and rigorous contribution: it formalizes a fundamental failure mode in multi-component probabilistic LLM agent compositions (local coherence not implying global coherence), introduces a computable diagnostic (compositional residual), provides theoretical characterization (product-structure dichotomy), and proposes principled repairs and monitoring (projection method, anytime-valid e-process). This targets a core reliability/safety issue with broad relevance to agentic AI, ensembling, decision-making, and probabilistic reasoning. Paper 2 is practically useful and timely, but is closer to an engineering framework around web interaction and memory with narrower methodological novelty.

vs. PRO-CUA: Process-Reward Optimization for Computer Use Agents

claude-opus-4.65/29/2026

PRO-CUA addresses a high-demand practical problem—training computer use agents via step-level reinforcement learning—with a clear, scalable framework that reduces distribution shift and improves credit assignment. Its real-world applicability to GUI automation gives it broad impact potential. Paper 2 offers rigorous theoretical analysis of compositional incoherence in multi-LLM systems, which is intellectually interesting but more niche. While Paper 2's formalization is novel, PRO-CUA's combination of methodological innovation, practical relevance, and timeliness in the rapidly growing CUA space gives it higher expected impact.

vs. Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

claude-opus-4.65/29/2026

Paper 2 addresses the highly active and timely area of RLVR for LLM reasoning, providing mechanistic interpretations using novel tools (T-SAE) and actionable difficulty-adaptive training strategies. Its findings on sample difficulty's non-monotonic effects have broad practical implications for LLM training pipelines. Paper 1, while mathematically rigorous and novel in formalizing compositional incoherence in multi-agent LLM systems, addresses a more niche problem with narrower immediate applicability. Paper 2's combination of mechanistic insight, practical relevance to mainstream LLM training, and proposed solutions gives it broader potential impact across the rapidly growing LLM research community.

vs. CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

claude-opus-4.65/29/2026

Paper 2 addresses a fundamental and broadly applicable problem—compositional incoherence in multi-component LLM agent systems—with rigorous mathematical formalization, theoretical guarantees (dichotomy theorem, Rayleigh-quotient bounds), and practical tools (deterministic repair via projection, sequential monitoring via e-processes). This has wide impact across AI safety, multi-agent systems, and probabilistic reasoning. Paper 1 introduces a useful but narrow benchmark for a specific crystallographic task targeting VLMs, with limited generalizability beyond that domain. Paper 2's theoretical depth and relevance to the rapidly growing LLM agent ecosystem give it substantially higher impact potential.

vs. Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

claude-opus-4.65/29/2026

Paper 2 addresses a fundamental and underexplored theoretical problem—compositional incoherence in multi-agent LLM systems—with rigorous mathematical formalization (coherent polytopes, Rayleigh-quotient bounds, e-processes). It introduces novel concepts (compositional residual ε*), provides both theoretical characterization and practical remedies, and has broad implications for the rapidly growing field of multi-agent AI systems. Paper 1, while practically useful, offers an incremental improvement to KV cache compression with modest gains (2.3-3.2%), addressing a well-studied engineering bottleneck rather than opening a new research direction.

vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

claude-opus-4.65/29/2026

Paper 2 addresses a fundamental and broadly applicable problem in multi-component LLM agent systems—compositional incoherence—with rigorous mathematical formalization, theoretical guarantees, and empirical validation. It introduces novel concepts (compositional residual, product-structure dichotomy) with practical tools (deterministic projection repair, sequential monitoring). Its impact spans AI safety, multi-agent systems, and decision theory. Paper 1, while useful, is a domain-specific application framework for tourist mobility with incremental contributions combining existing techniques (GPS priors, LLMs) for a narrower audience.

vs. Formalizing Mathematics at Scale

claude-opus-4.65/29/2026

Paper 2 demonstrates a transformative capability—automated formalization of 26 textbooks into 45,000+ verified Lean 4 declarations—that has broad impact across mathematics, formal verification, and AI. It produces lasting open-source artifacts (AutoformBot + Atlas) that the community can build upon, and establishes feasibility of large-scale autoformalization, a long-sought goal. Paper 1, while rigorous and novel in formalizing compositional incoherence in multi-agent LLM systems, addresses a narrower technical problem with more limited downstream applications. Paper 2's breadth of impact across mathematical fields and its infrastructure contribution give it higher potential.

vs. Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

gemini-3.15/29/2026

Paper 1 provides a rigorous theoretical foundation for understanding and bounding compositional incoherence in multi-component LLM agents, introducing novel metrics and deterministic repair mechanisms. While Paper 2 offers a practical neuro-symbolic approach for optimization via MaxSAT, Paper 1's fundamental mathematical formalization addresses a deep, emerging problem in multi-agent AI systems, offering broader theoretical impact and stronger methodological rigor.

vs. MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

claude-opus-4.65/29/2026

MUSE addresses a critical gap in Text-to-CAD evaluation by introducing a benchmark that moves beyond geometric similarity to assess manufacturability, functionality, and assemblability—directly relevant to industrial applications. Its practical impact on CAD/manufacturing is broad and timely given the surge in LLM-driven design tools. Paper 2, while theoretically rigorous in formalizing compositional incoherence in multi-agent LLM systems, addresses a more niche problem with narrower immediate applicability. MUSE's benchmark, leaderboard, and evaluation framework are likely to drive more community adoption and downstream research.

vs. Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

claude-opus-4.65/29/2026

Paper 2 provides a rigorous mathematical formalization of a fundamental problem in multi-component LLM agents—compositional incoherence—with theoretical guarantees, runtime-computable diagnostics, and deterministic repair mechanisms. Its contributions (polytope-based residual metric, dichotomy theorem, sequential monitoring via e-processes) are broadly applicable across any multi-agent LLM system, not just education. The methodological depth, formal proofs, and extensive empirical validation (1,876 cliques, 1,770 bets) give it stronger scientific foundations. Paper 1, while addressing an important applied problem, is primarily a conceptual/architectural proposal without comparable theoretical or empirical rigor.

vs. Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

gemini-3.15/29/2026

While Paper 1 offers a practical engineering solution for LLM service discovery, Paper 2 provides a profound theoretical contribution by formalizing probabilistic incoherence in multi-agent systems. Its mathematical rigor—utilizing compositional residuals, Rayleigh-quotient predictions, and Boyle-Dykstra projections—establishes foundational limits and deterministic repairs for agent ensembles. This rigorous methodological framework for bounding logical inconsistencies gives Paper 2 a deeper, longer-lasting scientific impact compared to the architectural pipeline proposed in Paper 1.

vs. HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

gemini-3.15/29/2026

Paper 1 offers a highly novel, mathematically rigorous framework for understanding and mitigating compositional incoherence in multi-agent LLM systems. While Paper 2 provides a useful benchmark for a timely topic, Paper 1 makes a fundamental theoretical contribution to the rapidly growing field of agentic systems, offering deeper methodological innovation and generalizable insights that could shape the architectural design of future multi-component AI systems.

vs. Anchorless Diversification for Parallel LLM Ideation

gpt-5.25/29/2026

Paper 1 has higher potential impact: it introduces a formal framework and computable metric for a fundamental failure mode in multi-component LLM agents (local vs global probabilistic coherence), provides theoretical conditions (product-structure dichotomy), prediction accuracy, deterministic repair via projection, and sequential monitoring via e-processes, plus empirical prevalence and decision-theoretic regret. This is methodologically rigorous, broadly relevant to agentic systems, probabilistic reasoning, and AI safety/alignment, and timely as multi-agent/ensemble LLM deployments grow. Paper 2 is practical and useful, but more incremental and narrower in scope (ideation diversification heuristics).

vs. Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

claude-opus-4.65/29/2026

Paper 1 (MMPO) addresses a fundamental and widely encountered challenge in LLM agents—memory degradation over long horizons—with a practical, well-motivated solution (Belief Entropy as a self-supervised proxy). Its strong empirical results (97.1% at 1.75M tokens) demonstrate clear practical value for the rapidly growing field of LLM agents. Paper 2, while theoretically rigorous in formalizing compositional incoherence, addresses a more niche problem with less immediate practical applicability, and its finding that intuitive mitigations fail limits near-term impact. Paper 1's broader applicability across diverse long-horizon tasks gives it higher potential impact.