Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

May 27, 2026

arXiv:2605.29116v1 PDF

cs.AI(primary)

#847of 2821·Artificial Intelligence

#847 of 2821 · Artificial Intelligence

Tournament Score

1451±50

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor7.5

Novelty7

Clarity6

Tournament Score

1451±50

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Beyond Consensus: Trace-Level Synthesis in Mixture of Agents"

1. Core Contribution

The paper identifies and formalizes the "aggregation paradox": an LLM aggregator reading full reasoning traces can recover correct answers even when all agents unanimously agree on a wrong answer, with beneficial corrections consistently outweighing harmful ones. This motivates SC-MoA, a three-phase pipeline: (1) semantic-preserving input perturbations for trace diversity, (2) anchored refinement that freezes the majority answer while revising minorities, and (3) universal synthesis over all traces without consensus gating. The key conceptual shift—treating the reasoning trace rather than the extracted answer as the unit of aggregation—is cleanly articulated and well-supported.

The paper also makes a surprising empirical finding: perturbation *content* is not load-bearing within meaning-preserving families (personas, paraphrases, and GPT-generated strategies are statistically indistinguishable), yet the *surface variation* suffices to induce structurally different reasoning strategies. This connects to recent mode-concentration hypotheses about RL-trained LLMs.

2. Methodological Rigor

Strengths in experimental design: The controlled setting isolating aggregation mechanism from confounds (single model, greedy decoding, same proposals for voting vs. synthesis) is well-constructed. The paper uses five diverse benchmarks spanning QA, mathematics, science, and code, with careful justification for each. The faithful/unfaithful clustering distinction (string-match for QA vs. test-pass signatures for code) is a genuinely useful conceptual framework that explains observed QA/code asymmetries.

Formal analysis: The synthesis advantage decomposition (Proposition 1) is elementary but diagnostically valuable—it cleanly separates recovery rate from corruption rate. The anchoring invariant (Proposition 3) and its submartingale property provide formal safety guarantees, verified empirically with zero degradations in the consensus transition matrix across 867 problems. The consensus fidelity bound (Proposition 6) connects theory to the calibration results.

Concerns: The theoretical results are largely design heuristics rather than performance guarantees, as the authors acknowledge. The key claim—that trace diversity predicts beneficial flips—relies on TF-IDF cosine distance and embedding-based validation, but the causal mechanism remains somewhat opaque. The error correlation analysis (ρ̄ identical for diverse vs. i.i.d.) is compelling but based on relatively small samples. Several McNemar tests don't reach significance (5 of 12), concentrated on smaller benchmarks (GPQA n=198, LCB-Hard n=171), though all bootstrap CIs include positive effects.

The use of a single proprietary model (gpt-oss-120b) as the primary evaluation platform limits reproducibility. While cross-model results are shown (Figure 6a), the main claims rest on this model.

3. Potential Impact

Practical applications: The finding that a single model with cheap perturbations outperforms heterogeneous model pools is immediately actionable for practitioners—it simplifies deployment while improving accuracy. The calibration-as-byproduct result (ECE 0.064–0.154 on QA) enables selective prediction without additional calibration infrastructure. The SC-MoA pipeline at k=1 (∼5 calls) exceeding SC at k=10 (10 calls) on GPQA is a strong efficiency result.

Broader influence: The paper challenges a widespread assumption in the multi-agent LLM literature—that consensus should trigger early exit. If the aggregation paradox holds generally, it would reshape how inference-time compute is allocated across voting, debate, and synthesis systems. The trace-level complementarity concept could influence test-time compute scaling research, verification-guided methods, and multi-agent debate frameworks.

Limitations on impact: The gains, while consistent, are modest in absolute terms (+2.5 to +6.1 pp over SC). The method requires an additional synthesis call per problem, and the overhead may not justify marginal gains in cost-sensitive deployments. The code domain shows particular fragility—SC-MoA's advantage over SC disappears under temperature sampling (67.3% vs. 62.6%).

4. Timeliness & Relevance

This paper arrives at a critical moment: test-time compute scaling is a major research focus, and the community is actively debating optimal strategies for inference-time reasoning. The paper directly addresses the question of what information should be preserved versus discarded during aggregation—a fundamental bottleneck as multi-agent systems proliferate. The connection to mode concentration in RL-trained models (Wu et al. 2026, Kruszewski et al. 2026) is timely and well-contextualized.

5. Strengths & Limitations

Key strengths:

Clean experimental design isolating the aggregation mechanism from diversity source

The "perturbation content is not load-bearing" finding is surprising and practically useful

Anchored refinement elegantly addresses the debate martingale problem with provable guarantees

Comprehensive ablation suite (trace ablation, information ladder, selection baselines) thoroughly validates the mechanism

The faithful/unfaithful clustering framework explains QA/code performance asymmetries

Notable weaknesses:

Reliance on a single proprietary model limits reproducibility and generalizability claims

Absolute accuracy gains are modest; the aggregation paradox, while conceptually interesting, affects a small fraction of problems (e.g., 12 beneficial vs. 5 harmful flips on GPQA pre-refinement)

The paper is extremely long with extensive appendices, making the core contribution harder to extract

Temperature sampling comparison (Appendix Q) reveals that SC-MoA's code advantage is protocol-specific

The aggregator itself is a black box—why synthesis works better than selection at matched accuracy (pairwise judge also reaches 75.3%) is not fully explained

Some claims about "highest accuracy on all five benchmarks" depend on protocol choices (greedy+tag) that may not generalize

6. Additional Observations

The paper's ablation depth is exceptional—the information ladder (Appendix X.2), embedding-based validation (Appendix V), and consensus-stratified analyses provide unusually thorough mechanistic evidence. The SC-GoA composition experiment demonstrates the method's modularity. However, the sheer volume of experiments across different configurations (N=4/k=5 vs. N=5/k=2) creates complexity that occasionally obscures the core narrative.

Rating:6.8/ 10

Significance 7Rigor 7.5Novelty 7Clarity 6

Generated May 29, 2026

Comparison History (18)

vs. Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

claude-opus-4.65/29/2026

Paper 2 introduces a broadly applicable paradigm shift in LLM aggregation—moving from answer-level to trace-level synthesis—with theoretical guarantees and demonstrated improvements across diverse challenging benchmarks. The 'aggregation paradox' finding that synthesis helps even under unanimous agreement is counterintuitive and likely to influence future multi-agent system design across many domains. Paper 1 addresses the narrower niche of personal memory QA with a diagnostic benchmark, which is useful but has more limited scope. Paper 2's insights about reasoning trace complementarity have wider methodological implications for the rapidly growing multi-agent LLM field.

vs. The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

claude-opus-4.65/29/2026

Paper 1 identifies a critical and counterintuitive inverse scaling law—larger LLMs become less robust to distractor instructions—which has broad implications for RAG systems and agentic AI deployment. It introduces a novel benchmark (DistractionIF), provides mechanistic analysis via perplexity, and demonstrates a practical mitigation via GRPO. This addresses a fundamental safety/reliability concern as LLMs scale, affecting nearly all production deployments. Paper 2 offers a useful improvement to multi-agent aggregation but operates in a narrower domain. Paper 1's finding challenges core assumptions about scaling and has wider practical and theoretical impact.

vs. DenseSteer: Steering Small Language Models towards Dense Math Reasoning

claude-opus-4.65/29/2026

Paper 1 introduces a fundamentally new paradigm for LLM agent aggregation—synthesizing at the reasoning trace level rather than answer level—with broad applicability across multiple domains (science, math, programming). The 'aggregation paradox' is a novel theoretical insight with provable guarantees, and the method outperforms heterogeneous model pools using a single model. Paper 2 offers a useful but narrower contribution (training-free steering for small models on math), limited to models ≤3B parameters and a single domain. Paper 1's breadth, theoretical depth, and paradigm-shifting potential give it significantly higher impact.

vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

claude-opus-4.65/29/2026

Paper 1 introduces a comprehensive multilingual benchmark (MentalMap) with a novel capability hierarchy that reveals a universal 'L3 reasoning cliff' in spatial reasoning, validated across 13 LLMs and human subjects. This finding has broad implications for understanding LLM cognition, world modeling, and multimodal AI. Paper 2 presents a useful engineering contribution (trace-level aggregation over majority voting) but is more incremental in scope. Paper 1's systematic diagnostic framework, cross-linguistic analysis, and fundamental insight about text-only working memory constraints provide deeper scientific understanding with broader impact across cognitive science and AI.

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

gemini-3.15/29/2026

Paper 1 provides foundational insights into the mechanisms of supervised fine-tuning and reinforcement learning for LLM reasoning, particularly how RL decomposes compressed steps. This fundamental understanding of data compression in chain-of-thought training will broadly influence how reasoning datasets and post-training pipelines are designed, offering deeper systemic impact than Paper 2's inference-time aggregation technique.

vs. Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

gpt-5.25/29/2026

Paper 1 introduces a broadly applicable and conceptually novel result (the “aggregation paradox”) and a general method (trace-level synthesis with anchored refinement and non-degradation guarantees) that can improve performance across many reasoning-heavy domains without new labeled data. Its claims, if validated, affect ensemble methods, self-consistency, and agentic systems widely, with clear cross-field impact and timeliness for LLM reliability. Paper 2 is valuable as a benchmark/engineering study, but its scope is narrower (screen-conditioned action prediction), relies on dataset specifics, and the key insight (architecture sensitivity to SFT) is less general.

vs. Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

claude-opus-4.65/29/2026

Paper 2 introduces a fundamental insight about LLM aggregation—the 'aggregation paradox'—that challenges the widespread practice of majority voting. Its finding that trace-level synthesis outperforms consensus-based methods across diverse reasoning benchmarks (PhD-level science, competition math, competitive programming) has broader applicability across essentially all LLM deployment scenarios. The provable non-degradation guarantees and the surprising result that a single model with perturbations can outperform heterogeneous model pools represents a paradigm shift in multi-agent reasoning. Paper 1, while solid, addresses a more niche problem (domain-specific data synthesis) with more incremental improvements.

vs. Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

gpt-5.25/29/2026

Paper 2 has higher impact potential: it proposes a broadly applicable, conceptually novel aggregation principle (trace-level synthesis) with an identified “aggregation paradox,” plus a concrete method (Self-Consistent Mixture of Agents) and claimed provable non-degradation guarantees. If validated, it could improve reliability across many high-stakes LLM reasoning domains and influence both research and deployed systems. Paper 1 is timely and valuable for auditing commercial RAG chat and bias/measurement protocols, but its impact is more domain-specific (brand recommendation behavior) and primarily observational rather than method-defining.

vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact because it introduces a reusable benchmark and experimental framework that enables the broader community to measure and diagnose “harness effects” in realistic tool-using agent workflows—an increasingly central deployment setting. Its contributions are broadly applicable across models, systems, evaluation, reliability, and safety, and the dataset/trace protocol can become a standard for reproducible comparison. Paper 1 is novel and potentially important for reasoning aggregation, but is more method-specific and narrower in scope than an execution-layer benchmark that can shape reporting norms and engineering practice.

vs. OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

gemini-3.15/29/2026

Paper 1 proposes a fundamental paradigm shift in LLM reasoning and agent aggregation, moving from answer-level consensus (majority voting) to trace-level synthesis. By demonstrating that an aggregator can recover correct solutions even when agents unanimously fail (the aggregation paradox), it unlocks higher performance ceilings for test-time compute. This approach has broad, immediate applications across math, coding, and scientific reasoning. While Paper 2 provides a valuable benchmark for agent safety and reliability, Paper 1's algorithmic innovation addresses a core bottleneck in scaling reasoning capabilities, making its potential scientific and practical impact significantly higher.

vs. Mind Your Tone: Does Tone Alter LLM Performance?

gpt-5.25/29/2026

Paper 2 is more methodologically and conceptually novel: it reframes multi-agent ensembling by exploiting trace-level complementarity, identifies the “aggregation paradox,” and proposes a concrete algorithm (Self-Consistent Mixture of Agents) with theoretical non-degradation guarantees and broad benchmark coverage (math, science, programming). Its approach is directly applicable to improving LLM reliability and performance across many domains, likely influencing both research and deployments. Paper 1 is useful and timely for prompt-robustness and evaluation practice, but is more incremental and narrower in impact compared to a new aggregation paradigm.

vs. MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation

gemini-3.15/29/2026

Paper 2 has higher potential scientific impact due to its broad applicability across the entire field of AI and LLM reasoning. While Paper 1 introduces a highly valuable benchmark for the specific domain of CAD and manufacturing, Paper 2 proposes a fundamental improvement to Mixture of Agents by aggregating reasoning traces rather than final answers. This tackles a core bottleneck in current LLM reasoning (majority voting limitations) and demonstrates performance gains across diverse, rigorous domains like math, science, and coding, ensuring broader methodological relevance and cross-disciplinary impact.

vs. Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

gemini-3.15/29/2026

Paper 2 addresses a fundamental limitation in multi-agent LLM systems by shifting from answer-level consensus to trace-level synthesis. Given the explosive growth and broad applicability of LLM agents across domains like science, math, and coding, this paradigm shift offers immediate, widespread practical impact. While Paper 1 provides rigorous theoretical contributions to safe reinforcement learning and causal bandits, Paper 2's approach has higher potential for rapid adoption and broader influence across the currently dominant AI landscape.

vs. Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

claude-opus-4.65/29/2026

Paper 1 presents a more fundamental and broadly applicable insight—that reasoning traces, not just answers, should be the unit of aggregation in multi-agent LLM systems. The 'aggregation paradox' is a novel theoretical finding with implications across all LLM reasoning tasks. It introduces a principled framework (Self-Consistent MoA) with provable guarantees and demonstrates improvements across diverse domains. Paper 2 addresses a meaningful but more narrowly scoped problem (interactive ASR correction) with an engineering-oriented framework. While valuable, its impact is more domain-specific compared to Paper 1's foundational contribution to LLM aggregation methodology.

vs. PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

claude-opus-4.65/29/2026

Paper 1 introduces a novel paradigm shift in LLM agent aggregation—moving from answer-level to trace-level synthesis—with theoretical guarantees and strong empirical results across multiple domains. The 'aggregation paradox' is a surprising and counterintuitive finding. The method (Self-Consistent Mixture of Agents) is broadly applicable and demonstrates that a single model with perturbation diversity can outperform heterogeneous model pools. Paper 2, while valuable as a diagnostic benchmark for LLM-assisted peer review, addresses a narrower application domain and is primarily observational/evaluative rather than introducing a fundamentally new methodology with broad applicability.

vs. ParaTool: Shifting Tool Representations from Context to Parameters

claude-opus-4.65/29/2026

ParaTool introduces a novel paradigm for tool calling by encoding tool knowledge into loadable parameter modules rather than relying on in-context documentation. This addresses fundamental scalability and efficiency limitations of current LLM tool-use approaches with a three-stage framework that is both practical and technically rigorous. While Paper 2 offers interesting insights about trace-level aggregation in multi-agent systems, ParaTool has broader practical impact—tool calling is a critical capability for deployed LLM systems, and reducing context overhead while improving accuracy addresses real engineering bottlenecks. The parametric tool representation concept is more architecturally novel and could influence how tool integration is designed across the field.

vs. Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

gemini-3.15/29/2026

Paper 1 proposes a fundamental shift in LLM ensembling by aggregating reasoning traces rather than final answers, demonstrating broad applicability across mathematics, science, and coding. This trace-level synthesis addresses a major bottleneck in agentic reasoning. While Paper 2 offers a solid methodological contribution for step-level credit assignment in agentic search, Paper 1's approach has wider, more generalizable implications for the broader LLM reasoning and agent communities, suggesting a higher potential for widespread scientific and practical impact.

vs. When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

claude-opus-4.65/29/2026

Paper 1 addresses a fundamental and increasingly critical problem—multi-model self-consuming training loops—with formal theoretical contributions (dynamical systems analysis, convergence characterization). It extends prior work on model collapse to the realistic multi-model regime, revealing counterintuitive results about human curation backfiring. This has broad implications for the entire foundation model ecosystem. Paper 2 presents a useful engineering contribution for LLM aggregation but is more incremental, building on existing Mixture of Agents ideas. Paper 1's theoretical framework is likely to have more lasting and cross-disciplinary impact as synthetic data training becomes ubiquitous.