Beyond Consensus: Trace-Level Synthesis in Mixture of Agents
Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss
Abstract
When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Beyond Consensus: Trace-Level Synthesis in Mixture of Agents"
1. Core Contribution
The paper identifies and formalizes the "aggregation paradox": an LLM aggregator reading full reasoning traces can recover correct answers even when all agents unanimously agree on a wrong answer, with beneficial corrections consistently outweighing harmful ones. This motivates SC-MoA, a three-phase pipeline: (1) semantic-preserving input perturbations for trace diversity, (2) anchored refinement that freezes the majority answer while revising minorities, and (3) universal synthesis over all traces without consensus gating. The key conceptual shift—treating the reasoning trace rather than the extracted answer as the unit of aggregation—is cleanly articulated and well-supported.
The paper also makes a surprising empirical finding: perturbation *content* is not load-bearing within meaning-preserving families (personas, paraphrases, and GPT-generated strategies are statistically indistinguishable), yet the *surface variation* suffices to induce structurally different reasoning strategies. This connects to recent mode-concentration hypotheses about RL-trained LLMs.
2. Methodological Rigor
Strengths in experimental design: The controlled setting isolating aggregation mechanism from confounds (single model, greedy decoding, same proposals for voting vs. synthesis) is well-constructed. The paper uses five diverse benchmarks spanning QA, mathematics, science, and code, with careful justification for each. The faithful/unfaithful clustering distinction (string-match for QA vs. test-pass signatures for code) is a genuinely useful conceptual framework that explains observed QA/code asymmetries.
Formal analysis: The synthesis advantage decomposition (Proposition 1) is elementary but diagnostically valuable—it cleanly separates recovery rate from corruption rate. The anchoring invariant (Proposition 3) and its submartingale property provide formal safety guarantees, verified empirically with zero degradations in the consensus transition matrix across 867 problems. The consensus fidelity bound (Proposition 6) connects theory to the calibration results.
Concerns: The theoretical results are largely design heuristics rather than performance guarantees, as the authors acknowledge. The key claim—that trace diversity predicts beneficial flips—relies on TF-IDF cosine distance and embedding-based validation, but the causal mechanism remains somewhat opaque. The error correlation analysis (ρ̄ identical for diverse vs. i.i.d.) is compelling but based on relatively small samples. Several McNemar tests don't reach significance (5 of 12), concentrated on smaller benchmarks (GPQA n=198, LCB-Hard n=171), though all bootstrap CIs include positive effects.
The use of a single proprietary model (gpt-oss-120b) as the primary evaluation platform limits reproducibility. While cross-model results are shown (Figure 6a), the main claims rest on this model.
3. Potential Impact
Practical applications: The finding that a single model with cheap perturbations outperforms heterogeneous model pools is immediately actionable for practitioners—it simplifies deployment while improving accuracy. The calibration-as-byproduct result (ECE 0.064–0.154 on QA) enables selective prediction without additional calibration infrastructure. The SC-MoA pipeline at k=1 (∼5 calls) exceeding SC at k=10 (10 calls) on GPQA is a strong efficiency result.
Broader influence: The paper challenges a widespread assumption in the multi-agent LLM literature—that consensus should trigger early exit. If the aggregation paradox holds generally, it would reshape how inference-time compute is allocated across voting, debate, and synthesis systems. The trace-level complementarity concept could influence test-time compute scaling research, verification-guided methods, and multi-agent debate frameworks.
Limitations on impact: The gains, while consistent, are modest in absolute terms (+2.5 to +6.1 pp over SC). The method requires an additional synthesis call per problem, and the overhead may not justify marginal gains in cost-sensitive deployments. The code domain shows particular fragility—SC-MoA's advantage over SC disappears under temperature sampling (67.3% vs. 62.6%).
4. Timeliness & Relevance
This paper arrives at a critical moment: test-time compute scaling is a major research focus, and the community is actively debating optimal strategies for inference-time reasoning. The paper directly addresses the question of what information should be preserved versus discarded during aggregation—a fundamental bottleneck as multi-agent systems proliferate. The connection to mode concentration in RL-trained models (Wu et al. 2026, Kruszewski et al. 2026) is timely and well-contextualized.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
6. Additional Observations
The paper's ablation depth is exceptional—the information ladder (Appendix X.2), embedding-based validation (Appendix V), and consensus-stratified analyses provide unusually thorough mechanistic evidence. The SC-GoA composition experiment demonstrates the method's modularity. However, the sheer volume of experiments across different configurations (N=4/k=5 vs. N=5/k=2) creates complexity that occasionally obscures the core narrative.
Generated May 29, 2026
Comparison History (18)
Paper 2 introduces a broadly applicable paradigm shift in LLM aggregation—moving from answer-level to trace-level synthesis—with theoretical guarantees and demonstrated improvements across diverse challenging benchmarks. The 'aggregation paradox' finding that synthesis helps even under unanimous agreement is counterintuitive and likely to influence future multi-agent system design across many domains. Paper 1 addresses the narrower niche of personal memory QA with a diagnostic benchmark, which is useful but has more limited scope. Paper 2's insights about reasoning trace complementarity have wider methodological implications for the rapidly growing multi-agent LLM field.
Paper 1 identifies a critical and counterintuitive inverse scaling law—larger LLMs become less robust to distractor instructions—which has broad implications for RAG systems and agentic AI deployment. It introduces a novel benchmark (DistractionIF), provides mechanistic analysis via perplexity, and demonstrates a practical mitigation via GRPO. This addresses a fundamental safety/reliability concern as LLMs scale, affecting nearly all production deployments. Paper 2 offers a useful improvement to multi-agent aggregation but operates in a narrower domain. Paper 1's finding challenges core assumptions about scaling and has wider practical and theoretical impact.
Paper 1 introduces a fundamentally new paradigm for LLM agent aggregation—synthesizing at the reasoning trace level rather than answer level—with broad applicability across multiple domains (science, math, programming). The 'aggregation paradox' is a novel theoretical insight with provable guarantees, and the method outperforms heterogeneous model pools using a single model. Paper 2 offers a useful but narrower contribution (training-free steering for small models on math), limited to models ≤3B parameters and a single domain. Paper 1's breadth, theoretical depth, and paradigm-shifting potential give it significantly higher impact.
Paper 1 introduces a comprehensive multilingual benchmark (MentalMap) with a novel capability hierarchy that reveals a universal 'L3 reasoning cliff' in spatial reasoning, validated across 13 LLMs and human subjects. This finding has broad implications for understanding LLM cognition, world modeling, and multimodal AI. Paper 2 presents a useful engineering contribution (trace-level aggregation over majority voting) but is more incremental in scope. Paper 1's systematic diagnostic framework, cross-linguistic analysis, and fundamental insight about text-only working memory constraints provide deeper scientific understanding with broader impact across cognitive science and AI.
Paper 1 provides foundational insights into the mechanisms of supervised fine-tuning and reinforcement learning for LLM reasoning, particularly how RL decomposes compressed steps. This fundamental understanding of data compression in chain-of-thought training will broadly influence how reasoning datasets and post-training pipelines are designed, offering deeper systemic impact than Paper 2's inference-time aggregation technique.
Paper 1 introduces a broadly applicable and conceptually novel result (the “aggregation paradox”) and a general method (trace-level synthesis with anchored refinement and non-degradation guarantees) that can improve performance across many reasoning-heavy domains without new labeled data. Its claims, if validated, affect ensemble methods, self-consistency, and agentic systems widely, with clear cross-field impact and timeliness for LLM reliability. Paper 2 is valuable as a benchmark/engineering study, but its scope is narrower (screen-conditioned action prediction), relies on dataset specifics, and the key insight (architecture sensitivity to SFT) is less general.
Paper 2 introduces a fundamental insight about LLM aggregation—the 'aggregation paradox'—that challenges the widespread practice of majority voting. Its finding that trace-level synthesis outperforms consensus-based methods across diverse reasoning benchmarks (PhD-level science, competition math, competitive programming) has broader applicability across essentially all LLM deployment scenarios. The provable non-degradation guarantees and the surprising result that a single model with perturbations can outperform heterogeneous model pools represents a paradigm shift in multi-agent reasoning. Paper 1, while solid, addresses a more niche problem (domain-specific data synthesis) with more incremental improvements.
Paper 2 has higher impact potential: it proposes a broadly applicable, conceptually novel aggregation principle (trace-level synthesis) with an identified “aggregation paradox,” plus a concrete method (Self-Consistent Mixture of Agents) and claimed provable non-degradation guarantees. If validated, it could improve reliability across many high-stakes LLM reasoning domains and influence both research and deployed systems. Paper 1 is timely and valuable for auditing commercial RAG chat and bias/measurement protocols, but its impact is more domain-specific (brand recommendation behavior) and primarily observational rather than method-defining.
Paper 2 likely has higher scientific impact because it introduces a reusable benchmark and experimental framework that enables the broader community to measure and diagnose “harness effects” in realistic tool-using agent workflows—an increasingly central deployment setting. Its contributions are broadly applicable across models, systems, evaluation, reliability, and safety, and the dataset/trace protocol can become a standard for reproducible comparison. Paper 1 is novel and potentially important for reasoning aggregation, but is more method-specific and narrower in scope than an execution-layer benchmark that can shape reporting norms and engineering practice.
Paper 1 proposes a fundamental paradigm shift in LLM reasoning and agent aggregation, moving from answer-level consensus (majority voting) to trace-level synthesis. By demonstrating that an aggregator can recover correct solutions even when agents unanimously fail (the aggregation paradox), it unlocks higher performance ceilings for test-time compute. This approach has broad, immediate applications across math, coding, and scientific reasoning. While Paper 2 provides a valuable benchmark for agent safety and reliability, Paper 1's algorithmic innovation addresses a core bottleneck in scaling reasoning capabilities, making its potential scientific and practical impact significantly higher.
Paper 2 is more methodologically and conceptually novel: it reframes multi-agent ensembling by exploiting trace-level complementarity, identifies the “aggregation paradox,” and proposes a concrete algorithm (Self-Consistent Mixture of Agents) with theoretical non-degradation guarantees and broad benchmark coverage (math, science, programming). Its approach is directly applicable to improving LLM reliability and performance across many domains, likely influencing both research and deployments. Paper 1 is useful and timely for prompt-robustness and evaluation practice, but is more incremental and narrower in impact compared to a new aggregation paradigm.
Paper 2 has higher potential scientific impact due to its broad applicability across the entire field of AI and LLM reasoning. While Paper 1 introduces a highly valuable benchmark for the specific domain of CAD and manufacturing, Paper 2 proposes a fundamental improvement to Mixture of Agents by aggregating reasoning traces rather than final answers. This tackles a core bottleneck in current LLM reasoning (majority voting limitations) and demonstrates performance gains across diverse, rigorous domains like math, science, and coding, ensuring broader methodological relevance and cross-disciplinary impact.
Paper 2 addresses a fundamental limitation in multi-agent LLM systems by shifting from answer-level consensus to trace-level synthesis. Given the explosive growth and broad applicability of LLM agents across domains like science, math, and coding, this paradigm shift offers immediate, widespread practical impact. While Paper 1 provides rigorous theoretical contributions to safe reinforcement learning and causal bandits, Paper 2's approach has higher potential for rapid adoption and broader influence across the currently dominant AI landscape.
Paper 1 presents a more fundamental and broadly applicable insight—that reasoning traces, not just answers, should be the unit of aggregation in multi-agent LLM systems. The 'aggregation paradox' is a novel theoretical finding with implications across all LLM reasoning tasks. It introduces a principled framework (Self-Consistent MoA) with provable guarantees and demonstrates improvements across diverse domains. Paper 2 addresses a meaningful but more narrowly scoped problem (interactive ASR correction) with an engineering-oriented framework. While valuable, its impact is more domain-specific compared to Paper 1's foundational contribution to LLM aggregation methodology.
Paper 1 introduces a novel paradigm shift in LLM agent aggregation—moving from answer-level to trace-level synthesis—with theoretical guarantees and strong empirical results across multiple domains. The 'aggregation paradox' is a surprising and counterintuitive finding. The method (Self-Consistent Mixture of Agents) is broadly applicable and demonstrates that a single model with perturbation diversity can outperform heterogeneous model pools. Paper 2, while valuable as a diagnostic benchmark for LLM-assisted peer review, addresses a narrower application domain and is primarily observational/evaluative rather than introducing a fundamentally new methodology with broad applicability.
ParaTool introduces a novel paradigm for tool calling by encoding tool knowledge into loadable parameter modules rather than relying on in-context documentation. This addresses fundamental scalability and efficiency limitations of current LLM tool-use approaches with a three-stage framework that is both practical and technically rigorous. While Paper 2 offers interesting insights about trace-level aggregation in multi-agent systems, ParaTool has broader practical impact—tool calling is a critical capability for deployed LLM systems, and reducing context overhead while improving accuracy addresses real engineering bottlenecks. The parametric tool representation concept is more architecturally novel and could influence how tool integration is designed across the field.
Paper 1 proposes a fundamental shift in LLM ensembling by aggregating reasoning traces rather than final answers, demonstrating broad applicability across mathematics, science, and coding. This trace-level synthesis addresses a major bottleneck in agentic reasoning. While Paper 2 offers a solid methodological contribution for step-level credit assignment in agentic search, Paper 1's approach has wider, more generalizable implications for the broader LLM reasoning and agent communities, suggesting a higher potential for widespread scientific and practical impact.
Paper 1 addresses a fundamental and increasingly critical problem—multi-model self-consuming training loops—with formal theoretical contributions (dynamical systems analysis, convergence characterization). It extends prior work on model collapse to the realistic multi-model regime, revealing counterintuitive results about human curation backfiring. This has broad implications for the entire foundation model ecosystem. Paper 2 presents a useful engineering contribution for LLM aggregation but is more incremental, building on existing Mixture of Agents ideas. Paper 1's theoretical framework is likely to have more lasting and cross-disciplinary impact as synthetic data training becomes ubiquitous.