TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

Chusen Li, Zhou Liu, Shuigeng Zhou, Wentao Zhang

May 27, 2026

arXiv:2605.28699v1 PDF

cs.AI(primary)

#843of 2682·Artificial Intelligence

#843 of 2682 · Artificial Intelligence

Tournament Score

1450±49

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5

Rigor5

Novelty5.5

Clarity5.5

Tournament Score

1450±49

10501800

73%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: TRACER

1. Core Contribution

TRACER proposes a two-layer reinforcement learning framework for cooperative multi-LLM reasoning that separates *when* agents should speak from *what* they should say. The Controller-Regret Layer uses regret matching over binary actions (speak/skip) to learn turn-taking policies, while the Generation-Credit Layer uses GSPO with role-specific rewards to optimize proposer and reviewer utterances independently. The key insight is that by reducing the controller's action space to a binary decision, the authors can apply classical regret matching from game theory, obtaining convergence guarantees while avoiding the combinatorial explosion of training full multi-agent RL rollouts.

The paper targets three specific problems: (i) sparse rewards and free-riding in multi-agent RL, (ii) agents that merely imitate collaboration without learning genuine cooperative skills, and (iii) training instability from fixed collaboration protocols. The proposed decomposition addresses these by assigning credit at both the action-mode and utterance levels, learning adaptive turn-taking, and leveraging regret matching's convergence properties.

2. Methodological Rigor

Strengths in design: The separation of concerns between the controller layer and generation layer is architecturally clean. The binary action space for controllers (speak/skip) is a clever design choice that makes regret matching tractable and theoretically grounded. The role-specific reward design—proposer rewarded for answer correctness, reviewer rewarded for judgment correctness—is intuitive and avoids the credit assignment confusion that plagues naive multi-agent RL.

Theoretical claims: The convergence proof in Appendix D is essentially a restatement of classical CFR convergence results applied to TRACER's setting. The paper correctly identifies that perfect information simplifies the information set structure (singleton sets), making the proof straightforward. However, the theoretical contribution is modest—the convergence guarantee applies only to the discrete controller layer, not to the neural generation policy. The claim of "extending classical game theory to deep learning" is somewhat overstated; the deep learning component (GSPO) operates independently without convergence guarantees from the game-theoretic framework.

Experimental concerns:

Training is performed exclusively on GSM8K, a relatively simple arithmetic reasoning benchmark. The cross-benchmark evaluation (MATH500, GPQA-D) tests generalization but from a narrow training distribution.

The accuracy improvements are mixed. On Qwen2.5-7B, TRACER (0.8901) underperforms simple CoT (0.9160) on GSM8K and Self-Consistency (0.9240). On MATH500 with Qwen, TRACER (0.6120) underperforms CoT (0.7550). The "balanced profile" argument in Figure 1 somewhat obscures that TRACER doesn't actually lead on most accuracy metrics.

The comparison with single-agent RL baselines (GRPO, GSPO) shows these methods catastrophically fail on GPQA-D (0.03-0.09 accuracy), suggesting severe overfitting. TRACER avoids this collapse, which is notable, but the baselines appear pathologically overtrained.

MAGRPO shows surprisingly poor performance (0.7252 and 0.6960 on GSM8K), raising questions about whether baseline implementations are fully optimized.

3. Potential Impact

The framework addresses a genuine gap: how to train multi-LLM systems to collaborate rather than relying on fixed prompting protocols. The decomposition into turn-taking control and utterance generation is a useful conceptual framework that could influence future multi-agent LLM work.

However, practical impact is limited by several factors:

The accuracy gains over simple baselines like CoT are often negative or marginal

The 2-agent proposer-reviewer setup is specific; extending to more agents "only by adding controllers" is claimed but not demonstrated

The benchmarks used (GSM8K, MATH500, GPQA-D) are standard but the training regime is narrow

The inference cost analysis (Table 2) is genuinely useful—TRACER achieves competitive accuracy with ~960-1014 tokens/task versus MAD's ~5300-5900, demonstrating meaningful efficiency gains over multi-agent prompting approaches.

4. Timeliness & Relevance

The paper addresses a timely topic at the intersection of multi-agent LLM systems and reinforcement learning for reasoning. The recent success of reasoning-focused RL (DeepSeek-R1, etc.) and multi-agent frameworks makes this a relevant research direction. However, the paper's positioning against "fixed collaboration protocols" may understate recent advances in adaptive prompting strategies.

5. Strengths & Limitations

Key Strengths:

Clean architectural decomposition of when-to-speak and what-to-say

Binary controller actions enabling tractable regret matching with convergence guarantees

Role-specific credit assignment avoiding free-riding

Comprehensive ablation study (Table 3) demonstrating each component's contribution

Significant inference cost reduction compared to multi-agent prompting baselines

Code availability enhances reproducibility

Key Limitations:

TRACER often underperforms simple CoT on absolute accuracy, undermining the practical value proposition

Theoretical convergence applies only to the controller layer, not the full system

Training on GSM8K only is narrow; claims about "learned collaboration" would be stronger with diverse training domains

The phase state space grows quadratically with T (t(t+1)/2 phases), which could become unwieldy for longer horizons

The paper doesn't compare against recent strong single-agent reasoning methods (e.g., best-of-N with verification)

Training dynamics (Figure 3) show only 1000 samples, making it difficult to assess long-term stability

The "correlated equilibrium" convergence result for the controller is interesting but its practical significance for improving reasoning accuracy is unclear

Missing comparisons: No comparison with tree-search methods (ToT, MCTS-based approaches), verification-guided generation, or process reward models—methods that also address multi-step reasoning credit assignment.

Overall Assessment

TRACER presents an architecturally interesting framework that makes a reasonable attempt at bridging game theory and multi-agent LLM training. The binary action space design for controllers is the paper's most creative contribution. However, the empirical results do not convincingly demonstrate that learned collaboration outperforms simpler approaches on absolute accuracy. The paper's main value lies in the conceptual framework and efficiency gains rather than state-of-the-art reasoning performance. The theoretical contribution, while correct, is relatively incremental—applying known CFR results to a simplified setting.

Rating:4.8/ 10

Significance 5Rigor 5Novelty 5.5Clarity 5.5

Generated May 28, 2026

Comparison History (22)

vs. A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

gemini-3.15/28/2026

Paper 1 introduces a fundamental algorithmic advancement in cooperative multi-LLM reasoning by integrating reinforcement learning with game-theoretic regret matching. This addresses critical bottlenecks like sparse rewards and fixed protocols, directly advancing AI reasoning capabilities. While Paper 2 offers a valuable automated benchmarking tool to address evaluation saturation, Paper 1's theoretical innovations and methodology have broader applicability and longer-term impact on building capable multi-agent AI systems.

vs. CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

claude-opus-4.65/28/2026

CORE introduces a fundamentally novel and elegant approach—contrastive reflection using natural language insights—that is more sample-efficient, interpretable, and broadly applicable across reasoning tasks. Its non-parametric nature makes it accessible and practical, addressing a key bottleneck (sample/rollout efficiency) that limits current RLVR approaches. TRACER, while technically sophisticated in combining game theory with multi-agent RL, addresses a narrower problem (multi-LLM collaboration) with higher complexity and more limited generalizability. CORE's simplicity, interpretability, and strong empirical results with minimal data suggest broader adoption potential and wider cross-field impact.

vs. PetroBench: A Benchmark for Large Language Models in Petroleum Engineering

gemini-3.15/28/2026

Paper 2 introduces a novel algorithmic framework (TRACER) that solves fundamental challenges in multi-agent LLM systems by combining reinforcement learning and game theory. This methodological innovation has broad applicability across AI and machine learning fields. In contrast, Paper 1 presents a domain-specific benchmark for petroleum engineering, which, while practically useful, has a much narrower scope and lower theoretical innovation.

vs. Online Allocation with Unknown Shared Supply

claude-opus-4.65/28/2026

TRACER addresses the highly active and impactful area of improving LLM reasoning through multi-agent cooperation combined with reinforcement learning. It tackles fundamental challenges (sparse rewards, free-riding, convergence) with a novel framework bridging game theory and deep learning. Given the enormous current interest in LLM reasoning improvements and multi-agent systems, this paper has broader potential impact across AI/ML communities. Paper 2 makes solid theoretical contributions to online allocation with tight bounds, but addresses a more niche operations research problem with narrower audience and application scope.

vs. A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

gemini-3.15/28/2026

Paper 2 addresses a highly critical and timely challenge in AI: integrating reinforcement learning with multi-agent LLM reasoning. Its novel approach of combining game theory (regret matching) with deep learning to solve sparse rewards and fixed collaboration protocols offers significant methodological innovation. This has a massive breadth of impact across complex AI reasoning tasks, whereas Paper 1 focuses on a more specific problem within multimodal sentiment analysis. The potential real-world applications and generalizability of multi-LLM cooperative frameworks give Paper 2 a higher estimated scientific impact.

vs. OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

gemini-3.15/28/2026

Paper 2 presents a fundamental methodological advancement by combining multi-agent reinforcement learning with LLM reasoning, addressing key issues like sparse rewards and convergence using game-theoretic concepts. Its framework (TRACER) is generalizable across various complex reasoning tasks, offering broader impact across AI subfields. In contrast, Paper 1 introduces a valuable but domain-specific benchmark for operations research, which, while practical for industrial applications, has a narrower scientific scope and theoretical contribution compared to Paper 2's algorithmic innovations.

vs. Continual Model Routing in Evolving Model Hubs

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: continual routing over rapidly expanding model hubs is a pressing, general problem affecting many tasks and deployment settings. It contributes a formal problem definition (CMR), a large-scale benchmark (CMRBench with 2,000+ models) that can anchor future work, and an efficient method (CARvE) with strong empirical comparisons. Paper 1 is innovative and rigorous for multi-LLM cooperation, but its impact is narrower (multi-agent reasoning/RL on specific reasoning benchmarks) and may be harder to generalize across modalities and hub-scale ecosystems.

vs. SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

claude-opus-4.65/28/2026

SAM addresses the critical and broadly applicable problem of long-horizon agentic reasoning with a novel state-adaptive memory framework. It demonstrates results across four diverse benchmarks and multiple agent backbones, suggesting strong generalizability. The problem of managing long interaction histories is fundamental to the rapidly growing field of LLM agents. TRACER tackles multi-LLM cooperative reasoning with interesting game-theoretic foundations, but its evaluation is limited to math/QA benchmarks and its scope is narrower. SAM's framework-agnostic design and broader applicability give it higher potential impact.

vs. C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

gemini-3.15/28/2026

Paper 2 presents a foundational advancement in multi-agent reinforcement learning for LLMs. By bridging classical game theory (regret matching) with deep learning, it solves critical bottlenecks in multi-LLM cooperation like free-riding, sparse rewards, and high training overhead. While Paper 1 offers a valuable application-specific improvement for clinical RAG, Paper 2's methodological innovation has broader implications across the AI field, fundamentally improving how multiple LLMs can learn to collaborate dynamically for complex, general reasoning tasks.

vs. The Ethics of LLM Sandbox and Persona Dynamics

gpt-5.25/28/2026

Paper 2 has higher likely scientific impact: it proposes a concrete, technically novel framework (turn-level regret matching + role-specific credit assignment) with claims of convergence and reduced training cost, plus empirical evaluation on multiple benchmarks and open-source code—supporting rigor, reproducibility, and adoption. It is timely for multi-LLM coordination and reinforcement learning, with clear downstream applications in scalable reasoning systems and agentic workflows. Paper 1 offers valuable conceptual critique on LLM ethics and “reality laundering,” but is less operationalized and harder to validate experimentally, limiting near-term cross-field uptake.

vs. Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

claude-opus-4.65/28/2026

TRACER addresses a more timely and broadly impactful problem—combining reinforcement learning with multi-agent LLM reasoning—which is at the frontier of current AI research. It introduces a novel framework bridging game theory (regret matching) with multi-turn LLM collaboration, offering practical solutions to free-riding, sparse rewards, and fixed protocols. Its applicability to the rapidly growing LLM reasoning ecosystem gives it broader potential impact. Paper 2, while rigorous in advancing PSRO for zero-sum games, addresses a more specialized niche in computational game theory with narrower audience and application scope.

vs. RULER: Representation-Level Verification of Machine Unlearning

gpt-5.25/28/2026

Paper 1 likely has higher scientific impact due to its novel shift from output-level to representation-level verification for machine unlearning, addressing a critical gap with immediate implications for privacy, compliance, and safety auditing. It proposes both oracle-comparative and oracle-free metrics, demonstrates failures of multiple existing unlearning methods across diverse modalities, and uses statistically grounded analysis (mixed-effects models, effect sizes). This evaluation framework can influence many unlearning algorithms and regulatory practices. Paper 2 is timely and useful for multi-LLM coordination, but its impact is more benchmark- and paradigm-specific and may be outpaced quickly by fast-moving RL/agent methods.

vs. Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

claude-opus-4.65/28/2026

TRACER addresses a fundamental challenge in combining reinforcement learning with multi-agent LLM reasoning, introducing a novel framework grounded in game theory (regret matching) with mathematical convergence guarantees. It tackles multiple important problems (sparse rewards, free-riding, fixed protocols) and demonstrates generalization across benchmarks. Paper 2 presents interesting mechanistic interpretability findings about refusal signals, but its scope is narrower—optimizing an existing attack method (AutoDAN) with efficiency gains. TRACER's broader applicability to cooperative multi-agent systems, novel theoretical contributions, and potential to reshape how LLMs collaborate give it higher impact potential.

vs. The Illusion of Opting in AI-Mediated Consequential Decisions

gemini-3.15/28/2026

Paper 1 addresses a critical technical bottleneck in the rapidly growing field of multi-agent LLM reasoning. By combining RL with game theory for mathematically rigorous convergence and providing open-source code with empirical benchmark results, it offers a concrete, reproducible advancement. While Paper 2 presents an important ethical framework, Paper 1's technical innovation is more likely to see immediate, widespread adoption and citation across the active AI research community.

vs. FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

claude-opus-4.65/28/2026

TRACER addresses a fundamental challenge in multi-agent LLM reasoning by combining reinforcement learning with game-theoretic regret matching, providing mathematical convergence guarantees and empirical validation across multiple benchmarks. Its contributions—turn-level credit assignment, learned collaboration policies, and bridging game theory with deep learning—are broadly applicable beyond any single domain. FundaPod, while addressing an important niche in fundamental investment research, is more of a domain-specific architectural design with a case study demonstration rather than a generalizable methodological advance. TRACER's reproducible framework and code availability further enhance its potential impact.

vs. Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents

gemini-3.15/28/2026

Hera addresses the highly timely and practical device-cloud dilemma for LLM agents. By optimizing the performance-cost Pareto frontier, it presents immediate real-world applications for deploying efficient, autonomous agents on edge devices. While Paper 2 offers strong theoretical contributions to multi-agent reasoning, Paper 1's approach significantly lowers the barrier for practical, wide-scale LLM agent deployment, promising broader immediate impact across industry and applied research.

vs. Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

gemini-3.15/28/2026

Paper 2 identifies a fundamental and pervasive flaw (reward bias substitution) in current RLHF and preference-learning mitigation methods. Its theoretical formalization and critique of existing evaluation practices have broad implications for the entire field of AI alignment and safety. While Paper 1 offers a novel multi-agent reasoning framework, Paper 2's potential to shift foundational methodologies across a wider range of LLM optimization research gives it a higher estimated scientific impact.

vs. DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

gpt-5.25/28/2026

Paper 1 (TRACER) has higher estimated scientific impact due to greater novelty and broader implications: it proposes a principled turn-level RL framework for multi-LLM cooperation with explicit credit assignment and a regret-matching controller, addressing sparse rewards/free-riding and enabling learned collaboration protocols beyond fixed debate/voting. Its claimed convergence grounding via game-theoretic regret matching plus a reusable testbed suggests methodological rigor and cross-field relevance (multi-agent RL, LLM alignment, cooperative reasoning). Paper 2 (DREAM-R) is timely and useful for efficiency, but is more incremental within speculative decoding/verification.

vs. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

gemini-3.15/28/2026

Paper 2 addresses a fundamental bottleneck in AI—combining reinforcement learning with multi-agent LLM systems—and introduces mathematically rigorous convergence through game-theoretic regret matching. While Paper 1 offers a highly practical approach to GPU kernel optimization, Paper 2's focus on foundational reasoning and collaborative policies promises a broader impact across various domains and applications in agentic AI.

vs. An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

gemini-3.15/28/2026

Paper 2 addresses a highly timely and critical challenge in AI: combining reinforcement learning with multi-agent LLM reasoning. By integrating game-theoretic regret matching with RL, it offers a novel, generalizable framework with rigorous convergence properties. Its impact spans NLP, AI, and RL communities. In contrast, Paper 1 presents an algorithmic improvement (a metaheuristic LNS) for a specific variant of a classical operations research problem, which, while practically useful, offers a narrower scope and represents more incremental scientific innovation compared to the broader, fast-moving field of LLM reasoning.