Breaking Winner-Takes-All\textit{Winner-Takes-All}: Cooperative Policy Optimization Improves Diverse LLM Reasoning

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

#208 of 2292 · Artificial Intelligence
Share
Tournament Score
1517±46
10501800
85%
Win Rate
17
Wins
3
Losses
20
Matches
Rating
7.4/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at \href\href{https://github.com/bradybuddiemarch/gcpo}{this}.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GCPO — Breaking Winner-Takes-All via Cooperative Policy Optimization

1. Core Contribution

GCPO addresses a well-documented problem in reinforcement learning with verifiers (RLVR): exploration collapse, where models trained with algorithms like GRPO converge to narrow, homogeneous reasoning patterns. The key insight is that existing diversity-promoting methods (entropy regularization, diversity bonuses) maintain a "winner-takes-all" competitive structure among rollouts. GCPO fundamentally restructures this by treating rollouts as a cooperative team.

The method has three components: (1) a quality-diversity coupled team value based on a determinantal point process (DPP) log-determinant over reward-weighted semantic embeddings, (2) Shapley value-based credit assignment that distributes the team's value fairly to individual rollouts, and (3) reward redistribution that preserves the total reward budget while reallocating credit toward semantically non-redundant correct rollouts.

The central novelty is the combination of DPP-based team coverage with game-theoretic credit assignment (Shapley values), applied as a pure reward-shaping mechanism within existing GRPO infrastructure. This is a meaningful conceptual shift from additive diversity bonuses to cooperative credit assignment.

2. Methodological Rigor

The theoretical foundation is solid. The paper provides complete proofs for five propositions/theorems characterizing: the spectral decomposition of team value (Proposition 1), the marginal contribution formula via Schur complements (Theorem 2), Shapley values for duplicate rollouts (Theorem 3), reward conservation (Proposition 4), and gradient decomposition (Proposition 5).

Theorem 3's comparison with leave-one-out (LOO) credit is particularly illuminating—it formally shows that LOO underestimates shared contributions of duplicate reasoning paths, while Shapley distributes value fairly. The reward conservation property (Proposition 4) is important for practical stability, distinguishing GCPO from additive bonus approaches that inflate the reward mass.

The experimental design is comprehensive: two model scales (Qwen3-1.7B, 4B), five in-domain and three out-of-domain benchmarks, and comparisons against five baselines (GRPO, DAPO, Div-R1, DIVER, DQO). The ablation study (Table 2) systematically validates each component. The paper also includes useful robustness checks on encoder choice and kernel function (Tables 8-9), which strengthen confidence that gains stem from the cooperative mechanism rather than specific implementation choices.

One concern: the paper reports point estimates without confidence intervals for most results. For smaller test sets (AIME24/25), Avg@16 provides some variance reduction, but single-run results on larger benchmarks could benefit from multiple seeds. The computational overhead analysis (Table 3) is appreciated but covers only wall-clock time per step, not total training cost.

3. Potential Impact

Immediate impact: GCPO offers a drop-in replacement for reward computation in GRPO-based training pipelines, requiring minimal code changes. The modest computational overhead (+3.7s/step) makes adoption practical.

Broader implications: The cooperative optimization paradigm could extend beyond math reasoning to code generation, scientific discovery, and any domain where solution diversity matters. The framework's game-theoretic foundation (Shapley values for credit assignment in RL) could inspire related work in multi-agent RL and ensemble methods.

Practical significance: The Pass@k improvements are particularly notable—GCPO shows increasing gaps at higher k values (Figure 3), indicating genuine expansion of the solution space rather than marginal improvements on already-covered paths. This directly benefits best-of-N sampling strategies used in production LLM systems.

4. Timeliness & Relevance

This paper addresses one of the most active research areas in LLM training. The exploration collapse problem in RLVR is widely recognized (cited works from early-mid 2025), and GCPO provides a theoretically principled solution at a time when the field is saturated with ad hoc diversity heuristics. The framing of "competition vs. cooperation" among rollouts is timely and provides conceptual clarity.

The choice to build on Qwen3 models and standard benchmarks (AIME, MATH500, GPQA) ensures relevance to current practice. The concurrent development of methods like DIVER, DQO, and Div-R1 confirms this is a high-demand research direction.

5. Strengths & Limitations

Strengths:

  • Principled formulation: The DPP + Shapley combination elegantly couples quality and diversity without introducing auxiliary objectives. The reward conservation property is a theoretically satisfying distinction from additive approaches.
  • Comprehensive evaluation: Multi-dimensional diversity analysis (eigenvalue spectra, Pareto frontiers, five diversity metrics, qualitative examples) goes well beyond standard accuracy reporting.
  • Scalability analysis: The Monte Carlo Shapley estimator with incremental Cholesky updates (Appendix A.3) addresses the exponential cost concern, with formal concentration bounds.
  • Strong ablations: Each component (quality-diversity coupling, DPP volume, Shapley values, redistribution) is validated in isolation.
  • Limitations:

  • Reliance on frozen encoder: The semantic embedding space (all-MiniLM-L6-v2) is a proxy for reasoning diversity. While robustness is shown across encoders, embedding-level similarity may not capture structural reasoning differences (e.g., two algebraically different approaches that produce similar sentence embeddings).
  • Scalability: Exact Shapley computation is feasible only for small group sizes (G≤8). The Monte Carlo estimator helps but adds approximation error for larger groups. The paper does not experiment with G>16.
  • Limited model scale: Experiments use 1.7B and 4B models. Whether the cooperative mechanism remains beneficial for frontier-scale models (70B+) where RLVR dynamics may differ is unclear.
  • Training data homogeneity: All experiments use a single math training dataset (DAPO-Math-17K). Generalization to other training distributions is untested.
  • Missing statistical rigor: No error bars or significance tests accompany the main results table.
  • Additional Observations

    The qualitative analysis (Figure 6, Appendix E.3) is compelling—GCPO discovers genuinely distinct solution strategies (case analysis vs. DP-based column sweep), while GRPO collapses to superficial variations. The DQO case study showing "superficial reformulations" nicely illustrates the limitation of additive diversity bonuses.

    The connection to cooperative game theory (Shapley values) in the context of RL for LLMs is novel and opens interesting theoretical directions for future work.

    Rating:7.4/ 10
    Significance 7.5Rigor 7.5Novelty 7.5Clarity 8

    Generated May 13, 2026

    Comparison History (20)

    vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
    claude-opus-4.65/21/2026

    Paper 2 addresses a fundamental limitation (exploration collapse) in RLVR for LLM reasoning, proposing a cooperative optimization paradigm (GCPO) that shifts from competitive to team-based credit assignment. Given the massive current interest in LLM reasoning improvement (post-DeepSeek R1/GRPO), this work is exceptionally timely and broadly applicable. Paper 1 addresses compositional guidance for flow models, which is valuable but more niche. Paper 2's cooperative framework with determinantal volume-based diversity has broader theoretical novelty and wider potential adoption across the LLM reasoning community.

    vs. Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
    gemini-3.15/18/2026

    Paper 1 introduces a foundational algorithmic innovation (GCPO) that addresses a critical bottleneck (exploration collapse) in reinforcement learning for LLMs. By shifting from competitive to cooperative policy optimization, it provides a generalizable solution to improve reasoning diversity and accuracy across all domains. While Paper 2 offers significant value for medical AI auditability, Paper 1's methodological breakthrough in RL optimization has a broader potential impact across the entire machine learning and AI community.

    vs. ASMR-Bench: Auditing for Sabotage in ML Research
    claude-opus-4.65/16/2026

    Paper 1 addresses a fundamental limitation in RLVR-based LLM reasoning (exploration collapse) with a novel cooperative optimization framework (GCPO) that shifts from competitive to cooperative rollout training. This has broad applicability across all LLM reasoning tasks and directly improves both accuracy and diversity. Paper 2 introduces an important but narrower benchmark for detecting sabotage in ML codebases—a timely AI safety contribution, but with more limited immediate impact. Paper 1's methodological innovation in credit assignment and diversity optimization is likely to influence a larger research community working on LLM training.

    vs. From History to State: Constant-Context Skill Learning for LLM Agents
    gpt-5.25/16/2026

    Paper 1 is likely to have higher impact: it introduces a practical context-to-weights framework for LLM agents that directly addresses a major deployment bottleneck (privacy/cost from long histories and cloud calls) while improving efficiency (2–7× fewer tokens) and maintaining strong performance across multiple agent benchmarks and model sizes, suggesting broad applicability to real-world assistants and tool-use systems. Its method combines a deterministic state tracker with SFT+online RL, indicating solid rigor and an actionable systems contribution. Paper 2 is novel for RLVR diversity, but appears more niche to verifier-based reasoning training.

    vs. Process Matters more than Output for Distinguishing Humans from Machines
    gpt-5.25/16/2026

    Paper 1 has higher likely scientific impact due to stronger novelty (process-level human–machine discrimination via a cognitively grounded task battery), broader cross-field relevance (AI evaluation, cognitive science, security/anti-bot, human-computer interaction), and clear real-world application need as LLM agents proliferate. The introduction of CogCAPTCHA30 and empirical evidence that process features outperform outputs (AUC 0.88) provide a durable evaluation framework and new research direction (process specification bottleneck). Paper 2 is timely and useful for RLVR training, but is a more incremental algorithmic improvement with narrower scope.

    vs. KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
    gemini-3.15/16/2026

    Paper 2 addresses a fundamental bottleneck in foundation model training: exploration collapse in Reinforcement Learning for reasoning (RLVR). By shifting from individual rollout competition to team-level credit assignment, GCPO offers a highly novel algorithmic improvement applicable across all LLM reasoning tasks. While Paper 1 introduces a valuable benchmark for mobile agents, Paper 2 provides a foundational algorithmic advancement that directly impacts the core training paradigms of state-of-the-art reasoning models, giving it broader cross-domain applicability and significantly higher potential scientific impact.

    vs. EMBER: Autonomous Cognitive Behaviour from Learned Spiking Neural Network Dynamics in a Hybrid LLM Architecture
    gpt-5.25/16/2026

    Paper 2 (GCPO) likely has higher scientific impact due to strong timeliness and direct applicability to mainstream LLM post-training: it addresses a widely observed RLVR failure mode (exploration/diversity collapse) with a clear, generally usable objective (team-level credit via determinant/coverage) and benchmarked improvements in both accuracy and diversity, plus promised code release aiding adoption and replication. Paper 1 is novel and ambitious, but its claims hinge on complex hybrid SNN dynamics and limited demonstrated evaluation; real-world scalability, reproducibility, and broad adoption are less certain at this stage.

    vs. Riemann-Bench: A Benchmark for Moonshot Mathematics
    gpt-5.25/16/2026

    Paper 2 likely has higher impact: it introduces a concrete, general-purpose RL training algorithm (GCPO) addressing exploration collapse in RLVR, with broad applicability to many LLM reasoning tasks and immediate real-world use in model training. The methodological contribution (team-level credit assignment via determinant-based coverage and marginal contribution advantages) is novel and can influence future RLHF/RLVR research. Paper 1 is valuable and rigorous as an evaluation benchmark, but its private nature may limit reproducibility, adoption, and downstream scientific reuse, narrowing impact despite high novelty and timeliness.

    vs. CircuitFormer: A Circuit Language Model for Analog Topology Design from Natural Language Prompt
    gpt-5.25/16/2026

    Paper 1 likely has higher scientific impact due to its substantial new resource (31k paired NL–netlist dataset) plus a domain-specific tokenizer with theoretical scalability benefits (O(1) vocab growth) and strong empirical gains, enabling practical analog EDA automation—an area with high industrial relevance and long-standing difficulty. It combines methodological innovation, reproducibility (public dataset), and cross-field impact (LLMs + hardware/EDA). Paper 2 proposes a promising RLVR variant for diversity, but resembles an algorithmic refinement in a crowded space and may face faster commoditization.

    vs. Human-Inspired Memory Architecture for LLM Agents
    gpt-5.25/16/2026

    Paper 2 likely has higher scientific impact: it introduces a new RLVR training objective (cooperative, coverage-based credit assignment) that directly targets a known failure mode (exploration collapse) in a widely used paradigm for LLM reasoning. If results generalize, GCPO can be broadly adopted across many reasoning tasks and model families, influencing both ML methodology and downstream applications. Paper 1 is well-motivated and useful for agent memory, but its impact may be narrower (systems/memory management) and more benchmark- and implementation-dependent, with less fundamental change to model training.

    vs. Orchard: An Open-Source Agentic Modeling Framework
    gpt-5.25/16/2026

    Paper 2 introduces a broadly applicable algorithmic change to RL-with-verifiers—shifting from competitive to cooperative, coverage-driven credit assignment—to address exploration collapse and improve both accuracy and diversity. This conceptual contribution is novel, method-centric, and likely transferable across many LLM reasoning and RL settings, giving it wider cross-field impact and timeliness. Paper 1 is impactful as open infrastructure plus strong empirical results, but its main advances are engineering/system integration and recipe tuning around specific agent domains; algorithmic novelty and generality appear lower than GCPO’s.

    vs. AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents
    claude-opus-4.65/13/2026

    Paper 2 addresses a fundamental limitation (exploration collapse) in reinforcement learning for LLM reasoning, a highly active research area. Its cooperative policy optimization framework (GCPO) introduces a novel paradigm shift from individual rollout competition to team-level credit assignment, with broad applicability across reasoning tasks. The theoretical contribution (determinantal volume-based credit assignment) is elegant and the problem is timely given the centrality of RLVR. Paper 1, while practical, addresses a more niche systems-level routing problem with incremental contributions using existing techniques on compact models.

    vs. Lightweight LLM Agent Memory with Small Language Models
    gemini-3.15/13/2026

    Paper 1 addresses a fundamental flaw in reinforcement learning for LLM reasoning (exploration collapse in group-based optimization). Its novel cooperative policy optimization approach using determinant volume for team-level credit assignment tackles a core training challenge. This foundational algorithmic improvement has a broader and deeper potential impact on how foundation models are trained to reason, whereas Paper 2 offers a practical, yet more engineering-focused, systems architecture for agent memory.

    vs. A Prompt-Aware Structuring Framework for Reliable Reuse of AI-Generated Content in the Agentic Web
    gpt-5.25/13/2026

    Paper 1 introduces a novel RL training objective (GCPO) that changes the underlying credit assignment from competitive to cooperative, addressing a known failure mode (exploration collapse) with a principled determinant-volume diversity measure and marginal contribution advantages. It is directly applicable to improving LLM reasoning performance and diversity, likely influencing RLHF/RLVR research broadly. Paper 2 targets important provenance/compliance needs, but reads more like a systems/framework proposal with less clear algorithmic novelty and empirical rigor from the abstract. Paper 1 is more methodologically impactful and timely for core LLM training.

    vs. Reason Analogically via Cross-domain Prior Knowledge: An Empirical Study of Cross-domain Knowledge Transfer for In-Context Learning
    claude-opus-4.65/13/2026

    Paper 2 (GCPO) addresses a fundamental limitation in RLVR-based LLM reasoning—exploration collapse in algorithms like GRPO—with a novel cooperative optimization framework that shifts from individual rollout competition to team-level credit assignment. This is a more innovative methodological contribution with broader applicability across reasoning tasks. Paper 1 is primarily an empirical study validating cross-domain ICL transfer, which, while useful, offers less novelty in terms of new methods. GCPO's cooperative paradigm has stronger potential to influence the rapidly growing RLVR research community and improve diverse reasoning capabilities.

    vs. Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery Games
    gpt-5.25/13/2026

    Paper 1 proposes a broadly applicable, conceptually novel shift in RL-with-verifiers from competitive rollout selection to cooperative, team-level credit assignment via a determinant-volume coverage objective and marginal-contribution advantages. This directly targets a known failure mode (exploration/diversity collapse) in a widely used LLM post-training paradigm, making it timely and likely impactful across many reasoning tasks and model families. Paper 2 is valuable but more domain-specific (murder-mystery game scripts) and relies on a combination of existing techniques (multi-agent data generation + GRPO reward shaping), with narrower cross-field impact.

    vs. Standing on the Shoulders of Giants: Stabilized Knowledge Distillation for Cross--Language Code Clone Detection
    gpt-5.25/13/2026

    Paper 1 likely has higher scientific impact due to a more novel RL training objective (cooperative, team-level credit assignment via determinant-volume coverage) that addresses a broadly relevant failure mode (exploration collapse) in RL-with-verifiers for LLM reasoning. If robust, it could generalize across many reasoning tasks and influence future RLHF/RLVR algorithm design, making it timely and wide-reaching. Paper 2 is practical and methodologically solid, but it is more incremental (distillation + stabilization) and narrower in scope (cross-language code clone detection), with impact concentrated in software engineering applications.

    vs. Optimal LTLf Synthesis
    gpt-5.25/13/2026

    Paper 2 likely has higher impact: it targets a timely, fast-moving area (LLM reasoning and RL with verifiers) with immediate real-world applicability to improving deployed models. Its cooperative, coverage-based credit assignment is a novel twist on group optimization that could influence broader RL/optimization work and be adopted widely if results replicate. Paper 1 is methodologically solid and valuable for formal methods, but its impact is more specialized and incremental within LTLf synthesis compared to the cross-field relevance and adoption potential of Paper 2.

    vs. Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains
    gpt-5.25/13/2026

    Paper 1 is more likely to have higher scientific impact: it proposes a clearly novel RL training objective (cooperative, team-level credit assignment via determinant-volume coverage and marginal contributions) addressing a known failure mode (exploration collapse) with broad relevance to RLHF/RLVR and reasoning diversity across many domains. It appears methodologically concrete and benchmark-driven, making it readily testable and extensible. Paper 2 is highly application-relevant but reads more like a systems/industry framework; its novelty and rigor depend on implementation details and controlled evaluation that are not evident from the abstract, and its impact may be narrower to enterprise supply-chain operations.

    vs. Do Enterprise Systems Need Learned World Models? The Importance of Context to Infer Dynamics
    gpt-5.25/13/2026

    Paper 1 introduces a novel RLVR training paradigm (cooperative, team-level credit assignment using determinant-volume coverage) that directly targets a known failure mode (exploration collapse) in a highly active area (LLM reasoning optimization). It offers broad applicability across RL-based post-training methods and tasks, with clear potential to influence future algorithm design and tooling. Paper 2 is timely and valuable for enterprise-agent robustness, but its impact is narrower to configurable systems and relies more on reframing plus benchmarking than a generally reusable learning algorithm. Overall, Paper 1 is likely to have wider cross-field uptake.