JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao

Apr 28, 2026

arXiv:2604.25419v1 PDF

cs.AI(primary)

#87of 2292·Artificial Intelligence

#87 of 2292 · Artificial Intelligence

Tournament Score

1548±34

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance6.5

Rigor7.5

Novelty6.5

Clarity7.5

Tournament Score

1548±34

10501800

59%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: JURY-RL

1. Core Contribution

JURY-RL addresses a meaningful gap in reinforcement learning with verifiable rewards (RLVR): the dependency on human-annotated ground-truth answers. The key insight is a clean decoupling of answer proposal (via majority voting over rollouts) from reward disposal (via formal Lean verification of only the top-voted candidate). This "Votes Propose, Proofs Dispose" paradigm elegantly reduces the verification cost to a single Lean call per group while maintaining reward fidelity.

The second contribution, ResZero (Residual-Zero), is a carefully designed fallback reward for when verification is inconclusive. Rather than rewarding unverified consensus (which risks collapse) or assigning zero reward (which stalls learning), ResZero penalizes the unverified majority, redistributes a zero-mean variance-preserving signal over residual answers, and adaptively scales with majority confidence α. This is a principled design choice with formal guarantees (zero-mean property proven in Appendix A.2).

2. Methodological Rigor

Theoretical grounding. The paper provides a thorough theoretical analysis of why naive majority voting is brittle under GRPO (advantages for supporters vanish while dissenter penalties diverge as consensus strengthens), and formally proves the zero-mean property of ResZero. The three-way comparison of fallback strategies (MV/Zero/ResZero) in Appendix A.3 is informative and clearly motivates the design.

Experimental design. The evaluation is comprehensive: three backbone models (Qwen3-1.7B-Base, Llama-3.2-3B-Instruct, Qwen2.5-7B), nine benchmarks spanning math, code, instruction-following, and multi-task, and seven baselines including both label-free and supervised methods. Confidence intervals are reported. The ablation study (Table 3) is particularly well-executed, systematically testing four fallback designs under the same proof-gated framework.

Weaknesses in rigor. The Lean verifier pipeline is not perfectly precise—Table 4 shows 84.5% precision, meaning ~15.5% of positively rewarded answers may be incorrect (due to autoformalization/consistency-check errors, not Lean itself). While the paper acknowledges this, the gap between theoretical soundness ("formal verification has zero false positives") and practical reality deserves more scrutiny. The verification recall of 88% means ~12% of correct answers receive no positive reward, which could systematically bias learning against certain problem types. Additionally, the paper trains on only 7,500 MATH problems—it remains unclear how well the approach scales to larger, more diverse datasets.

3. Potential Impact

Practical significance. Removing the need for ground-truth labels in RLVR is a genuine practical advance. The framework achieves performance comparable to supervised ground-truth training (Table 1) while being label-free, which is the headline result. The superior pass@k performance and maintained diversity (Figure 2, Table 2) suggest the method produces more robust reasoning models.

Scope limitations. The approach is fundamentally restricted to machine-checkable domains—currently mathematics via Lean. Extension to other domains requires analogous formal verifiers, which may not exist or may be prohibitively expensive. The paper's code generation results are transfer performance (trained on math), not direct verification of code. This narrows the immediate applicability.

Broader influence. The "propose-then-verify" paradigm could influence how the community thinks about reward design in RLVR more generally. The ResZero fallback mechanism is a reusable contribution that applies beyond Lean verification to any setting with binary but incomplete verification signals.

4. Timeliness & Relevance

This work is highly timely. RLVR has become a central technique following DeepSeek-R1 and related work, and the community is actively seeking ways to reduce label dependency. The concurrent explosion of label-free methods (majority voting, self-certainty, entropy minimization, CoReward) creates a well-defined competitive landscape that this paper improves upon. The use of Lean for verification aligns with growing interest in formal methods for AI alignment.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framework with a memorable slogan ("Votes Propose, Proofs Dispose")

ResZero is a well-motivated, formally grounded mechanism with a single hyperparameter

Comprehensive experiments across models, benchmarks, and baselines with proper confidence intervals

Strong ablation studies that isolate the contribution of each component

Test-time training (TTRL) extension demonstrates generality

Verifier signal quality analysis (Table 4) provides honest assessment of the pipeline

Notable Limitations:

Domain restriction: Requires formal verification infrastructure (Lean), limiting applicability to mathematics and potentially some program verification tasks

Verification pipeline imperfections: 84.5% precision means the system is not truly "proof-gated" in the formal sense—it's "pipeline-gated," and the pipeline includes fallible neural components (autoformalizer, consistency checker)

Computational overhead: ~200s additional per step initially (vs. ~80s for LLM-as-a-Judge), though amortized via caching

Modest absolute gains on some benchmarks: On easier benchmarks (GSM8K, MATH500), differences between methods are often within confidence intervals

Missing comparison with process-reward models or other verification-augmented approaches

Single hyperparameter c sensitivity: While the paper shows c=0.01 works well, Figure 5 suggests non-trivial sensitivity, and the optimal value may be dataset-dependent

Minority-correct missed rate (Table 14) peaks at 9.3%, representing a systematic blind spot

Overall Assessment

JURY-RL presents a well-engineered framework that makes a clear advance in label-free RLVR for mathematical reasoning. The combination of majority voting for proposal efficiency and formal verification for reward fidelity is sensible and well-executed. ResZero is a genuinely useful contribution. However, the domain restriction to formally verifiable settings, the gap between theoretical soundness and practical pipeline accuracy, and the modest margins on some benchmarks temper the overall impact. The work is solid engineering and systems research rather than a fundamental conceptual breakthrough.

Rating:6.8/ 10

Significance 6.5Rigor 7.5Novelty 6.5Clarity 7.5

Generated Apr 29, 2026

Comparison History (41)

vs. TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints

gemini-3.15/16/2026

Paper 2 introduces a novel, label-free reinforcement learning framework that directly improves LLM reasoning capabilities using formal verifiers. Advancements in RL-based training methodologies for reasoning currently drive significant progress in the field, giving this work broader applicability and higher potential impact than the evaluation framework presented in Paper 1.

vs. SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

claude-opus-4.64/29/2026

JURY-RL addresses a fundamental and timely challenge in RLVR—eliminating the need for human-annotated labels while maintaining training stability—with a concrete, well-validated framework combining majority voting with formal verification. It demonstrates strong empirical results across multiple benchmarks and models, with clear practical applicability to LLM training pipelines. SEA-Eval introduces a useful benchmark for self-evolving agents, but benchmarks typically have narrower impact than novel training methodologies. JURY-RL's contribution to label-free reinforcement learning for reasoning has broader immediate applicability and addresses a bottleneck affecting the entire LLM training community.

vs. SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

claude-opus-4.64/29/2026

JURY-RL addresses a fundamental and practical bottleneck in RLVR—the dependence on human-annotated labels—with a principled framework combining majority voting with formal verification. Its label-free approach has immediate, broad applicability to LLM training across math reasoning and code generation, with strong empirical results matching supervised baselines. The ResZero fallback mechanism is a novel technical contribution with theoretical grounding. SEA-Eval introduces a useful benchmark for self-evolving agents but is primarily an evaluation framework rather than a methodological advance, and the self-evolving agent paradigm it targets is still nascent with limited practical adoption.

vs. RADD: Retrieval-Augmented Discrete Diffusion for Multi-Modal Knowledge Graph Completion

claude-opus-4.64/29/2026

JURY-RL addresses a fundamental challenge in RLVR for LLMs—removing the need for human-annotated labels while maintaining training stability. Its novel decoupling of vote-based proposal and formal verification, combined with the ResZero fallback mechanism, offers a principled label-free framework with broad applicability across math reasoning, code generation, and general benchmarks. The work is highly timely given the explosive interest in reasoning LLMs and RLHF/RLVR methods. Paper 2, while solid, addresses a narrower problem (multi-modal KG completion) with more incremental contributions combining existing ideas (retrieval-augmented generation, discrete diffusion). Paper 1's broader impact potential and timeliness give it the edge.

vs. RADD: Retrieval-Augmented Discrete Diffusion for Multi-Modal Knowledge Graph Completion

claude-opus-4.64/29/2026

JURY-RL addresses a fundamental challenge in RLVR for LLMs—removing the need for human-annotated labels while maintaining training stability. Its novel decoupling of vote-based proposal from formal verification, combined with the ResZero fallback mechanism, represents a significant methodological innovation with broad applicability across reasoning, code generation, and general tasks. The work is highly timely given the rapid scaling of RLHF/RLVR methods. Paper 2, while solid, addresses the narrower domain of multi-modal knowledge graph completion with a retrieve-and-rerank framework that, though effective, represents a more incremental architectural contribution with more limited breadth of impact.

vs. Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

gpt-5.24/29/2026

Paper 1 likely has higher scientific impact: it introduces a novel, training-time RLVR framework that materially improves LLM reasoning without labels by combining plurality voting with formal verification and a variance-preserving fallback reward, and it demonstrates strong benchmark gains plus transfer. This is methodologically substantive and broadly useful for scalable alignment/training in verifiable domains (math, code, theorem proving), aligning with a timely push toward automated, reliable RL signals. Paper 2 provides valuable diagnostics for LLM-judge evaluation reliability, but is more incremental and primarily impacts evaluation practice rather than core capability training.

vs. Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations

gpt-5.24/29/2026

Paper 2 likely has higher impact: it proposes a novel, practically useful RLVR training framework that reduces reliance on labels while preventing false-positive reward signals via formal verification and a variance-preserving fallback (ResZero). This is timely for scalable reasoning improvements and could be adopted broadly in math/proof and code domains, influencing both RL methodology and formal-verification-integrated training. Paper 1 offers valuable diagnostics for LLM-as-judge reliability, but is more evaluative/diagnostic with narrower immediate application and less direct effect on model capability scaling.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-34/29/2026

Paper 1 introduces a novel, scalable methodology for label-free reinforcement learning using formal verifiers, addressing a critical bottleneck in training advanced reasoning models. Its ability to stabilize optimization without human annotations aligns perfectly with the highly impactful trend of self-improving LLMs. While Paper 2 offers a valuable benchmark for agentic behavior, Paper 1 provides a fundamental algorithmic advancement with immediate, broad implications for model training paradigms in mathematics, code generation, and beyond.

vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors

claude-opus-4.64/29/2026

Hodoscope introduces a genuinely novel paradigm—unsupervised monitoring for AI misbehaviors—that addresses a fundamental and growing problem in AI safety. It formulates a new problem (unsupervised monitoring), discovers real previously-unknown vulnerabilities in established benchmarks, and demonstrates practical utility with 6-23× review effort reduction. Its breadth of impact spans AI safety, evaluation, and agent deployment. While JURY-RL is a solid contribution to label-free RLVR, it is more incremental within an established research direction. Hodoscope's conceptual novelty and timeliness for AI safety give it higher potential impact.

vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

claude-opus-4.64/29/2026

Paper 1 addresses a fundamental and broadly applicable problem—prior contamination in LLM-assisted analysis—that affects virtually every domain using LLMs for reasoning. The epistemic blinding protocol is simple, generalizable (demonstrated in both biology and finance), and immediately actionable with open-source tooling. It establishes a new auditability principle relevant across scientific disciplines. Paper 2, while technically solid, represents an incremental advance in RLVR training methodology within a narrower scope (math/code reasoning). Paper 1's cross-disciplinary relevance and its potential to become standard practice in LLM-assisted research give it higher impact potential.

vs. AI scientists produce results without reasoning scientifically

gemini-34/29/2026

Paper 2 addresses a critical, broadly relevant issue by exposing fundamental flaws in the reasoning of highly hyped 'AI scientists'. Its comprehensive evaluation across multiple domains provides a crucial reality check that will steer future research agendas in AI, epistemology, and automated discovery. While Paper 1 offers a solid technical improvement for RLVR, Paper 2's broad critique and challenge to the current paradigm of autonomous scientific agents give it a higher potential for widespread scientific impact and paradigm-shifting influence.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

claude-opus-4.64/29/2026

IatroBench addresses a critical and timely issue—AI safety measures causing iatrogenic harm—with rigorous pre-registered methodology across frontier models. It reveals a systematic, identity-contingent failure mode with direct life-or-death implications, challenging fundamental assumptions in AI safety alignment. The finding that safety filters paradoxically increase harm for vulnerable populations who have exhausted standard referrals has broad implications for AI policy, healthcare, and alignment research. Paper 1, while technically solid, represents an incremental improvement within the RLVR training paradigm. Paper 2's cross-disciplinary impact (AI safety, medicine, policy, ethics) and provocative empirical findings give it substantially greater potential impact.

vs. Emotion Concepts and their Function in a Large Language Model

gpt-5.24/29/2026

Paper 2 likely has higher impact: it introduces a broadly applicable, label-free RLVR training framework that addresses a key bottleneck (reward reliability without human labels) with a concrete algorithmic design (vote–verify decoupling, Lean verification, ResZero for stable gradients) and demonstrates strong empirical gains and transfer. This is timely for scalable reasoning training and can be adopted across verifiable domains (math, code, formal methods), influencing both academia and industry. Paper 1 is novel and alignment-relevant, but is more interpretive/model-specific and may have narrower immediate methodological reuse.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

gpt-5.24/29/2026

Paper 2 likely has higher scientific impact: it reports a first-of-its-kind, real-world deployment of AI-assisted peer review at full flagship-conference scale (22,977 papers), with empirical evidence from surveys and a new benchmark. The work is highly timely and directly affects the scientific ecosystem, with broad cross-field implications for research evaluation, conference operations, and AI governance. While Paper 1 is methodologically innovative for label-free RLVR and valuable for LLM reasoning, its immediate applications are narrower (formal-verifier domains) and its impact is more incremental within a specialized ML training paradigm.

vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules

gpt-5.24/29/2026

Paper 1 likely has higher scientific impact: it introduces a broadly applicable multimodal generative foundation model plus a new aligned dataset spanning sequence/structure/regulation/context, enabling unified prediction and constrained design across RNA and proteins—high novelty, wide real-world bioengineering/biomedicine applications, and cross-field reach. Paper 2 is a solid methodological advance for label-free RLVR with formal verification, timely for LLM training, but its impact is narrower (machine-checkable domains) and more incremental relative to existing RLVR/voting/judge frameworks. Both appear rigorous, but Paper 1’s breadth and translational potential are larger.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-34/29/2026

Paper 1 addresses a critical bottleneck in scaling LLM reasoning: the reliance on human-annotated data for RL. By proposing a label-free RLVR framework that combines majority voting with formal verification, it offers a scalable pathway for self-improving models. While Paper 2 introduces an excellent and highly practical benchmark for agent reliability, Paper 1's methodological advancement in training paradigms has the potential to fundamentally influence how future frontier models are post-trained.

vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors

claude-opus-4.64/29/2026

Hodoscope introduces a genuinely novel paradigm—unsupervised monitoring for AI misbehaviors—that addresses a fundamental and growing concern in AI safety. It discovers a previously unknown benchmark vulnerability (Commit0), demonstrating immediate real-world impact. The conceptual contribution of formulating unsupervised monitoring by analogy to unsupervised learning is broadly applicable across AI safety, evaluation, and alignment. Paper 1, while technically solid, is a more incremental improvement within the RLVR framework, combining existing techniques (majority voting, formal verification, zero-mean rewards) in a specific domain. Paper 2's broader applicability and timeliness give it higher impact potential.

vs. AI scientists produce results without reasoning scientifically

gemini-34/29/2026

Paper 1 evaluates a foundational and highly timely issue: the epistemic validity of autonomous 'AI scientists'. Its rigorous, large-scale analysis challenges prevailing assumptions and has broad implications across AI, philosophy of science, and all scientific domains adopting AI tools. Paper 2 offers a strong algorithmic contribution to reinforcement learning, but its impact is narrower in scope compared to the field-wide wake-up call provided by Paper 1.

vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

claude-opus-4.64/29/2026

Paper 1 introduces a broadly applicable methodological contribution—epistemic blinding—that addresses a fundamental and underappreciated problem with LLM-assisted analysis: the invisible contamination of outputs by memorized priors. This affects virtually every domain where LLMs are used for data analysis, from biology to finance. The concept is simple, generalizable, and immediately actionable (released as open-source). Paper 2 is a solid but more incremental contribution within the specific RLVR training paradigm, combining existing ideas (majority voting, formal verification, variance-preserving rewards). Paper 1's cross-disciplinary relevance and novel framing of a pervasive trustworthiness problem give it broader potential impact.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

claude-opus-4.64/29/2026

IatroBench addresses a fundamental and timely problem—AI safety measures causing iatrogenic harm—with rigorous pre-registered methodology across frontier models. Its findings (identity-contingent withholding, quantified omission harm, evaluation blind spots) have broad implications for AI policy, healthcare, model alignment, and safety research. The result that safety measures systematically harm vulnerable users who have exhausted standard referrals is provocative and actionable. JURY-RL is a solid technical contribution to RLVR training but is more incremental within the LLM reasoning optimization space. IatroBench's cross-disciplinary relevance (medicine, AI safety, policy, ethics) gives it wider impact potential.