JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR
Xinjie Chen, Biao Fu, Jing Wu, Guoxin Chen, Xinggao Liu, Dayiheng Liu, Minpeng Liao
Abstract
Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.
AI Impact Assessments
(3 models)Scientific Impact Assessment: JURY-RL
1. Core Contribution
JURY-RL addresses a meaningful gap in reinforcement learning with verifiable rewards (RLVR): the dependency on human-annotated ground-truth answers. The key insight is a clean decoupling of answer proposal (via majority voting over rollouts) from reward disposal (via formal Lean verification of only the top-voted candidate). This "Votes Propose, Proofs Dispose" paradigm elegantly reduces the verification cost to a single Lean call per group while maintaining reward fidelity.
The second contribution, ResZero (Residual-Zero), is a carefully designed fallback reward for when verification is inconclusive. Rather than rewarding unverified consensus (which risks collapse) or assigning zero reward (which stalls learning), ResZero penalizes the unverified majority, redistributes a zero-mean variance-preserving signal over residual answers, and adaptively scales with majority confidence α. This is a principled design choice with formal guarantees (zero-mean property proven in Appendix A.2).
2. Methodological Rigor
Theoretical grounding. The paper provides a thorough theoretical analysis of why naive majority voting is brittle under GRPO (advantages for supporters vanish while dissenter penalties diverge as consensus strengthens), and formally proves the zero-mean property of ResZero. The three-way comparison of fallback strategies (MV/Zero/ResZero) in Appendix A.3 is informative and clearly motivates the design.
Experimental design. The evaluation is comprehensive: three backbone models (Qwen3-1.7B-Base, Llama-3.2-3B-Instruct, Qwen2.5-7B), nine benchmarks spanning math, code, instruction-following, and multi-task, and seven baselines including both label-free and supervised methods. Confidence intervals are reported. The ablation study (Table 3) is particularly well-executed, systematically testing four fallback designs under the same proof-gated framework.
Weaknesses in rigor. The Lean verifier pipeline is not perfectly precise—Table 4 shows 84.5% precision, meaning ~15.5% of positively rewarded answers may be incorrect (due to autoformalization/consistency-check errors, not Lean itself). While the paper acknowledges this, the gap between theoretical soundness ("formal verification has zero false positives") and practical reality deserves more scrutiny. The verification recall of 88% means ~12% of correct answers receive no positive reward, which could systematically bias learning against certain problem types. Additionally, the paper trains on only 7,500 MATH problems—it remains unclear how well the approach scales to larger, more diverse datasets.
3. Potential Impact
Practical significance. Removing the need for ground-truth labels in RLVR is a genuine practical advance. The framework achieves performance comparable to supervised ground-truth training (Table 1) while being label-free, which is the headline result. The superior pass@k performance and maintained diversity (Figure 2, Table 2) suggest the method produces more robust reasoning models.
Scope limitations. The approach is fundamentally restricted to machine-checkable domains—currently mathematics via Lean. Extension to other domains requires analogous formal verifiers, which may not exist or may be prohibitively expensive. The paper's code generation results are transfer performance (trained on math), not direct verification of code. This narrows the immediate applicability.
Broader influence. The "propose-then-verify" paradigm could influence how the community thinks about reward design in RLVR more generally. The ResZero fallback mechanism is a reusable contribution that applies beyond Lean verification to any setting with binary but incomplete verification signals.
4. Timeliness & Relevance
This work is highly timely. RLVR has become a central technique following DeepSeek-R1 and related work, and the community is actively seeking ways to reduce label dependency. The concurrent explosion of label-free methods (majority voting, self-certainty, entropy minimization, CoReward) creates a well-defined competitive landscape that this paper improves upon. The use of Lean for verification aligns with growing interest in formal methods for AI alignment.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
JURY-RL presents a well-engineered framework that makes a clear advance in label-free RLVR for mathematical reasoning. The combination of majority voting for proposal efficiency and formal verification for reward fidelity is sensible and well-executed. ResZero is a genuinely useful contribution. However, the domain restriction to formally verifiable settings, the gap between theoretical soundness and practical pipeline accuracy, and the modest margins on some benchmarks temper the overall impact. The work is solid engineering and systems research rather than a fundamental conceptual breakthrough.
Generated Apr 29, 2026
Comparison History (41)
Paper 2 introduces a novel, label-free reinforcement learning framework that directly improves LLM reasoning capabilities using formal verifiers. Advancements in RL-based training methodologies for reasoning currently drive significant progress in the field, giving this work broader applicability and higher potential impact than the evaluation framework presented in Paper 1.
JURY-RL addresses a fundamental and timely challenge in RLVR—eliminating the need for human-annotated labels while maintaining training stability—with a concrete, well-validated framework combining majority voting with formal verification. It demonstrates strong empirical results across multiple benchmarks and models, with clear practical applicability to LLM training pipelines. SEA-Eval introduces a useful benchmark for self-evolving agents, but benchmarks typically have narrower impact than novel training methodologies. JURY-RL's contribution to label-free reinforcement learning for reasoning has broader immediate applicability and addresses a bottleneck affecting the entire LLM training community.
JURY-RL addresses a fundamental and practical bottleneck in RLVR—the dependence on human-annotated labels—with a principled framework combining majority voting with formal verification. Its label-free approach has immediate, broad applicability to LLM training across math reasoning and code generation, with strong empirical results matching supervised baselines. The ResZero fallback mechanism is a novel technical contribution with theoretical grounding. SEA-Eval introduces a useful benchmark for self-evolving agents but is primarily an evaluation framework rather than a methodological advance, and the self-evolving agent paradigm it targets is still nascent with limited practical adoption.
JURY-RL addresses a fundamental challenge in RLVR for LLMs—removing the need for human-annotated labels while maintaining training stability. Its novel decoupling of vote-based proposal and formal verification, combined with the ResZero fallback mechanism, offers a principled label-free framework with broad applicability across math reasoning, code generation, and general benchmarks. The work is highly timely given the explosive interest in reasoning LLMs and RLHF/RLVR methods. Paper 2, while solid, addresses a narrower problem (multi-modal KG completion) with more incremental contributions combining existing ideas (retrieval-augmented generation, discrete diffusion). Paper 1's broader impact potential and timeliness give it the edge.
JURY-RL addresses a fundamental challenge in RLVR for LLMs—removing the need for human-annotated labels while maintaining training stability. Its novel decoupling of vote-based proposal from formal verification, combined with the ResZero fallback mechanism, represents a significant methodological innovation with broad applicability across reasoning, code generation, and general tasks. The work is highly timely given the rapid scaling of RLHF/RLVR methods. Paper 2, while solid, addresses the narrower domain of multi-modal knowledge graph completion with a retrieve-and-rerank framework that, though effective, represents a more incremental architectural contribution with more limited breadth of impact.
Paper 1 likely has higher scientific impact: it introduces a novel, training-time RLVR framework that materially improves LLM reasoning without labels by combining plurality voting with formal verification and a variance-preserving fallback reward, and it demonstrates strong benchmark gains plus transfer. This is methodologically substantive and broadly useful for scalable alignment/training in verifiable domains (math, code, theorem proving), aligning with a timely push toward automated, reliable RL signals. Paper 2 provides valuable diagnostics for LLM-judge evaluation reliability, but is more incremental and primarily impacts evaluation practice rather than core capability training.
Paper 2 likely has higher impact: it proposes a novel, practically useful RLVR training framework that reduces reliance on labels while preventing false-positive reward signals via formal verification and a variance-preserving fallback (ResZero). This is timely for scalable reasoning improvements and could be adopted broadly in math/proof and code domains, influencing both RL methodology and formal-verification-integrated training. Paper 1 offers valuable diagnostics for LLM-as-judge reliability, but is more evaluative/diagnostic with narrower immediate application and less direct effect on model capability scaling.
Paper 1 introduces a novel, scalable methodology for label-free reinforcement learning using formal verifiers, addressing a critical bottleneck in training advanced reasoning models. Its ability to stabilize optimization without human annotations aligns perfectly with the highly impactful trend of self-improving LLMs. While Paper 2 offers a valuable benchmark for agentic behavior, Paper 1 provides a fundamental algorithmic advancement with immediate, broad implications for model training paradigms in mathematics, code generation, and beyond.
Hodoscope introduces a genuinely novel paradigm—unsupervised monitoring for AI misbehaviors—that addresses a fundamental and growing problem in AI safety. It formulates a new problem (unsupervised monitoring), discovers real previously-unknown vulnerabilities in established benchmarks, and demonstrates practical utility with 6-23× review effort reduction. Its breadth of impact spans AI safety, evaluation, and agent deployment. While JURY-RL is a solid contribution to label-free RLVR, it is more incremental within an established research direction. Hodoscope's conceptual novelty and timeliness for AI safety give it higher potential impact.
Paper 1 addresses a fundamental and broadly applicable problem—prior contamination in LLM-assisted analysis—that affects virtually every domain using LLMs for reasoning. The epistemic blinding protocol is simple, generalizable (demonstrated in both biology and finance), and immediately actionable with open-source tooling. It establishes a new auditability principle relevant across scientific disciplines. Paper 2, while technically solid, represents an incremental advance in RLVR training methodology within a narrower scope (math/code reasoning). Paper 1's cross-disciplinary relevance and its potential to become standard practice in LLM-assisted research give it higher impact potential.
Paper 2 addresses a critical, broadly relevant issue by exposing fundamental flaws in the reasoning of highly hyped 'AI scientists'. Its comprehensive evaluation across multiple domains provides a crucial reality check that will steer future research agendas in AI, epistemology, and automated discovery. While Paper 1 offers a solid technical improvement for RLVR, Paper 2's broad critique and challenge to the current paradigm of autonomous scientific agents give it a higher potential for widespread scientific impact and paradigm-shifting influence.
IatroBench addresses a critical and timely issue—AI safety measures causing iatrogenic harm—with rigorous pre-registered methodology across frontier models. It reveals a systematic, identity-contingent failure mode with direct life-or-death implications, challenging fundamental assumptions in AI safety alignment. The finding that safety filters paradoxically increase harm for vulnerable populations who have exhausted standard referrals has broad implications for AI policy, healthcare, and alignment research. Paper 1, while technically solid, represents an incremental improvement within the RLVR training paradigm. Paper 2's cross-disciplinary impact (AI safety, medicine, policy, ethics) and provocative empirical findings give it substantially greater potential impact.
Paper 2 likely has higher impact: it introduces a broadly applicable, label-free RLVR training framework that addresses a key bottleneck (reward reliability without human labels) with a concrete algorithmic design (vote–verify decoupling, Lean verification, ResZero for stable gradients) and demonstrates strong empirical gains and transfer. This is timely for scalable reasoning training and can be adopted across verifiable domains (math, code, formal methods), influencing both academia and industry. Paper 1 is novel and alignment-relevant, but is more interpretive/model-specific and may have narrower immediate methodological reuse.
Paper 2 likely has higher scientific impact: it reports a first-of-its-kind, real-world deployment of AI-assisted peer review at full flagship-conference scale (22,977 papers), with empirical evidence from surveys and a new benchmark. The work is highly timely and directly affects the scientific ecosystem, with broad cross-field implications for research evaluation, conference operations, and AI governance. While Paper 1 is methodologically innovative for label-free RLVR and valuable for LLM reasoning, its immediate applications are narrower (formal-verifier domains) and its impact is more incremental within a specialized ML training paradigm.
Paper 1 likely has higher scientific impact: it introduces a broadly applicable multimodal generative foundation model plus a new aligned dataset spanning sequence/structure/regulation/context, enabling unified prediction and constrained design across RNA and proteins—high novelty, wide real-world bioengineering/biomedicine applications, and cross-field reach. Paper 2 is a solid methodological advance for label-free RLVR with formal verification, timely for LLM training, but its impact is narrower (machine-checkable domains) and more incremental relative to existing RLVR/voting/judge frameworks. Both appear rigorous, but Paper 1’s breadth and translational potential are larger.
Paper 1 addresses a critical bottleneck in scaling LLM reasoning: the reliance on human-annotated data for RL. By proposing a label-free RLVR framework that combines majority voting with formal verification, it offers a scalable pathway for self-improving models. While Paper 2 introduces an excellent and highly practical benchmark for agent reliability, Paper 1's methodological advancement in training paradigms has the potential to fundamentally influence how future frontier models are post-trained.
Hodoscope introduces a genuinely novel paradigm—unsupervised monitoring for AI misbehaviors—that addresses a fundamental and growing concern in AI safety. It discovers a previously unknown benchmark vulnerability (Commit0), demonstrating immediate real-world impact. The conceptual contribution of formulating unsupervised monitoring by analogy to unsupervised learning is broadly applicable across AI safety, evaluation, and alignment. Paper 1, while technically solid, is a more incremental improvement within the RLVR framework, combining existing techniques (majority voting, formal verification, zero-mean rewards) in a specific domain. Paper 2's broader applicability and timeliness give it higher impact potential.
Paper 1 evaluates a foundational and highly timely issue: the epistemic validity of autonomous 'AI scientists'. Its rigorous, large-scale analysis challenges prevailing assumptions and has broad implications across AI, philosophy of science, and all scientific domains adopting AI tools. Paper 2 offers a strong algorithmic contribution to reinforcement learning, but its impact is narrower in scope compared to the field-wide wake-up call provided by Paper 1.
Paper 1 introduces a broadly applicable methodological contribution—epistemic blinding—that addresses a fundamental and underappreciated problem with LLM-assisted analysis: the invisible contamination of outputs by memorized priors. This affects virtually every domain where LLMs are used for data analysis, from biology to finance. The concept is simple, generalizable, and immediately actionable (released as open-source). Paper 2 is a solid but more incremental contribution within the specific RLVR training paradigm, combining existing ideas (majority voting, formal verification, variance-preserving rewards). Paper 1's cross-disciplinary relevance and novel framing of a pervasive trustworthiness problem give it broader potential impact.
IatroBench addresses a fundamental and timely problem—AI safety measures causing iatrogenic harm—with rigorous pre-registered methodology across frontier models. Its findings (identity-contingent withholding, quantified omission harm, evaluation blind spots) have broad implications for AI policy, healthcare, model alignment, and safety research. The result that safety measures systematically harm vulnerable users who have exhausted standard referrals is provocative and actionable. JURY-RL is a solid technical contribution to RLVR training but is more incremental within the LLM reasoning optimization space. IatroBench's cross-disciplinary relevance (medicine, AI safety, policy, ethics) gives it wider impact potential.