Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang
Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.
The paper introduces DRPO (Divergence Regularized Policy Optimization), which addresses a specific limitation in trust-region methods for LLM reinforcement learning. The lineage is clear: PPO uses ratio-based clipping → DPPO replaces ratio-based clipping with divergence-based binary masking → DRPO replaces the binary mask with a smooth quadratic regularizer weighted by the behavior probability of the sampled token.
The key insight is elegant: by multiplying SPO's quadratic penalty on the importance ratio by µ(y_t|s_t), the implicit regularization shifts from a χ²-type penalty to an ℓ₂²-type penalty on absolute probability shifts. This single modification transforms the trust-region geometry from ratio-based to Binary-TV-based, while maintaining continuous gradients. The resulting gradient weight w_t = 1 - sign(Â_t(r_t-1))|π(y_t|s_t) - µ(y_t|s_t)|/δ provides corrective signals beyond the trust-region boundary, is bounded in [1-1/δ, 1+1/δ], and smoothly attenuates diverging updates.
The theoretical analysis is sound and well-presented. The paper provides:
The experimental evaluation covers six settings: three model architectures (Qwen3-4B-Base, Qwen3-30B-A3B-Base, Qwen3.5-35B-A3B-Base, plus R1D), two precision regimes (BF16 and FP8), and both dense and MoE architectures. Evaluations use AIME 2024/2025 benchmarks with 16-sample averaging. The ablation studies are comprehensive, examining the role of advantage weighting |Â_t|, alternative divergence penalties, hyperparameter sensitivity, and where the corrective signal matters (inside vs. outside the trust region).
However, there are some limitations in rigor:
Practical utility: DRPO is a drop-in replacement for PPO/GRPO/DPPO clipping mechanisms. The implementation is minimal—essentially changing one line in the objective function. This low adoption barrier is significant for the LLM training community.
FP8 training stability: The demonstrated improvements in FP8 precision settings are particularly relevant as the field moves toward lower-precision training for cost efficiency. DRPO's bounded gradient weights provide inherent robustness to the increased numerical noise in low-precision regimes.
Conceptual contribution: The paper's gradient-centered view of regularizer design—arguing that the induced gradient form matters more than the nominal divergence—is a valuable perspective. The three practical criteria identified (stable boundary aligned with distributional shift, bounded gradient weights, smooth corrective signals) provide a useful design framework.
Limitations of impact: The contribution is incremental in nature. DRPO builds directly on DPPO and SPO with a single modification (the µ(y_t|s_t) factor). While well-motivated, this represents a refinement rather than a paradigm shift.
This paper is highly timely. LLM RL training (especially for reasoning models) is a dominant research direction in 2025-2026. The specific problems addressed—training-inference mismatch, policy staleness, FP8 precision challenges—are active pain points in production LLM training systems. The paper directly builds on very recent work (DPPO from 2026, DAPO, GRPO) and addresses known failure modes in current practice.
The paper's Appendix C analysis of why KL and TV penalties fail from a gradient perspective is genuinely insightful and could influence future regularizer design beyond this specific method. The connection between the ℓ₂² vs. χ² implicit regularization provides a clean theoretical distinction. The code availability through Tencent's UniRL framework enhances reproducibility.
Generated Jun 9, 2026
Paper 2 has higher potential impact because it targets a broadly used, timely bottleneck—stable and efficient RL post-training of LLMs—affecting many models, tasks, and labs. The proposed DRPO is a principled modification (smooth divergence regularization vs hard masking) that can plug into existing RL pipelines and may generalize across architectures and training regimes, increasing breadth of impact. Paper 1 is a neat, lightweight inference-speedup fix with clear practical value, but reported gains are modest and narrower in scope (specific to MTP decoding setups), limiting cross-field reach.
Paper 2 (DRPO) addresses a more fundamental and broadly applicable problem in LLM RL—trust-region control and divergence regularization—that affects all RL-based post-training methods (PPO, GRPO, etc.). Its smooth regularizer replacing hard masks is a principled theoretical contribution with wide applicability beyond math reasoning. Paper 1 (N-GRPO) offers a useful but narrower contribution focused on exploration diversity during rollouts in math reasoning. DRPO's impact spans more use cases, model scales, and architectures, making it more likely to influence the broader LLM training community.
Paper 1 provides a foundational critique of current interpretability and pruning methodologies in MoE models, challenging widespread assumptions about observational metrics and causal importance. This paradigm-shifting insight has broader implications for how model interventions are evaluated. Paper 2, while offering a practical algorithmic improvement for LLM RL, represents a more incremental optimization over existing techniques.
Paper 1 addresses the critical bottleneck of rollout efficiency and reward contrast in agentic RL with verifiable rewards. Given the recent industry shift towards scaling inference-time compute and tree-search methods for reasoning LLMs, this unified rollout allocation framework offers high immediate relevance and potential to influence how future agentic models are trained. While Paper 2 offers solid algorithmic improvements to RL stabilization, Paper 1's focus on multi-turn ReAct-style tree structures aligns perfectly with the bleeding edge of AI research, promising a broader real-world impact.
Paper 1 introduces a broadly applicable tool (Express) that addresses multiple fundamental bottlenecks in language model inference and training—long-context prefill, KV cache compression, and memory/compute-constrained decoding. It combines theoretical guarantees with practical implementation (Triton kernels, speedups over FlashAttention 2), making it impactful across both theory and systems. Paper 2 offers an incremental improvement to RL fine-tuning (replacing hard masks with smooth regularization), which, while useful, addresses a narrower problem with more limited potential for cross-field impact.
Paper 2 introduces a concrete algorithmic improvement (DRPO) to a timely, high-impact problem in LLM post-training RL, addressing a specific limitation of divergence-masked approaches with a principled smooth regularizer and reporting empirical gains across settings. This combination of methodological novelty, direct applicability to widely used RLHF-style pipelines, and potential to influence future RL optimization practice suggests higher impact. Paper 1 is a useful unifying survey, but surveys typically have less scientific impact than a new, validated method unless they set a dominant taxonomy or agenda.
Paper 2 likely has higher impact: it targets RL post-training for LLMs, a highly timely and widely used component of modern AI systems, so improvements can propagate across many models and applications. The proposed shift from hard masking (DPPO) to a smooth quadratic divergence regularizer is a clear methodological refinement with broad applicability to off-policy instability and trust-region control. Its evaluation across scales/architectures/precision suggests stronger generality. Paper 1 is novel and useful for multimodal federated graph settings, but the niche “client-level modality deficiency” scenario is narrower in reach.
PBSD addresses a more fundamental and broadly impactful problem—credit assignment in long-horizon RL for LLM agents—which is a critical bottleneck as agents tackle increasingly complex multi-step tasks. Its Bayesian framework for converting sparse outcome rewards into turn-level credit signals is more novel and theoretically elegant than DRPO's incremental improvement over existing trust-region methods (replacing a hard mask with a smooth regularizer). PBSD's applicability to multi-turn agentic settings and demonstrated generalization from short to long contexts suggest broader impact across the rapidly growing field of LLM agents.
Paper 2 is a broad, problem-organizing Review that introduces unifying frameworks (discoverability phase diagram; REO abstraction) for data-driven differential equation discovery across many physical systems. Its potential impact is wide across physics, engineering, and scientific ML, shaping how researchers frame problems, compare methods, and identify open challenges—often leading to high citation and cross-field uptake. Paper 1 is a solid, timely algorithmic improvement for LLM RL stability, but it is narrower in scope and likely incremental within a fast-moving subarea where methods can be quickly superseded.
Paper 2 likely has higher scientific impact due to its timeliness and broad relevance: improving RL post-training for LLMs is a central, fast-moving area with immediate applicability across many models and downstream tasks. The proposed DRPO addresses a widely used trust-region mechanism limitation (hard masking) with a smoother, theoretically motivated regularizer and reports benefits across scales/architectures/precision—suggesting robustness and wide adoption potential. Paper 1 is solid and novel within sports analytics, but its domain is narrower and data-specific, limiting cross-field impact compared to LLM RL methodology advances.