Bowen Ping, Xiangxin Zhou, Penghui Qi, Minnan Luo, Liefeng Bo, Tianyu Pang
Recent work has demonstrated that online reinforcement learning (RL) can substantially improve the quality and alignment of flow matching models for image and video generation. Methods such as Flow-GRPO and CPS cast the denoising process as a Markov Decision Process and apply PPO-style ratio clipping to enforce a trust region. However, we argue that ratio clipping is structurally ill-suited for flow models: the probability ratio between new and old policies is a noisy, single-sample estimate of the true policy divergence, leading to over-constraining in some regions of the trajectory and under-constraining in others. We propose Flow-DPPO (Flow Divergence Proximal Policy Optimization), which replaces ratio clipping with a divergence proximal constraint. A key observation is that the per-step policy in flow models is Gaussian, enabling exact and cheap computation of the KL divergence between old and new policies. Flow-DPPO employs an asymmetric divergence mask that blocks gradient updates only when they simultaneously move away from the trusted region and violate the divergence threshold. Experiments show that Flow-DPPO achieves higher rewards with better KL-proximal efficiency, alleviates catastrophic forgetting, promotes balanced multi-objective optimization, and enables stable multi-epoch training where ratio clipping degrades. Code and models are available at https://github.com/Tencent-Hunyuan/UniRL/tree/main/FlowDPPO.
Flow-DPPO addresses a specific and well-identified problem: PPO-style ratio clipping is a noisy proxy for true policy divergence in flow matching models used for image/video generation. The paper's key insight is that per-step policies in flow models are Gaussian with fixed variance and differing means, enabling exact, closed-form KL divergence computation at zero additional cost (since both forward passes are already required for ratio computation). This structural observation is elegant and well-motivated.
The proposed solution replaces ratio clipping with an asymmetric divergence mask that blocks gradient updates only when: (a) the update direction pushes the policy away from the old policy, AND (b) the exact KL divergence already exceeds a threshold δ. This preserves PPO's beneficial asymmetry—corrective updates toward the old policy are never blocked—while eliminating the noise inherent in single-sample ratio estimates.
Theoretical foundations are solid. The paper provides a formal policy improvement bound (Theorem 2) adapted to the finite-horizon, undiscounted, terminal-reward MDP structure of flow models, with a tighter linear-in-K variant in the appendix. The connection between KL and TV divergence in the Gaussian setting (Remark 3) cleanly justifies using KL as a proxy for the TV divergence appearing in the bound.
The variance analysis of ratio clipping (Section 3.2) is particularly convincing. The decomposition showing log r_t = ε⊤d/σ − ||d||²/(2σ²) makes it transparent that the log-ratio's standard deviation (√(2D_KL)) is comparable to its mean (−D_KL), meaning individual ratio samples are dominated by noise. The observation that the PPO clip range [0.8, 1.2] in log-space is [−0.22, 0.18], which is narrow relative to typical log-ratio variance, quantitatively demonstrates why spurious clipping is pervasive.
Experimental design is comprehensive. The paper evaluates across three base models (SD3.5, FLUX2-9B, FLUX.1-dev), two sampling schedules (Flow-SDE, CPS), single- and multi-reward settings, and includes ablations on divergence threshold, asymmetric masking, multi-epoch training, CFG, and KL regularization strength. The inclusion of out-of-domain evaluation (PickScore prompts during GenEval2 training) to measure catastrophic forgetting adds credibility.
One concern: the comparison could be more complete. While four baselines are included, the paper doesn't compare against methods that directly optimize divergence in other ways (e.g., adaptive clipping schedules or explicit KL penalty terms without masking). The ablation showing training collapse without asymmetric masking (Figure 3) is important but the mechanism could be better explained.
Immediate practical impact is significant. Flow matching models (FLUX, SD3) are the dominant architecture for production image generation, and RL fine-tuning is becoming standard practice for alignment. A drop-in replacement for ratio clipping that improves reward optimization, reduces catastrophic forgetting, and enables multi-epoch training addresses real engineering pain points.
Multi-epoch training stability (Figure 5) is particularly impactful for video generation, where rollouts are extremely expensive. The ability to reuse samples across multiple gradient steps without degradation could substantially reduce computational costs for video RL.
The broader principle—that Gaussian policy structure in continuous generative models enables exact divergence computation, making divergence-based trust regions strictly preferable to ratio-based ones—could influence RL fine-tuning for other continuous-action generative models (audio, 3D generation, molecular design).
This paper is extremely timely. RL fine-tuning for flow matching models is an active frontier (2025-2026), with Flow-GRPO, CPS, GRPO-Guard, and DanceGRPO all appearing within the past year. The paper correctly identifies that methods were ported from the LLM setting without adequately adapting to the continuous action space structure. The observation that flow models admit exact divergence (unlike LLMs where DPPO must approximate) is a timely structural insight that the community needs.
Flow-DPPO makes a well-motivated, theoretically grounded contribution to RL fine-tuning of flow matching models. The core insight—exploiting Gaussian policy structure for exact divergence computation—is simple, elegant, and practically impactful. The experimental validation is thorough and the improvements are consistent. While the contribution is somewhat incremental (replacing one trust-region mechanism with another), it addresses a genuine structural mismatch and delivers measurable gains across a comprehensive evaluation matrix. The work is likely to influence standard practice in the rapidly growing area of RL for generative models.
Generated Jun 10, 2026
Paper 2 likely has higher scientific impact due to a broadly relevant methodological contribution to online RL for flow-based generative models: replacing PPO ratio clipping with an exactly computable KL-based proximal constraint leveraging Gaussian per-step policies. This addresses a known structural mismatch, improves stability (multi-epoch training), mitigates forgetting, and supports multi-objective optimization—advances applicable across image/video generation and potentially other diffusion/flow frameworks. Paper 1 is practically valuable for coding-agent UX, but its impact is more application-layer and narrower in scope compared to a generally reusable optimization/training improvement for generative modeling.
Flow-DPPO addresses a timely and high-impact problem in generative AI alignment—improving RL fine-tuning of flow matching models for image/video generation. It introduces a principled divergence-based alternative to ratio clipping that exploits the Gaussian structure of flow models, with strong empirical results showing improved reward, KL efficiency, and training stability. The direct applicability to state-of-the-art generative models (backed by Tencent's Hunyuan) gives it broad practical impact. Paper 2 provides solid theoretical contributions for asynchronous SGD with clipping, but addresses a more incremental, narrower optimization theory question with less immediate broad impact.
While Paper 1 offers a valuable technical optimization for training flow matching models in generative AI, Paper 2 addresses a fundamental challenge across all of science and engineering: discovering governing physical laws (ODEs/PDEs) when data is scarce and expensive. By enabling accurate model discovery in the ultra-low-data limit, Paper 2 has a significantly broader potential impact across multiple scientific disciplines, accelerating empirical research and physical system modeling far beyond a single subfield of machine learning.
Paper 2 likely has higher scientific impact due to timeliness and broad applicability: improving RL fine-tuning for flow/diffusion-style generative models directly affects a rapidly moving, high-impact area (image/video generation and alignment). Methodologically, it identifies a structural mismatch in PPO ratio clipping for flow models and replaces it with an analytically computable KL constraint leveraging Gaussian per-step policies—an elegant, generalizable fix with practical stability benefits and released code. Paper 1 is novel and interpretable for neural operators, but its demonstrated scope (two PDE settings) suggests narrower near-term adoption and cross-field impact.
Paper 2 likely has higher scientific impact: it addresses a central, timely interpretability concern (reproducibility/seed dependence of SAE features) with a broadly applicable measurement (per-feature stability), extensive empirical study across conditions, and a unifying geometric explanation (reproducible subspaces vs basis ambiguity) supported by a synthetic model. The insights and methodology can influence how SAEs are evaluated and used across many labs and domains. Paper 1 is a solid algorithmic refinement for RL-fine-tuning flow models with clear applications, but its impact is narrower and more incremental within a fast-moving generative-RL niche.
Paper 1 addresses a fundamental challenge in semiconductor manufacturing—a critical industry—with a novel event-driven RL framework demonstrating real-world applicability at industrial scale. Its contributions span RL theory (event-driven temporal-difference formulation), methodology, and practical validation on industry-realistic scenarios. Paper 2 offers a technically sound but incremental improvement (replacing ratio clipping with divergence constraints) for flow matching models in generative AI. While timely, it is a narrower algorithmic refinement. Paper 1's broader cross-disciplinary impact (RL + manufacturing), novelty of formulation, and potential for real-world deployment give it higher estimated scientific impact.
Paper 1 addresses the critical challenge of credit assignment in RL for reasoning models. By providing an efficient alternative to expensive Monte Carlo sampling for step-level rewards, it directly impacts the highly active development of advanced reasoning LLMs. This solves a major bottleneck in a rapidly growing, high-impact field, offering broader transformative potential compared to the more specialized algorithmic tweak for flow matching models presented in Paper 2.
Paper 1 introduces a unifying theoretical framework (Q-target) that reinterprets supervised fine-tuning as target distribution design, unifying many existing SFT variants under one lens. This has broader impact across LLM training, reasoning, and alignment—touching a much larger research community. Paper 2 offers a solid but more incremental contribution (replacing ratio clipping with divergence constraints in flow models for RL-based image/video generation). While technically sound, its scope is narrower. Paper 1's conceptual contribution opens a new design space for SFT objectives with wider applicability across domains.
Flow-DPPO addresses a practical and timely problem in RL-based fine-tuning of flow matching models for image/video generation, proposing a principled replacement for ratio clipping with exact KL divergence constraints. It has immediate real-world applications in generative AI, demonstrates concrete improvements across multiple metrics, and provides open-source code. Paper 2 makes a valuable methodological point about the gap between observational and interventional evidence in MoE pruning, but its scope is narrower (a negative/cautionary result on existing metrics) with less direct applicability to improving systems, limiting its broader impact.
Paper 1 addresses a ubiquitous bottleneck in large language model deployment by providing a mathematically rigorous algorithm for optimal post-training quantization scaling. Given the widespread demand for efficient LLM inference across nearly all AI applications, its potential for broad, immediate real-world impact and methodological improvement over standard heuristics outweighs the more specialized, though innovative, advancements in flow matching alignment presented in Paper 2.