Boshu Lei, Kostas Daniilidis, Antonio Loquercio
We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.
This paper introduces RLDT, an online RL algorithm for fine-tuning flow-matching policies that recasts policy improvement as a probability transport problem on the action manifold. The key conceptual insight is that flow-matching models already learn transport fields from noise to action distributions, making it natural to define RL updates as additional transport toward high-reward regions. The method uses Stein Variational Gradient Descent (SVGD) to construct a transport field ϕ*(a) that drives action samples toward the optimal policy density defined by a maximum-entropy RL objective. A critical technical contribution is the use of expected-target estimation (Eq. 8) to map intermediate denoising steps onto the action manifold, enabling gradient propagation without backpropagation through the full ODE chain.
The paper addresses two concrete problems: (1) the intractability of log-likelihood computation for flow-matching policies, which plagues on-policy methods like DPPO and FPO++; and (2) gradient instability from backpropagation through multi-step denoising, which affects off-policy methods. RLDT sidesteps both by operating entirely in sample space via SVGD and using expected-target predictions.
The theoretical derivation is mostly sound, progressing logically from the maximum-entropy RL objective through SVGD to the final loss function. The use of SVGD to avoid density estimation is well-motivated, and the connection between the transport field formulation and flow-matching is elegant. The derivation of the parameter update ξ* (Eq. 11-12) via chain rule arguments is clean, though the approximation from exact matrix inversion to a single gradient step (Eq. 12) introduces an unquantified approximation error.
The experimental evaluation covers three benchmark settings of increasing complexity: OpenAI Gym (dense rewards), FurnitureBench (sparse rewards, state-based, long-horizon), and Robomimic (sparse rewards, vision-based). This breadth is a strength. The comparison against DPPO, ReinFlow, FPO++, and QAM provides reasonable coverage of the landscape, though some baselines (QAM) could not be tuned for sparse-reward tasks, limiting comparisons there.
However, there are methodological concerns. The base policies for RLDT use flow-matching objectives while DPPO uses DDPM—though architectures are matched, the pretrained policies may differ in quality, creating a confound. The paper acknowledges this but doesn't control for it with ablations on the same pretrained model. The gradient norm analysis (Fig. 4) is informative but only shown for one environment (Hopper). The ablation on kernel functions (Fig. 5) reveals that the RBF kernel's advantage is primarily in sparse-reward settings, suggesting the SVGD repulsive term's benefit is situation-dependent rather than universal.
The work is highly relevant to the robotics and VLA community, where flow-matching policies are becoming the dominant paradigm (π0, π0.5, SmolVLA). A principled RL fine-tuning method for these policies addresses a genuine practical need: adapting pretrained foundation models to specific downstream tasks with reward feedback.
The transport formulation could inspire follow-up work in related areas—text-to-image generation alignment (RLHF for flow models), protein design, or any domain using flow-matching generative models. The SVGD-based approach that bypasses density estimation could also transfer to diffusion policy fine-tuning more broadly.
However, the computational overhead of requiring K=8 parallel particles for SVGD, plus large numbers of parallel environments (100-1000), limits near-term applicability to real-robot RL. The authors acknowledge this and suggest simulation-to-real transfer as the practical pathway.
The paper is extremely timely. Flow-matching policies are rapidly becoming standard in robotics (π0, π0.5 from Physical Intelligence), and the question of how to fine-tune them with RL is an active research frontier. The paper directly competes with several concurrent/recent works (DPPO, FPO++, QAM, ReinFlow, SACFlow) published in 2025-2026 venues, indicating a highly competitive area where incremental improvements matter.
The sparse-reward, long-horizon evaluation on FurnitureBench and vision-based Robomimic is particularly relevant, as most prior work only evaluates on dense-reward Gym tasks. The Lamp-Med result (70% vs DPPO's 30%) is notable.
Notable observation: The finding that RLDT-Delta (without repulsive forces) doesn't cause mode collapse but still underperforms on sparse tasks is interesting and somewhat surprising. The explanation—that Q-function evolution prevents particle collapse—deserves more rigorous investigation.
RLDT presents a conceptually clean and practically effective method for an important and timely problem. The transport perspective is natural for flow-matching policies and yields concrete advantages over density-based and adjoint-based alternatives. While the experimental evaluation could be strengthened with better-controlled ablations and more seeds, the breadth of tasks and consistent improvements over baselines make a compelling case. This is a solid contribution to an active research area, likely to influence subsequent work on RL fine-tuning of flow-matching foundation models.
Generated Jun 9, 2026
Paper 2 addresses a critical bottleneck in the current AI landscape: the high inference cost and latency of Large Reasoning Models. By proposing hardware-aware optimizations for the emerging NVFP4 standard, it offers immediate, widespread practical applications for deploying large language models. While Paper 1 introduces a mathematically elegant approach to continuous control, Paper 2's focus on LLM efficiency, hardware-software co-design, and latency-critical decoding ensures broader and more immediate impact across both academia and industry.
Paper 2 likely has higher scientific impact due to greater conceptual novelty and broader applicability: it proposes a principled link between max-entropy RL policy improvement and density transport for flow-matching policies, using SVGD and a stabilization technique for multi-step generators. This can influence RL, generative modeling, and robotics, with timely relevance as diffusion/flow policies become common in control. Paper 1 is highly valuable engineering (a fused kernel) with strong speedups and ANN benefits, but its impact is more specialized to GPU systems and clustering/IVF pipelines.
Paper 2 (ERBench) likely has higher scientific impact because it provides a broadly useful benchmark and testsuite for equation discovery/symbolic regression, enabling standardized, rigorous, and reproducible evaluation across algorithms and settings (noise, dimensionality, sampling regimes). Benchmarks often become community infrastructure with long-lasting, cross-field influence (ML, physics, chemistry, systems biology). Paper 1 is novel and timely within RL + generative policy optimization, with strong robotics relevance, but its impact is narrower (specialized to flow-matching policies and continuous control) and may be superseded faster by evolving RL methods.
Paper 2 addresses the timely and high-impact problem of fine-tuning flow-matching generative models with RL for continuous control, bridging two rapidly growing fields (flow matching and RL). It offers a novel density transport perspective using SVGD, provides practical solutions for backpropagation challenges, and demonstrates results across diverse robotics tasks including vision-based manipulation. Paper 1 is a technically rigorous but incremental improvement to local hypergraph diffusion methods, serving a narrower community. Paper 2's broader applicability to robotics, generative modeling, and RL gives it significantly wider potential impact.
Paper 1 addresses a fundamental challenge in generative AI and RL by enabling efficient fine-tuning of multi-step flow-matching policies for continuous control. Its algorithmic innovations, using density transport and SVGD, have broad applicability across robotics and autonomous systems. Paper 2 offers a valuable but more specialized contribution to latent-space Bayesian optimization for molecular design, making Paper 1's foundational methodological advancements likely to have a wider and more significant impact across the AI community.
Paper 1 presents a novel and technically rigorous approach to fine-tuning flow-matching policies with RL, addressing a timely problem at the intersection of generative models and reinforcement learning. The density transport perspective is innovative, and the method applies broadly across continuous control and robotics tasks. Paper 2 addresses an important clinical problem but relies on relatively incremental methodological contributions (combining transition-based and sequence-based models) applied to a single dataset (ADNI). Paper 1 has broader methodological impact potential across robotics, generative modeling, and RL communities.
Paper 2 has higher estimated impact due to broader cross-field relevance and conceptual unification. It clarifies an underspecified area (concept alignment) by defining axes/properties, mapping existing methods to guarantees, and providing a benchmark (InterVenchA) plus a method (CoSAE) with practical guidance (tiny paired data). This combination of theory, measurement infrastructure, and actionable findings can influence interpretability, multimodal learning, neuroscience-style RSA, and evaluation practices. Paper 1 is technically strong and timely for diffusion/flow-based RL, but its impact is likely narrower to continuous-control RL and generative policy optimization.
Paper 2 likely has higher scientific impact due to its timeliness and broad relevance to modern ML deployment: it exposes a new, practical vulnerability in offline bandit-based evaluation pipelines widely used for ranking generative models/LLMs, especially via public reward models. It combines theory (high-dimensional scaling laws for attack norm) with empirical validation on real Hugging Face evaluators, which can influence security practices, evaluation standards, and policy across multiple fields. Paper 1 is novel and useful within continuous-control RL, but its impact is narrower and more method-specific.
Paper 1 introduces a paradigm shift in longitudinal causal inference by leveraging prior-fitted networks for zero-shot counterfactual prediction. This has profound implications for high-stakes fields like healthcare, addressing critical challenges such as limited data and time-varying confounding without requiring domain-specific retraining. While Paper 2 offers a strong algorithmic advance in RL for robotics, Paper 1's foundational approach to causal AI promises broader interdisciplinary impact and tackles a more fundamental, ubiquitous problem in observational data analysis.
Paper 2 likely has higher scientific impact. It extends a broadly applicable alignment technique (consistency training) with new internal targets (MLP and attention) and evaluates across multiple timely, high-stakes safety threats, with evidence of cross-threat generalization and mechanistic insights. This is methodologically and societally relevant to a wide swath of transformer-based AI, affecting safety, robustness, and interpretability across fields and applications. Paper 1 is novel and useful for continuous-control RL with diffusion/flow policies, but its impact is narrower to robotics/control and specialized generative-policy training.