Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine

Jun 9, 2026arXiv:2606.11087v1

cs.LGcs.AI

#426of 5669·cs.LG

#426 of 5669 · cs.LG

Tournament Score

1511±45

10501750

85%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity8

Abstract

Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning"

1. Core Contribution

QGF proposes a simple yet effective mechanism for policy improvement at test time in offline RL with flow-based policies. The key insight is to avoid both (a) taking Q-function gradients at intermediate noisy actions (out-of-distribution for the critic) and (b) expensive backpropagation through the entire denoising chain. Instead, QGF uses a single-step Euler integration to approximate the denoised action from any intermediate step, then takes the critic gradient at this approximated clean action. Crucially, the Jacobian of the mapping is replaced with the identity matrix—a seemingly crude approximation that empirically proves beneficial due to lower variance.

The method cleanly decouples policy training (standard behavioral cloning via flow matching) from value learning (IQL or other TD methods), performing all reward-seeking optimization at inference time. This sidesteps the notorious instability of actor-critic training with iterative generative models.

2. Methodological Rigor

The paper provides a solid theoretical motivation grounded in KL-regularized RL and the connection between flow matching and score functions. The derivation from Eq. (3) through Eq. (9) is clearly presented and well-motivated.

The experimental evaluation is thorough:

Benchmarks: 7 OGBench environments × 5 tasks each for single-task, plus 5 challenging goal-conditioned environments

Seeds: 10 seeds for main results, 4+ for ablations, with 95% confidence intervals

Baselines: Comprehensive comparison against 5 training-time and 6 test-time methods, all using the same critic for fairness

Ablations: Extensive analysis of gradient estimator variants (QGF-Jacobian, QGF-chain, QGF-Distill, QGF-Regularized, QGF-Ortho), sensitivity to guidance weight, scaling behavior, and different critic types

The noise sensitivity analysis (Fig. 3, cosine similarity metric) and the Q-value optimization analysis (Fig. 4) provide good mechanistic understanding. The 1D illustrative example (Fig. 2) effectively demonstrates the OOD gradient bias problem.

However, there are methodological concerns. The claim that dropping the Jacobian is "better" rather than just "simpler" deserves more theoretical scrutiny. The paper acknowledges this is an approximation but frames the empirical advantage as somewhat surprising without fully explaining why. The connection to prior work on approximate gradients (random feedback alignment, etc.) is mentioned but not deeply developed. Additionally, the offline RL setting is the sole evaluation domain—no online RL or real-robot experiments are included.

3. Potential Impact

Practical significance: QGF offers a compelling practical workflow: train a flow policy with stable BC, train a critic separately, then compose them at test time. This modularity is attractive for real-world robotics where:

Policy architectures can be scaled without worrying about actor-critic coupling

Different critics can be swapped without retraining the policy (demonstrated in Section 6.5)

Guidance strength can be tuned at deployment time

Computational efficiency: QGF requires only one additional forward pass through the critic per denoising step, making it orders of magnitude cheaper than BFN (N=16) while achieving comparable performance (Fig. 6-7).

Scaling properties: The favorable scaling with model size (Fig. 9) addresses a genuine bottleneck—actor-critic methods often degrade with larger networks due to optimization instability, while QGF's supervised training loss scales predictably.

Broader implications: This work contributes to the growing paradigm of "test-time compute" in decision-making, analogous to developments in LLMs. The idea of separating capability (policy) from optimization (test-time guidance) could influence how robotics foundation models are deployed.

4. Timeliness & Relevance

This paper is highly timely. Flow and diffusion models are rapidly becoming the dominant policy class for robotic manipulation, yet incorporating them into RL remains challenging. The tension between the stability of supervised pretraining and the instability of RL fine-tuning is a current bottleneck that multiple groups are attacking simultaneously (FQL, QAM, EDP, DAC—all 2024-2025 papers). QGF offers an orthogonal and arguably simpler solution.

The test-time compute paradigm is also trending across ML, from language models to image generation. Positioning RL policy improvement as a test-time problem rather than a training-time problem is conceptually aligned with these broader trends.

5. Strengths & Limitations

Key Strengths:

Simplicity: The method is easy to implement (Algorithm 1 is ~5 lines) and requires no special training objectives

Modularity: Complete decoupling of policy and value training enables independent scaling and swapping

Strong empirical results: Competitive with or better than SOTA training-time methods while being cheaper

Comprehensive ablations: The paper is unusually thorough in analyzing variants and understanding why the method works

Code availability

Notable Limitations:

Evaluation is limited to simulated offline RL on OGBench; no real-robot experiments or online RL settings

The first-order Euler approximation quality degrades in early denoising steps (acknowledged but not deeply analyzed)

Theoretical justification for dropping the Jacobian is incomplete—empirical evidence is strong but the mechanism is unclear

The method still requires a well-trained critic, and critic quality is the primary bottleneck (acknowledged via Fig. 10)

Hyperparameter tuning of guidance weight per domain is still needed, though this is at least tunable without retraining

Limited to continuous action spaces with flow/diffusion policies

The paper does not explore sparse reward settings extensively (only briefly mentioned for p45/p46)

Comparison to closest prior work: The advantage over QFQL (OOD gradient) and BPTT is clearly demonstrated. The relationship to EDP is nuanced—both use Euler approximation but QGF operates at test time, making it strictly more flexible.

Summary

QGF makes a convincing case that test-time gradient guidance is a practical alternative to training-time policy optimization for flow-based RL policies. The contribution is primarily empirical and algorithmic rather than theoretical, but the extensive experiments and ablations substantiate the claims well. The work addresses a genuine current need and offers a simple, modular solution that could see adoption in robotics pipelines using generative policy models.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 6.5Clarity 8

Generated Jun 10, 2026

Comparison History (20)

Wonvs. nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding

Paper 2 (QGF) addresses a fundamental challenge in RL—stable policy improvement with expressive generative models—by proposing test-time gradient guidance that decouples supervised training from policy optimization. This has broad real-world robotics applications, offers practical scalability advantages, and bridges imitation learning with RL in a novel way. Paper 1 (nD-RoPE) is a solid theoretical contribution generalizing position embeddings, but it is more incremental in scope, extending existing RoPE methodology. Paper 2's potential to influence both the RL and robot learning communities, combined with its practical benefits and timeliness amid the scaling of diffusion/flow policies, gives it higher impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Paper 1 offers higher potential scientific impact due to its immediate applicability to a critical bottleneck in modern AI: LLM inference latency. By elegantly identifying and solving the head-backbone competition in multi-token prediction with a drastically simplified, parameter-efficient layer (CLP), it provides a zero-loss speedup for widely used models. While Paper 2 presents a strong test-time guidance method for RL, Paper 1's solution addresses a universal economic and computational problem in AI deployment with a highly practical architectural fix that will likely see rapid, broad adoption.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Paper 2 has higher estimated impact due to a more broadly applicable and timely idea: shifting policy improvement to test time for diffusion/flow policies, avoiding unstable RL training while leveraging scalable supervised pretraining. This has clear real-world robotics and offline RL applications, potentially lowering compute and engineering barriers and influencing both RL and generative policy modeling communities. Methodologically it introduces a clean algorithmic paradigm (critic + value-gradient guidance) that can transfer across tasks and model families. Paper 1 is a solid, rigorous PPO refinement for LLM RL, but is narrower in scope and likely incremental within a crowded area.

gpt-5.2·Jun 10, 2026

Wonvs. Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling

Paper 2 is likely higher impact due to broader cross-field relevance and real-world applicability: it targets continuous-control robotics/decision-making, proposing a practical paradigm (test-time policy improvement via value gradients) that preserves stable supervised training while avoiding actor-critic instability—an important bottleneck for diffusion/flow policies in RL. This approach could generalize across many offline/goal-conditioned settings and influence deployment-time optimization. Paper 1 is novel and solid for code LLMs, but its impact is narrower (program synthesis benchmarks) and more incremental within an already-active test-time scaling/self-training line.

gpt-5.2·Jun 10, 2026

Wonvs. Algorithmic and Minimax Complexities in Kernel Bandits

Paper 2 addresses a critical bottleneck in modern reinforcement learning: stable policy improvement with expressive generative models like diffusion and flow. By shifting policy optimization to test time, it bypasses the instabilities of actor-critic training, offering a scalable solution with immediate real-world applications in robotics and continuous control. While Paper 1 provides a strong theoretical unification for kernel bandits, its impact is likely confined to a specialized theoretical community. Paper 2's practical approach aligns with highly active, high-impact trends in generative AI and RL.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Efficiently Learning Drifting Halfspaces with Massart Noise

Paper 1 addresses a highly timely and widely applicable problem in reinforcement learning and robotics—scaling expressive policies like diffusion/flow models without the instability of training-time RL. Its practical approach of test-time guidance offers immediate real-world utility in robotics control and sidesteps major bottlenecks. Paper 2, while methodologically rigorous and theoretically significant for learning theory, focuses on specific bounds for linear classifiers under noise and drift, which has a narrower scope and less immediate practical impact across varied fields.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

Paper 2 (QGF) addresses a broader and more impactful problem: enabling stable RL with expressive generative policies by shifting optimization to test time. This has wide applications across robotics, imitation learning, and RL, touching multiple active research communities. Its insight that test-time guidance can replace unstable actor-critic training is novel and practically significant, especially given favorable scaling properties. Paper 1 (GRAFT), while technically strong with state-of-the-art NLB results, addresses a narrower niche (cross-day BCI recalibration) with more incremental contributions combining existing techniques (Transformers, adapters, gain modulation).

claude-opus-4-6·Jun 10, 2026

Wonvs. Flexible Kernels for Protein Property Prediction

Paper 2 addresses a fundamental challenge in reinforcement learning—how to perform policy improvement with expressive generative models without destabilizing training. The proposed test-time-only optimization paradigm (QGF) is a novel conceptual shift that sidesteps actor-critic instability, offering favorable scaling properties. This has broad implications across robotics, imitation learning, and RL at scale. Paper 1 makes solid contributions to protein property prediction with clever kernel design, but operates in a narrower niche. Paper 2's potential to influence how RL is done with diffusion/flow policies gives it broader and more timely impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

Paper 1 (QGF) presents a novel, practical RL algorithm that decouples policy training from policy improvement by performing optimization at test time, addressing fundamental scalability and stability issues in RL with expressive policies. It demonstrates strong empirical results competitive with state-of-the-art methods while being computationally cheaper. Paper 2 offers valuable insights into emergent misalignment but is more analytical/observational. Paper 1's broader applicability to robotics and RL, its methodological innovation combining flow models with test-time guidance, and favorable scaling properties give it higher potential for widespread adoption and cross-field impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

Paper 1 (QGF) addresses a fundamental challenge in scaling RL with expressive policies by proposing test-time policy improvement that avoids actor-critic training instability. It has broad practical applications in robotics and control, demonstrates strong empirical results across multiple benchmarks, and offers favorable scaling properties. Paper 2 makes a valuable methodological point about the gap between observational and interventional evidence in MoE interpretability, but its scope is narrower—primarily a cautionary finding about existing pruning heuristics rather than enabling new capabilities. Paper 1's contribution is more actionable and broadly impactful across robotics and RL communities.

claude-opus-4-6·Jun 10, 2026

#426of 5669·cs.LG

#426 of 5669 · cs.LG

Tournament Score

1511±45

10501750

85%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity8