Zhiyuan Zhou, Andy Peng, Charles Xu, Qiyang Li, Tobias Springenberg, Kevin Frans, Sergey Levine
Expressive continuous control policies, such as diffusion and flow models, form the backbone of recent advances in scaling imitation learning for simulated and real robot control. While they are known to scale stably in the supervised imitation learning setting, incorporating them into reinforcement learning (RL) pipelines for policy improvement has proven more difficult. It often requires specialized training objectives or backpropagating through denoising processes, which cause well-known issues with stability and affect scalability. In this paper we study the question of whether simple policy improvement schemes at test time alone, leaving stable supervised policy training intact, can be a competitive alternative which sidesteps these issues. To this end, we propose QGF (Q-Guided Flow), an RL algorithm that performs policy optimization entirely at test time. QGF works by pre-training both a reference flow policy (via a standard behavioral cloning objective) and a value function critic and, at test time, using the value gradient to guide the reference policy to generate higher-value actions without any additional policy learning. Empirically, QGF outperforms prior test-time RL methods on single-task and goal-conditioned offline RL benchmarks with high-dimensional action spaces, and is competitive with state-of-the-art training-time algorithms while being much cheaper to run. Moreover, it exhibits favorable scaling with model size by avoiding the instability of actor-critic training, offering a practical and effective alternative RL algorithm with expressive policies.
QGF proposes a simple yet effective mechanism for policy improvement at test time in offline RL with flow-based policies. The key insight is to avoid both (a) taking Q-function gradients at intermediate noisy actions (out-of-distribution for the critic) and (b) expensive backpropagation through the entire denoising chain. Instead, QGF uses a single-step Euler integration to approximate the denoised action from any intermediate step, then takes the critic gradient at this approximated clean action. Crucially, the Jacobian of the mapping is replaced with the identity matrix—a seemingly crude approximation that empirically proves beneficial due to lower variance.
The method cleanly decouples policy training (standard behavioral cloning via flow matching) from value learning (IQL or other TD methods), performing all reward-seeking optimization at inference time. This sidesteps the notorious instability of actor-critic training with iterative generative models.
The paper provides a solid theoretical motivation grounded in KL-regularized RL and the connection between flow matching and score functions. The derivation from Eq. (3) through Eq. (9) is clearly presented and well-motivated.
The experimental evaluation is thorough:
The noise sensitivity analysis (Fig. 3, cosine similarity metric) and the Q-value optimization analysis (Fig. 4) provide good mechanistic understanding. The 1D illustrative example (Fig. 2) effectively demonstrates the OOD gradient bias problem.
However, there are methodological concerns. The claim that dropping the Jacobian is "better" rather than just "simpler" deserves more theoretical scrutiny. The paper acknowledges this is an approximation but frames the empirical advantage as somewhat surprising without fully explaining why. The connection to prior work on approximate gradients (random feedback alignment, etc.) is mentioned but not deeply developed. Additionally, the offline RL setting is the sole evaluation domain—no online RL or real-robot experiments are included.
Practical significance: QGF offers a compelling practical workflow: train a flow policy with stable BC, train a critic separately, then compose them at test time. This modularity is attractive for real-world robotics where:
Computational efficiency: QGF requires only one additional forward pass through the critic per denoising step, making it orders of magnitude cheaper than BFN (N=16) while achieving comparable performance (Fig. 6-7).
Scaling properties: The favorable scaling with model size (Fig. 9) addresses a genuine bottleneck—actor-critic methods often degrade with larger networks due to optimization instability, while QGF's supervised training loss scales predictably.
Broader implications: This work contributes to the growing paradigm of "test-time compute" in decision-making, analogous to developments in LLMs. The idea of separating capability (policy) from optimization (test-time guidance) could influence how robotics foundation models are deployed.
This paper is highly timely. Flow and diffusion models are rapidly becoming the dominant policy class for robotic manipulation, yet incorporating them into RL remains challenging. The tension between the stability of supervised pretraining and the instability of RL fine-tuning is a current bottleneck that multiple groups are attacking simultaneously (FQL, QAM, EDP, DAC—all 2024-2025 papers). QGF offers an orthogonal and arguably simpler solution.
The test-time compute paradigm is also trending across ML, from language models to image generation. Positioning RL policy improvement as a test-time problem rather than a training-time problem is conceptually aligned with these broader trends.
Comparison to closest prior work: The advantage over QFQL (OOD gradient) and BPTT is clearly demonstrated. The relationship to EDP is nuanced—both use Euler approximation but QGF operates at test time, making it strictly more flexible.
QGF makes a convincing case that test-time gradient guidance is a practical alternative to training-time policy optimization for flow-based RL policies. The contribution is primarily empirical and algorithmic rather than theoretical, but the extensive experiments and ablations substantiate the claims well. The work addresses a genuine current need and offers a simple, modular solution that could see adoption in robotics pipelines using generative policy models.
Generated Jun 10, 2026
Paper 2 (QGF) addresses a fundamental challenge in RL—stable policy improvement with expressive generative models—by proposing test-time gradient guidance that decouples supervised training from policy optimization. This has broad real-world robotics applications, offers practical scalability advantages, and bridges imitation learning with RL in a novel way. Paper 1 (nD-RoPE) is a solid theoretical contribution generalizing position embeddings, but it is more incremental in scope, extending existing RoPE methodology. Paper 2's potential to influence both the RL and robot learning communities, combined with its practical benefits and timeliness amid the scaling of diffusion/flow policies, gives it higher impact.
Paper 1 offers higher potential scientific impact due to its immediate applicability to a critical bottleneck in modern AI: LLM inference latency. By elegantly identifying and solving the head-backbone competition in multi-token prediction with a drastically simplified, parameter-efficient layer (CLP), it provides a zero-loss speedup for widely used models. While Paper 2 presents a strong test-time guidance method for RL, Paper 1's solution addresses a universal economic and computational problem in AI deployment with a highly practical architectural fix that will likely see rapid, broad adoption.
Paper 2 has higher estimated impact due to a more broadly applicable and timely idea: shifting policy improvement to test time for diffusion/flow policies, avoiding unstable RL training while leveraging scalable supervised pretraining. This has clear real-world robotics and offline RL applications, potentially lowering compute and engineering barriers and influencing both RL and generative policy modeling communities. Methodologically it introduces a clean algorithmic paradigm (critic + value-gradient guidance) that can transfer across tasks and model families. Paper 1 is a solid, rigorous PPO refinement for LLM RL, but is narrower in scope and likely incremental within a crowded area.
Paper 2 is likely higher impact due to broader cross-field relevance and real-world applicability: it targets continuous-control robotics/decision-making, proposing a practical paradigm (test-time policy improvement via value gradients) that preserves stable supervised training while avoiding actor-critic instability—an important bottleneck for diffusion/flow policies in RL. This approach could generalize across many offline/goal-conditioned settings and influence deployment-time optimization. Paper 1 is novel and solid for code LLMs, but its impact is narrower (program synthesis benchmarks) and more incremental within an already-active test-time scaling/self-training line.
Paper 2 addresses a critical bottleneck in modern reinforcement learning: stable policy improvement with expressive generative models like diffusion and flow. By shifting policy optimization to test time, it bypasses the instabilities of actor-critic training, offering a scalable solution with immediate real-world applications in robotics and continuous control. While Paper 1 provides a strong theoretical unification for kernel bandits, its impact is likely confined to a specialized theoretical community. Paper 2's practical approach aligns with highly active, high-impact trends in generative AI and RL.
Paper 1 addresses a highly timely and widely applicable problem in reinforcement learning and robotics—scaling expressive policies like diffusion/flow models without the instability of training-time RL. Its practical approach of test-time guidance offers immediate real-world utility in robotics control and sidesteps major bottlenecks. Paper 2, while methodologically rigorous and theoretically significant for learning theory, focuses on specific bounds for linear classifiers under noise and drift, which has a narrower scope and less immediate practical impact across varied fields.
Paper 2 (QGF) addresses a broader and more impactful problem: enabling stable RL with expressive generative policies by shifting optimization to test time. This has wide applications across robotics, imitation learning, and RL, touching multiple active research communities. Its insight that test-time guidance can replace unstable actor-critic training is novel and practically significant, especially given favorable scaling properties. Paper 1 (GRAFT), while technically strong with state-of-the-art NLB results, addresses a narrower niche (cross-day BCI recalibration) with more incremental contributions combining existing techniques (Transformers, adapters, gain modulation).
Paper 2 addresses a fundamental challenge in reinforcement learning—how to perform policy improvement with expressive generative models without destabilizing training. The proposed test-time-only optimization paradigm (QGF) is a novel conceptual shift that sidesteps actor-critic instability, offering favorable scaling properties. This has broad implications across robotics, imitation learning, and RL at scale. Paper 1 makes solid contributions to protein property prediction with clever kernel design, but operates in a narrower niche. Paper 2's potential to influence how RL is done with diffusion/flow policies gives it broader and more timely impact.
Paper 1 (QGF) presents a novel, practical RL algorithm that decouples policy training from policy improvement by performing optimization at test time, addressing fundamental scalability and stability issues in RL with expressive policies. It demonstrates strong empirical results competitive with state-of-the-art methods while being computationally cheaper. Paper 2 offers valuable insights into emergent misalignment but is more analytical/observational. Paper 1's broader applicability to robotics and RL, its methodological innovation combining flow models with test-time guidance, and favorable scaling properties give it higher potential for widespread adoption and cross-field impact.
Paper 1 (QGF) addresses a fundamental challenge in scaling RL with expressive policies by proposing test-time policy improvement that avoids actor-critic training instability. It has broad practical applications in robotics and control, demonstrates strong empirical results across multiple benchmarks, and offers favorable scaling properties. Paper 2 makes a valuable methodological point about the gap between observational and interventional evidence in MoE interpretability, but its scope is narrower—primarily a cautionary finding about existing pruning heuristics rather than enabling new capabilities. Paper 1's contribution is more actionable and broadly impactful across robotics and RL communities.