Zakariae El Asri, Philippe Gratias-Quiquandon, Nicolas Thome, Olivier Sigaud
Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.
MODIP proposes an indirect approach to fine-tuning diffusion policies (DPs) beyond behavioral cloning by leveraging model-based planning rather than direct RL optimization. The key insight is that instead of wrestling with the computational and stability challenges of backpropagating RL gradients through multi-step denoising processes, one can use MPC to generate improved trajectories and then distill them back into the DP via standard supervised denoising loss. This sidesteps a genuine pain point: diffusion policies' iterative generation process makes standard policy gradient or Q-learning-based fine-tuning computationally expensive and often unstable.
Three specific technical contributions support this framework: (1) using DPs as expressive action-sequence priors within MPPI-based trajectory optimization, (2) replacing the policy-dependent terminal estimate Q(s, π(s)) with a state-value V(s) to avoid costly denoising at terminal states during planning, and (3) policy-independent critic learning that forms TD targets from state values rather than policy-sampled actions.
The paper demonstrates reasonable methodological rigor, though with some gaps:
Strengths in experimental design: The evaluation spans three benchmark families (D4RL MuJoCo, D4RL Kitchen, RoboMimic) covering dense-reward locomotion, sparse-reward long-horizon manipulation, and multi-stage robotic manipulation. The comparison set includes relevant baselines spanning direct DP fine-tuning methods (DQL, DPPO, DSRL, PA-RL) and model-based planning (TD-MPC2). The ablation study in Table 2 systematically isolates contributions of each component.
Quantitative efficiency analysis: Tables 3 and 4 provide concrete measurements showing the V(s) terminal value yields a 2.9× inference speedup and policy-independent critics yield ~1.6× training speedup. These are meaningful practical improvements.
Concerns: Standard deviations are sometimes large (e.g., Kitchen-Complete: 0.94±0.14), suggesting high variance across seeds. The number of seeds is not explicitly stated. Some baselines show 0.0 performance (TD-MPC2 on RoboMimic), which raises questions about whether the comparison is truly fair despite claims of matched computational budgets. The paper uses italics to indicate results taken from other papers (DSRL), which may involve different experimental conditions. The Kitchen results for DSRL are marked with ✗, indicating the method wasn't evaluated there.
Practical relevance: The offline-to-online fine-tuning paradigm addressed here is highly relevant for robotics, where demonstration data is increasingly available but pure imitation learning plateaus. MODIP's approach of keeping the supervised training objective while improving through better data distributions is elegant and practically useful.
Broader influence: The idea of using planning as a policy improvement operator (rather than RL gradients) for expressive generative policy classes could extend beyond diffusion policies to other iterative generative models (flow matching policies, autoregressive action models). The efficiency insights about V vs. Q terminal values and policy-independent critics are transferable to any hybrid planning system using expensive policy representations.
Limitations on impact scope: All experiments use state-based observations, not images. Given that diffusion policies' primary appeal in robotics is visuomotor control (as highlighted in the original DP paper), this is a significant limitation. The authors acknowledge this but do not address it. The tasks, while standard benchmarks, are relatively simple compared to real-world robotic manipulation challenges.
This work is well-timed. Diffusion policies have gained substantial traction in the robotics community since Chi et al. (2023), and the question of how to improve them beyond BC is an active research frontier. Multiple concurrent works (DPPO, DSRL, PA-RL) tackle similar problems, indicating strong community interest. MODIP offers a distinct angle by combining model-based planning with DPs, filling a gap in the landscape of DP improvement methods.
The offline-to-online paradigm is increasingly important as large demonstration datasets become available. The computational efficiency considerations are particularly timely as the field moves toward deploying diffusion policies on real robots where inference time matters.
Missing comparisons: No comparison with DAgger-style methods or other model-based distillation approaches. The relationship to Dyna-style methods could be discussed more thoroughly.
MODIP presents a well-motivated and technically sound approach to improving diffusion policies through model-based planning and distillation. The efficiency improvements from V-function terminal values and policy-independent critics are practical and well-demonstrated. However, the restriction to state-based observations, the system complexity, and the incremental nature of individual contributions moderate its overall impact. The work is a solid contribution to the growing literature on diffusion policy optimization but would benefit from vision-based validation to realize its full potential.
Generated Jun 10, 2026
MODIP addresses a fundamental challenge in robot learning—efficiently fine-tuning diffusion policies beyond behavioral cloning—with a novel framework combining world models, MPC, and supervised learning. This has broad implications for robotics and RL, offering methodological innovation (policy-independent TD targets, terminal value-based MPC) with demonstrated improvements over strong baselines. Paper 1, while useful, is primarily an engineering benchmark/adapter contribution for coding agents, with narrower scope and less methodological novelty. MODIP's approach could influence multiple research directions in robot learning and diffusion models.
Paper 2 (HAMNO/PI-HAMNO) likely has higher scientific impact due to broader cross-field relevance (scientific ML, PDE solvers, physics-informed learning across engineering/physics), and timely advances in neural operators for multi-scale, long-horizon dynamical systems. The hierarchical local–global operator with adaptive gating plus a concrete strong/weak-form physics-informed extension is a generally reusable methodological contribution with wide real-world application potential. Paper 1 is strong and timely for robotics RL fine-tuning of diffusion policies, but its impact is narrower to robot learning and relies on established world-model+MPC paradigms.
Paper 2 likely has higher scientific impact: it tackles a timely, widely relevant bottleneck—efficient RL fine-tuning for diffusion policies—bridging diffusion-based imitation and model-based RL in robotics. The approach (WM+MPC-generated supervised targets, terminal value, policy-independent TD) is broadly applicable and can influence both diffusion policy training and model-based control, with clear real-world robotics implications. Paper 1 is rigorous and practically important for safety monitoring, but its contributions are more specialized to deployed safety classifiers and reveal limitations (importance-weight collapse), potentially narrowing immediate uptake.
Paper 2 (MODIP) likely has higher scientific impact: it targets a fast-moving area (diffusion policies for robotics) and tackles a key bottleneck—making offline-to-online RL fine-tuning practical despite multi-step denoising. The integration of world models + MPC with supervised fine-tuning offers a broadly applicable framework across robot learning and model-based RL, with clear real-world relevance. RCAP is useful and rigorous for efficient training and robustness under pruning, but it is a more incremental algorithmic advance in a crowded dataset-pruning space with narrower cross-field reach.
Paper 1 addresses a critical bottleneck in a rapidly growing field (diffusion policies in robot learning) by introducing a novel offline-to-online reinforcement learning framework using world models. This methodological innovation is likely to influence a broad range of AI and robotics research. In contrast, Paper 2, while highly valuable for clinical applications and societal impact, represents an application of existing machine learning techniques (multimodal attention, ordinal regression) to Alzheimer's staging, making it less methodologically ground-breaking for the broader scientific community.
MODIP addresses a highly practical and timely problem in robot learning—fine-tuning diffusion policies beyond behavioral cloning—with a novel framework combining world models and MPC. Diffusion policies are a rapidly growing area with broad robotics applications, and demonstrating competitive or superior performance against strong baselines like TD-MPC2 on standard benchmarks gives it significant practical relevance. Paper 2 makes solid theoretical contributions to online convex optimization with capacity constraints, but addresses a more niche setting with narrower immediate impact. Paper 1's combination of novelty, timeliness, and real-world applicability gives it higher potential impact.
Paper 2 likely has higher scientific impact due to releasing a large, harmonized, leakage-audited public clinical-genomic benchmark with locked tasks and an evaluation harness. Such resources can catalyze broad, long-term work across ML, oncology, bioinformatics, and regulatory/clinical translation, and its explicit identification of a modality ceiling informs future data collection (serial ctDNA) and study design. Paper 1 is a solid, timely algorithmic contribution for diffusion-policy fine-tuning in robotics, but its impact is narrower and more incremental relative to fast-moving RL/model-based control literature compared with a new public benchmark in precision oncology.
Paper 1 addresses a critical bottleneck in large language model training—efficient knowledge distillation for reasoning tasks. By moving from output-space to hidden-state space, it provides significant theoretical and empirical improvements in efficiency and performance on highly relevant benchmarks (AIME/AIMO). Given the massive adoption and broad applications of LLMs compared to the more specialized domain of robot learning and diffusion policies in Paper 2, Paper 1 has higher potential for widespread cross-disciplinary impact and immediate real-world utility.
Paper 1 likely has higher impact due to stronger novelty and broader real-world applicability: it tackles a key bottleneck in diffusion policies for robotics (offline-to-online fine-tuning) via a pragmatic model-based MPC-to-supervised adaptation scheme, with efficiency improvements (terminal value, policy-independent TD targets) and validation across standard robot learning benchmarks (D4RL, RoboMimic) against strong baselines. This can influence both diffusion-policy RL and model-based control in robotics. Paper 2 is timely for LLM reasoning but appears more incremental (a sampling/exploration tweak within GRPO) with narrower methodological scope and less standardized evaluation breadth.
MODIP addresses a timely and significant challenge in robot learning—fine-tuning diffusion policies beyond behavioral cloning—combining world models with MPC in a novel framework. Diffusion policies are a rapidly growing area, and enabling efficient RL fine-tuning has broad implications for robotics. Paper 1 (DtR) proposes a useful engineering contribution for constructing hybrid attention models but is more incremental, combining known techniques (distillation + greedy replacement). MODIP's cross-cutting impact across RL, diffusion models, and robotics, plus its strong empirical results against multiple baselines, gives it higher potential impact.