Alexander Soen, Hisham Husain, Valentin De Bortoli, Arnaud Doucet
Speculative decoding speeds up LLM inference by using a draft model to generate tokens, with an acceptance-rejection scheme that ensures that the output matches the target distribution. Adapting this to continuous diffusions is difficult because speculative sampling requires drawing from a residual distribution. While straightforward in discrete spaces, efficiently sampling this residual in continuous space is non-trivial. Consequently, existing diffusion adaptations either use computationally inefficient sampling techniques or rely on an alternative scheme. In this work, we introduce a novel scheme that efficiently implements the original speculative sampling mechanism for diffusion models. Our approach offers a critical advantage over current methods: it enables us to adapt block verification from LLMs to diffusions -- which provably improves the acceptance rate of drafts. Furthermore, we formalize and analyze the Free Drafter, a heuristic self-speculative drafter for diffusions that requires no training. By enabling block verification, our Free Drafter yields up to a 6.3% speedup over existing speculative methods with no additional training and negligible overhead beyond the existing parallel verification pass.
This paper tackles a specific but important technical bottleneck in adapting speculative decoding from LLMs to continuous diffusion models. The fundamental challenge is that speculative sampling requires drawing from a residual distribution r_Γ(y) ∝ max{0, q(y) - p(y)}, which is trivial in discrete (token) spaces but non-trivial in continuous spaces. Prior approaches either used computationally expensive rejection sampling (with random execution times and multiple model evaluations) or replaced the Γ-maximal coupling with a reflection coupling that, while efficient, precludes block verification.
The paper's key insight (Proposition 3.1) is an orthogonal decomposition of the Gaussian residual: the high-dimensional sampling problem reduces to a 1D sampling task along the direction of the mean difference between draft and target, plus independent Gaussian sampling in the orthogonal complement. The 1D distribution has a closed-form CDF, enabling inverse sampling via bisection. This elegantly solves the residual sampling problem in deterministic time with a single target model evaluation.
This decomposition unlocks block verification for diffusions — a technique from LLM speculative decoding where the entire draft block is jointly verified rather than token-by-token. The paper proves (Proposition 3.2) that reflection-style deterministic corrections cannot support block verification, establishing the necessity of their stochastic approach.
The theoretical foundations are solid. The paper provides:
The proofs in the appendix are detailed and appear correct. The orthogonal decomposition leverages standard properties of Gaussian distributions but applies them in a novel context.
Experimentally, the paper tests across 6 dataset configurations (CIFAR10, CelebA, ImageNet, LSUN in pixel and latent space), multiple churn parameters ε, denoising steps K, and draft sizes γ. FID scores confirm no quality degradation (as theoretically guaranteed for ω=1). The experimental design is thorough, with error propagation for speedup measurements.
Magnitude of speedup: The headline result — up to 6.3% wall-clock improvement over existing speculative diffusion methods — is modest. This is an incremental improvement on top of already significant 1.9-3.6× speedups that speculative diffusion provides over vanilla DDPM. The block verification improvement is most pronounced for pixel-space models and lower churn values.
Practical relevance: The Free Drafter analysis is practically valuable. Demonstrating that the zero-overhead Free Drafter consistently outperforms the theoretically better-aligned Frozen Drafter (despite lower block efficiency) provides clear practical guidance. Table 4 shows the Frozen Drafter is 8-31% slower in wall-clock time despite 9-45% higher block efficiency.
Broader applicability: The authors note connections to Langevin dynamics and molecular dynamics (citing Kosmala et al. [2026]), suggesting the block verification technique could extend beyond image generation. The theoretical framework is general enough for any setting where draft and target are Gaussian with shared covariance.
Limitations on impact: The approach requires stochastic samplers (ε > 0) and matching denoising schedules between draft and target. It cannot accelerate deterministic (DDIM-style) samplers. The speedup diminishes when few denoising steps are used (precisely the regime where other acceleration methods like distillation operate), somewhat limiting composability.
Speculative decoding for diffusion models is an active area (6+ concurrent/recent papers cited from 2024-2026). The paper addresses a known gap: the inability to efficiently implement the original Γ-maximal coupling for continuous spaces. Block verification for LLMs (Sun et al. [2025]) is state-of-the-art, and extending it to diffusions is a natural and timely step.
The work is well-positioned relative to concurrent efforts: it directly improves upon De Bortoli et al. [2025] and Hu et al. [2025] while providing a cleaner alternative to the complex parallel rejection scheme of Anari et al. [2026].
This is a technically sound paper that provides an elegant solution to a known problem (residual sampling for continuous speculative decoding) and uses it to unlock block verification for diffusions. The theoretical contributions are clean and the experiments are thorough. However, the practical impact is modest — the speedups are incremental improvements on existing methods. The work represents solid incremental progress in diffusion model acceleration rather than a paradigm shift.
Generated Jun 12, 2026
Paper 2 is likely to have higher scientific impact: it targets broadly important spatio-temporal forecasting domains (transportation, climate, energy), proposes a general plug-and-play pretraining framework that integrates with multiple STGNN backbones, and demonstrates consistent gains across five baselines and five real-world datasets, suggesting robustness and wide adoptability. Paper 1 is technically novel but impacts a narrower slice of diffusion inference and reports modest speedups (up to 6.3%), making downstream real-world influence potentially more limited despite strong methodological contributions.
Paper 1 likely has higher scientific impact: it advances a timely, widely used generative-modeling paradigm (diffusion) with a novel, principled adaptation of speculative decoding and block verification, offering provable acceptance-rate benefits and measurable speedups without extra training. This targets a major real-world bottleneck (inference cost) and is broadly relevant across ML systems, generative modeling, and deployment. Paper 2 is mathematically elegant and rigorous, but appears more specialized (gauge-invariant readouts for cochain cup products) with narrower immediate applicability and impact outside niche physics/geometry-ML intersections.
Paper 2 addresses a fundamental limitation in classical learning theory by providing generalization guarantees for dependent data via simulatable processes, significantly broadening the PAC model. In contrast, Paper 1 offers a highly specialized, incremental engineering improvement (6.3% speedup) for speculative diffusion models. The theoretical foundations and broader conceptual impact of Paper 2 give it a significantly higher potential for long-term scientific influence.
Paper 1 offers a novel and fundamental insight into weight-space geometry in transformer optimization, demonstrating that different modules benefit from different manifold constraints. This finding has broad implications for optimizer design across all transformer-based models, potentially influencing how future optimizers are built. Paper 2 presents a useful engineering contribution for speeding up diffusion model inference, but the 6.3% speedup is incremental. Paper 1's conceptual contribution—module-specific geometric optimization—opens a new research direction with wider theoretical and practical impact across deep learning.
Paper 2 introduces a fundamentally new theoretical framework connecting equivariance, Lyapunov spectra, and certified prediction horizons for world models, with broad implications across dynamical systems, robotics, and AI safety. Its provable guarantees (orbit-constant error, two-sided horizon bounds) and training-free auditing of pretrained models represent deeper conceptual contributions. Paper 1, while practically useful, offers incremental improvements (6.3% speedup) to speculative decoding for diffusion models. Paper 2's cross-disciplinary relevance (control theory, symmetry, trustworthy AI) and novel certification methodology suggest broader and more lasting scientific impact.
Paper 2 addresses a fundamental question about how reinforcement learning post-training improves reasoning in LLMs—a topic of immense current interest. Its mechanistic insights (strategy selection and strategy improvement) provide actionable understanding that could influence how the entire field approaches training reasoning models. Paper 1, while technically solid, offers incremental improvements (6.3% speedup) to speculative decoding for diffusion models, a narrower contribution. Paper 2's broader applicability, timeliness given the RL-for-reasoning boom, and potential to guide future training methodologies give it higher impact potential.
Paper 1 targets a hard, high-variance real-world robotics problem (aerial pickup/transport of diverse payloads) and proposes an end-to-end meta-RL + contrastive context approach with sim-to-real deployment, which is both novel and application-rich. If validated experimentally, it could impact aerial robotics, manipulation, adaptive control, and meta-learning broadly. Paper 2 is timely for generative model acceleration and has solid methodological framing, but the reported gains (e.g., 6.3%) are relatively modest and the impact may be more incremental within diffusion inference. Overall, Paper 1 has higher cross-domain and real-world impact potential.
Paper 1 addresses a fundamental theoretical limitation in ensemble learning and provides a model-agnostic framework yielding massive improvements (up to 96% compression). In contrast, Paper 2 adapts an existing LLM technique to diffusion models, offering a relatively marginal 6.3% speedup. The broad applicability of Paper 1 to ubiquitous ensemble methods, combined with its rigorous mathematical novelty and significant empirical gains, gives it higher potential for widespread scientific and practical impact.
Paper 1 is more novel and broadly impactful: it introduces a simple but powerful interface (boundary tokens) that simultaneously resolves an optimization barrier (on-policy RL ratios for latent recurrence) and enables mechanistic/causal analysis of latent reasoning. This bridges RLHF-style training, interpretability, and reasoning efficiency—areas with wide cross-field relevance and strong timeliness. Paper 2 is methodologically solid and useful for diffusion inference, but the reported gains are modest and the contribution is more incremental/engineering-focused, with narrower impact compared to a general framework for RL-trainable latent reasoning.
Paper 1 introduces a fundamental theoretical framework ('loss shift' and Bayes quotients) that addresses a novel failure mode in transfer learning independent of distribution shift. This foundational insight has broad implications across representation learning and generalization. In contrast, Paper 2 offers a valuable but highly specific algorithmic speedup for diffusion models. Theoretical advances like those in Paper 1 typically yield broader, longer-lasting scientific impact across multiple subfields of machine learning.