Mark Kozdoba, Shie Mannor
Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths , the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for below the threshold the prior converges to a limit distribution . We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions , and demonstrate a complex multimodal behaviour of the limit distributions -- a regime that becomes increasingly narrow with and would be hard to identify without knowing the threshold.
This paper addresses a fundamental question about deep Gaussian processes (DGPs): what happens to the compositional GP prior as depth grows to infinity? Prior work by Dunlop et al. (2018) had shown that for the RBF kernel with sufficiently large bandwidth, the prior degenerates to constant functions (synchronization). This paper makes three main contributions: (1) it identifies the *sharp* critical bandwidth , improving on the previous non-tight bound; (2) it proves that below this threshold, the chain converges in total variation to a unique, non-degenerate stationary distribution ; and (3) it establishes that this limit is non-Gaussian with non-vanishing inter-coordinate dependence, despite being built entirely from Gaussian ingredients.
The result that non-trivial, non-Gaussian limits exist is the most significant conceptual contribution. It fundamentally changes the picture from "deep GPs degenerate" to "deep GPs admit a rich phase diagram with a sharp transition."
The theoretical development is rigorous and well-structured. The authors decompose the problem intelligently: rather than attacking the full position chain directly, they work with the pairwise distance chain , which is scalar and Markov. The key insight is that the log-chain behaves like a random walk with drift near the origin, with the sign of determining the regime.
The supercritical proof uses a clean global bound () combined with the SLLN. The subcritical proof is more involved, employing a Foster-Lyapunov drift argument with a carefully constructed "tent" Lyapunov function that captures drift toward a compact set from both tails. The extension to general (Theorem 4.3) uses a sum-of-squared-logs Lyapunov function across all pairs.
The non-Gaussianity proof (Theorem 4.4) is elegant: it proceeds by contradiction, showing that if the joint limit were Gaussian, isotropy would force to be a scaled , but characteristic function analysis proves no scaled can be stationary for the recursion (2). This is a clean structural argument.
The proofs are complete and detailed (occupying a substantial appendix), and the logical structure is clear throughout. The paper correctly identifies and addresses technical subtleties, such as the need to avoid conditioning on before invoking the SLLN in the non-convergence argument.
Theoretical impact: This work establishes the first non-trivial depth-infinite limit for compositional GPs, complementing the well-understood infinite-width limits (NNGP kernels) that reduce to Gaussian processes. The sharp threshold provides a precise characterization of when depth "matters" versus when it destroys structure. This could influence how practitioners parameterize deep GP models.
Connections to adjacent fields: The paper connects to the theory of iterated random functions (Diaconis & Freedman, 1999), but in a genuinely infinite-dimensional setting that goes beyond classical parametric families. The phase transition result may inspire analogous investigations for other kernel families (Matérn, polynomial) or other compositional architectures.
Practical implications: The finding that the non-trivial regime becomes increasingly narrow with dimension ( for the structure to be visible) has direct practical implications. It explains why practitioners working in high dimensions may never observe non-degenerate deep GP behavior without knowing the precise threshold — the "interesting" bandwidth window is too narrow to find by accident. The paper provides explicit formulas for selecting bandwidth parameters to achieve desired dependence strength.
Limitations for practice: The results are currently specific to the RBF kernel and the composition class of DGPs. Real-world DGP implementations often use variational approximations and different kernel families, so the direct practical applicability is limited. The paper also does not address convergence rates, which would be essential for finite-depth applications.
Understanding the properties of deep probabilistic models is a central concern in modern machine learning. While much attention has focused on infinite-width neural network limits (NTK, NNGP), deep GPs represent an important alternative that maintains non-Gaussianity at finite depth. The gap between what was known (degeneration for large bandwidth) and the full picture (existence of non-trivial limits) was a genuine open problem, and closing it is timely.
Minor observations: The paper is well-written with clear notation and good use of proof sketches in the main text. The experimental section, while simple, effectively validates the theory. The comparison with the neural network literature (synchronization in weight space vs. function space) is instructive and well-placed.
Generated Jun 9, 2026
While Paper 1 provides rigorous and foundational theoretical insights into Deep Gaussian Processes, Paper 2 is highly timely and relevant to the current boom in large language models. By analyzing the sparsity and geometry of on-policy distillation, Paper 2 offers actionable empirical insights that directly impact modern LLM and VLM post-training recipes, suggesting broader and more immediate real-world applications in optimizing large-scale model deployment.
Paper 1 makes fundamental theoretical contributions to understanding deep Gaussian processes, establishing sharp phase transitions and proving the existence of non-trivial, non-Gaussian limiting distributions. This advances core mathematical understanding of deep probabilistic models with broad implications across Bayesian deep learning and probability theory. Paper 2 addresses a practical engineering problem (safe on-device LLM deployment) with incremental contributions—combining existing techniques (soft prompts, distillation, parameter-efficient methods). While useful, it is more applied and narrower in scope, with findings likely to be superseded as LLM architectures evolve.
Paper 1 provides tight VC dimension bounds for Transformers and chain-of-thought learning, directly addressing the theoretical foundations of the most impactful architecture in modern AI. Its results are broadly relevant to understanding generalization in large language models and chain-of-thought prompting—topics of immense current interest. Paper 2, while mathematically rigorous and novel in characterizing deep GP limits, addresses a narrower topic (compositional GPs) with less immediate practical relevance and a smaller research community. The timeliness and breadth of impact of Transformer theory gives Paper 1 the edge.
Paper 1 focuses on diffusion models, a highly popular and widely applied area in generative AI. By introducing a novel metric (ICR) to detect early memorization without external datasets, it offers immediate, practical utility for training large models. While Paper 2 provides rigorous and important theoretical advancements for Deep Gaussian Processes, its scope is narrower and less likely to drive broad, cross-disciplinary applications compared to the insights provided by Paper 1.
Paper 2 likely has higher impact: it addresses a timely, practical failure mode in diffusion/score-based generative modeling (size extrapolation), provides a clear diagnostic theory grounded in Tweedie’s formula and reverse diffusion, and contributes an actionable benchmark (FDLF) with exact controllable ground truth—facilitating reproducibility and broad adoption. Its implications span scientific ML domains needing size transfer (physics, chemistry, materials). Paper 1 is mathematically strong and novel for deep GP theory, but is more specialized with narrower near-term application and audience.
Paper 2 addresses an urgent and highly relevant problem in AI: privacy and safety in Large Language Models via machine unlearning. Its proposed few-shot approach has immediate, widespread real-world applications across various domains deploying LLMs. While Paper 1 offers rigorous theoretical contributions to deep Gaussian Processes, its impact is largely confined to a specific theoretical machine learning niche. Paper 2's timeliness, practical utility, and broader applicability give it a much higher potential for significant scientific and societal impact.
Paper 2 addresses the highly timely and critical area of LLM decision-making and AI alignment. By localizing and steering temporal preferences, it offers broad real-world applications in safe AI planning. While Paper 1 provides rigorous foundational theory for Gaussian Processes, Paper 2's focus on mechanistic interpretability of state-of-the-art models gives it significantly higher potential for immediate, cross-disciplinary impact in AI safety and cognitive science.
Paper 1 addresses a critical computational bottleneck in modern LLM training—efficient reinforcement learning for long-context reasoning. By achieving over 2x speedups for state-of-the-art models, its methods have immediate potential for widespread adoption in both industry and academia. While Paper 2 offers strong theoretical contributions to Deep Gaussian Processes, Paper 1's timely relevance to the rapidly expanding field of LLM reasoning gives it significantly higher potential for broad scientific and real-world impact.
Paper 1 makes fundamental theoretical contributions to understanding deep Gaussian processes, establishing sharp phase transition thresholds and proving the existence of non-trivial, non-Gaussian limiting distributions. These results advance foundational understanding of deep Bayesian models with lasting theoretical significance. Paper 2 proposes an incremental engineering contribution (CVAformer) combining LLMs with time series forecasting using causal disentanglement, but operates in a crowded applied space where methods are quickly superseded. Paper 1's mathematical rigor and novel theoretical insights have broader and more durable impact across multiple research communities.
Paper 2 likely has higher impact due to fundamental theoretical contributions to deep Gaussian processes: it proves a sharp depth–bandwidth threshold and identifies a novel non-degenerate, non-Gaussian limiting regime with dependence structure. These results clarify when deep GP priors are meaningful, informing kernel choice, model design, and theory across Bayesian deep learning and probabilistic numerics. The claims are broadly relevant and timely for understanding depth limits in compositional models. Paper 1 is innovative and application-relevant for adaptive recruitment, but its impact is more domain-specific and relies on simulation-based validation.