Andries Rosseau, Robert Müller, Ann Nowé
Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.
This paper establishes a principled theoretical connection between plasticity loss in continual learning and the anisotropy of the empirical Neural Tangent Kernel (NTK), then identifies dynamical isometry—the condition that layer-wise Jacobian singular values remain near one—as a tractable surrogate for preserving task-agnostic NTK isotropy. The key insight is that task-agnostic plasticity requires gradient descent to make comparable progress in all output-space directions, which is governed by the condition number of the empirical NTK rather than its scale.
The paper makes several concrete contributions: (1) a formal definition of plasticity as the expected fraction of excess achievable loss eliminated under a resource budget, connecting it to NTK geometry; (2) revisiting GroupSort/MaxMin networks as existence proofs that almost-everywhere isometry is compatible with universal Lipschitz approximation; (3) an efficient Gram-deviation regularizer for general architectures with a novel dead-ReLU revival mechanism; and (4) AdamO, an optimizer that decouples isometry regularization from adaptive gradient moments, analogous to AdamW's treatment of weight decay.
The theoretical development is generally sound and well-structured. The chain from plasticity definition → NTK anisotropy → dynamical isometry is logically clear. The derivation showing that layer-wise isometry (not just end-to-end) is required for NTK isotropy—because the NTK decomposes as a sum of per-layer terms coupling forward Gram matrices with backward sensitivity matrices (Eq. 7-8)—is a valuable clarification.
However, several caveats apply. The connection between dynamical isometry and NTK isotropy is demonstrated rigorously only for the deep linear case (Eq. 9). For nonlinear networks, the argument is qualitative: the data-dependent diagonal matrices Dℓ(x) from nonlinear activations mean that weight isometry is necessary but not sufficient. The paper acknowledges this but could be more precise about how large the gap can be in practice. The claim that the regularizer revives dead ReLUs is argued mechanistically (Section 4.3) but lacks formal guarantees—it provides a "plausible reactivation mechanism" rather than a proof.
The experimental evaluation covers multiple supervised benchmarks (Random-Label Memorization, Permuted MNIST/CIFAR-10, Label-Shuffled CIFAR-100) and RL environments (MinAtar, Octax), using 8 seeds. The breadth is commendable. The extensive diagnostic appendix (weight spectra, NTK statistics, Jacobian conditioning, dormant neurons) provides strong empirical support for the theoretical claims. The sensitivity analysis (Figure 3) shows reasonable robustness to the regularization strength hyperparameter.
One weakness is that some comparisons could be tighter. The paper positions itself against NaP, ReDo, L2 Init, and spectral norm regularization, but the margins over NaP in some benchmarks (e.g., Label-Shuffled CIFAR-100) are small due to performance saturation. The RL experiments, while extensive in diagnostics, show somewhat modest absolute improvements in some games.
The paper has significant potential impact in several directions:
Continual learning: By providing a unifying geometric lens through which multiple existing plasticity-preserving methods can be understood (Section 5 is particularly insightful), the paper elevates the discourse from symptom-based fixes to principled design. The observation that NaP controls mean squared singular values, spectral norm regularization controls only the maximum, while full isometry controls the entire spectrum, is an elegant unification.
Deep RL: The suggestion that dynamical isometry could enable genuinely deep RL networks is compelling. RL architectures are typically kept shallow precisely because depth destabilizes training—if isometry-preserving methods can change this, the impact on RL practice could be substantial.
Large language models: The paper speculates about applications to LLMs, noting that residual connections maintain near-isometry at initialization but drift away during training. This is a timely observation given current interest in continual pre-training and fine-tuning.
Practical adoption: AdamO is designed for easy integration (drop-in replacement for Adam, ~4-5% memory overhead), which lowers the barrier to adoption.
Plasticity loss is a recognized bottleneck in continual learning, with recent high-profile publications (Dohare et al., 2024 in Nature; Lyle et al., 2024). The paper enters a crowded space but distinguishes itself by providing theoretical grounding rather than another heuristic fix. The unification of prior methods through the dynamical isometry lens is timely as the field is transitioning from empirical observations to principled understanding.
The relevance to LLM continual training and deep RL gives the work legs beyond the immediate continual learning community.
The paper's recommendation of "AdamO + ReLU" for practical use is well-supported by the experiments. The dead ReLU revival mechanism, while not formally guaranteed, adds a novel angle to understanding how orthogonality regularization interacts with piecewise-linear activations. The extensive appendix (38 pages of supplementary material) demonstrates thoroughness but also suggests the core story could be communicated more concisely.
Generated Jun 9, 2026
Paper 2 likely has higher impact due to direct applicability to protein engineering and biotechnology, where data efficiency from sparse assays is a core bottleneck. Its kernel/GP approach is methodologically clear, interpretable, and can integrate evolutionary priors plus structure signals from foundation models, enabling broad use across binding, stability, and multi-task settings. This bridges modern protein ML with principled uncertainty-aware Bayesian modeling, relevant across computational biology, drug discovery, and design. Paper 1 is novel for continual learning optimization, but its impact is more specialized within deep learning research and may see slower real-world translation.
Paper 2 likely has higher impact: it targets a timely, high-growth area (agentic RL for LLMs with verifiable rewards) and proposes a broadly applicable rollout-budget allocation framework that can directly reduce compute/sample costs—an immediate real-world constraint. The prefix-level/tree allocation idea generalizes beyond specific tasks and may influence RL training pipelines for many agentic systems. Paper 1 is methodologically strong and conceptually interesting, but its practical adoption may be narrower to continual-learning settings and dependent on architectural/regularization details. Overall, Paper 2’s timeliness and applicability suggest larger near-term impact.
Paper 2 likely has higher impact: it addresses a central, widely relevant bottleneck in modern ML (continual learning plasticity loss), offers a unifying theoretical lens (NTK/plasticity tied to dynamical isometry), and proposes broadly usable, low-friction interventions (regularizer + AdamO) that can transfer across tasks and architectures. The applications span supervised and RL settings and may influence optimization, initialization, and continual learning research. Paper 1 is novel and rigorous for stochastic dynamics surrogates, but its immediate audience and cross-field adoption are likely narrower than a general-purpose optimizer/regularization framework in deep learning.
Paper 1 provides a deeper theoretical framework connecting plasticity loss to dynamical isometry via the Neural Tangent Kernel, proposes a novel optimizer (AdamO) with principled regularization, and reinterprets prior methods through a unifying lens. It addresses the fundamental and broadly relevant problem of continual learning plasticity with both theoretical grounding and practical solutions across supervised and RL domains. Paper 2 offers valuable empirical and theoretical insights into Muon's advantages but is more descriptive/analytical of an existing optimizer rather than introducing a fundamentally new framework or method.
Paper 1 addresses a fundamental challenge in AI (continual learning) by introducing a novel theoretical framework and a concrete solution (AdamO optimizer). Its impact spans broad fields like supervised and reinforcement learning. In contrast, Paper 2 is narrower in scope, focusing on a specific application (energy forecasting), and acts primarily as a critique rather than proposing a rigorous new methodology. Paper 1 demonstrates significantly greater novelty, methodological rigor, and breadth of potential impact across the scientific community.
Paper 2 is likely higher impact due to timeliness and broad real-world relevance: it introduces a new threat model for multi-stage LLM post-training and exposes a compounding vulnerability (the “single-attacker illusion”) that can invalidate current safety/security evaluations. The results directly affect deployment, governance, and auditing across many LLM pipelines (SFT→DPO, SFT→PPO), with clear implications for defenses and benchmarking. Paper 1 is technically strong and useful for continual learning, but its immediate cross-domain and societal impact is narrower than LLM security.
Paper 1 addresses the fundamental problem of plasticity loss in continual learning, connecting it to dynamical isometry and the Neural Tangent Kernel—providing both theoretical insight and practical solutions (AdamO optimizer, isometry regularization). It has broad applicability across deep learning (supervised and RL), introduces a unifying theoretical lens for prior methods, and tackles a timely challenge as continual learning grows in importance. Paper 2, while methodologically sound, addresses a narrower sports analytics application with limited dataset (7 matches) and more constrained cross-field impact.
Paper 1 addresses a fundamental and broad challenge in deep learning—loss of plasticity in continual learning. By leveraging theoretical insights like the Neural Tangent Kernel and dynamical isometry, it introduces a widely applicable optimizer (AdamO) that benefits both supervised and reinforcement learning. In contrast, Paper 2 tackles a highly specific, niche problem (modality deficiency in federated graph learning). Paper 1's general-purpose approach and theoretical depth give it significantly higher potential for broad scientific impact across multiple AI subfields.
Paper 1 offers a deeper theoretical contribution by connecting plasticity loss to dynamical isometry and the Neural Tangent Kernel, providing both theoretical insights and practical tools (AdamO optimizer, isometry regularization) applicable broadly across supervised and reinforcement learning. It unifies prior plasticity-preserving methods under a single framework. Paper 2 addresses an important but narrower problem (turn-level credit assignment for LLM agents without verifiers) with a clever but more domain-specific solution. Paper 1's foundational nature, broader applicability across continual learning settings, and novel theoretical lens give it higher long-term impact potential.
Paper 1 tackles the critical issue of plasticity loss in continual learning, a major bottleneck in modern AI and RL. By connecting plasticity to dynamical isometry and introducing a practical Adam-style optimizer (AdamO), it offers a theoretically grounded solution with immediate, widespread applicability. While Paper 2 presents an innovative approach to disentanglement using HRRs, the broader relevance and potential for wide adoption of a new optimizer in continuous training scenarios give Paper 1 a higher estimated scientific impact.