Florian Hübler, Thomas Pethick, Suvrit Sra
Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded -th central moments, . We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for matrices, Muon finds an -stationary point in nuclear norm within samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.
This paper addresses a timely theoretical gap: why do non-Euclidean optimizers like Muon, which perform steepest descent in the spectral norm geometry, empirically outperform Euclidean methods for training Transformers? The authors provide an answer through the lens of heavy-tailed stochastic gradient noise.
The central insight is that when convergence is measured in the nuclear norm (the dual of Muon's spectral norm), Muon achieves sample complexity O(min{m,n} · Δ₁L/ε² · (σ/ε)^{p/(p-1)}) for p-th moment bounded noise (p ∈ (1,2]), whereas Euclidean methods require O(min{m,n}^{(3p-2)/(2(p-1))} · Δ₁L/ε² · (σ/ε)^{p/(p-1)}). The dimension-dependent gap is substantial — for p=1.5 and modern transformer dimensions, this is ~6 orders of magnitude. Crucially, they prove this rate is first-order optimal via matching lower bounds.
The theoretical framework is mathematically sophisticated and well-constructed:
1. Upper bounds: The key technical innovation is using Banach space martingale inequalities (Theorem 2.4, building on p-uniform smoothness) to control gradient noise directly in the native geometry, avoiding the lossy reduction to Euclidean norms via norm equivalence constants. The proof strategy for Muon (Theorem 3.5) cleverly uses norm equivalence only between the nuclear norm and the Schatten-p norm (rather than all the way to Frobenius), minimizing dimensional blowup.
2. Lower bounds: Two complementary lower bounds are provided. Theorem 3.7 gives a clean construction matching the dimension and σ dependence through a coupon-collector argument. Theorem 3.8 provides exact ε-dependence matching via embedding κ copies of the Euclidean hard instance into matrices, though it requires a high-dimensional construction and is restricted to zero-respecting algorithms.
3. Generality: Results extend to (L₀,L₁)-smoothness, arbitrary q-uniformly convex norms, constrained optimization (Frank-Wolfe setting), and multi-layer networks.
One careful aspect: the noise assumption is made in the native dual norm geometry (E[‖∇f(x,ξ) - ∇F(x)‖*^p] ≤ σ^p), which is stronger than assuming Frobenius-norm bounded moments. The authors acknowledge this and empirically validate that the ratio σ_{S₁}/σ_F stays well below the worst-case √min{m,n}, which strengthens the practical relevance.
Theoretical: This work establishes a clean separation between Euclidean and non-Euclidean methods under heavy-tailed noise — the first such result for matrix-valued optimization. The Banach space martingale tools (Theorem 2.4 applied via uniform smoothness) provide a reusable framework for analyzing non-Euclidean algorithms beyond this specific setting. The first-order optimality proof for Muon is particularly impactful, as it closes the complexity question for nuclear-norm stationarity.
Practical: The surprising finding that intermediate Schatten geometries (r ≈ p/(p-1) ≈ 3 for p ≈ 1.5) can match Muon's performance has practical implications, though the authors correctly note that Muon remains preferable due to efficient Newton-Schulz computation. The empirical observation connecting noise geometry ratios to which layers benefit from Muon vs. Adam provides actionable insights for practitioners.
Broader influence: As Muon is being adopted in production systems (Kimi K2, DeepSeek V4), rigorous theoretical understanding is valuable for the community. The framework naturally extends to other LMO-based optimizers.
This paper is extremely timely. Muon has seen rapid adoption in 2024-2025 but theoretical understanding has lagged behind. The heavy-tailed noise perspective is well-motivated by empirical evidence from language modeling, and the paper connects to the concurrent work of Zhang & Lin (2026), clearly articulating why native-geometry assumptions yield strictly better guarantees.
This is a strong theoretical contribution that provides the first rigorous explanation for why Muon outperforms Euclidean methods, grounded in the natural heavy-tailed noise structure of deep learning. The combination of tight upper/lower bounds, novel functional-analytic tools, and well-designed experiments makes this a complete and convincing paper that will likely influence both the theory and practice of non-Euclidean optimization.
Generated Jun 15, 2026
Paper 2 offers foundational theoretical justifications for non-Euclidean optimizers like Muon in training LLMs. By proving optimal sample complexity bounds in heavy-tailed regimes, it addresses a fundamental gap in deep learning optimization theory. While Paper 1 tackles a highly relevant real-world issue (AI energy consumption), Paper 2's theoretical breakthroughs have broader implications for the core methodology of training state-of-the-art AI models, likely leading to widespread algorithmic adoption and extensive follow-up theoretical research across the broader machine learning community.
Paper 2 has higher potential impact due to broader applicability and clearer real-world relevance: joint state/topology inference in nonlinear networked dynamical systems spans biology, transportation, neuroscience, and engineering. It offers an integrated method (Koopman lifting + Kalman filtering + convex group-sparse ADMM) with certified convergence and explicit, decomposed MSE error bounds, strengthening methodological rigor. Its demonstrated performance on multiple synthetic and real datasets suggests near-term deployability. Paper 1 is theoretically strong and timely for deep learning optimization, but its impact is narrower (mainly LLM/optimization theory) and depends on adoption of specific non-Euclidean optimizers.
Paper 1 has higher likely scientific impact due to a stronger novelty-to-breadth ratio: it provides optimal (and provably dimension-improving) sample-complexity theory for non-Euclidean matrix-valued optimization under heavy-tailed noise, directly explaining empirically successful methods (Muon/Scion) and suggesting broader Schatten-geometry design. This advances core optimization theory with implications across deep learning training and stochastic nonconvex optimization. Paper 2 is timely and practically important for LLM serving, but its impact is more systems-specific; its dynamical-model insights may influence scheduling, yet the theoretical framework is narrower in cross-field reach.
Paper 2 offers higher potential impact due to a novel theoretical explanation for recently successful non-Euclidean optimizers in modern deep learning, with tight (optimal) sample-complexity results in heavy-tailed regimes and experiments on large language models. Its implications span optimization theory, matrix geometry, and practical training of Transformers, making it timely and broadly relevant. Paper 1 is a strong software/engineering contribution with wide adoption potential, but it is primarily incremental advances to an established modeling framework rather than a new theoretical breakthrough.
Paper 1 addresses the optimization of Transformer models, an area with massive current interest and broad applications in large language models. Providing theoretical justification for empirically successful methods (like Muon) under heavy-tailed noise directly impacts cutting-edge deep learning research. Paper 2's acceleration of SAV methods for PINNs is methodologically rigorous and useful, but its scope and potential community impact are more niche compared to the foundational advancements in LLM training offered by Paper 1.
Paper 1 has higher potential scientific impact due to its novel theoretical contribution: optimal sample complexity guarantees for non-Euclidean, matrix-valued optimizers under heavy-tailed non-convex noise, including tight dimension dependence and lower bounds for all first-order methods. This advances core optimization/ML theory with broad relevance to training large Transformers and other matrix-structured problems, and is timely given interest in Muon/Scion. Paper 2 is a solid, applied operations study with localized scope (single airport, scenario assumptions) and narrower cross-field generality.
Paper 1 addresses a timely and high-impact problem: providing theoretical justification for the empirical success of non-Euclidean optimizers (Muon, Scion) used in training Transformers. It bridges theory and practice in deep learning optimization, proves optimal sample complexity bounds in the heavy-tailed regime, and suggests new practical directions (alternative Schatten geometries). Its breadth of impact spans optimization theory, deep learning, and LLM training. Paper 2, while mathematically rigorous, addresses a more niche topic in stochastic game theory with narrower immediate applicability and audience.
Paper 1 addresses the theoretical foundations of optimization methods specifically used for training Transformer models and Large Language Models. Its focus on heavy-tailed, non-convex regimes perfectly aligns with modern deep learning challenges. Providing theoretical justification for empirically successful algorithms like Muon ensures high relevance and timeliness. In contrast, Paper 2 focuses on non-separable objectives with local convergence under convexity, which is less applicable to cutting-edge AI problems, giving Paper 1 a broader and more significant impact in the current research landscape.
Paper 2 is more likely to have higher impact: it provides a timely theoretical explanation for empirically successful optimizers in modern deep learning (Transformers), addressing a widely felt gap. The results give dimension-improved, heavy-tailed nonconvex sample-complexity guarantees under stronger stationarity notions, plus matching lower bounds (methodological rigor and novelty). Its implications span optimization, learning theory, and practical large-scale ML training. Paper 1 is rigorous and useful for network interdiction, but its impact is narrower to operations research/security and less broadly cross-field than optimization theory for LLM training.
Paper 1 addresses a highly timely topic—providing theoretical foundations for Muon and related non-Euclidean optimizers that are gaining rapid adoption in large-scale Transformer training. It establishes optimal sample complexity bounds in the heavy-tailed setting and reveals that matrix-valued updates avoid dimension-dependent costs, directly explaining empirical observations. The breadth of impact is large, spanning optimization theory and practical deep learning. Paper 2 makes a solid contribution to Koopman-based control theory but addresses a more niche problem with narrower immediate impact. Paper 1's relevance to the current wave of LLM training research gives it substantially higher potential impact.