Back to Rankings

Free Heavy-Tailed Lunch for Muon: A Theoretical Justification of Empirical Success

Florian Hübler, Thomas Pethick, Suvrit Sra

Jun 12, 2026arXiv:2606.14560v1
math.OCcs.LGstat.ML
Share
#4 of 151 · math.OC
Tournament Score
1578±37
11001650
93%
Win Rate
42
Wins
3
Losses
45
Matches
Rating
8/ 10
Significance8
Rigor8.5
Novelty7.5
Clarity8.5

Abstract

Non-Euclidean optimisation methods with matrix-valued updates, such as Muon and Scion, have recently shown strong empirical performance for training Transformer models, yet their theoretical advantages over Euclidean methods remain poorly understood. We address this gap in the heavy-tailed non-convex regime, where stochastic gradients have bounded pp-th central moments, p(1,2]p \in (1,2]. We show that certain non-Euclidean methods achieve optimal sample complexity under stronger stationarity measures, while Euclidean methods incur additional dimension-dependent costs. As a consequence, for m×nm \times n matrices, Muon finds an ε\varepsilon-stationary point in nuclear norm within O(min{m,n}Δ1Lε2(σε)pp1)\mathcal{O}\left(\min\{m, n\} \frac{Δ_1 L}{\varepsilon^2} \left(\frac σ\varepsilon \right)^{\frac p {p-1}}\right) samples, absorbing heavy-tailed noise without extra dimension dependence, unlike Euclidean methods. We further prove this sample complexity, including its dimension dependence, is optimal for all first-order methods under nuclear-norm stationarity. Experiments on large language models support our theory. Surprisingly, our results suggest that other Schatten geometries beyond the spectral geometry of Muon can perform competitively in certain settings.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a timely theoretical gap: why do non-Euclidean optimizers like Muon, which perform steepest descent in the spectral norm geometry, empirically outperform Euclidean methods for training Transformers? The authors provide an answer through the lens of heavy-tailed stochastic gradient noise.

The central insight is that when convergence is measured in the nuclear norm (the dual of Muon's spectral norm), Muon achieves sample complexity O(min{m,n} · Δ₁L/ε² · (σ/ε)^{p/(p-1)}) for p-th moment bounded noise (p ∈ (1,2]), whereas Euclidean methods require O(min{m,n}^{(3p-2)/(2(p-1))} · Δ₁L/ε² · (σ/ε)^{p/(p-1)}). The dimension-dependent gap is substantial — for p=1.5 and modern transformer dimensions, this is ~6 orders of magnitude. Crucially, they prove this rate is first-order optimal via matching lower bounds.

Methodological Rigor

The theoretical framework is mathematically sophisticated and well-constructed:

1. Upper bounds: The key technical innovation is using Banach space martingale inequalities (Theorem 2.4, building on p-uniform smoothness) to control gradient noise directly in the native geometry, avoiding the lossy reduction to Euclidean norms via norm equivalence constants. The proof strategy for Muon (Theorem 3.5) cleverly uses norm equivalence only between the nuclear norm and the Schatten-p norm (rather than all the way to Frobenius), minimizing dimensional blowup.

2. Lower bounds: Two complementary lower bounds are provided. Theorem 3.7 gives a clean construction matching the dimension and σ dependence through a coupon-collector argument. Theorem 3.8 provides exact ε-dependence matching via embedding κ copies of the Euclidean hard instance into matrices, though it requires a high-dimensional construction and is restricted to zero-respecting algorithms.

3. Generality: Results extend to (L₀,L₁)-smoothness, arbitrary q-uniformly convex norms, constrained optimization (Frank-Wolfe setting), and multi-layer networks.

One careful aspect: the noise assumption is made in the native dual norm geometry (E[‖∇f(x,ξ) - ∇F(x)‖*^p] ≤ σ^p), which is stronger than assuming Frobenius-norm bounded moments. The authors acknowledge this and empirically validate that the ratio σ_{S₁}/σ_F stays well below the worst-case √min{m,n}, which strengthens the practical relevance.

Potential Impact

Theoretical: This work establishes a clean separation between Euclidean and non-Euclidean methods under heavy-tailed noise — the first such result for matrix-valued optimization. The Banach space martingale tools (Theorem 2.4 applied via uniform smoothness) provide a reusable framework for analyzing non-Euclidean algorithms beyond this specific setting. The first-order optimality proof for Muon is particularly impactful, as it closes the complexity question for nuclear-norm stationarity.

Practical: The surprising finding that intermediate Schatten geometries (r ≈ p/(p-1) ≈ 3 for p ≈ 1.5) can match Muon's performance has practical implications, though the authors correctly note that Muon remains preferable due to efficient Newton-Schulz computation. The empirical observation connecting noise geometry ratios to which layers benefit from Muon vs. Adam provides actionable insights for practitioners.

Broader influence: As Muon is being adopted in production systems (Kimi K2, DeepSeek V4), rigorous theoretical understanding is valuable for the community. The framework naturally extends to other LMO-based optimizers.

Timeliness & Relevance

This paper is extremely timely. Muon has seen rapid adoption in 2024-2025 but theoretical understanding has lagged behind. The heavy-tailed noise perspective is well-motivated by empirical evidence from language modeling, and the paper connects to the concurrent work of Zhang & Lin (2026), clearly articulating why native-geometry assumptions yield strictly better guarantees.

Strengths

  • Clean theoretical narrative: The progression from general Banach space results → uniformly convex norms → Muon → lower bounds is well-structured and each result builds naturally.
  • Tight bounds: Upper and lower bounds match in dimension and noise dependence, with exact ε-matching under mild conditions.
  • Novel technical tools: The martingale deviation bound (Lemma 3.3) in general norms is a genuinely useful contribution beyond this paper.
  • Empirical validation: Experiments are well-designed — the Schatten-r sweep precisely validates the theoretical prediction about optimal geometry, and noise ratio measurements connect theory to practice.
  • Optimality of Muon: Lemma C.7 shows Schatten-p is the best possible p-uniformly smooth norm for bounding nuclear norm, justifying Muon's automatic adaptation.
  • Limitations

  • Scale of experiments: Only 70M parameter models on one dataset. Validating at larger scales would strengthen claims significantly.
  • Expectation bounds only: High-probability guarantees remain open, which matters for practical reliability.
  • Idealized algorithm: Analysis assumes exact LMO computation, while practical Muon uses approximate Newton-Schulz iterations.
  • Noise assumption shift: While empirically validated, the native-geometry noise assumption is non-standard and complicates direct comparison with prior work.
  • Gap in lower bounds: Theorem 3.7 doesn't match ε-dependence; Theorem 3.8 requires the zero-respecting restriction and high dimensionality.
  • Overall Assessment

    This is a strong theoretical contribution that provides the first rigorous explanation for why Muon outperforms Euclidean methods, grounded in the natural heavy-tailed noise structure of deep learning. The combination of tight upper/lower bounds, novel functional-analytic tools, and well-designed experiments makes this a complete and convincing paper that will likely influence both the theory and practice of non-Euclidean optimization.

    Rating:8/ 10
    Significance 8Rigor 8.5Novelty 7.5Clarity 8.5

    Generated Jun 15, 2026

    Comparison History (45)

    Wonvs. Contextual Robust Optimization for AI Data Center Scheduling with Statistical Guarantees

    Paper 2 offers foundational theoretical justifications for non-Euclidean optimizers like Muon in training LLMs. By proving optimal sample complexity bounds in heavy-tailed regimes, it addresses a fundamental gap in deep learning optimization theory. While Paper 1 tackles a highly relevant real-world issue (AI energy consumption), Paper 2's theoretical breakthroughs have broader implications for the core methodology of training state-of-the-art AI models, likely leading to widespread algorithmic adoption and extensive follow-up theoretical research across the broader machine learning community.

    gemini-3.1-pro-preview·Jun 17, 2026
    Lostvs. Koopman Lifting with Certified Error Bounds for Joint Inference in Nonlinear Networks

    Paper 2 has higher potential impact due to broader applicability and clearer real-world relevance: joint state/topology inference in nonlinear networked dynamical systems spans biology, transportation, neuroscience, and engineering. It offers an integrated method (Koopman lifting + Kalman filtering + convex group-sparse ADMM) with certified convergence and explicit, decomposed MSE error bounds, strengthening methodological rigor. Its demonstrated performance on multiple synthetic and real datasets suggests near-term deployability. Paper 1 is theoretically strong and timely for deep learning optimization, but its impact is narrower (mainly LLM/optimization theory) and depends on adoption of specific non-Euclidean optimizers.

    gpt-5.2·Jun 17, 2026
    Wonvs. Service-Induced Congestion in Memory-Constrained LLM Serving

    Paper 1 has higher likely scientific impact due to a stronger novelty-to-breadth ratio: it provides optimal (and provably dimension-improving) sample-complexity theory for non-Euclidean matrix-valued optimization under heavy-tailed noise, directly explaining empirically successful methods (Muon/Scion) and suggesting broader Schatten-geometry design. This advances core optimization theory with implications across deep learning training and stochastic nonconvex optimization. Paper 2 is timely and practically important for LLM serving, but its impact is more systems-specific; its dynamical-model insights may influence scheduling, yet the theoretical framework is narrower in cross-field reach.

    gpt-5.2·Jun 16, 2026
    Wonvs. CVXPY 1.9: Recent Advances in Optimization Modeling Software

    Paper 2 offers higher potential impact due to a novel theoretical explanation for recently successful non-Euclidean optimizers in modern deep learning, with tight (optimal) sample-complexity results in heavy-tailed regimes and experiments on large language models. Its implications span optimization theory, matrix geometry, and practical training of Transformers, making it timely and broadly relevant. Paper 1 is a strong software/engineering contribution with wide adoption potential, but it is primarily incremental advances to an established modeling framework rather than a new theoretical breakthrough.

    gpt-5.2·Jun 16, 2026
    Wonvs. Accelerating SAV-based optimization via randomized low-rank Hessian approximation

    Paper 1 addresses the optimization of Transformer models, an area with massive current interest and broad applications in large language models. Providing theoretical justification for empirically successful methods (like Muon) under heavy-tailed noise directly impacts cutting-edge deep learning research. Paper 2's acceleration of SAV methods for PINNs is methodologically rigorous and useful, but its scope and potential community impact are more niche compared to the foundational advancements in LLM training offered by Paper 1.

    gemini-3.1-pro-preview·Jun 15, 2026
    Wonvs. Short-Range IAM Operations to Overcome Access Gaps Around Bologna Airport

    Paper 1 has higher potential scientific impact due to its novel theoretical contribution: optimal sample complexity guarantees for non-Euclidean, matrix-valued optimizers under heavy-tailed non-convex noise, including tight dimension dependence and lower bounds for all first-order methods. This advances core optimization/ML theory with broad relevance to training large Transformers and other matrix-structured problems, and is timely given interest in Muon/Scion. Paper 2 is a solid, applied operations study with localized scope (single airport, scenario assumptions) and narrower cross-field generality.

    gpt-5.2·Jun 15, 2026
    Wonvs. A Nonlinear Snell Envelope Representation for Path-Dependent Controller-Stopper Games

    Paper 1 addresses a timely and high-impact problem: providing theoretical justification for the empirical success of non-Euclidean optimizers (Muon, Scion) used in training Transformers. It bridges theory and practice in deep learning optimization, proves optimal sample complexity bounds in the heavy-tailed regime, and suggests new practical directions (alternative Schatten geometries). Its breadth of impact spans optimization theory, deep learning, and LLM training. Paper 2, while mathematically rigorous, addresses a more niche topic in stochastic game theory with narrower immediate applicability and audience.

    claude-opus-4-6·Jun 15, 2026
    Wonvs. A stochastic gradient algorithm for non-separable optimization with convergence guarantee

    Paper 1 addresses the theoretical foundations of optimization methods specifically used for training Transformer models and Large Language Models. Its focus on heavy-tailed, non-convex regimes perfectly aligns with modern deep learning challenges. Providing theoretical justification for empirically successful algorithms like Muon ensures high relevance and timeliness. In contrast, Paper 2 focuses on non-separable objectives with local convergence under convexity, which is less applicable to cutting-edge AI problems, giving Paper 1 a broader and more significant impact in the current research landscape.

    gemini-3.1-pro-preview·Jun 15, 2026
    Wonvs. Supermodularity and Submodularity in Network Interdiction

    Paper 2 is more likely to have higher impact: it provides a timely theoretical explanation for empirically successful optimizers in modern deep learning (Transformers), addressing a widely felt gap. The results give dimension-improved, heavy-tailed nonconvex sample-complexity guarantees under stronger stationarity notions, plus matching lower bounds (methodological rigor and novelty). Its implications span optimization, learning theory, and practical large-scale ML training. Paper 1 is rigorous and useful for network interdiction, but its impact is narrower to operations research/security and less broadly cross-field than optimization theory for LLM training.

    gpt-5.2·Jun 15, 2026
    Wonvs. Koopman Modeling and Stabilization of Discrete-Time Nonlinear Control Systems: Bilinearity on a Reproducing Kernel Hilbert Space

    Paper 1 addresses a highly timely topic—providing theoretical foundations for Muon and related non-Euclidean optimizers that are gaining rapid adoption in large-scale Transformer training. It establishes optimal sample complexity bounds in the heavy-tailed setting and reveals that matrix-valued updates avoid dimension-dependent costs, directly explaining empirical observations. The breadth of impact is large, spanning optimization theory and practical deep learning. Paper 2 makes a solid contribution to Koopman-based control theory but addresses a more niche problem with narrower immediate impact. Paper 1's relevance to the current wave of LLM training research gives it substantially higher potential impact.

    claude-opus-4-6·Jun 15, 2026