Back to Rankings

The Spectral Dynamics and Noise Geometry of Muon

Pierfrancesco Beneventano, Mahmoud Abdelmoneum, Tomaso Poggio

cs.LGmath.OCstat.ML
Share
#3066 of 5669 · cs.LG
Tournament Score
1392±43
10501750
52%
Win Rate
11
Wins
10
Losses
21
Matches
Rating
6.2/ 10
Significance6.5
Rigor6
Novelty7
Clarity7.5

Abstract

Muon replaces a matrix gradient G=UΣVG=UΣV^\top by its polar factor UVUV^\top. This keeps the singular directions selected by the gradient, but makes the update spectrum flat. We study the optimization bias created by this operation. Under explicit alignment assumptions, we prove that the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, we derive exact singular-value dynamics for continuous-time Muon and identify a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry also rules out a common low-rank interpretation: at fixed Frobenius norm, Muon's distinguished state has a flat spectrum, whereas nuclear-norm minimization favors spectral concentration. Controlled matrix-sensing experiments separate the effect from simple gradient rescaling, show that norm-matched gradient descent does not reproduce Muon, and recover the predicted flattening trend across broad ablations. In small NanoGPT pretraining, Muon preserves stable rank, has a broad learning-rate plateau, and improves validation loss relative to AdamW; in a matched small-ViT control, the ranking reverses. The resulting picture is regime-dependent: Muon is not universally superior, but its flat-spectrum bias can help when many spectral directions need to remain active.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "The Spectral Dynamics and Noise Geometry of Muon"

1. Core Contribution

This paper provides a theoretical characterization of the Muon optimizer's implicit bias through the lens of spectral dynamics. The central insight is that Muon, which replaces the gradient G = UΣV⊤ with its polar factor UV⊤, induces a flat-spectrum bias rather than the low-rank (nuclear-norm minimizing) bias commonly assumed. The paper makes four concrete contributions:

  • A one-step spectral-entropy maximization result for the polar update (under alignment assumptions)
  • Exact singular-value dynamics for a projected polar flow on the interpolation manifold, with a measurement-dependent sign criterion for spectral flattening
  • Separation of Muon from both gradient normalization and nuclear-norm minimization
  • Empirical evidence of regime-dependent behavior (Muon helps in NanoGPT but not in small-ViT)
  • The key theoretical object is the projected self-polar flow W˙=ηPP(W)\dot{W} = -\eta P_\perp P(W), which yields the clean singular-value dynamics σ˙i=ηαi\dot{\sigma}_i = -\eta \alpha_i where αi=Pui2\alpha_i = \|P_\perp u_i\|^2. This is elegant and provides genuine insight into what Muon selects.

    2. Methodological Rigor

    Strengths in rigor:

  • The paper is unusually careful about separating what is proved from what is conjectured. The distinction between the projected self-polar flow (Theorem 1, no SF assumption), the projected gradient/momentum polar flow (Theorem 2, requiring approximate SF), and literal Muon is clearly maintained.
  • The variational characterization (Theorem 1(iv)) via the signless graph Laplacian strict convexity argument is clean and correct.
  • The robustness theorem (Theorem 2) properly uses the gauge-invariant δ_P rather than subspace angles alone.
  • Concerns:

  • The gap between the analyzed object (projected polar flow on an affine manifold) and actual Muon training is substantial. The paper acknowledges this but the practical relevance of the theoretical results depends heavily on unverified conditions (e.g., Remark 6's conditional link to transformers).
  • The sign criterion for flattening (Theorem 1(iii)) is conditional: iαiqi0\sum_i \alpha_i q_i \leq 0 depends on the measurement geometry, and the paper does not characterize when this holds beyond noting it requires αi\alpha_i to be "concentrated on large-singular-value directions."
  • R_pw is explicitly NOT a Lyapunov function — this is a significant limitation that the authors acknowledge but that weakens the convergence story considerably.
  • The NanoGPT experiments are single-seed, limiting statistical conclusions.
  • 3. Potential Impact

    The paper addresses an important question: what does Muon actually optimize for? The answer — flat spectra rather than low-rank solutions — is counterintuitive and practically meaningful. If confirmed at scale, this changes how practitioners should think about when to use Muon.

    The regime-dependence finding (Muon helps when many spectral directions need to be active, hurts when the useful spectrum is low-dimensional) provides actionable guidance. The critical batch size formula (Theorem 4) connecting polar-map sensitivity S(μ) to training dynamics is potentially useful for large-scale training.

    However, the impact is somewhat limited by: (1) the gap between the analyzed model and real training, (2) the small scale of experiments (124M NanoGPT, 5K steps), and (3) the conditional nature of the transformer connection.

    4. Timeliness & Relevance

    Extremely timely. Muon has become a significant optimizer in the LLM training community, with many concurrent papers (the related work section lists ~15 April-May 2026 preprints). The paper fills a genuine theoretical gap: most concurrent work focuses on convergence rates or max-margin classification, while this paper addresses the underdetermined regression regime where the optimizer must select among interpolants.

    The distinction from nuclear-norm minimization is particularly important given the community's tendency to assume spectral-norm geometry implies low-rank bias.

    5. Strengths & Limitations

    Key Strengths:

  • Clean mathematical framework with exact dynamics (not just bounds)
  • Careful intellectual honesty about limitations (rare in the field)
  • The nuclear-norm falsification is convincing: 1.29×–2.02× gap across 10 seeds with zero convergence to the nuclear-norm minimum
  • The regime-dependence narrative (NanoGPT vs. ViT) is more nuanced and credible than a universal superiority claim
  • The pairwise functional R_pw and its variational characterization are novel analytical tools
  • Notable Weaknesses:

  • The projected self-polar flow is a mathematical idealization; literal Muon at zero loss stops (G=0), and the paper's bridge via momentum buffers is acknowledged to be incomplete
  • Single-seed NanoGPT experiments with no error bars
  • The transformer connection (Remark 6) requires unmeasured activation rank and unverified linearization error
  • Missing Figure 12 (placeholder noted in the paper) — suggests incomplete preparation
  • The "Paper written by prompting pAI/MSc" disclosure, while commendable for transparency, raises questions about the depth of human verification of all mathematical claims
  • Unusual Meta-Aspect:

    Section 2 explicitly describes AI-assisted authorship. While the transparency is admirable, the paper being a "testbed" for an agentic system raises concerns about whether the theoretical development was driven by genuine mathematical insight or pattern-matching. The authors note manual inspection of proofs and claims, but the 5-iteration pipeline with minimal human writing is unprecedented and warrants scrutiny of proof correctness.

    Overall Assessment

    This paper makes a genuine theoretical contribution to understanding Muon's implicit bias, with the flat-spectrum characterization being its most important insight. The mathematical analysis is clean within its scope but limited by the gap between the idealized model and practice. The empirical work is directionally informative but insufficient for strong conclusions. The paper's greatest virtue is its intellectual honesty about limitations — a quality that ironically highlights how far the results are from a complete theory of Muon.

    Rating:6.2/ 10
    Significance 6.5Rigor 6Novelty 7Clarity 7.5

    Generated Jun 9, 2026

    Comparison History (21)

    Lostvs. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

    Paper 2 likely has higher impact: it delivers a large-scale, realistic benchmark (up to ~37k channels) with physically grounded targets (AC power-flow) and introduces constraint-aware probabilistic metrics, enabling standardized evaluation of safety-critical forecasting at unprecedented scale. This can catalyze broad follow-on work across ML, time-series, energy systems, and risk-aware decision-making. It also proposes a competitive baseline model (PowerForge). Paper 1 is novel and rigorous but narrower (a specific optimizer bias) and its real-world gains appear regime-dependent, limiting immediate cross-field adoption.

    gpt-5.2·Jun 12, 2026
    Lostvs. CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

    Paper 1 addresses a critical bottleneck in large language models (inference speed) with a highly practical, computationally lightweight solution that provides measurable speedups with zero quality degradation. Its direct applicability to LLM deployment gives it a broader and more immediate real-world impact compared to Paper 2, which offers a valuable but more niche theoretical analysis of a specific optimizer with regime-dependent benefits.

    gemini-3.1-pro-preview·Jun 10, 2026
    Wonvs. LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

    Paper 2 offers a deeper theoretical contribution: it characterizes the implicit bias of Muon via entropy-maximizing polar updates, derives exact spectral dynamics in a regression setting, and experimentally validates predicted spectrum-flattening effects, including nuanced regime dependence. This blend of theory + mechanistic insight + controlled ablations is likely to generalize across optimization, implicit regularization, and deep learning practice. Paper 1 is timely and practical for continual learning, but relies heavily on leveraging existing foundation models and system-level orchestration, making its core novelty less fundamental and potentially more benchmark-dependent.

    gpt-5.2·Jun 9, 2026
    Lostvs. PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

    PBSD addresses the critical and timely challenge of credit assignment in long-horizon agentic RL tasks (e.g., multi-turn LLM agents), offering a principled Bayesian framework that converts sparse outcome rewards into turn-level signals. This has broad applicability to the rapidly growing field of LLM agent fine-tuning. Paper 1 provides valuable theoretical analysis of Muon optimizer's spectral dynamics but is more narrowly focused on understanding an existing method, with regime-dependent conclusions that limit universal applicability. PBSD's practical utility for training agentic systems and its compatibility with standard policy optimization give it higher near-term impact potential.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. A Universal Dense Football Event Representation Based on TabTransformer

    Paper 1 addresses foundational optimization techniques in deep learning with broad applicability to training large models (e.g., LLMs and ViTs). It offers rigorous theoretical proofs combined with controlled experiments. In contrast, Paper 2 is an applied study utilizing an existing architecture (TabTransformer) for a highly specific niche (football sports analytics), limiting its breadth of impact and foundational novelty compared to Paper 1.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

    Paper 1 likely has higher scientific impact due to clearer, immediate real-world applicability and broader systems-level relevance: it targets compiler/auto-scheduling efficiency with substantial measured gains on CPU/GPU and end-to-end model inference, and integrates into a widely used framework (TVM), enabling adoption. Its world-model latent dynamics idea is novel for tensor program search and could generalize to other sequential optimization problems. Paper 2 offers deeper theoretical insight into Muon’s spectral bias, but its practical impact appears more regime-dependent and currently narrower, with mixed results across models.

    gpt-5.2·Jun 9, 2026
    Wonvs. A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

    Paper 2 likely has higher impact: it analyzes a broadly relevant optimizer variant (polar-factor updates) with theoretical characterization (entropy-maximizing bias, exact spectral dynamics) and links to practical deep-learning regimes, potentially influencing optimization theory and algorithm design across many models. Its core concept generalizes beyond selective prediction to training dynamics, with timely relevance to LLM/Vision training. Paper 1 is methodologically solid and practically useful for certified selective conformal deployment, but its impact is more specialized to conformal risk control and selective prediction settings.

    gpt-5.2·Jun 9, 2026
    Wonvs. Accelerated Decentralized Stochastic Gradient Descent for Strongly Convex Optimization

    Paper 1 investigates the Muon optimizer, which has direct and immediate implications for the highly active and resource-intensive field of large language model (LLM) training. While Paper 2 presents excellent theoretical advancements in decentralized optimization, Paper 1's focus on practical deep learning optimizers addresses a critical bottleneck in modern AI, likely resulting in broader adoption and higher immediate scientific and practical impact.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA

    Paper 1 investigates deep learning optimization, a highly active field with broad implications for training large-scale foundation models. It combines rigorous theoretical analysis of spectral dynamics with empirical validation on modern architectures like NanoGPT. In contrast, Paper 2 is a replication and methodological critique of a narrow application (airline profit clustering) using standard statistical techniques. Paper 1's theoretical insights into optimizer behavior have significantly higher potential to drive future algorithmic innovations and broad real-world AI applications.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. Towards Graph Foundation Models for Dynamics in Complex Networked Systems: Lessons from Super-Spreader Identification in Multilayer Networks

    Paper 2 provides rigorous theoretical foundations and empirical analysis for Muon, a highly relevant optimization algorithm for training large-scale deep learning models. Its insights into spectral dynamics and optimization bias have broad implications across AI, offering immediate practical utility. While Paper 1 presents an innovative step toward Graph Foundation Models, its scope is more confined to network dynamics, making Paper 2's potential impact on the broader machine learning community more significant and immediate.

    gemini-3.1-pro-preview·Jun 9, 2026