Pierfrancesco Beneventano, Mahmoud Abdelmoneum, Tomaso Poggio
Muon replaces a matrix gradient by its polar factor . This keeps the singular directions selected by the gradient, but makes the update spectrum flat. We study the optimization bias created by this operation. Under explicit alignment assumptions, we prove that the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, we derive exact singular-value dynamics for continuous-time Muon and identify a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry also rules out a common low-rank interpretation: at fixed Frobenius norm, Muon's distinguished state has a flat spectrum, whereas nuclear-norm minimization favors spectral concentration. Controlled matrix-sensing experiments separate the effect from simple gradient rescaling, show that norm-matched gradient descent does not reproduce Muon, and recover the predicted flattening trend across broad ablations. In small NanoGPT pretraining, Muon preserves stable rank, has a broad learning-rate plateau, and improves validation loss relative to AdamW; in a matched small-ViT control, the ranking reverses. The resulting picture is regime-dependent: Muon is not universally superior, but its flat-spectrum bias can help when many spectral directions need to remain active.
This paper provides a theoretical characterization of the Muon optimizer's implicit bias through the lens of spectral dynamics. The central insight is that Muon, which replaces the gradient G = UΣV⊤ with its polar factor UV⊤, induces a flat-spectrum bias rather than the low-rank (nuclear-norm minimizing) bias commonly assumed. The paper makes four concrete contributions:
The key theoretical object is the projected self-polar flow , which yields the clean singular-value dynamics where . This is elegant and provides genuine insight into what Muon selects.
The paper addresses an important question: what does Muon actually optimize for? The answer — flat spectra rather than low-rank solutions — is counterintuitive and practically meaningful. If confirmed at scale, this changes how practitioners should think about when to use Muon.
The regime-dependence finding (Muon helps when many spectral directions need to be active, hurts when the useful spectrum is low-dimensional) provides actionable guidance. The critical batch size formula (Theorem 4) connecting polar-map sensitivity S(μ) to training dynamics is potentially useful for large-scale training.
However, the impact is somewhat limited by: (1) the gap between the analyzed model and real training, (2) the small scale of experiments (124M NanoGPT, 5K steps), and (3) the conditional nature of the transformer connection.
Extremely timely. Muon has become a significant optimizer in the LLM training community, with many concurrent papers (the related work section lists ~15 April-May 2026 preprints). The paper fills a genuine theoretical gap: most concurrent work focuses on convergence rates or max-margin classification, while this paper addresses the underdetermined regression regime where the optimizer must select among interpolants.
The distinction from nuclear-norm minimization is particularly important given the community's tendency to assume spectral-norm geometry implies low-rank bias.
Section 2 explicitly describes AI-assisted authorship. While the transparency is admirable, the paper being a "testbed" for an agentic system raises concerns about whether the theoretical development was driven by genuine mathematical insight or pattern-matching. The authors note manual inspection of proofs and claims, but the 5-iteration pipeline with minimal human writing is unprecedented and warrants scrutiny of proof correctness.
This paper makes a genuine theoretical contribution to understanding Muon's implicit bias, with the flat-spectrum characterization being its most important insight. The mathematical analysis is clean within its scope but limited by the gap between the idealized model and practice. The empirical work is directionally informative but insufficient for strong conclusions. The paper's greatest virtue is its intellectual honesty about limitations — a quality that ironically highlights how far the results are from a complete theory of Muon.
Generated Jun 9, 2026
Paper 2 likely has higher impact: it delivers a large-scale, realistic benchmark (up to ~37k channels) with physically grounded targets (AC power-flow) and introduces constraint-aware probabilistic metrics, enabling standardized evaluation of safety-critical forecasting at unprecedented scale. This can catalyze broad follow-on work across ML, time-series, energy systems, and risk-aware decision-making. It also proposes a competitive baseline model (PowerForge). Paper 1 is novel and rigorous but narrower (a specific optimizer bias) and its real-world gains appear regime-dependent, limiting immediate cross-field adoption.
Paper 1 addresses a critical bottleneck in large language models (inference speed) with a highly practical, computationally lightweight solution that provides measurable speedups with zero quality degradation. Its direct applicability to LLM deployment gives it a broader and more immediate real-world impact compared to Paper 2, which offers a valuable but more niche theoretical analysis of a specific optimizer with regime-dependent benefits.
Paper 2 offers a deeper theoretical contribution: it characterizes the implicit bias of Muon via entropy-maximizing polar updates, derives exact spectral dynamics in a regression setting, and experimentally validates predicted spectrum-flattening effects, including nuanced regime dependence. This blend of theory + mechanistic insight + controlled ablations is likely to generalize across optimization, implicit regularization, and deep learning practice. Paper 1 is timely and practical for continual learning, but relies heavily on leveraging existing foundation models and system-level orchestration, making its core novelty less fundamental and potentially more benchmark-dependent.
PBSD addresses the critical and timely challenge of credit assignment in long-horizon agentic RL tasks (e.g., multi-turn LLM agents), offering a principled Bayesian framework that converts sparse outcome rewards into turn-level signals. This has broad applicability to the rapidly growing field of LLM agent fine-tuning. Paper 1 provides valuable theoretical analysis of Muon optimizer's spectral dynamics but is more narrowly focused on understanding an existing method, with regime-dependent conclusions that limit universal applicability. PBSD's practical utility for training agentic systems and its compatibility with standard policy optimization give it higher near-term impact potential.
Paper 1 addresses foundational optimization techniques in deep learning with broad applicability to training large models (e.g., LLMs and ViTs). It offers rigorous theoretical proofs combined with controlled experiments. In contrast, Paper 2 is an applied study utilizing an existing architecture (TabTransformer) for a highly specific niche (football sports analytics), limiting its breadth of impact and foundational novelty compared to Paper 1.
Paper 1 likely has higher scientific impact due to clearer, immediate real-world applicability and broader systems-level relevance: it targets compiler/auto-scheduling efficiency with substantial measured gains on CPU/GPU and end-to-end model inference, and integrates into a widely used framework (TVM), enabling adoption. Its world-model latent dynamics idea is novel for tensor program search and could generalize to other sequential optimization problems. Paper 2 offers deeper theoretical insight into Muon’s spectral bias, but its practical impact appears more regime-dependent and currently narrower, with mixed results across models.
Paper 2 likely has higher impact: it analyzes a broadly relevant optimizer variant (polar-factor updates) with theoretical characterization (entropy-maximizing bias, exact spectral dynamics) and links to practical deep-learning regimes, potentially influencing optimization theory and algorithm design across many models. Its core concept generalizes beyond selective prediction to training dynamics, with timely relevance to LLM/Vision training. Paper 1 is methodologically solid and practically useful for certified selective conformal deployment, but its impact is more specialized to conformal risk control and selective prediction settings.
Paper 1 investigates the Muon optimizer, which has direct and immediate implications for the highly active and resource-intensive field of large language model (LLM) training. While Paper 2 presents excellent theoretical advancements in decentralized optimization, Paper 1's focus on practical deep learning optimizers addresses a critical bottleneck in modern AI, likely resulting in broader adoption and higher immediate scientific and practical impact.
Paper 1 investigates deep learning optimization, a highly active field with broad implications for training large-scale foundation models. It combines rigorous theoretical analysis of spectral dynamics with empirical validation on modern architectures like NanoGPT. In contrast, Paper 2 is a replication and methodological critique of a narrow application (airline profit clustering) using standard statistical techniques. Paper 1's theoretical insights into optimizer behavior have significantly higher potential to drive future algorithmic innovations and broad real-world AI applications.
Paper 2 provides rigorous theoretical foundations and empirical analysis for Muon, a highly relevant optimization algorithm for training large-scale deep learning models. Its insights into spectral dynamics and optimization bias have broad implications across AI, offering immediate practical utility. While Paper 1 presents an innovative step toward Graph Foundation Models, its scope is more confined to network dynamics, making Paper 2's potential impact on the broader machine learning community more significant and immediate.