Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang
Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.
This paper shifts the discourse around Muon from optimization efficiency to feature quality, arguing that Muon learns representations that are (a) more robust to input corruptions and (b) more transferable to downstream tasks, compared to Adam and SGD. The paper identifies two mechanistic signatures of this advantage: larger layer-wise logit margins (linked to robustness) and higher effective rank of hidden-state matrices (linked to transferability). A theoretical analysis in a stylized one-layer classification setting with multi-component features formalizes these observations, proving that Muon's spectral normalization of the gradient induces a smaller "representation imbalance ratio," which monotonically maps to larger margins and higher effective rank at matched training loss.
The core novelty lies in reframing Muon's advantage through a feature-learning lens rather than a convergence/efficiency lens—connecting spectral gradient normalization to concrete representation-level quantities that practitioners care about (robustness, transfer). The construction of FineWeb10B-C as a corrupted language benchmark is a minor but useful contribution.
Empirical methodology is generally sound. The matched-budget protocol (same architecture, data, epochs, with independent hyperparameter tuning) is appropriate. Experiments span CNNs (ResNet-18), ViTs (ViT-S), and causal transformers (GPT-2, GPT-2 Medium), providing breadth across architectures and modalities. Standard deviations over three seeds are reported. The use of tuned-lens probes for layer-wise margin analysis and spectral analysis (effective rank, Top-k energy) provides interpretable intermediate diagnostics.
However, several concerns limit confidence:
Theoretical analysis is rigorous within its stylized setting. The one-layer linear classifier with block-structured multi-component features captures an interesting structural property (verified empirically via cosine similarity clustering in Figure 7). The proof technique—reducing each optimizer's trajectory to a 2D canonical plane parameterized by (u, v), then showing that Muon's spectral normalization produces the smallest imbalance ratio ρ = v/u—is clean and interpretable. The matched-loss comparison framework is well-motivated. However, the gap between the one-layer linear model and deep nonlinear networks remains significant, and the theory does not account for stochastic gradients, finite learning rates, or momentum.
The paper addresses a question of growing practical importance: as Muon enters production-scale LLM training (DeepSeek, GLM-5, Kimi K2), understanding *what kind of features* it learns—beyond just training efficiency—is valuable for practitioners making optimizer choices. If Muon's robustness and transferability advantages are confirmed at scale, this could influence:
The effective rank and margin as diagnostic tools for comparing optimizers could become standard evaluation metrics.
The paper is highly timely. Muon has rapidly gained traction in 2024–2025, with multiple production deployments and a flurry of variants. Most existing analyses focus on convergence properties; this is the first systematic study of Muon's feature-learning behavior. The robustness and transfer perspectives are well-chosen, as these are among the most practically relevant axes for evaluating pretrained representations.
This is a well-executed empirical and theoretical study that opens a new analytical angle on an increasingly important optimizer. The findings are consistent and the theoretical framework is elegant, though both the empirical scale and theoretical abstraction leave room for stronger validation. The paper makes a meaningful conceptual contribution to the optimizer landscape and provides useful diagnostic tools.
Generated Jun 9, 2026
Paper 1 is more novel methodologically (learning probability current directly from trajectories without estimating drift/diffusion/score) and targets a broad, high-impact domain: fast surrogate ensemble prediction for chaotic/turbulent/stochastic dynamical systems and PDEs, with clear real-world applications (climate, fluids, materials, uncertainty quantification). It includes stability analysis separating discretization vs sampling variance, suggesting stronger rigor. Paper 2 is timely and useful for ML practice, but is incremental (comparing optimizers’ learned features) and likely narrower in cross-field scientific reach than a new surrogate modeling framework for stochastic dynamics.
Paper 1 provides deeper theoretical and empirical insights into a fundamental question about optimizer behavior (Muon vs Adam/SGD), covering robustness, transferability, and formal proofs. Its breadth spans multiple architectures (transformers, CNNs), modalities (vision, language), and includes rigorous theoretical analysis. Paper 2 presents a useful but incremental improvement to GRPO rollout strategies for math reasoning, with narrower scope. Paper 1's findings about optimizer-driven feature quality have broader implications for the entire deep learning community, while Paper 2's contribution is more application-specific.
Paper 1 provides tight, fundamental theoretical bounds on the VC dimension and sample complexity of Transformers and chain-of-thought learning. Establishing definitive mathematical limits for the dominant architecture in AI offers lasting foundational impact, whereas optimizer analyses like Paper 2, while highly practical, are often tied to more transient empirical trends.
Paper 2 provides a foundational framework and organizing perspective for a highly impactful, cross-disciplinary field (data-driven discovery of physical laws). Review papers that unify rapidly expanding methodologies often become highly cited landmarks. While Paper 1 offers valuable insights into a specific optimization algorithm, its scope is narrower and confined to machine learning, whereas Paper 2 spans physics, AI, and adjacent sciences with profound implications for scientific discovery.
Paper 1 provides a deeper theoretical framework connecting plasticity loss to dynamical isometry via the Neural Tangent Kernel, proposes a novel optimizer (AdamO) with principled regularization, and reinterprets prior methods through a unifying lens. It addresses the fundamental and broadly relevant problem of continual learning plasticity with both theoretical grounding and practical solutions across supervised and RL domains. Paper 2 offers valuable empirical and theoretical insights into Muon's advantages but is more descriptive/analytical of an existing optimizer rather than introducing a fundamentally new framework or method.
Paper 1 addresses a broadly impactful topic—understanding why the Muon optimizer produces more robust and transferable features than Adam/SGD—relevant to the entire deep learning community working with LLMs, vision models, and beyond. It combines extensive empirical analysis across architectures with theoretical guarantees, offering insights that could influence optimizer selection in mainstream practice. Paper 2, while interesting, addresses a narrower problem (molecular force prediction) with a single minimal testbed (NaCl aqueous system), limiting its immediate breadth of impact and generalizability.
Paper 1 addresses a fundamental question about optimizer behavior in deep learning—why Muon produces more robust and transferable features than Adam/SGD—with both empirical evidence across architectures and theoretical guarantees. Given the centrality of optimizers in modern ML and the rapid adoption of Muon for LLM pretraining, this work has broad implications across all of deep learning. Paper 2, while methodologically sound, addresses a niche application in football analytics with limited dataset (7 matches) and narrower cross-disciplinary impact.
Paper 1 likely has higher impact due to stronger novelty and clearer real-world applicability: a model-agnostic, self-supervised framework targeting a key bottleneck in ML interatomic potentials—transfer under limited expensive DFT labels—validated across major materials/chemistry benchmarks with large reported error reductions and released pretrained models. Its impact spans computational chemistry, materials science, and representation learning. Paper 2 is timely and useful for ML optimization practice, but the incremental nature of optimizer comparisons and potentially narrower downstream novelty (given many robustness/transfer studies) suggests comparatively lower scientific impact.
Paper 2 investigates fundamental properties of the Muon optimizer (robustness, transferability) with both empirical and theoretical analysis across multiple architectures. This has broader impact since optimizer choice affects all of deep learning, not just LLM RL. Paper 1 proposes an incremental improvement (DRPO) over existing trust-region methods (DPPO/PPO) for LLM RL—a narrower scope. Paper 2's insights into why Muon learns better features (margins, effective rank) provide foundational understanding applicable across vision and language, with stronger potential to influence optimizer design and model training practices broadly.
Paper 1 investigates fundamental optimization mechanisms in deep learning, demonstrating that the new Muon optimizer outperforms the ubiquitous Adam optimizer in feature robustness and transferability. Because optimization is central to all deep learning models (LLMs, CNNs), a proven improvement here has a massive, field-wide impact. Paper 2 introduces a valuable but more narrowly focused benchmark for smartphone AI agents, which, while highly relevant for applied AI, does not match the fundamental breadth and foundational theoretical contribution of Paper 1.