Muon Learns More Robust and Transferable Features than Adam

Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang

Jun 8, 2026arXiv:2606.09658v1

cs.LGcs.AI

#1472of 5669·cs.LG

#1472 of 5669 · cs.LG

Tournament Score

1453±44

10501750

70%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5

Abstract

Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Muon Learns More Robust and Transferable Features than Adam"

1. Core Contribution

This paper shifts the discourse around Muon from optimization efficiency to feature quality, arguing that Muon learns representations that are (a) more robust to input corruptions and (b) more transferable to downstream tasks, compared to Adam and SGD. The paper identifies two mechanistic signatures of this advantage: larger layer-wise logit margins (linked to robustness) and higher effective rank of hidden-state matrices (linked to transferability). A theoretical analysis in a stylized one-layer classification setting with multi-component features formalizes these observations, proving that Muon's spectral normalization of the gradient induces a smaller "representation imbalance ratio," which monotonically maps to larger margins and higher effective rank at matched training loss.

The core novelty lies in reframing Muon's advantage through a feature-learning lens rather than a convergence/efficiency lens—connecting spectral gradient normalization to concrete representation-level quantities that practitioners care about (robustness, transfer). The construction of FineWeb10B-C as a corrupted language benchmark is a minor but useful contribution.

2. Methodological Rigor

Empirical methodology is generally sound. The matched-budget protocol (same architecture, data, epochs, with independent hyperparameter tuning) is appropriate. Experiments span CNNs (ResNet-18), ViTs (ViT-S), and causal transformers (GPT-2, GPT-2 Medium), providing breadth across architectures and modalities. Standard deviations over three seeds are reported. The use of tuned-lens probes for layer-wise margin analysis and spectral analysis (effective rank, Top-k energy) provides interpretable intermediate diagnostics.

However, several concerns limit confidence:

Scale: The models studied (11M–354M parameters) are modest by modern standards. Whether Muon's feature-quality advantage persists at truly large scale (billions of parameters) is unaddressed.

Hyperparameter sensitivity: While learning rates are tuned, other choices (e.g., Muon's Newton-Schulz iterations, the use of Adam for non-matrix parameters in Muon runs) could confound comparisons.

GPT-2 Medium: Results are from a single seed, weakening statistical confidence for the largest model.

Transfer evaluation: Linear probing for vision is standard, but the downstream tasks are relatively simple. Language transfer uses only instruction-tuning perplexity, not task-specific metrics.

Theoretical analysis is rigorous within its stylized setting. The one-layer linear classifier with block-structured multi-component features captures an interesting structural property (verified empirically via cosine similarity clustering in Figure 7). The proof technique—reducing each optimizer's trajectory to a 2D canonical plane parameterized by (u, v), then showing that Muon's spectral normalization produces the smallest imbalance ratio ρ = v/u—is clean and interpretable. The matched-loss comparison framework is well-motivated. However, the gap between the one-layer linear model and deep nonlinear networks remains significant, and the theory does not account for stochastic gradients, finite learning rates, or momentum.

3. Potential Impact

The paper addresses a question of growing practical importance: as Muon enters production-scale LLM training (DeepSeek, GLM-5, Kimi K2), understanding *what kind of features* it learns—beyond just training efficiency—is valuable for practitioners making optimizer choices. If Muon's robustness and transferability advantages are confirmed at scale, this could influence:

Foundation model training: Optimizer selection for models intended to be fine-tuned on diverse downstream tasks.

Safety and reliability: Robustness to corruptions has implications for deployment in noisy real-world environments.

Optimizer design: The connection between spectral normalization and representation diversity (effective rank) could guide the design of new optimizers.

The effective rank and margin as diagnostic tools for comparing optimizers could become standard evaluation metrics.

4. Timeliness & Relevance

The paper is highly timely. Muon has rapidly gained traction in 2024–2025, with multiple production deployments and a flurry of variants. Most existing analyses focus on convergence properties; this is the first systematic study of Muon's feature-learning behavior. The robustness and transfer perspectives are well-chosen, as these are among the most practically relevant axes for evaluating pretrained representations.

5. Strengths & Limitations

Key Strengths:

Novel perspective: first to systematically study Muon vs. Adam/SGD from a feature-quality viewpoint.

Multi-modal, multi-architecture empirical design with matched-budget protocol.

Clean theoretical framework that isolates the mechanism (spectral normalization → balanced representation → larger margin + higher effective rank).

Empirical validation of the theoretical assumption (block structure in embeddings, Figure 7).

The layer-wise analysis (margins, effective rank) provides mechanistic insight beyond end-to-end metrics.

Notable Limitations:

Scale gap: 124M–354M models are far from production-scale LLMs where Muon is deployed. The feature-quality advantage may or may not persist.

Theory-practice gap: The one-layer linear model with zero-momentum, continuous-time dynamics is a substantial simplification. Extensions to deep, nonlinear, stochastic settings would strengthen the claims.

Limited corruption types: ImageNet-C and simple typo corruptions do not cover adversarial robustness or more complex distribution shifts.

Confounding factors: Muon uses Adam for embeddings and 1D parameters, making the comparison impure—the "Muon features" are partly shaped by Adam.

Missing domains: No experiments on generative tasks (diffusion, autoregressive generation quality), which are among Muon's most impactful applications.

Effect sizes: Some improvements are modest (e.g., 1–2% accuracy differences on transfer tasks), though consistent across settings.

Summary

This is a well-executed empirical and theoretical study that opens a new analytical angle on an increasingly important optimizer. The findings are consistent and the theoretical framework is elegant, though both the empirical scale and theoretical abstraction leave room for stronger validation. The paper makes a meaningful conceptual contribution to the optimizer landscape and provides useful diagnostic tools.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 7Clarity 7.5

Generated Jun 9, 2026

Comparison History (23)

Lostvs. First-Order Trajectory Matching: Fast Ensemble Predictions of Chaotic, Turbulent, Stochastic Systems

Paper 1 is more novel methodologically (learning probability current directly from trajectories without estimating drift/diffusion/score) and targets a broad, high-impact domain: fast surrogate ensemble prediction for chaotic/turbulent/stochastic dynamical systems and PDEs, with clear real-world applications (climate, fluids, materials, uncertainty quantification). It includes stability analysis separating discretization vs sampling variance, suggesting stronger rigor. Paper 2 is timely and useful for ML practice, but is incremental (comparing optimizers’ learned features) and likely narrower in cross-field scientific reach than a new surrogate modeling framework for stochastic dynamics.

gpt-5.2·Jun 10, 2026

Wonvs. N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

Paper 1 provides deeper theoretical and empirical insights into a fundamental question about optimizer behavior (Muon vs Adam/SGD), covering robustness, transferability, and formal proofs. Its breadth spans multiple architectures (transformers, CNNs), modalities (vision, language), and includes rigorous theoretical analysis. Paper 2 presents a useful but incremental improvement to GRPO rollout strategies for math reasoning, with narrower scope. Paper 1's findings about optimizer-driven feature quality have broader implications for the entire deep learning community, while Paper 2's contribution is more application-specific.

claude-opus-4-6·Jun 10, 2026

Lostvs. Tight Sample Complexity of Transformers

Paper 1 provides tight, fundamental theoretical bounds on the VC dimension and sample complexity of Transformers and chain-of-thought learning. Establishing definitive mathematical limits for the dominant architecture in AI offers lasting foundational impact, whereas optimizer analyses like Paper 2, while highly practical, are often tied to more transient empirical trends.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Data-driven discovery of governing differential equations across physical systems

Paper 2 provides a foundational framework and organizing perspective for a highly impactful, cross-disciplinary field (data-driven discovery of physical laws). Review papers that unify rapidly expanding methodologies often become highly cited landmarks. While Paper 1 offers valuable insights into a specific optimization algorithm, its scope is narrower and confined to machine learning, whereas Paper 2 spans physics, AI, and adjacent sciences with profound implications for scientific discovery.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Preserving Plasticity in Continual Learning via Dynamical Isometry

Paper 1 provides a deeper theoretical framework connecting plasticity loss to dynamical isometry via the Neural Tangent Kernel, proposes a novel optimizer (AdamO) with principled regularization, and reinterprets prior methods through a unifying lens. It addresses the fundamental and broadly relevant problem of continual learning plasticity with both theoretical grounding and practical solutions across supervised and RL domains. Paper 2 offers valuable empirical and theoretical insights into Muon's advantages but is more descriptive/analytical of an existing optimizer rather than introducing a fundamentally new framework or method.

claude-opus-4-6·Jun 9, 2026

Wonvs. Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

Paper 1 addresses a broadly impactful topic—understanding why the Muon optimizer produces more robust and transferable features than Adam/SGD—relevant to the entire deep learning community working with LLMs, vision models, and beyond. It combines extensive empirical analysis across architectures with theoretical guarantees, offering insights that could influence optimizer selection in mainstream practice. Paper 2, while interesting, addresses a narrower problem (molecular force prediction) with a single minimal testbed (NaCl aqueous system), limiting its immediate breadth of impact and generalizability.

claude-opus-4-6·Jun 9, 2026

Wonvs. Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning

Paper 1 addresses a fundamental question about optimizer behavior in deep learning—why Muon produces more robust and transferable features than Adam/SGD—with both empirical evidence across architectures and theoretical guarantees. Given the centrality of optimizers in modern ML and the rapid adoption of Muon for LLM pretraining, this work has broad implications across all of deep learning. Paper 2, while methodologically sound, addresses a niche application in football analytics with limited dataset (7 matches) and narrower cross-disciplinary impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. TriForces: Augmenting Atomistic GNNs for Transferable Representations

Paper 1 likely has higher impact due to stronger novelty and clearer real-world applicability: a model-agnostic, self-supervised framework targeting a key bottleneck in ML interatomic potentials—transfer under limited expensive DFT labels—validated across major materials/chemistry benchmarks with large reported error reductions and released pretrained models. Its impact spans computational chemistry, materials science, and representation learning. Paper 2 is timely and useful for ML optimization practice, but the incremental nature of optimizer comparisons and potentially narrower downstream novelty (given many robustness/transfer studies) suggests comparatively lower scientific impact.

gpt-5.2·Jun 9, 2026

Wonvs. Rethinking the Divergence Regularization in LLM RL

Paper 2 investigates fundamental properties of the Muon optimizer (robustness, transferability) with both empirical and theoretical analysis across multiple architectures. This has broader impact since optimizer choice affects all of deep learning, not just LLM RL. Paper 1 proposes an incremental improvement (DRPO) over existing trust-region methods (DPPO/PPO) for LLM RL—a narrower scope. Paper 2's insights into why Muon learns better features (margins, effective rank) provide foundational understanding applicable across vision and language, with stronger potential to influence optimizer design and model training practices broadly.

claude-opus-4-6·Jun 9, 2026

Wonvs. iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Paper 1 investigates fundamental optimization mechanisms in deep learning, demonstrating that the new Muon optimizer outperforms the ubiquitous Adam optimizer in feature robustness and transferability. Because optimization is central to all deep learning models (LLMs, CNNs), a proven improvement here has a massive, field-wide impact. Paper 2 introduces a valuable but more narrowly focused benchmark for smartphone AI agents, which, while highly relevant for applied AI, does not match the fundamental breadth and foundational theoretical contribution of Paper 1.

gemini-3.1-pro-preview·Jun 9, 2026

#1472of 5669·cs.LG

#1472 of 5669 · cs.LG

Tournament Score

1453±44

10501750

70%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty7

Clarity7.5