Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Kirato Yoshihara

Jun 11, 2026arXiv:2606.13276v1

cs.LGcs.AI

#4382of 5669·cs.LG

#4382 of 5669 · cs.LG

Tournament Score

1327±48

10501750

32%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5

Rigor4.5

Novelty5.5

Clarity7.5

Abstract

Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates whether different transformer modules (attention vs. MLP/FFN blocks) benefit from different weight-space manifold geometries during optimization. Using Manifold Muon on GPT-2 small pretraining, the authors compare five configurations combining Stiefel and DGram constraints across module types. The central finding is an asymmetry: Stiefel on attention + DGram on MLP ("HETERO") performs best, while any configuration placing DGram on attention becomes unstable under shared hyperparameters. The paper provides a mechanistic explanation rooted in singular value growth, logit amplification, and softmax saturation.

The contribution is conceptually clean and addresses a genuine gap: prior work on manifold-constrained optimization applies constraints uniformly, and this paper asks whether module-specific assignment matters. The answer—yes, and in a predictable direction—is intuitive in retrospect but has not been systematically demonstrated before.

Methodological Rigor

Strengths in experimental design: The paper controls for confounds reasonably well by fixing all hyperparameters across configurations, isolating the effect of manifold assignment. The instability criterion (Appendix A) is operationally defined with four specific conditions, which is more rigorous than many papers that simply describe runs as "unstable."

Weaknesses:

The experiments are limited to a single model (GPT-2 small, ~124M parameters), a single dataset (OpenWebText), and apparently a single random seed per configuration. The authors acknowledge this but it substantially limits the strength of the empirical claims.

The validation loss differences between stable configurations are small (HETERO: 3.3544 vs. ALL-STIEFEL: 3.3679 vs. UNCONSTRAINED: 3.3855). Without multiple seeds and confidence intervals, it's difficult to assess statistical significance of the performance ordering among stable runs.

A critical confound is acknowledged but not resolved: DGram-managed weights receive no weight decay or explicit scale control, while Stiefel inherently controls scale. The instability could potentially be addressed by adding weight decay to DGram parameters or retuning hyperparameters, which would weaken the paper's main narrative.

Theoretical analysis: The propositions (4.1 and 4.2) are mathematically correct but relatively straightforward observations. Proposition 4.2 shows DGram *admits* gradient-degenerate directions—an existence result, not a dynamical inevitability. The gap between "DGram permits large singular values" and "DGram attention will become unstable in practice" is bridged primarily by the empirical observation, not by theory. The authors are appropriately careful about this distinction.

Potential Impact

The practical implications are modest but directionally useful. For practitioners using Manifold Muon or similar manifold-constrained optimizers, the finding provides a concrete recommendation: use Stiefel for attention, DGram for MLP. More broadly, the paper contributes to the growing understanding that transformer components are not interchangeable from an optimization perspective.

The mechanistic insight about softmax saturation from singular value growth connects to broader concerns about attention logit scaling (e.g., the √d normalization in standard attention, logit scaling in various attention variants). This could inform future optimizer design and manifold selection criteria.

However, the impact is limited by the narrow experimental scope. The finding may not generalize to other architectures (e.g., vision transformers, encoder-decoder models), other scales, or other manifold-constrained optimizers beyond Manifold Muon.

Timeliness & Relevance

The paper is well-timed. Muon and its manifold variants are recent developments (2024-2026), and there is active interest in understanding when and how manifold constraints benefit large-scale training. The question of module-specific optimization is increasingly relevant as transformers scale and as the community moves beyond one-size-fits-all optimizer configurations (e.g., different learning rates for different parameter groups is already common practice).

The paper also connects to the growing literature on weight-space symmetries, making it relevant for the workshop context.

Strengths

1. Clear, well-posed research question: The module-wise manifold assignment question is natural and underexplored.

2. Clean experimental design: Shared hyperparameters across configurations provide a controlled comparison.

3. Mechanistic explanation: The singular value growth → logit amplification → softmax saturation pathway is well-articulated and supported by spectral monitoring data.

4. Honest limitations section: The authors are transparent about the scope of their claims.

5. Detailed appendix: The implementation details (Appendix A) and proof sketches (Appendix C) are thorough and support reproducibility.

Limitations

1. Single scale, single dataset, apparent single seed: This is the most significant limitation. The empirical evidence is thin for the claims made.

2. Confounded comparison: The lack of weight decay for DGram parameters means the instability could reflect a hyperparameter mismatch rather than a fundamental geometric incompatibility.

3. Limited practical novelty: The performance gap between HETERO and ALL-STIEFEL is small (~0.014 in validation loss), and the main finding is essentially "don't use DGram on attention without scale control," which could be addressed by simpler means than manifold switching.

4. Theoretical results are existence proofs: The propositions show what *can* happen, not what *will* happen, limiting their predictive power.

5. No downstream evaluation: Only validation loss is reported; no downstream task performance or generation quality metrics are provided.

6. Missing ablations: No experiments with weight decay applied to DGram attention, different learning rates for different manifold types, or intermediate scale controls that might disambiguate the mechanism.

Overall Assessment

This is a well-executed workshop paper that poses a sensible question and provides a preliminary but informative answer. The module-specific manifold assignment insight is genuinely useful for the manifold optimization community, and the mechanistic analysis adds value beyond pure empirical observation. However, the narrow experimental scope (single model, single scale, single seed, single dataset) and the unresolved weight-decay confound limit the strength and generalizability of the conclusions. The paper opens an interesting research direction but does not yet provide the evidence needed to establish it as a robust finding.

Rating:4.8/ 10

Significance 5Rigor 4.5Novelty 5.5Clarity 7.5

Generated Jun 12, 2026

Comparison History (19)

Lostvs. MiniPIC: Flexible Position-Independent Caching in <100LOC

MiniPIC addresses a critical practical problem in LLM inference serving—efficient KV cache reuse for retrieval-augmented and agentic workloads—with a minimal, elegant solution (<100 LOC changes to vLLM). It demonstrates significant throughput improvements (49%) and latency reductions (up to 100x for cached spans) on a production-grade system. Its breadth of impact is larger: it unifies multiple PIC methods, integrates with existing infrastructure, and addresses a bottleneck affecting widespread LLM deployment. Paper 2 provides interesting geometric insights for transformer optimization but is narrower in scope (GPT-2 pretraining) and more incremental in its contributions.

claude-opus-4-6·Jun 12, 2026

Wonvs. Accelerating Speculative Diffusions via Block Verification

Paper 1 offers a novel and fundamental insight into weight-space geometry in transformer optimization, demonstrating that different modules benefit from different manifold constraints. This finding has broad implications for optimizer design across all transformer-based models, potentially influencing how future optimizers are built. Paper 2 presents a useful engineering contribution for speeding up diffusion model inference, but the 6.3% speedup is incremental. Paper 1's conceptual contribution—module-specific geometric optimization—opens a new research direction with wider theoretical and practical impact across deep learning.

claude-opus-4-6·Jun 12, 2026

Lostvs. AgentDS Technical Report: Benchmarking the Future of Human-AI Collaboration in Domain-Specific Data Science

Paper 2 likely has higher impact: it introduces a broad, reusable benchmark/competition with open datasets spanning multiple industries, directly enabling standardized evaluation of AI agents and human-AI collaboration—high real-world relevance and timeliness. Benchmarks often catalyze rapid follow-on work across ML, HCI, and applied data science. Paper 1 is novel and mechanistically insightful for transformer optimization, but its contribution is narrower (specific manifold constraints in GPT-2 pretraining) and may have more limited adoption compared to a widely used evaluation suite.

gpt-5.2·Jun 12, 2026

Lostvs. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

Paper 1 addresses a highly timely and impactful question—understanding the mechanics of reinforcement learning post-training for reasoning models, which is central to current LLM development. It provides actionable insights (strategy selection vs. improvement, role of data diversity and difficulty) with direct practical implications for scaling reasoning capabilities. Paper 2 offers interesting findings about module-specific optimization geometry but addresses a more niche topic (weight-space manifold constraints) with narrower applicability, limited to specific optimizer variants and small-scale GPT-2 experiments. Paper 1's broader relevance to the rapidly growing RL-for-reasoning field gives it higher potential impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling

Paper 2 demonstrates a major breakthrough in AI mathematical reasoning, achieving gold-medal thresholds on prestigious competitions like IMO. Its focus on test-time scaling and generative-verifier RL aligns with cutting-edge trends in reasoning, offering profound implications for AGI and automated theorem proving. Paper 1 is methodologically rigorous but its impact is narrower, primarily concerning transformer optimization configurations.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Towards More General Control of Diffusion Models Using Jeffrey Guidance

Paper 1 offers a principled, general framework (Jeffrey guidance) that broadens diffusion-model control beyond standard conditioning, with clear demonstrations (FID improvements; fairness via attribute independence). This combination of theoretical novelty and broad applicability to controllable generative modeling, evaluation metrics, and responsible AI suggests wide uptake. Paper 2 provides insightful empirical findings for module-specific manifold constraints in transformer optimization, but its scope is narrower (specific to a particular geometry method and GPT-2 setting) and may translate less directly into widely adopted practice than a general diffusion guidance framework.

gpt-5.2·Jun 12, 2026

Wonvs. Variance-Gated Ensembles: An Epistemic-Aware Framework for Uncertainty Estimation

Paper 2 has higher potential impact due to its timeliness and breadth: transformer optimization affects many domains, and the module-specific geometry finding could influence how manifold constraints, regularization, and optimization are designed for large-scale pretraining. It offers a concrete, mechanistic explanation (singular value growth → logit amplification → softmax saturation) and actionable guidance (different geometries for attention vs MLP). Paper 1 is useful and practical for uncertainty estimation, but its scope is narrower and more incremental relative to established ensemble-uncertainty literature.

gpt-5.2·Jun 12, 2026

Lostvs. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Paper 2 introduces SWITCH, a novel framework that solves two fundamental problems in latent reasoning—RL trainability and interpretability—with a single elegant mechanism (boundary tokens). It bridges latent chain-of-thought reasoning with standard on-policy RL, opening a new research direction with broad implications for efficient reasoning in LLMs. The mechanistic analysis adds scientific depth. Paper 1, while offering useful empirical insights about module-specific manifold constraints, is more incremental—it characterizes geometry preferences for a specific optimizer (Manifold Muon) on GPT-2, with narrower scope and applicability.

claude-opus-4-6·Jun 12, 2026

Wonvs. Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

Paper 1 addresses a fundamental question in deep learning optimization—whether different transformer modules benefit from different manifold geometries—providing novel insights into weight-space geometry that could influence how large language models are trained. This has broad implications across all transformer-based architectures. Paper 2 presents a solid but incremental contribution combining known techniques (ordinal regression, multimodal fusion, attention mechanisms) for AD staging, with moderate performance improvements. Paper 1's novelty in module-specific geometric optimization and its potential to reshape training practices for widely-used architectures gives it higher impact potential.

claude-opus-4-6·Jun 12, 2026

Lostvs. Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

Paper 2 addresses a broadly important problem—how citation signals influence LLM hallucination—with a large-scale, rigorously designed benchmark (220K+ prompts, factorial design, multiple domains and models). Its findings that citations increase hallucination rates have immediate implications for RAG systems, AI safety, and deployment practices across many fields. It also releases a reusable benchmark. Paper 1, while technically interesting, addresses a narrower optimization question (module-specific manifold constraints for GPT-2 training) with more incremental findings relevant primarily to the optimization community.

claude-opus-4-6·Jun 12, 2026

#4382of 5669·cs.LG

#4382 of 5669 · cs.LG

Tournament Score

1327±48

10501750

32%

Win Rate

Wins

Losses

Matches

Rating

4.8/ 10

Significance5

Rigor4.5

Novelty5.5

Clarity7.5