Kirato Yoshihara
Weight-space geometry plays a central role in neural network optimization, yet manifold constraints are often applied uniformly across all weight matrices. In this work, we ask whether different transformer modules prefer different manifold geometries. We study Manifold Muon for GPT-2 pretraining and compare layer-wise assignments of Stiefel and DGram constraints across attention and MLP blocks. Our results show a clear asymmetry: constraining attention layers with Stiefel geometry while assigning DGram geometry to MLP layers gives the best performance among the tested configurations, whereas the inverted assignment and all-DGram configuration become unstable under the shared hyperparameter setting. We trace this failure to singular value growth in DGram-constrained attention weights, which can amplify attention logits and induce softmax saturation. These findings suggest that symmetry-aware and geometry-aware optimization for transformers should be module-specific rather than uniform.
This paper investigates whether different transformer modules (attention vs. MLP/FFN blocks) benefit from different weight-space manifold geometries during optimization. Using Manifold Muon on GPT-2 small pretraining, the authors compare five configurations combining Stiefel and DGram constraints across module types. The central finding is an asymmetry: Stiefel on attention + DGram on MLP ("HETERO") performs best, while any configuration placing DGram on attention becomes unstable under shared hyperparameters. The paper provides a mechanistic explanation rooted in singular value growth, logit amplification, and softmax saturation.
The contribution is conceptually clean and addresses a genuine gap: prior work on manifold-constrained optimization applies constraints uniformly, and this paper asks whether module-specific assignment matters. The answer—yes, and in a predictable direction—is intuitive in retrospect but has not been systematically demonstrated before.
Strengths in experimental design: The paper controls for confounds reasonably well by fixing all hyperparameters across configurations, isolating the effect of manifold assignment. The instability criterion (Appendix A) is operationally defined with four specific conditions, which is more rigorous than many papers that simply describe runs as "unstable."
Theoretical analysis: The propositions (4.1 and 4.2) are mathematically correct but relatively straightforward observations. Proposition 4.2 shows DGram *admits* gradient-degenerate directions—an existence result, not a dynamical inevitability. The gap between "DGram permits large singular values" and "DGram attention will become unstable in practice" is bridged primarily by the empirical observation, not by theory. The authors are appropriately careful about this distinction.
The practical implications are modest but directionally useful. For practitioners using Manifold Muon or similar manifold-constrained optimizers, the finding provides a concrete recommendation: use Stiefel for attention, DGram for MLP. More broadly, the paper contributes to the growing understanding that transformer components are not interchangeable from an optimization perspective.
The mechanistic insight about softmax saturation from singular value growth connects to broader concerns about attention logit scaling (e.g., the √d normalization in standard attention, logit scaling in various attention variants). This could inform future optimizer design and manifold selection criteria.
However, the impact is limited by the narrow experimental scope. The finding may not generalize to other architectures (e.g., vision transformers, encoder-decoder models), other scales, or other manifold-constrained optimizers beyond Manifold Muon.
The paper is well-timed. Muon and its manifold variants are recent developments (2024-2026), and there is active interest in understanding when and how manifold constraints benefit large-scale training. The question of module-specific optimization is increasingly relevant as transformers scale and as the community moves beyond one-size-fits-all optimizer configurations (e.g., different learning rates for different parameter groups is already common practice).
The paper also connects to the growing literature on weight-space symmetries, making it relevant for the workshop context.
1. Clear, well-posed research question: The module-wise manifold assignment question is natural and underexplored.
2. Clean experimental design: Shared hyperparameters across configurations provide a controlled comparison.
3. Mechanistic explanation: The singular value growth → logit amplification → softmax saturation pathway is well-articulated and supported by spectral monitoring data.
4. Honest limitations section: The authors are transparent about the scope of their claims.
5. Detailed appendix: The implementation details (Appendix A) and proof sketches (Appendix C) are thorough and support reproducibility.
1. Single scale, single dataset, apparent single seed: This is the most significant limitation. The empirical evidence is thin for the claims made.
2. Confounded comparison: The lack of weight decay for DGram parameters means the instability could reflect a hyperparameter mismatch rather than a fundamental geometric incompatibility.
3. Limited practical novelty: The performance gap between HETERO and ALL-STIEFEL is small (~0.014 in validation loss), and the main finding is essentially "don't use DGram on attention without scale control," which could be addressed by simpler means than manifold switching.
4. Theoretical results are existence proofs: The propositions show what *can* happen, not what *will* happen, limiting their predictive power.
5. No downstream evaluation: Only validation loss is reported; no downstream task performance or generation quality metrics are provided.
6. Missing ablations: No experiments with weight decay applied to DGram attention, different learning rates for different manifold types, or intermediate scale controls that might disambiguate the mechanism.
This is a well-executed workshop paper that poses a sensible question and provides a preliminary but informative answer. The module-specific manifold assignment insight is genuinely useful for the manifold optimization community, and the mechanistic analysis adds value beyond pure empirical observation. However, the narrow experimental scope (single model, single scale, single seed, single dataset) and the unresolved weight-decay confound limit the strength and generalizability of the conclusions. The paper opens an interesting research direction but does not yet provide the evidence needed to establish it as a robust finding.
Generated Jun 12, 2026
MiniPIC addresses a critical practical problem in LLM inference serving—efficient KV cache reuse for retrieval-augmented and agentic workloads—with a minimal, elegant solution (<100 LOC changes to vLLM). It demonstrates significant throughput improvements (49%) and latency reductions (up to 100x for cached spans) on a production-grade system. Its breadth of impact is larger: it unifies multiple PIC methods, integrates with existing infrastructure, and addresses a bottleneck affecting widespread LLM deployment. Paper 2 provides interesting geometric insights for transformer optimization but is narrower in scope (GPT-2 pretraining) and more incremental in its contributions.
Paper 1 offers a novel and fundamental insight into weight-space geometry in transformer optimization, demonstrating that different modules benefit from different manifold constraints. This finding has broad implications for optimizer design across all transformer-based models, potentially influencing how future optimizers are built. Paper 2 presents a useful engineering contribution for speeding up diffusion model inference, but the 6.3% speedup is incremental. Paper 1's conceptual contribution—module-specific geometric optimization—opens a new research direction with wider theoretical and practical impact across deep learning.
Paper 2 likely has higher impact: it introduces a broad, reusable benchmark/competition with open datasets spanning multiple industries, directly enabling standardized evaluation of AI agents and human-AI collaboration—high real-world relevance and timeliness. Benchmarks often catalyze rapid follow-on work across ML, HCI, and applied data science. Paper 1 is novel and mechanistically insightful for transformer optimization, but its contribution is narrower (specific manifold constraints in GPT-2 pretraining) and may have more limited adoption compared to a widely used evaluation suite.
Paper 1 addresses a highly timely and impactful question—understanding the mechanics of reinforcement learning post-training for reasoning models, which is central to current LLM development. It provides actionable insights (strategy selection vs. improvement, role of data diversity and difficulty) with direct practical implications for scaling reasoning capabilities. Paper 2 offers interesting findings about module-specific optimization geometry but addresses a more niche topic (weight-space manifold constraints) with narrower applicability, limited to specific optimizer variants and small-scale GPT-2 experiments. Paper 1's broader relevance to the rapidly growing RL-for-reasoning field gives it higher potential impact.
Paper 2 demonstrates a major breakthrough in AI mathematical reasoning, achieving gold-medal thresholds on prestigious competitions like IMO. Its focus on test-time scaling and generative-verifier RL aligns with cutting-edge trends in reasoning, offering profound implications for AGI and automated theorem proving. Paper 1 is methodologically rigorous but its impact is narrower, primarily concerning transformer optimization configurations.
Paper 1 offers a principled, general framework (Jeffrey guidance) that broadens diffusion-model control beyond standard conditioning, with clear demonstrations (FID improvements; fairness via attribute independence). This combination of theoretical novelty and broad applicability to controllable generative modeling, evaluation metrics, and responsible AI suggests wide uptake. Paper 2 provides insightful empirical findings for module-specific manifold constraints in transformer optimization, but its scope is narrower (specific to a particular geometry method and GPT-2 setting) and may translate less directly into widely adopted practice than a general diffusion guidance framework.
Paper 2 has higher potential impact due to its timeliness and breadth: transformer optimization affects many domains, and the module-specific geometry finding could influence how manifold constraints, regularization, and optimization are designed for large-scale pretraining. It offers a concrete, mechanistic explanation (singular value growth → logit amplification → softmax saturation) and actionable guidance (different geometries for attention vs MLP). Paper 1 is useful and practical for uncertainty estimation, but its scope is narrower and more incremental relative to established ensemble-uncertainty literature.
Paper 2 introduces SWITCH, a novel framework that solves two fundamental problems in latent reasoning—RL trainability and interpretability—with a single elegant mechanism (boundary tokens). It bridges latent chain-of-thought reasoning with standard on-policy RL, opening a new research direction with broad implications for efficient reasoning in LLMs. The mechanistic analysis adds scientific depth. Paper 1, while offering useful empirical insights about module-specific manifold constraints, is more incremental—it characterizes geometry preferences for a specific optimizer (Manifold Muon) on GPT-2, with narrower scope and applicability.
Paper 1 addresses a fundamental question in deep learning optimization—whether different transformer modules benefit from different manifold geometries—providing novel insights into weight-space geometry that could influence how large language models are trained. This has broad implications across all transformer-based architectures. Paper 2 presents a solid but incremental contribution combining known techniques (ordinal regression, multimodal fusion, attention mechanisms) for AD staging, with moderate performance improvements. Paper 1's novelty in module-specific geometric optimization and its potential to reshape training practices for widely-used architectures gives it higher impact potential.
Paper 2 addresses a broadly important problem—how citation signals influence LLM hallucination—with a large-scale, rigorously designed benchmark (220K+ prompts, factorial design, multiple domains and models). Its findings that citations increase hallucination rates have immediate implications for RAG systems, AI safety, and deployment practices across many fields. It also releases a reusable benchmark. Paper 1, while technically interesting, addresses a narrower optimization question (module-specific manifold constraints for GPT-2 training) with more incremental findings relevant primarily to the optimization community.