Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin
Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.
The paper identifies a gap in conventional MoE router design: there is no explicit mechanism ensuring that each row of the router matrix faithfully represents its corresponding expert's characteristics. The authors propose aligning each router row with the principal singular direction of the associated expert weight matrix, arguing this direction maximally preserves the expert's information content. They implement this via Manifold Power Iteration (MPI), a "Power-then-Retract" paradigm where a single power iteration step is applied to the router weights at each training step, followed by L2 norm retraction for stability.
The key insight is elegant: if the router row serves as a proxy for an expert, it should capture the most informative direction of that expert's weight matrix. The principal singular direction is the natural candidate from linear algebra. Rather than computing expensive SVD at every step, a single power iteration gradually steers router rows toward this direction across training.
Theoretical grounding: The paper provides a reasonable theoretical framework connecting MPI to steepest ascent optimization of the Rayleigh quotient on a spherical manifold. The derivation showing structural alignment between the MPI update (Eq. 10) and gradient ascent on the manifold (Eq. 9) is convincing, though the approximation relies on the assumption that the orthogonal component becomes negligible as convergence proceeds—which is a somewhat circular argument during early training when alignment is low.
Experimental design: The experiments span 1B to 11B parameters, which is respectable scale for an academic paper. The optimizer-agnostic evaluation across AdamW, Muon, AdamH, and MuonH strengthens the claim that improvements are intrinsic to the router design. The ablation studies effectively isolate the contributions of power iteration versus retraction.
Weaknesses in rigor:
Direct applications: MPI is a drop-in replacement for standard MoE routers with negligible training overhead (0.2% slowdown) and zero inference overhead (router weights can be pre-computed). This low barrier to adoption could facilitate widespread use.
Broader influence: The paper opens an interesting research direction—mathematically principled router design informed by expert weight structure. This could inspire further work connecting routing decisions to expert parameterization, potentially leading to better expert specialization or more efficient expert pruning.
Load balancing improvement as a side effect is a notable practical benefit, as load imbalance is a persistent pain point in MoE deployment.
However, the improvements, while consistent, are modest. The 11B downstream improvements (Table 3) show GSM8K jumping from 17.89 to 27.60 (notable), but other metrics show smaller gains. The ~1.04× faster convergence claim, while meaningful at scale, is not transformative.
MoE architectures are central to current frontier model development (DeepSeek, GLM-5, GPT-oss). Router design is a known bottleneck—expert collapse, load imbalance, and suboptimal routing remain active research areas. The paper addresses a real need, though it's worth noting that production systems increasingly use more sophisticated routing mechanisms (shared experts, auxiliary-loss-free routing) that may complicate the applicability of this specific approach.
The retraction mechanism deserves more attention as an independent contribution. The ablation shows it alone achieves similar load balancing improvements, suggesting the balance benefits may be decoupled from the alignment benefits. The paper would benefit from a cleaner decomposition of these two effects.
The connection to Rayleigh quotient optimization is well-known in numerical linear algebra, but its application to MoE routing is novel. The adaptive step-size property (larger updates when misaligned, smaller when aligned) is a practically valuable feature.
Generated Jun 11, 2026
Paper 1 addresses a fundamental architectural component of large language models (MoE routers) with a mathematically rigorous redesign. Its theoretical grounding combined with large-scale empirical validation (up to 11B parameters) suggests it could broadly influence foundational model architectures. While Paper 2 offers a practical system for AI agents, Paper 1's contribution is more fundamental, potentially improving the efficiency and performance of a wide range of future foundation models, leading to greater scientific and practical impact.
Paper 2 likely has higher impact: it targets Mixture-of-Experts routing, a central bottleneck in scaling foundation models, making it timely and broadly relevant across NLP/vision systems. The proposed Manifold Power Iteration provides a clear design principle (align routers with experts’ principal singular directions) with theoretical convergence arguments and large-scale (1B–11B) pretraining evidence, suggesting methodological rigor and immediate applicability. Paper 1 is novel and practical for irregular time series, but its impact is more domain-specific and depends on adoption of a new benchmark/metric.
Paper 1 introduces a fundamentally new learning-theoretic framework (simulatable processes) that bridges a major gap between classical PAC learning with independent data and learning under arbitrary dependencies. It provides deep theoretical contributions—recovering VC dimension-based guarantees without independence, demonstrating strict advantages of conditional sampling, and connecting to time-bounded Kolmogorov complexity. This broadens foundational understanding in learning theory with wide-reaching implications. Paper 2, while practically useful, offers an incremental architectural improvement to MoE routers with narrower scope of impact.
Paper 1 introduces a fundamental architectural and theoretical innovation to Mixture-of-Experts (MoE) models, offering a principled method (Manifold Power Iteration) to improve router efficiency. Because MoEs are the backbone of modern large-scale AI models, this foundational improvement has a broader, more pervasive potential impact across all LLM development compared to Paper 2, which focuses on advanced search and test-time scaling strategies for a specific domain (mathematical proofs).
Paper 1 addresses a fundamental architectural component of Mixture-of-Experts models, which are central to modern large-scale AI systems (e.g., GPT-4, Mixtral). The principled redesign of MoE routers via manifold power iteration offers a novel theoretical framework with broad applicability across all MoE-based architectures, validated at scales up to 11B parameters. Paper 2 makes a useful but more incremental contribution to LLM evaluation methodology. While practically valuable, it operates in a narrower domain (benchmarking/evaluation) with less potential to influence core model architectures and training paradigms.
Paper 2 addresses the fundamental and highly active question of efficient alternatives to quadratic attention in Transformers, comparing leading subquadratic architectures (xLSTM, Mamba-2, Gated DeltaNet) across diverse practical tasks and providing principled theoretical analysis. This has broader impact across NLP, time-series, and efficient ML. Paper 1, while technically sound, addresses a narrower optimization of MoE router design. Paper 2's breadth of applications, timeliness given the efficiency scaling crisis, and potential to guide future architecture design give it higher impact potential.
Paper 2 likely has higher impact: it proposes a novel, general router-design principle for Mixture-of-Experts with a theoretically motivated algorithm (Manifold Power Iteration) and validation across large-scale (1B–11B) pretraining, aligning with timely, high-interest foundation-model scaling. Its contributions could broadly affect NLP and systems/optimization communities and improve widely used MoE architectures. Paper 1 is rigorous and clinically relevant, but is more incremental within established multimodal AD staging and may have narrower cross-field influence and higher translational barriers.
Paper 1 proposes a concrete, theoretically grounded, and empirically validated enhancement to Mixture-of-Experts models, which are central to state-of-the-art large language models. Its successful application at scale (up to 11B parameters) promises high immediate real-world utility. Paper 2, while offering valuable conceptual clarity, lacks the immediate, broad technological applicability of Paper 1.
Paper 1 offers a foundational theoretical contribution to Mixture-of-Experts (MoE) architectures, a critical area in scaling LLMs. By introducing a mathematically rigorous design principle (Manifold Power Iteration) and demonstrating effectiveness at scale (up to 11B parameters), it has the potential to influence core architectural designs broadly. While Paper 2 presents a valuable practical optimization for inference speed, Paper 1's methodological novelty and deep implications for training efficiency and model capacity give it a higher potential for long-term scientific impact.
Paper 1 is more novel in framing unlabeled multi-policy behavior as an implicit neural representation problem with episode-level latents, yielding a general generative prior over policies and addressing variable-length/sampling naturally. It targets a broad, timely need in robotics and behavioral datasets (play, demos, games, racing) and introduces a new notion of policy-level OOD shifts beyond standard agent/env OOD. The evaluation spans many domains including real-world datasets, suggesting wide applicability. Paper 2 is a solid, rigorous MoE optimization/design contribution, but its impact is narrower (router parameterization) and incremental relative to ongoing MoE engineering.