When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

Youngsik Yoon, Siwei Wang, Wei Chen, Jungseul Ok

May 8, 2026arXiv:2605.07260v1

cs.LGcs.CL

#792of 5669·cs.LG

#792 of 5669 · cs.LG

Tournament Score

1485±42

10501750

88%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity8.5

Abstract

Mixture-of-Experts (MoE) language models route each token to a small subset of experts, but whether the routes selected by a trained top- $k$ router are good ones is rarely evaluated directly. Holding the model fixed, we compare each standard route against sampled equal-compute alternatives for the same token and score each by the next-token probability it assigns to the realized token in a verified reasoning trajectory. The result is sharply token-conditional: the standard router is well-aligned with route utility on confident tokens but uninformative on the fragile tokens that drive hard reasoning, where lower-loss equal-compute routes consistently exist inside the frozen model but are not selected. The same pattern holds across Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, and OLMoE-1B-7B, and follows structurally from how standard top- $k$ training evaluates routing decisions: the language modeling loss scores only the executed route, and load balancing depends only on aggregate routing statistics. A minimal router-only update to the final-layer router, leaving every expert and every other router frozen, is sufficient to shift pass@K on AIME 2024+2025 and HMMT 2025 for both Qwen3-30B-A3B and GPT-OSS-20B, suggesting that at least part of the failure reflects router-reachable misallocation rather than expert capacity alone.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a counterfactual routing analysis framework for evaluating whether the expert routes selected by trained top-k routers in Mixture-of-Experts (MoE) language models are actually good choices. The key insight is methodologically clean: hold the model frozen, sample equal-compute alternative routes via Gumbel-top-k perturbation, and compare their next-token probabilities against the standard route on verified reasoning trajectories. The main finding is that routing quality is sharply token-conditional — the standard router performs well on confident tokens (where routing barely matters) but becomes essentially uninformative on "fragile" tokens (where route choice is most consequential), with the best sampled alternative outperforming the standard route by ~20 percentage points in next-token probability.

The paper also identifies a structural explanation: standard MoE training only evaluates the executed route's loss and load-balancing losses depend on aggregate statistics, creating a "counterfactual blind spot" where the model never learns which alternative routes would have been better for individual tokens. This is formalized cleanly through the gradient analysis showing the indicator function $\mathbf{1}\{j \in S_t^{\text{std}}\}$ zeroes out gradients for unselected experts.

Methodological Rigor

The experimental design is well-constructed. The stratification into Confident/Ambiguous/Fragile bins based on route-averaged probability is a natural and informative decomposition that reveals structure hidden by aggregate metrics. The use of 32 Gumbel-top-k samples from the top-32 scoring experts provides reasonable coverage of the alternative route space while maintaining computational tractability.

The analysis is thorough along three axes: layers (early/middle/final), domains (MATH, AIME, HMMT, GPQA-Diamond), and model families (Qwen3-30B-A3B, GPT-OSS-20B, DeepSeek-V2-Lite, OLMoE-1B-7B). The consistency of the pattern across all these dimensions substantially strengthens the claim that this is a structural phenomenon rather than a model-specific artifact.

However, there are methodological caveats worth noting:

1. The analysis uses verified-correct trajectories only, meaning every "realized token" lies on a successful reasoning path. This biases the proxy: the realized token may itself have been a fortunate sample, and alternatives assigning higher probability to it aren't necessarily producing better reasoning.

2. The analysis is per-layer with other layers held fixed, so it cannot capture cross-layer interactions in routing failures.

3. The pool of 32 alternatives from the top-32 experts is a reasonable but limited exploration of the combinatorial route space ( $\binom{128}{8}$ for Qwen3).

4. The EPO experiment, while clever as an existence proof, shows relatively small pass@K shifts, and the confidence intervals in Figure 3 appear to overlap for some K values.

Potential Impact

This paper opens an important diagnostic lens for MoE architectures that dominate the current frontier model landscape. The practical implications are significant:

For MoE training: The identification of a structural blind spot in standard top-k training suggests a concrete research direction — designing counterfactual-aware training objectives that evaluate alternative routes during pretraining. If feasible at scale, this could meaningfully improve MoE model performance on hard reasoning without increasing parameter count.

For inference-time methods: The finding that better routes exist inside frozen models provides theoretical grounding for recent work on inference-time routing modifications (cited as Chen et al., 2026; Li et al., 2025a,b). It suggests that routing-focused test-time compute strategies could be productive.

For evaluation methodology: The counterfactual routing analysis framework itself is a contribution — it provides a model-agnostic diagnostic tool that can be applied to any MoE model to assess routing quality.

The paper does not yet deliver a scalable solution, which limits immediate practical impact. The EPO update is explicitly positioned as a "minimal existence check" rather than a method, and the pass@K improvements are modest.

Timeliness & Relevance

This paper is exceptionally timely. MoE architectures are now standard in frontier models (DeepSeek-V3/R1, Qwen3, GPT-OSS), yet routing quality has received remarkably little scrutiny relative to its importance. The connection to reasoning performance is particularly relevant given the current focus on mathematical and scientific reasoning capabilities. The observation that routing failures concentrate precisely on the hard tokens where reasoning success is determined connects to the active research area on token-level reasoning analysis (pivotal tokens, critical tokens, high-entropy minorities).

Strengths

1. Novel diagnostic framework: The counterfactual routing analysis is simple, principled, and reveals structure that aggregate metrics miss entirely.

2. Broad empirical validation: Consistent results across 4 models, 4 benchmarks, and multiple layers make the finding robust and general.

3. Clean theoretical explanation: The gradient analysis in Section 4.1 and the characterization of load-balancing losses as "aggregate" in Section 4.2 provide structural understanding of why the phenomenon occurs.

4. Minimal intervention experiment: The EPO proof-of-concept, modifying <0.001% of parameters, provides compelling evidence that the observed misalignment is partially actionable.

5. Clear writing and presentation: The paper presents a complex analysis accessibly, with Figure 1 effectively communicating the core finding.

Limitations

1. No scalable solution: The paper diagnoses a problem convincingly but does not provide a training-time fix, which limits transformative impact.

2. Proxy validity concerns: Using next-token probability on verified trajectories as a routing quality proxy has acknowledged limitations — it assumes the realized token is the right target and that local improvements translate to global reasoning improvements.

3. EPO results are modest: The pass@K shifts are small and don't clearly separate from baseline under bootstrap uncertainty at all K values, somewhat weakening the "reachable by routing alone" claim.

4. Single-layer analysis: The inability to assess multi-layer routing interactions is a meaningful gap, as routing failures may compound or cancel across layers.

5. Limited to open-weight models: The analysis cannot be applied to the most capable proprietary MoE models.

Overall Assessment

This is a well-executed diagnostic paper that identifies an important and previously uncharacterized failure mode in MoE language models. The finding that routing quality degrades precisely where it matters most, and the clean structural explanation for why, constitute a meaningful contribution to understanding MoE architectures. The work is more diagnostic than constructive — it opens research directions rather than closing them — but the quality of the analysis and its timeliness relative to the dominance of MoE architectures make it a valuable contribution.

Rating:7/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 8.5

Generated May 11, 2026

Comparison History (24)

Wonvs. LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Paper 2 addresses a critical inefficiency in Mixture-of-Experts (MoE) architectures, which form the backbone of modern frontier LLMs. By proving that standard routing fails on complex reasoning tokens and demonstrating that minimal router updates significantly improve math reasoning (AIME), it offers highly actionable insights for SOTA foundation models. While Paper 1 presents an elegant and efficient simplification for world models, Paper 2's direct impact on the reasoning capabilities and training paradigms of large language models gives it broader and more immediate scientific and industrial relevance.

gemini-3.1-pro-preview·Jun 6, 2026

Lostvs. Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

Paper 2 likely has higher impact: it addresses the timely, broadly relevant problem of LLM unlearning (privacy, copyright, safety, compliance) with a principled formulation and a practical method (ATWU) that learns token-level importance without external supervision, showing SOTA forget–retain trade-offs on standard benchmarks and interpretability via alignment with ground-truth spans. This combination of theoretical framing + scalable algorithm + strong empirical validation is broadly applicable across models and domains. Paper 1 is novel and valuable for MoE reliability, but its scope is narrower to MoE routing and demonstrated gains appear more incremental.

gpt-5.2·Jun 6, 2026

Wonvs. Sparse Layers are Critical to Scaling Looped Language Models

Paper 2 addresses a critical and timely issue in modern MoE architectures—misrouting during complex reasoning tasks. By demonstrating that existing routers fail on fragile tokens and that minimal updates can significantly improve reasoning benchmarks like AIME, it offers highly actionable insights for current cutting-edge models. While Paper 1 provides an interesting architectural innovation for efficiency, Paper 2's direct applicability to improving the reasoning capabilities of widely used models suggests a more immediate and broader scientific impact.

gemini-3.1-pro-preview·May 16, 2026

Wonvs. Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Paper 2 offers a novel, mechanistically grounded analysis of MoE routing that reveals a systematic failure mode (misrouting on fragile tokens) with immediate practical implications for improving existing large-scale models. The finding that a minimal router-only update improves reasoning benchmarks is actionable and broadly relevant to the rapidly growing MoE model ecosystem (DeepSeek, Qwen, etc.). Paper 1, while valuable as a benchmark for LLM agents in particle physics, serves a narrower community and its main finding (agents don't beat human physicists) is less surprising. Paper 2's insights are more likely to influence model architecture and training across the field.

claude-opus-4-6·May 15, 2026

Wonvs. Distributional simplicity bias and effective convexity in Energy Based Models

Paper 1 likely has higher impact due to timeliness and direct relevance to current large-scale MoE LLMs. It introduces a concrete counterfactual routing evaluation, demonstrates a systematic failure mode (misrouting on fragile reasoning tokens) across multiple major models, and shows an actionable, minimal intervention (router-only update) that improves competition-grade math benchmarks—clear real-world applicability. Methodologically, the cross-model empirical validation plus a targeted causal-style fix strengthens rigor. Paper 2 offers valuable theoretical insight into EBMs, but EBMs are less central in today’s dominant generative pipelines, potentially narrowing near-term breadth and application impact.

gpt-5.2·May 11, 2026

Wonvs. Enhancing Federated Quadruplet Learning: Stochastic Client Selection and Embedding Stability Analysis

Paper 2 addresses a fundamental and timely question about Mixture-of-Experts routing in large language models, providing novel counterfactual analysis showing routers are misaligned on hard reasoning tokens. The finding that a minimal router-only update improves math reasoning benchmarks is highly actionable. The work spans multiple prominent MoE models and offers both diagnostic insights and a practical intervention. Paper 1 contributes incremental improvements to federated learning via metric learning, which is useful but more narrowly scoped and less novel. Paper 2's relevance to the rapidly growing MoE/LLM ecosystem gives it substantially broader impact potential.

claude-opus-4-6·May 11, 2026

Wonvs. Diagnosing Capability Gaps in Fine-Tuning Data

Paper 2 has higher impact potential: it introduces a broadly applicable counterfactual evaluation of MoE routing, identifies a systematic failure mode on fragile reasoning tokens across multiple prominent MoE families, and demonstrates that a minimal router-only intervention can measurably improve competitive math benchmarks—implying actionable gains without retraining experts. This targets a central scaling paradigm (MoE) with implications for efficiency, training objectives, and interpretability across many tasks. Paper 1 is practical for dataset diagnostics, but its scope is narrower and more dependent on LLM-judge scoring and interactive decomposition.

gpt-5.2·May 11, 2026

Wonvs. AdaCubic: An Adaptive Cubic Regularization Optimizer for Deep Learning

Paper 1 is more impactful due to its timely, MoE-specific diagnosis of a core failure mode (router misallocation on hard reasoning tokens) and a practical, minimal intervention (router-only update) that improves real benchmark performance across multiple major MoE families. It introduces a counterfactual, equal-compute routing evaluation that can become a standard analysis tool, with implications for training objectives, interpretability, and scaling. Paper 2 is a solid optimizer contribution, but cubic-regularization/Newton-style methods and adaptive variants are a crowded area, and the practical adoption barrier (cost/complexity) may limit breadth relative to MoE routing insights.

gpt-5.2·May 11, 2026

Wonvs. Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Paper 1 addresses a critical bottleneck in Mixture-of-Experts (MoE) language models, a highly prevalent architecture in modern AI. By identifying and mitigating routing failures for 'fragile' tokens crucial to complex reasoning, it offers immediate, actionable improvements for LLM performance on rigorous benchmarks like AIME. While Paper 2 provides valuable theoretical insights into optimization dynamics, Paper 1's findings have broader, more immediate real-world applicability and timeliness given the current focus on scaling reasoning capabilities in foundation models.

gemini-3.1-pro-preview·May 11, 2026

Wonvs. Predictive Entropy Links Calibration and Paraphrase Sensitivity in Medical Vision-Language Models

Paper 1 addresses a fundamental architectural question about Mixture-of-Experts models—whether routers actually select optimal expert paths—and provides novel counterfactual analysis across multiple major models (Qwen3, DeepSeek, OLMoE). The finding that routers are systematically misaligned on critical reasoning tokens, combined with a practical fix (router-only updates improving math benchmarks), has broad implications for MoE architecture design and training. Paper 2 provides useful empirical findings linking calibration and paraphrase sensitivity in medical VLMs but is more incremental and domain-specific, with narrower architectural scope and less transformative potential.

claude-opus-4-6·May 11, 2026

#792of 5669·cs.LG

#792 of 5669 · cs.LG

Tournament Score

1485±42

10501750

88%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity8.5