From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

Leonard Engmann, Christian Medeiros Adriano, Holger Giese

Jun 9, 2026arXiv:2606.10703v1

cs.LGcs.CL

#1286of 5669·cs.LG

#1286 of 5669 · cs.LG

Tournament Score

1461±44

10501750

67%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor8

Novelty6

Clarity8.5

Abstract

Interpretability methods routinely use population-level summary statistics over observed model behaviour to license claims about the effects of targeted interventions on specific computations; in Pearl's terms, they treat rung-1 associational evidence as if it supported rung-2 interventional conclusions, a move whose validity is rarely tested. We examine one concrete instance: the use of routing statistics in Mixture-of-Experts (MoE) pruning, where utilization rates, activation norms, and routing weight distributions are treated as predictors of which experts can be removed without functional cost. A token-level interventional audit across three high-redundancy MoE architectures (OLMoE-1B-7B-0924, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite) finds no observational metric predicts causal expert importance after multiple-comparison correction in any model, with effect sizes below Cohen's $d = 0.17$ across all 60 metric-layer combinations. A per-token routing weight control rules out insufficient power, recovering a single Bonferroni-significant signal at OLMoE's final MoE layer ( $d = + 0.231$ , $p = 0.0013$ ). Existing pruning methods succeed in this regime not by identifying dispensable experts but because early-layer redundancy renders most selection criteria interchangeable. Our results provide an explicit counterexample to the common inferential step from population-level observational summaries to token-level interventional claims about expert importance, and illustrate how interventional audits can calibrate the evidential standards for interpretability claims.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper tests a specific inferential assumption in MoE model pruning: that population-level observational routing statistics (utilization rate, activation norm, mean routing weight, activation standard deviation) predict the causal importance of individual experts at the token level. Framed through Pearl's causal hierarchy, the authors argue that the pruning literature conflates rung-1 (associational) evidence with rung-2 (interventional) claims. Through systematic token-level ablation experiments across three architectures (OLMoE-1B-7B, Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite), they find that no observational metric predicts causal expert importance after multiple-comparison correction, with universally small effect sizes (Cohen's d < 0.17 across 60 metric-layer combinations). The paper offers an alternative explanation for why pruning pipelines work: early-layer redundancy makes most expert selections approximately harmless, rendering the choice of selection criterion immaterial.

Methodological Rigor

The experimental design is notably careful for a workshop paper. Several features stand out:

Statistical discipline. The authors apply Bonferroni correction per model, require agreement between parametric (paired t-test) and non-parametric (Wilcoxon) tests, and report effect sizes throughout. This is substantially more rigorous than most ablation studies in the interpretability literature.

Control experiment. The per-token routing weight control is a well-designed positive control that bounds the available signal. By showing that token-conditioned routing weights can recover a signal (at OLMoE Layer 15), the authors rule out the possibility that their null result stems from insufficient statistical power or a flawed experimental setup.

Verification procedures. The four-test verification suite (Appendix B) checking cross-entropy alignment, hook clearing, position diversity, and position-specific ablation effects demonstrates attention to implementation correctness.

Limitations in rigor. The sample size of n=200 per cell, while sufficient to detect medium effects, may miss small but practically relevant effects. The authors partially address this through the routing weight control. However, the audit tests a specific operationalization of "metric validity" (Definition 2.1) that compares highest-ranked vs. lowest-ranked active experts at each token position. This is a reasonable but not unique formalization—one could test rank correlations across all active experts, or examine whether metrics predict importance ordinally rather than just at the extremes. The Spearman ρ values reported in the control tables are generally weak, supporting the null interpretation.

Potential Impact

For MoE pruning. The most direct impact is on the MoE compression community. The finding that pruning succeeds due to redundancy rather than metric accuracy is practically important—it suggests that researchers should focus on identifying the few critical layers/positions rather than refining expert-level selection criteria. This could redirect research effort away from developing more sophisticated observational metrics and toward understanding redundancy structure.

For interpretability methodology. The paper contributes to a growing body of work (following Jain & Wallace, 2019; Adebayo et al., 2020) questioning whether observational statistics about model internals support interventional claims. By adding a third concrete instance of this failure pattern, it strengthens the case for interventional validation as a standard practice.

For causal reasoning in ML. The explicit use of Pearl's hierarchy to frame interpretability claims is pedagogically valuable, though the mapping is more illustrative than formally developed. The paper does not provide conditions under which observational-to-interventional inference would succeed, limiting its theoretical contribution.

Timeliness & Relevance

The paper is well-timed. MoE architectures are increasingly prominent (DeepSeek-V3, Mixtral, etc.), and efficient deployment through pruning is a practical necessity. The interpretability community is simultaneously grappling with questions about evidential standards (the Joshi et al. 2026 reference suggests this is an active conversation). The work sits at the intersection of these two trends.

Strengths

1. Clean experimental design with appropriate controls and corrections—a model for ablation-based audits.

2. Cross-architecture replication spanning meaningfully different design choices (shared experts, different top-k ratios, different training procedures).

3. Constructive explanation via progressive ablation: the redundancy regime provides a mechanistic account of why null metrics coexist with successful pruning pipelines.

4. Intellectual honesty in scope claims: the authors carefully distinguish the general null from the narrow OLMoE late-layer finding, and explicitly state what their audit does and does not show about deployed pruning pipelines.

5. Redistribution analysis (Appendix A.3) adds depth by tracing the signal chain from router to logits.

Limitations

1. Scope restriction to high-redundancy architectures. The paper acknowledges this but does not test low-redundancy models (Switch, Mixtral-8x7B), where observational metrics might be more valid. This limits generalizability.

2. Token-level vs. global pruning. Deployed pruning makes global one-shot decisions followed by fine-tuning. The token-level null is important but does not directly address whether population-aggregated metrics work for the intended use case of global expert removal.

3. Sample size and corpus. WikiText-2 test split is a standard but narrow evaluation corpus. Results on diverse domains (code, math, multilingual) might differ.

4. No alternative metric proposed. The paper identifies a problem but offers no constructive replacement, though this is appropriate for a workshop paper.

5. Three models. While spanning key design axes, three models provide limited statistical power for cross-architecture conclusions about the OLMoE late-layer effect.

Overall Assessment

This is a well-executed negative-result paper that makes a focused but important contribution: demonstrating that a widely-used inferential step in MoE pruning does not survive interventional testing. The statistical rigor exceeds the norm for this type of work. The main limitation is the gap between the token-level audit and deployed pruning pipelines, though the paper is transparent about this. As a workshop paper, it establishes a clear finding and methodology that merits full-paper development with expanded model coverage and alternative metric proposals.

Rating:6.5/ 10

Significance 6.5Rigor 8Novelty 6Clarity 8.5

Generated Jun 10, 2026

Comparison History (21)

Lostvs. Unifying Local Communications and Local Updates for LLM Pretraining

Paper 2 likely has higher impact due to a more broadly applicable, timely contribution to scaling LLM pretraining under real distributed-systems constraints. GASLoC proposes a novel decentralized algorithm compatible with adaptive optimizers, local steps, and sparse randomized communication, and shows empirical gains over strong baselines (including heterogeneous bandwidth settings). This targets a major bottleneck for frontier training and can influence both ML systems and optimization research. Paper 1 is rigorous and valuable as a cautionary interpretability result, but its impact is narrower (MoE pruning/interpretability methodology) and primarily negative/diagnostic rather than enabling new capabilities.

gpt-5.2·Jun 10, 2026

Lostvs. Flexible Kernels for Protein Property Prediction

Paper 2 introduces a novel and practical methodology for protein property prediction that addresses a significant challenge in protein design. Its contributions—evolutionary substitution matrix-based kernels, structure-aware learning, and multi-task capabilities—have broad applicability in biotechnology, drug design, and protein engineering. Paper 1, while methodologically rigorous and offering important cautionary insights about interpretability practices in MoE models, is primarily a negative result with narrower scope, showing that observational metrics don't predict causal importance. Paper 2's positive, constructive contribution with demonstrated practical utility gives it higher potential for real-world impact and broader adoption.

claude-opus-4-6·Jun 10, 2026

Wonvs. Optimal Post-Training Quantization Scales and Where to Find Them

Paper 2 is more novel and broadly impactful: it challenges a common inferential leap in interpretability (associational metrics → interventional claims) with a concrete, multi-model causal audit, yielding a general negative result and an evidential standard that can reshape evaluation practices beyond MoE pruning. Its methodological rigor is high (token-level interventions, multiple-comparison correction, power control), and the implications span interpretability, causal evaluation, pruning, and safety auditing. Paper 1 is strong and timely for PTQ efficiency, but its impact is likely more incremental and narrower to quantization workflows.

gpt-5.2·Jun 10, 2026

Lostvs. K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

Paper 2 (K-Forcing) addresses a critical bottleneck in LLM deployment—inference efficiency—with a novel paradigm (push-forward language modeling) that offers concrete 2.4-3.5x speedups while maintaining compatibility with existing infrastructure. This has immediate, broad real-world applications in industrial-scale LLM serving. Paper 1, while methodologically rigorous and raising important points about the observation-intervention gap in MoE interpretability, is more narrowly focused on a negative/cautionary result about existing pruning metrics. Paper 2's potential to influence both research (new decoding paradigms) and practice (deployment costs) gives it broader impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. AuRA: Internalizing Audio Understanding into LLMs as LoRA

Paper 2 has higher potential impact: it challenges a widely used inferential leap in interpretability/pruning (observational routing stats → causal expert importance) with a systematic interventional audit across multiple popular MoE families. The negative result is methodologically rigorous (token-level interventions, multiple-comparison correction, power control) and broadly relevant to interpretability, causal evaluation, and model compression practices. Its implications generalize beyond MoE pruning to many observational interpretability claims, making it timely and cross-cutting. Paper 1 is useful engineering for speech+LLM efficiency, but is narrower and more incremental.

gpt-5.2·Jun 10, 2026

Wonvs. Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Paper 2 addresses a fundamental methodological gap in ML interpretability—the conflation of observational and interventional evidence (Pearl's causal hierarchy)—with rigorous experimental methodology including multiple-comparison correction and effect size reporting. Its findings challenge widely-used assumptions in MoE pruning and have broad implications for interpretability research standards across the field. Paper 1, while technically sound, offers an incremental improvement to PPO-style trust regions for LLM RL, a rapidly evolving area where methods are frequently superseded. Paper 2's contribution is more foundational and likely to influence research practices across multiple subfields.

claude-opus-4-6·Jun 10, 2026

Wonvs. Limitations of Learning Tanh Neural Networks with Finite Precision

Paper 1 addresses a timely and broadly impactful issue in AI interpretability and MoE model pruning—a rapidly growing area given the deployment of large MoE models. It introduces a rigorous causal framework (connecting Pearl's causal hierarchy to interpretability practices) and provides concrete empirical evidence challenging widely-used assumptions, which could reshape pruning methodologies and interpretability standards across the field. Paper 2, while technically strong, extends known impossibility results from ReLU to tanh networks in a narrower theoretical niche, with less immediate practical impact and a smaller affected community.

claude-opus-4-6·Jun 10, 2026

Lostvs. Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models

Flow-DPPO addresses a practical and timely problem in RL-based fine-tuning of flow matching models for image/video generation, proposing a principled replacement for ratio clipping with exact KL divergence constraints. It has immediate real-world applications in generative AI, demonstrates concrete improvements across multiple metrics, and provides open-source code. Paper 2 makes a valuable methodological point about the gap between observational and interventional evidence in MoE pruning, but its scope is narrower (a negative/cautionary result on existing metrics) with less direct applicability to improving systems, limiting its broader impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Paper 1 (QGF) addresses a fundamental challenge in scaling RL with expressive policies by proposing test-time policy improvement that avoids actor-critic training instability. It has broad practical applications in robotics and control, demonstrates strong empirical results across multiple benchmarks, and offers favorable scaling properties. Paper 2 makes a valuable methodological point about the gap between observational and interventional evidence in MoE interpretability, but its scope is narrower—primarily a cautionary finding about existing pruning heuristics rather than enabling new capabilities. Paper 1's contribution is more actionable and broadly impactful across robotics and RL communities.

claude-opus-4-6·Jun 10, 2026

Lostvs. OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

Paper 2 likely has higher impact: it introduces a generally applicable, plug-and-play gradient estimation framework for optimizing order-statistic (distributional/risk-sensitive) objectives in RL, covering many important criteria (VaR/CVaR, trimmed means, top-m) with unbiased estimators and broad real-world relevance (robustness, tail risk, best-of-K), plus code and demonstrations (incl. LLM post-training). Paper 1 is methodologically rigorous and timely as a cautionary interventional audit, but its impact is narrower (MoE pruning/interpretability) and primarily corrective rather than enabling a new widely reusable method.

gpt-5.2·Jun 10, 2026

#1286of 5669·cs.LG

#1286 of 5669 · cs.LG

Tournament Score

1461±44

10501750

67%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor8

Novelty6

Clarity8.5