Geometric Routing Enables Causal Expert Control in Mixture of Experts

Ivan Ternovtsii, Yurii Bilak

#180 of 2292 · Artificial Intelligence
Share
Tournament Score
1523±37
10501800
51%
Win Rate
21
Wins
20
Losses
41
Matches
Rating
5.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Sparse Mixture-of-Experts (MoE) models scale parameters while fixing active computation per token, but the specialization of individual experts remains opaque. In a companion paper we showed that routing topology is quality-neutral: five structurally different configurations converge to statistically equivalent language modeling quality. Here we show that expert identity is nonetheless causally meaningful: individual rank-1 experts are monosemantic by construction, and cosine-similarity routing in a low-dimensional metric space makes their specialization directly inspectable. We present four lines of evidence. First, projecting expert output vectors through the unembedding matrix yields a Semantic Dictionary: 15% of experts are monosemantic specialists spanning 10 categories (temporal, geographic, cardinal, discourse, emotional, financial, military, scientific). Second, routing exhibits a frequency-to-syntax gradient: early layers separate tokens by word frequency, deeper layers by syntactic class (Zipf-confound controls, all p<0.001p < 0.001). Third, causal interventions confirm these labels: steering toward a temporal expert's centroid increases P(temporal) by +321% (median across 44 prompts); suppressing a geographic expert drops P(geographic) by -23%; rewriting an expert's output vector halves target-category probability, and effects compose additively across layers. Fourth, the interventions are not unique to cosine routing: linear routers support comparable steering, but only cosine routing provides geometric transparency -- expert specialization is readable directly from the centroid matrix. MoE expert-level specialization is a first-class interpretability primitive: architecturally monosemantic, causally validated, and controllable at inference with zero overhead.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper argues that individual rank-1 experts in sparse Mixture-of-Experts (MoE) models are "architecturally monosemantic" — each expert reads along one direction and writes along another in activation space — and that cosine-similarity routing in a shared metric space makes expert specialization directly inspectable via centroid positions. The authors present four main lines of evidence: (1) a "Semantic Dictionary" obtained by projecting expert output vectors through the unembedding matrix, (2) a frequency-to-syntax gradient across layers, (3) context-dependent routing (polysemy branching), and (4) causal interventions (steering, suppression, surgery) that validate the discovered semantic labels.

The central insight is that cosine routing provides "geometric transparency" — expert functions are readable from the centroid matrix without running probes — whereas linear routers support similar steering but require activation-based discovery. This positions MoE expert structure as an alternative to post-hoc interpretability methods like sparse autoencoders (SAEs).

Methodological Rigor

Strengths in methodology:

  • The paper employs multiple complementary causal interventions (knockout, steering, suppression, surgery), which is commendable. The suppression experiment comparing targeted vs. random expert removal is well-designed and shows clean selectivity.
  • The Zipf-confound control for the frequency-to-syntax gradient is a thoughtful experimental design choice, using matched-frequency pairs to disentangle frequency from syntactic class.
  • The polysemy branching analysis (86% expert divergence for "bank" in different contexts) provides compelling evidence that routing is genuinely semantic.
  • Statistical testing (permutation nulls, bootstrap CIs, TOST equivalence) is generally appropriate.
  • Weaknesses in methodology:

  • The entire study is conducted on WikiText-103 at 76–84M parameters — a very small scale by modern standards. The authors acknowledge this limitation but it severely constrains the generalizability of claims. At this scale, the model has PPL ~34, which is quite high, and expert specialization patterns may be qualitatively different in models that actually achieve strong language modeling.
  • The "15% monosemantic" figure is based on manual inspection of top-10 decoded words across 8,192 experts. This is labor-intensive and subjective; the HDBSCAN automation is welcome but leaves 73.5% of experts unclustered.
  • The claim that rank-1 experts are "monosemantic by construction" is somewhat overstated. The rank-1 constraint means each expert has one input and one output direction, but this doesn't guarantee monosemanticity — a single direction in activation space can still correspond to a superposition of concepts. The authors' own finding that 65% of experts are "moderately polysemantic" undermines the "by construction" framing.
  • Steering effect sizes are reported as percentage increases (e.g., +321%), which can be misleading when base rates are low. The temporal steering shifts P(temporal) from 0.17 to 0.49 — impressive in relative terms, but the absolute probability mass is still modest.
  • The logit lens analysis in Section 3 uses an underfit model (PPL ~300–320), which weakens its connection to conclusions drawn from the converged model.
  • Potential Impact

    The paper addresses an important question: can MoE architecture itself serve as an interpretability substrate? If the findings generalize to scale, this would be significant because:

    1. Alternative to SAEs: End-to-end trained monosemantic units avoid the reconstruction-fidelity gap, feature absorption, and non-canonical feature problems plaguing SAEs.

    2. Zero-overhead control: Steering via centroid biasing requires no additional training or probing datasets, making it deployment-friendly.

    3. Design principle: The finding that cosine routing provides geometric transparency could influence future MoE architecture design toward interpretability-aware routing.

    However, the practical impact is limited by scale. No production MoE system uses rank-1 experts (typical expert ranks are much higher), and the relationship between rank-1 monosemanticity and higher-rank expert behavior is unclear. The paper's comparison to MONET (262K monosemantic experts) is acknowledged but not deeply explored.

    Timeliness & Relevance

    The paper is well-timed. MoE architectures are dominant in frontier models (Mixtral, DeepSeek, Gemini), and interpretability of these systems is an active concern. The connection to SAE limitations (non-canonical features, absorption, dark matter) positions the work against a current and recognized problem. The concurrent SteerMoE work on expert-level safety steering shows this is an active research direction.

    Strengths

    1. Comprehensive evidence structure: Four distinct intervention types (knockout, steering, suppression, surgery) provide triangulating evidence.

    2. Good experimental controls: Zipf-confound controls, random-suppression baselines, adversarial prompts for steering robustness.

    3. Compositional analysis: The finding that cross-layer expert composition is additive while within-layer composition interferes is a useful structural insight.

    4. Honest limitations: The paper clearly states scale limitations and does not overclaim generalization.

    Limitations

    1. Scale: 76–84M parameters on WikiText-103 is far from the regime where MoE models are practically relevant. The gap to Mixtral (47B) or DeepSeek-V3 (671B) is enormous.

    2. Architecture specificity: Rank-1 experts are not standard in production MoE models. The claim that experts are "monosemantic by construction" is really a claim about the architectural constraint, not a general MoE property.

    3. Companion paper dependency: Key architectural details and the equifinality result are in a companion paper, making this work harder to evaluate independently.

    4. Limited category coverage: Only 10 categories are causally validated; 65% of experts remain polysemantic and uncategorized.

    5. Reproducibility concerns: The specific model architecture (ST-MOE with multi-hop routing, d_space=64, τ=30) is non-standard, and it's unclear how results depend on these specific choices.

    6. Overstatement of novelty: Projecting expert vectors through the unembedding matrix is, as the authors acknowledge, "a straightforward application of existing techniques." The cosine routing is novel but the interpretability methods are largely borrowed.

    Overall Assessment

    This paper presents a well-structured interpretability study with good experimental methodology for its scale. The geometric transparency argument for cosine routing is the most novel contribution. However, the extremely small model scale, non-standard architecture, and rank-1 expert constraint limit the paper's immediate impact on the broader MoE interpretability landscape. The work would be substantially more impactful with validation at even moderate scale (1–10B parameters) with standard expert architectures.

    Rating:5.2/ 10
    Significance 5.5Rigor 6Novelty 5.5Clarity 7

    Generated May 5, 2026

    Comparison History (41)

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gemini-35/5/2026

    Paper 2 highlights a critical, measurable flaw in current AI safety paradigms where safety filters cause active medical harm through identity-contingent withholding. Its interdisciplinary relevance across AI, medicine, and ethics, combined with a pre-registered rigorous methodology and immediate real-world policy implications, gives it exceptional breadth of impact compared to the algorithmic, albeit valuable, contributions of Paper 1.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    gemini-35/5/2026

    Paper 1 presents a multimodal foundation model bridging genomics, transcriptomics, and proteomics. Its demonstrated success in clinically relevant tasks, such as mutation correction and targeted protein design, gives it profound and immediate real-world utility in biotech and medicine. While Paper 2 offers valuable insights into AI interpretability, Paper 1's broad applicability to fundamental biological problems and therapeutics suggests a higher and more diverse potential scientific impact.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    gemini-35/5/2026

    Paper 1 offers higher scientific impact due to its profound real-world applicability and urgent cross-disciplinary relevance. While Paper 2 provides valuable technical advancements in LLM interpretability, Paper 1 exposes a critical, life-threatening flaw in current AI alignment: iatrogenic harm via identity-contingent withholding. By rigorously quantifying how safety filters actively withhold medical knowledge from laypersons, Paper 1 challenges existing safety paradigms. Its pre-registered methodology and direct implications for public health, medical AI, and AI policy give it a significantly broader and more immediate societal and scientific footprint.

    vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
    claude-opus-4.65/5/2026

    Paper 2 addresses a fundamental, widely applicable problem—LLM prior contamination—that affects every domain using LLMs for data analysis. Its epistemic blinding protocol is simple, generalizable (demonstrated in biology and finance), immediately actionable (open-source tool), and addresses a critical trust/auditability gap in the rapidly growing field of LLM-assisted scientific reasoning. Paper 1 makes solid contributions to MoE interpretability, but targets a narrower architectural community. Paper 2's breadth of impact across fields, timeliness given surging agentic AI adoption, and practical tooling give it higher potential impact.

    vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis
    claude-opus-4.65/5/2026

    Paper 2 addresses a fundamental and broadly applicable problem—the inability to distinguish data-driven inference from memorized priors in LLM outputs—that affects every field using LLMs for analysis. Its epistemic blinding protocol is simple, generalizable (demonstrated in both biology and finance), and immediately actionable with open-source tools. Paper 1, while technically rigorous in MoE interpretability, addresses a narrower architectural concern. Paper 2's timeliness is exceptional given the rapid adoption of LLM-assisted scientific analysis, and it establishes a new auditing paradigm relevant across all domains using agentic LLM systems.

    vs. AI scientists produce results without reasoning scientifically
    claude-opus-4.65/5/2026

    Paper 1 addresses a fundamental question about the epistemic validity of AI-driven scientific research—a rapidly growing practice with enormous implications across all scientific fields. Its finding that LLM agents fail to exhibit genuine scientific reasoning (ignoring evidence 68% of the time, rare belief revision) challenges core assumptions about autonomous AI research and has broad policy, methodological, and safety implications. Paper 2 makes a solid interpretability contribution to MoE architectures, but its scope is narrower, primarily impacting the ML/NLP community. Paper 1's timeliness and breadth of impact across all sciences gives it the edge.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    gpt-5.25/5/2026

    Paper 2 has higher likely impact due to a real-world, conference-scale deployment (22,977 papers) addressing an urgent bottleneck in science, with immediate applicability across fields and strong timeliness. Its contributions (end-to-end system, safeguards, benchmark, and large survey evidence) could rapidly influence peer-review policy, tooling, and research evaluation practices. Paper 1 is novel and methodologically interesting for MoE interpretability and controllability, but its direct downstream impact is narrower and more contingent on adoption in specific model architectures and research communities.

    vs. MIMIC: A Generative Multimodal Foundation Model for Biomolecules
    gemini-35/5/2026

    Paper 1 introduces a unifying multimodal foundation model for biomolecules, bridging sequence, structure, and evolution for both prediction and design. Its demonstrated applications in complex tasks like clinically relevant RNA edits and targeted protein design offer transformative potential for drug discovery, bioengineering, and molecular biology. While Paper 2 presents valuable advances in AI interpretability and MoE control, Paper 1's direct, tangible impacts on the life sciences and medicine represent a broader and more profound scientific and societal contribution.

    vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors
    claude-opus-4.65/5/2026

    Hodoscope introduces a novel paradigm—unsupervised monitoring for AI misbehaviors—that addresses a critical and timely AI safety challenge. Its practical impact is demonstrated by discovering a previously unknown benchmark vulnerability (Commit0) and recovering known exploits, with significant review effort reduction. The formulation of unsupervised monitoring is broadly applicable across AI safety. Paper 2 contributes valuable interpretability insights for MoE models, but its scope is narrower (specific architecture) and builds on a companion paper. Paper 1's broader applicability to AI safety, novel problem formulation, and demonstrated real-world impact give it higher potential scientific impact.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    gpt-5.25/5/2026

    Paper 2 has higher estimated impact: it reports an end-to-end autonomous discovery system validated on a real optical platform, including experimental reproduction and a previously unreported mechanism with potential hardware implications. This is novel (autonomous closed-loop discovery), timely (agentic LLMs), and has broad cross-field relevance (AI, optics, scientific automation, hardware acceleration). The methodological bar is higher due to physical experiments and validation. Paper 1 advances interpretability/control for MoE routing with solid causal evidence, but its applications are narrower and incremental relative to the broader scientific and technological implications of autonomous experimental discovery.

    vs. AI scientists produce results without reasoning scientifically
    claude-opus-4.65/5/2026

    Paper 2 addresses a fundamental question about AI agents conducting autonomous science—a rapidly growing deployment area—revealing that LLMs fail to exhibit genuine scientific reasoning (ignoring evidence 68% of the time, rarely performing belief revision). This finding has broad implications across all fields using AI for research, challenges current evaluation paradigms, and identifies a critical gap (reasoning as a training target). Its 25,000+ agent runs across 8 domains provide strong empirical grounding. Paper 1, while technically rigorous in MoE interpretability, addresses a narrower architectural question with more specialized impact.

    vs. End-to-end autonomous scientific discovery on a real optical platform
    gpt-5.25/5/2026

    Paper 2 likely has higher scientific impact: it reports an end-to-end autonomous agent that conducts real-world experiments and claims a previously unreported, experimentally validated physical mechanism with potential implications for optical computing hardware—high novelty, strong real-world application potential, and broad cross-field relevance (AI agents, experimental physics, photonics hardware). Paper 1 is a rigorous and timely interpretability/control advance for MoE routing, but its impact is more scoped to ML model analysis and may be less transformative than a validated autonomous-discovery milestone plus new physical mechanism.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    claude-opus-4.65/5/2026

    HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements. Its ability to simulate clinical interventions in silico, validated against 41 randomized trial comparisons, and transfer to four independent cohorts for disease/mortality prediction has enormous real-world clinical impact. It addresses a fundamental medical challenge (personalized intervention prediction) with broad applications as clinical digital twins. Paper 2 makes a solid interpretability contribution to MoE architectures but has narrower scope, primarily advancing mechanistic understanding of expert specialization in language models.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    claude-opus-4.65/5/2026

    HealthFormer addresses a central challenge in medicine—personalized health forecasting and intervention simulation—with broad clinical applications including risk stratification, digital twins, and in silico trial simulation. It demonstrates transfer across four independent cohorts, outperforms established clinical risk scores on 27/30 endpoints, and validates intervention predictions against published randomized trials. Its breadth of impact across medicine, public health, and precision nutrition far exceeds Paper 1's contribution, which, while novel for MoE interpretability, addresses a narrower ML architecture concern with less immediate real-world impact.

    vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors
    claude-opus-4.65/5/2026

    Paper 2 (Hodoscope) addresses a more broadly impactful and timely problem—unsupervised monitoring of AI agent misbehaviors—which is critical for AI safety as autonomous agents become widespread. It introduces a novel conceptual framework (unsupervised monitoring), demonstrates practical real-world impact by discovering a previously unknown benchmark vulnerability, and provides a generalizable tool applicable across diverse AI systems. Paper 1 makes solid contributions to MoE interpretability but is more niche, focusing on architectural transparency within a specific model family. Hodoscope's broader applicability to AI safety and its demonstrated practical discoveries give it higher potential impact.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    gpt-5.25/5/2026

    Paper 2 likely has higher impact due to its unprecedented real-world, conference-scale deployment (22,977 papers) with direct operational relevance and immediate applicability to the scientific ecosystem. It addresses a timely, high-stakes bottleneck (peer review), provides empirical evidence via field data and surveys, and introduces a benchmark, making it broadly influential across disciplines and research governance. Paper 1 is novel and rigorous for MoE interpretability/control, but its impact is more specialized to ML architecture/interpretability, whereas Paper 2 could reshape review workflows and policy across fields.

    vs. TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling
    gemini-35/5/2026

    Paper 2 offers foundational advancements in mechanistic interpretability and causal control for Mixture-of-Experts (MoE) architectures. Given the pervasive use of MoEs in frontier AI models, enabling zero-overhead causal steering and making expert specialization geometrically transparent addresses critical bottlenecks in AI safety, alignment, and reliability. This core AI breakthrough will likely generate significantly broader impact and higher citation volume across the field compared to Paper 1's domain-specific, albeit highly innovative, application of LLM agents to urban traffic control.

    vs. TrafficClaw: Generalizable Urban Traffic Control via Unified Physical Environment Modeling
    gemini-35/5/2026

    Paper 2 addresses fundamental interpretability and control mechanisms in Mixture-of-Experts (MoE) models, a critical topic in scaling large language models. The discovery of causally controllable, monosemantic experts provides a major breakthrough in AI transparency and alignment, promising broad applicability across all MoE-based foundation models. Paper 1 offers an innovative approach, but its impact is largely restricted to the specific applied domain of urban traffic control.

    vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
    gpt-5.25/5/2026

    Paper 1 likely has higher scientific impact: it introduces a large-scale, domain-specific foundation model trained on unprecedented nationwide claims-scale data, demonstrates broad and externally validated performance gains across 1,000+ clinical tasks, and shows direct improvements in real-world evidence workflows (expenditure forecasting, reduced bias in target trial emulation). Its applications are immediate for healthcare, regulation, surveillance, and policy, with wide cross-disciplinary relevance (clinical ML, epidemiology, health economics). Paper 2 is novel for MoE interpretability/control, but its near-term real-world impact is more indirect and narrower.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    claude-opus-4.65/5/2026

    Paper 1 presents a fundamentally new paradigm ('machine collective intelligence') for autonomous scientific discovery that addresses a central bottleneck in AI-driven science. It demonstrates broad applicability across deterministic, stochastic, and uncharacterized systems, achieving dramatic improvements (up to 6 orders of magnitude) in extrapolation over deep neural networks while providing interpretable equations. Its breadth of impact spans virtually all empirical sciences. Paper 2 makes valuable contributions to MoE interpretability and control, but its scope is narrower, focused on understanding routing mechanisms within a specific architecture class, with more incremental impact on the interpretability subfield.