Geometric Routing Enables Causal Expert Control in Mixture of Experts
Ivan Ternovtsii, Yurii Bilak
Abstract
Sparse Mixture-of-Experts (MoE) models scale parameters while fixing active computation per token, but the specialization of individual experts remains opaque. In a companion paper we showed that routing topology is quality-neutral: five structurally different configurations converge to statistically equivalent language modeling quality. Here we show that expert identity is nonetheless causally meaningful: individual rank-1 experts are monosemantic by construction, and cosine-similarity routing in a low-dimensional metric space makes their specialization directly inspectable. We present four lines of evidence. First, projecting expert output vectors through the unembedding matrix yields a Semantic Dictionary: 15% of experts are monosemantic specialists spanning 10 categories (temporal, geographic, cardinal, discourse, emotional, financial, military, scientific). Second, routing exhibits a frequency-to-syntax gradient: early layers separate tokens by word frequency, deeper layers by syntactic class (Zipf-confound controls, all ). Third, causal interventions confirm these labels: steering toward a temporal expert's centroid increases P(temporal) by +321% (median across 44 prompts); suppressing a geographic expert drops P(geographic) by -23%; rewriting an expert's output vector halves target-category probability, and effects compose additively across layers. Fourth, the interventions are not unique to cosine routing: linear routers support comparable steering, but only cosine routing provides geometric transparency -- expert specialization is readable directly from the centroid matrix. MoE expert-level specialization is a first-class interpretability primitive: architecturally monosemantic, causally validated, and controllable at inference with zero overhead.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper argues that individual rank-1 experts in sparse Mixture-of-Experts (MoE) models are "architecturally monosemantic" — each expert reads along one direction and writes along another in activation space — and that cosine-similarity routing in a shared metric space makes expert specialization directly inspectable via centroid positions. The authors present four main lines of evidence: (1) a "Semantic Dictionary" obtained by projecting expert output vectors through the unembedding matrix, (2) a frequency-to-syntax gradient across layers, (3) context-dependent routing (polysemy branching), and (4) causal interventions (steering, suppression, surgery) that validate the discovered semantic labels.
The central insight is that cosine routing provides "geometric transparency" — expert functions are readable from the centroid matrix without running probes — whereas linear routers support similar steering but require activation-based discovery. This positions MoE expert structure as an alternative to post-hoc interpretability methods like sparse autoencoders (SAEs).
Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
Potential Impact
The paper addresses an important question: can MoE architecture itself serve as an interpretability substrate? If the findings generalize to scale, this would be significant because:
1. Alternative to SAEs: End-to-end trained monosemantic units avoid the reconstruction-fidelity gap, feature absorption, and non-canonical feature problems plaguing SAEs.
2. Zero-overhead control: Steering via centroid biasing requires no additional training or probing datasets, making it deployment-friendly.
3. Design principle: The finding that cosine routing provides geometric transparency could influence future MoE architecture design toward interpretability-aware routing.
However, the practical impact is limited by scale. No production MoE system uses rank-1 experts (typical expert ranks are much higher), and the relationship between rank-1 monosemanticity and higher-rank expert behavior is unclear. The paper's comparison to MONET (262K monosemantic experts) is acknowledged but not deeply explored.
Timeliness & Relevance
The paper is well-timed. MoE architectures are dominant in frontier models (Mixtral, DeepSeek, Gemini), and interpretability of these systems is an active concern. The connection to SAE limitations (non-canonical features, absorption, dark matter) positions the work against a current and recognized problem. The concurrent SteerMoE work on expert-level safety steering shows this is an active research direction.
Strengths
1. Comprehensive evidence structure: Four distinct intervention types (knockout, steering, suppression, surgery) provide triangulating evidence.
2. Good experimental controls: Zipf-confound controls, random-suppression baselines, adversarial prompts for steering robustness.
3. Compositional analysis: The finding that cross-layer expert composition is additive while within-layer composition interferes is a useful structural insight.
4. Honest limitations: The paper clearly states scale limitations and does not overclaim generalization.
Limitations
1. Scale: 76–84M parameters on WikiText-103 is far from the regime where MoE models are practically relevant. The gap to Mixtral (47B) or DeepSeek-V3 (671B) is enormous.
2. Architecture specificity: Rank-1 experts are not standard in production MoE models. The claim that experts are "monosemantic by construction" is really a claim about the architectural constraint, not a general MoE property.
3. Companion paper dependency: Key architectural details and the equifinality result are in a companion paper, making this work harder to evaluate independently.
4. Limited category coverage: Only 10 categories are causally validated; 65% of experts remain polysemantic and uncategorized.
5. Reproducibility concerns: The specific model architecture (ST-MOE with multi-hop routing, d_space=64, τ=30) is non-standard, and it's unclear how results depend on these specific choices.
6. Overstatement of novelty: Projecting expert vectors through the unembedding matrix is, as the authors acknowledge, "a straightforward application of existing techniques." The cosine routing is novel but the interpretability methods are largely borrowed.
Overall Assessment
This paper presents a well-structured interpretability study with good experimental methodology for its scale. The geometric transparency argument for cosine routing is the most novel contribution. However, the extremely small model scale, non-standard architecture, and rank-1 expert constraint limit the paper's immediate impact on the broader MoE interpretability landscape. The work would be substantially more impactful with validation at even moderate scale (1–10B parameters) with standard expert architectures.
Generated May 5, 2026
Comparison History (41)
Paper 2 highlights a critical, measurable flaw in current AI safety paradigms where safety filters cause active medical harm through identity-contingent withholding. Its interdisciplinary relevance across AI, medicine, and ethics, combined with a pre-registered rigorous methodology and immediate real-world policy implications, gives it exceptional breadth of impact compared to the algorithmic, albeit valuable, contributions of Paper 1.
Paper 1 presents a multimodal foundation model bridging genomics, transcriptomics, and proteomics. Its demonstrated success in clinically relevant tasks, such as mutation correction and targeted protein design, gives it profound and immediate real-world utility in biotech and medicine. While Paper 2 offers valuable insights into AI interpretability, Paper 1's broad applicability to fundamental biological problems and therapeutics suggests a higher and more diverse potential scientific impact.
Paper 1 offers higher scientific impact due to its profound real-world applicability and urgent cross-disciplinary relevance. While Paper 2 provides valuable technical advancements in LLM interpretability, Paper 1 exposes a critical, life-threatening flaw in current AI alignment: iatrogenic harm via identity-contingent withholding. By rigorously quantifying how safety filters actively withhold medical knowledge from laypersons, Paper 1 challenges existing safety paradigms. Its pre-registered methodology and direct implications for public health, medical AI, and AI policy give it a significantly broader and more immediate societal and scientific footprint.
Paper 2 addresses a fundamental, widely applicable problem—LLM prior contamination—that affects every domain using LLMs for data analysis. Its epistemic blinding protocol is simple, generalizable (demonstrated in biology and finance), immediately actionable (open-source tool), and addresses a critical trust/auditability gap in the rapidly growing field of LLM-assisted scientific reasoning. Paper 1 makes solid contributions to MoE interpretability, but targets a narrower architectural community. Paper 2's breadth of impact across fields, timeliness given surging agentic AI adoption, and practical tooling give it higher potential impact.
Paper 2 addresses a fundamental and broadly applicable problem—the inability to distinguish data-driven inference from memorized priors in LLM outputs—that affects every field using LLMs for analysis. Its epistemic blinding protocol is simple, generalizable (demonstrated in both biology and finance), and immediately actionable with open-source tools. Paper 1, while technically rigorous in MoE interpretability, addresses a narrower architectural concern. Paper 2's timeliness is exceptional given the rapid adoption of LLM-assisted scientific analysis, and it establishes a new auditing paradigm relevant across all domains using agentic LLM systems.
Paper 1 addresses a fundamental question about the epistemic validity of AI-driven scientific research—a rapidly growing practice with enormous implications across all scientific fields. Its finding that LLM agents fail to exhibit genuine scientific reasoning (ignoring evidence 68% of the time, rare belief revision) challenges core assumptions about autonomous AI research and has broad policy, methodological, and safety implications. Paper 2 makes a solid interpretability contribution to MoE architectures, but its scope is narrower, primarily impacting the ML/NLP community. Paper 1's timeliness and breadth of impact across all sciences gives it the edge.
Paper 2 has higher likely impact due to a real-world, conference-scale deployment (22,977 papers) addressing an urgent bottleneck in science, with immediate applicability across fields and strong timeliness. Its contributions (end-to-end system, safeguards, benchmark, and large survey evidence) could rapidly influence peer-review policy, tooling, and research evaluation practices. Paper 1 is novel and methodologically interesting for MoE interpretability and controllability, but its direct downstream impact is narrower and more contingent on adoption in specific model architectures and research communities.
Paper 1 introduces a unifying multimodal foundation model for biomolecules, bridging sequence, structure, and evolution for both prediction and design. Its demonstrated applications in complex tasks like clinically relevant RNA edits and targeted protein design offer transformative potential for drug discovery, bioengineering, and molecular biology. While Paper 2 presents valuable advances in AI interpretability and MoE control, Paper 1's direct, tangible impacts on the life sciences and medicine represent a broader and more profound scientific and societal contribution.
Hodoscope introduces a novel paradigm—unsupervised monitoring for AI misbehaviors—that addresses a critical and timely AI safety challenge. Its practical impact is demonstrated by discovering a previously unknown benchmark vulnerability (Commit0) and recovering known exploits, with significant review effort reduction. The formulation of unsupervised monitoring is broadly applicable across AI safety. Paper 2 contributes valuable interpretability insights for MoE models, but its scope is narrower (specific architecture) and builds on a companion paper. Paper 1's broader applicability to AI safety, novel problem formulation, and demonstrated real-world impact give it higher potential scientific impact.
Paper 2 has higher estimated impact: it reports an end-to-end autonomous discovery system validated on a real optical platform, including experimental reproduction and a previously unreported mechanism with potential hardware implications. This is novel (autonomous closed-loop discovery), timely (agentic LLMs), and has broad cross-field relevance (AI, optics, scientific automation, hardware acceleration). The methodological bar is higher due to physical experiments and validation. Paper 1 advances interpretability/control for MoE routing with solid causal evidence, but its applications are narrower and incremental relative to the broader scientific and technological implications of autonomous experimental discovery.
Paper 2 addresses a fundamental question about AI agents conducting autonomous science—a rapidly growing deployment area—revealing that LLMs fail to exhibit genuine scientific reasoning (ignoring evidence 68% of the time, rarely performing belief revision). This finding has broad implications across all fields using AI for research, challenges current evaluation paradigms, and identifies a critical gap (reasoning as a training target). Its 25,000+ agent runs across 8 domains provide strong empirical grounding. Paper 1, while technically rigorous in MoE interpretability, addresses a narrower architectural question with more specialized impact.
Paper 2 likely has higher scientific impact: it reports an end-to-end autonomous agent that conducts real-world experiments and claims a previously unreported, experimentally validated physical mechanism with potential implications for optical computing hardware—high novelty, strong real-world application potential, and broad cross-field relevance (AI agents, experimental physics, photonics hardware). Paper 1 is a rigorous and timely interpretability/control advance for MoE routing, but its impact is more scoped to ML model analysis and may be less transformative than a validated autonomous-discovery milestone plus new physical mechanism.
HealthFormer represents a paradigm-shifting contribution to precision medicine by creating a generative 'health world model' trained on deeply phenotyped longitudinal data from 15,000+ individuals across 667 measurements. Its ability to simulate clinical interventions in silico, validated against 41 randomized trial comparisons, and transfer to four independent cohorts for disease/mortality prediction has enormous real-world clinical impact. It addresses a fundamental medical challenge (personalized intervention prediction) with broad applications as clinical digital twins. Paper 2 makes a solid interpretability contribution to MoE architectures but has narrower scope, primarily advancing mechanistic understanding of expert specialization in language models.
HealthFormer addresses a central challenge in medicine—personalized health forecasting and intervention simulation—with broad clinical applications including risk stratification, digital twins, and in silico trial simulation. It demonstrates transfer across four independent cohorts, outperforms established clinical risk scores on 27/30 endpoints, and validates intervention predictions against published randomized trials. Its breadth of impact across medicine, public health, and precision nutrition far exceeds Paper 1's contribution, which, while novel for MoE interpretability, addresses a narrower ML architecture concern with less immediate real-world impact.
Paper 2 (Hodoscope) addresses a more broadly impactful and timely problem—unsupervised monitoring of AI agent misbehaviors—which is critical for AI safety as autonomous agents become widespread. It introduces a novel conceptual framework (unsupervised monitoring), demonstrates practical real-world impact by discovering a previously unknown benchmark vulnerability, and provides a generalizable tool applicable across diverse AI systems. Paper 1 makes solid contributions to MoE interpretability but is more niche, focusing on architectural transparency within a specific model family. Hodoscope's broader applicability to AI safety and its demonstrated practical discoveries give it higher potential impact.
Paper 2 likely has higher impact due to its unprecedented real-world, conference-scale deployment (22,977 papers) with direct operational relevance and immediate applicability to the scientific ecosystem. It addresses a timely, high-stakes bottleneck (peer review), provides empirical evidence via field data and surveys, and introduces a benchmark, making it broadly influential across disciplines and research governance. Paper 1 is novel and rigorous for MoE interpretability/control, but its impact is more specialized to ML architecture/interpretability, whereas Paper 2 could reshape review workflows and policy across fields.
Paper 2 offers foundational advancements in mechanistic interpretability and causal control for Mixture-of-Experts (MoE) architectures. Given the pervasive use of MoEs in frontier AI models, enabling zero-overhead causal steering and making expert specialization geometrically transparent addresses critical bottlenecks in AI safety, alignment, and reliability. This core AI breakthrough will likely generate significantly broader impact and higher citation volume across the field compared to Paper 1's domain-specific, albeit highly innovative, application of LLM agents to urban traffic control.
Paper 2 addresses fundamental interpretability and control mechanisms in Mixture-of-Experts (MoE) models, a critical topic in scaling large language models. The discovery of causally controllable, monosemantic experts provides a major breakthrough in AI transparency and alignment, promising broad applicability across all MoE-based foundation models. Paper 1 offers an innovative approach, but its impact is largely restricted to the specific applied domain of urban traffic control.
Paper 1 likely has higher scientific impact: it introduces a large-scale, domain-specific foundation model trained on unprecedented nationwide claims-scale data, demonstrates broad and externally validated performance gains across 1,000+ clinical tasks, and shows direct improvements in real-world evidence workflows (expenditure forecasting, reduced bias in target trial emulation). Its applications are immediate for healthcare, regulation, surveillance, and policy, with wide cross-disciplinary relevance (clinical ML, epidemiology, health economics). Paper 2 is novel for MoE interpretability/control, but its near-term real-world impact is more indirect and narrower.
Paper 1 presents a fundamentally new paradigm ('machine collective intelligence') for autonomous scientific discovery that addresses a central bottleneck in AI-driven science. It demonstrates broad applicability across deterministic, stochastic, and uncharacterized systems, achieving dramatic improvements (up to 6 orders of magnitude) in extrapolation over deep neural networks while providing interpretable equations. Its breadth of impact spans virtually all empirical sciences. Paper 2 makes valuable contributions to MoE interpretability and control, but its scope is narrower, focused on understanding routing mechanisms within a specific architecture class, with more incremental impact on the interpretability subfield.