Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs
Charles Ye, Bo Yuan, Lee Sharkey
Abstract
An LLM's residual stream is both state and instruction: it encodes the current context and determines the next transformation. We introduce a parameter-free decomposition for Mixture-of-Experts models that splits each layer's hidden state into a control signal that causally drives routing and an orthogonal content channel invisible to the router. Across six MoE architectures, we find that models preserve surface-level features (language, token identity, position) in the content channel, while the control signal encodes an abstract function that rotates from layer to layer. Because each routing decision is low-bandwidth, this hand-off forces compositional specialization across layers. While individual experts remain polysemantic, expert paths become monosemantic, clustering tokens by semantic function across languages and surface forms. The same token (e.g., ":") follows distinct trajectories depending on whether it serves as a type annotation, an introductory colon, or a time separator. Our decomposition identifies the source of this structure: clusters in the control subspace are substantially more monosemantic than those in the full representation. As a result, the natural unit of interpretability in MoEs is not the expert but the trajectory.
AI Impact Assessments
(3 models)Scientific Impact Assessment: "Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs"
1. Core Contribution
This paper introduces a parameter-free decomposition of the residual stream in MoE models into two orthogonal components: a control signal (router-visible) that causally determines expert selection, and a content channel (router-blind) that carries information invisible to the router. The decomposition is mathematically exact—derived via SVD of the routing matrix—and requires no learned parameters or approximations. The key insight is reframing the residual stream as simultaneously encoding *state* (what is being processed) and *instruction* (what computation to apply next), and showing that MoEs make this separation recoverable due to the linearity of routing.
The paper's central empirical finding is that while individual experts are polysemantic (confirming prior work), expert paths—sequences of expert assignments across layers—are monosemantic, clustering tokens by semantic function rather than surface form. The decomposition explains why: the control subspace encodes abstract functional information that rotates across layers, while content preserves surface-level features.
2. Methodological Rigor
The decomposition itself is mathematically clean and the causal claim is well-grounded: since routing is linear (), anything in the null space of literally cannot influence expert selection. This is not a statistical argument but a mathematical guarantee, which is a genuine strength.
The empirical validation is thorough for a workshop paper, spanning six diverse MoE architectures (7B–106B parameters, different sparsities, shared experts, hybrid attention). Key findings are replicated across all models:
However, there are methodological limitations. The monosemanticity claim for paths relies on automated scoring with limited detail about the scoring methodology—only ~100 clusters are sampled per subspace, and the diversity metric (unique token IDs per cluster) is a rough proxy. The "4× more diverse" claim for control-subspace clusters is compelling but would benefit from more rigorous semantic evaluation. The path analysis focuses on a specific layer range (8–16) in one model (GPT-OSS-20B), raising questions about generalizability of the path monosemanticity finding.
3. Potential Impact
Interpretability: The paper shifts the unit of analysis from individual experts to trajectories, which could redirect a significant body of MoE interpretability research. Prior work finding weak expert-level specialization may have been asking the wrong question entirely.
Steering and control: The orthogonal decomposition provides a natural coordinate system for interventions—perturbing control without corrupting content, or vice versa. This has immediate practical implications for model editing and alignment work.
Architecture design: Understanding that routing operates in a low-dimensional subspace (~2-5% of dimensions) could inform more efficient router designs or training objectives.
Dense model interpretability: The speculation that state-instruction duality extends to dense models, but is harder to identify without discrete routing, is thought-provoking and could inspire new decomposition methods for dense transformers.
4. Timeliness & Relevance
This paper is highly timely. MoE architectures dominate frontier models (as the paper notes, all top-10 disclosed architectures are MoEs), yet mechanistic understanding of how they organize computation is limited. The paper directly addresses the gap between MoE's practical success and theoretical understanding. The interpretability community has increasingly focused on MoEs, and this work provides new tools and conceptual framing.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment:
This is an unusually strong workshop paper that introduces a clean theoretical framework backed by solid empirical evidence across multiple architectures. The core insight—that the linearity of routing enables exact separation of control from content—is simple but powerful, and the "rotating control, accumulating content" finding provides a compelling mechanistic story. The shift from expert-level to trajectory-level analysis could meaningfully redirect MoE interpretability research. The main limitation is depth: several claims (especially around monosemanticity) need more rigorous evaluation that a full paper could provide.
Generated Apr 21, 2026
Comparison History (91)
Paper 1 likely has higher scientific impact: it introduces a unified, physically grounded framework (GSS) that bridges diffusion generation and random structure search, delivering large practical gains (10× lower sampling cost) and demonstrating out-of-distribution effectiveness—key for materials/molecular discovery. Its real-world applicability (structure prediction, catalyst/battery/drug/material screening) is immediate and cross-disciplinary (chemistry, physics, materials science, ML). Paper 2 is novel and timely for MoE interpretability, but is primarily analytical/diagnostic with less direct downstream utility compared to a method enabling new structure discovery.
HealthFormer demonstrates broader scientific impact by addressing a central challenge in medicine with a novel generative model that serves as a 'health world model.' It shows impressive real-world clinical applications: transferring to independent cohorts, outperforming established clinical risk scores across 27/30 endpoints, and accurately simulating clinical interventions validated against published trials. The concept of clinical digital twins has transformative potential for personalized medicine. While Paper 1 offers valuable interpretability insights for MoE architectures, its impact is narrower, primarily advancing mechanistic understanding within the AI/ML community rather than enabling broad translational applications.
Paper 1 introduces a groundbreaking foundational model for human physiology with direct, transformative applications in personalized medicine and clinical trial simulation. Its ability to accurately predict disease endpoints and simulate interventions in silico gives it a massive potential for real-world impact across medicine and healthcare, outweighing Paper 2's specialized focus on the mechanistic interpretability of Mixture-of-Experts models in AI.
Paper 2 offers a fundamental conceptual breakthrough in understanding Mixture-of-Experts (MoEs) by reframing the unit of interpretability from individual experts to expert trajectories. This parameter-free decomposition has profound implications for AI interpretability, safety, and model design. While Paper 1 provides a highly valuable dataset for AI-driven scientific discovery, Paper 2's insights address a critical gap in our understanding of state-of-the-art LLM architectures, likely leading to broad methodological shifts in how large models are analyzed and controlled.
Paper 2 likely has higher scientific impact because it turns interpretability into a scalable, actionable training method with clear real-world utility (data-efficient fine-tuning) and broad applicability across tasks and model families. The reported gains and comparisons to baselines suggest stronger practical relevance and timeliness for current LLM optimization workflows. Paper 1 is novel and conceptually valuable for MoE interpretability, but its impact is narrower (MoE-specific) and more descriptive than actionable, with less immediate downstream leverage for training or deployment.
Paper 2 likely has higher impact: it turns interpretability into an actionable training method (interpretability-guided data selection) with strong, quantifiable downstream gains and data-efficiency across multiple tasks and model families, making it broadly applicable and timely for LLM optimization. Its claims are experimentally grounded (baselines, multiple models/tasks, measurable improvements), and the real-world utility (cheaper fine-tuning, targeted data curation) is immediate. Paper 1 is novel and valuable for MoE interpretability, but its applications are more indirect and narrower to MoE routing analyses.
Paper 1 offers a paradigm-shifting insight for mechanistic interpretability by demonstrating that trajectories, not individual experts, are the natural unit of interpretability in MoEs. This fundamental theoretical contribution and parameter-free decomposition approach have broad implications for understanding and controlling large language models, giving it a deeper and more lasting scientific impact compared to the practical, domain-specific training improvements of Paper 2.
Paper 1 introduces a novel theoretical framework for understanding MoE models that reveals fundamental architectural principles—decomposing hidden states into control and content channels, showing expert paths are monosemantic even when individual experts are polysemantic. This reframes interpretability for MoE architectures (shifting the unit from expert to trajectory) with broad implications across mechanistic interpretability research. Paper 2, while practically useful, is more incremental—combining existing techniques (majority voting, formal verification, RLVR) in a clever engineering pipeline. Paper 1's conceptual contribution is more likely to influence future research directions across multiple subfields.
Paper 1 likely has higher impact: it proposes a multimodal generative foundation model for biomolecules plus a new aligned dataset (LORE), unifying prediction and conditional design across RNA/protein modalities with concrete biological and translational demonstrations (splicing, disease mutation editing suggestions, binding-site–conditioned protein design). This broadens real-world applicability and cross-field relevance (genomics, structural biology, drug design). Paper 2 is novel and timely for MoE interpretability/control but is primarily conceptual/diagnostic; near-term applications are less direct, so expected downstream scientific and societal impact is narrower.
Paper 1 likely has higher scientific impact due to a more novel and broadly applicable multimodal generative framework for biomolecules, backed by a new aligned dataset (LORE), strong downstream results (e.g., SOTA splicing), and clear translational applications in RNA editing and protein design. Its methodology spans multiple biological modalities and tasks, potentially influencing genomics, structural biology, and drug discovery. Paper 2 is innovative and timely for interpretability/control in MoE LLMs, but its primary impact is narrower (mainly ML theory/analysis) and less directly tied to immediate real-world deployment outcomes.
Paper 2 offers a fundamental breakthrough in mechanistic interpretability for Mixture-of-Experts (MoE) architectures, shifting the paradigm from analyzing individual experts to understanding expert trajectories. Given the dominance of MoEs in state-of-the-art AI, this insight has profound implications for AI safety, alignment, and model design. While Paper 1 provides a valuable tool for addressing the replication crisis, Paper 2's theoretical and architectural insights have a deeper, more immediate structural impact on the rapidly advancing field of AI development.
Paper 2 offers a fundamental breakthrough in AI interpretability by redefining how Mixture-of-Experts (MoEs) process information, shifting the focus from individual experts to monosemantic trajectories. This theoretical advancement is likely to broadly influence future MoE architecture design and safety research. Paper 1, while highly valuable for addressing the reproducibility crisis in social sciences, represents an applied use of LLMs rather than a foundational algorithmic or theoretical advancement.
While Paper 1 addresses an important socio-technical issue regarding LLM bias, Paper 2 offers a foundational breakthrough in mechanistic interpretability for Mixture-of-Experts (MoE) architectures. By shifting the unit of analysis from individual experts to routing trajectories, Paper 2 provides a novel, parameter-free framework that solves a critical bottleneck in understanding state-of-the-art LLMs. This structural insight into how MoEs process information compositionally will likely catalyze significant downstream research in AI alignment, model steering, and architecture design, giving it a higher potential for deep, long-lasting scientific impact in machine learning.
Paper 1 introduces a novel, parameter-free decomposition revealing fundamental structure in MoE models—showing that interpretability units are trajectories, not individual experts. This insight is broadly applicable across architectures (tested on 6 MoE models), offers a new theoretical framework for mechanistic interpretability, and addresses a timely question as MoE models become dominant. Paper 2 proposes an incremental engineering contribution (hierarchical error correction for VLAs) with narrower scope. Paper 1's conceptual contribution—separating control signals from content in routing—has deeper implications for understanding and improving large language models.
Paper 1 introduces a novel, parameter-free decomposition revealing fundamental structure in MoE models—showing that interpretability's natural unit is the trajectory, not the expert. This insight is broadly applicable across architectures (validated on 6 MoE models), offers deep mechanistic understanding of how routing creates compositional specialization, and contributes to the critical field of LLM interpretability. Paper 2 proposes a useful but more incremental engineering framework for VLA error correction. Paper 1's conceptual contribution (control vs. content decomposition, monosemantic paths) is more likely to influence future research directions across multiple subfields.
Paper 1 demonstrates that brief AI chatbot conversations can durably shift human moral values without participants' awareness, a finding with profound implications for AI safety, policy, ethics, and society at large. Its breadth of impact extends beyond computer science to psychology, law, and governance. The timeliness is exceptional given rapid chatbot adoption. While Paper 2 offers valuable mechanistic insights into MoE interpretability, its impact is more narrowly technical. Paper 1's findings are likely to influence regulation, public discourse, and interdisciplinary research more broadly.
Paper 2 demonstrates that brief AI chatbot interactions can produce lasting, undetected shifts in human moral values—a finding with profound implications for AI policy, ethics, regulation, and society. Its accessibility and direct relevance to billions of chatbot users gives it enormous breadth of impact across psychology, AI ethics, law, and public policy. While Paper 1 makes a sophisticated technical contribution to MoE interpretability, its audience is narrower (mechanistic interpretability researchers). Paper 2's large effect sizes and real-world urgency position it for substantially higher citation counts and policy influence.
While Paper 1 offers profound insights into AI interpretability for MoE architectures, Paper 2 provides a methodological breakthrough with far broader cross-disciplinary impact. By solving a fundamental statistical problem (prevalence estimation under covariate shift) applicable to public health, social sciences, and beyond, Paper 2 extends its utility across nearly all academic disciplines that rely on classification models, granting it higher potential for widespread scientific impact.
Paper 1 addresses a fundamental statistical problem—prevalence estimation under covariate shift—with implications spanning public health, social sciences, and AI safety. By connecting fairness theory (multicalibration) to a ubiquitous measurement challenge, it offers a rigorous, broadly applicable solution. While Paper 2 provides valuable insights into LLM interpretability, Paper 1's potential for cross-disciplinary real-world impact and its fundamental methodological contribution give it a higher overall scientific impact.
Paper 2 introduces a fundamental mechanistic insight about how MoE models organize computation, proposing that trajectories (not individual experts) are the natural unit of interpretability. This parameter-free decomposition validated across six architectures offers broad theoretical contributions to mechanistic interpretability, a rapidly growing field. Paper 1, while addressing an important practical problem in RLHF safety, is more incremental—combining known techniques (red-teaming, RM fine-tuning) into a pipeline. Paper 2's conceptual reframing of MoE interpretability has greater potential to influence future research directions across multiple subfields.