Back to Rankings

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

cs.LGcs.AIcs.CL
Share
#1078 of 5669 · cs.LG
Tournament Score
1469±44
10501750
59%
Win Rate
10
Wins
7
Losses
17
Matches
Rating
7.2/ 10
Significance7.5
Rigor7.8
Novelty6.8
Clarity8.2

Abstract

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a principled framework for studying the reproducibility of sparse autoencoder (SAE) features through feature stability—the probability that a given SAE feature reappears across independently trained runs. The key insight is that unstable features are not merely noise or training artifacts, but reflect basis ambiguity within reproducible low-rank subspaces. The paper decomposes the instability problem into functional, geometric, and constructive components: (1) stable features dominate reconstruction and prediction performance while unstable features contribute minimally; (2) unstable features, though individually non-reproducible, collectively span reproducible lower-dimensional subspaces across seeds; and (3) pooling high-stability features across seeds produces more robust SAE dictionaries without sacrificing explained variance.

This reframing—from "are SAE features reproducible?" to "what structure underlies non-reproducible features?"—is a meaningful conceptual advance for the mechanistic interpretability community.

Methodological Rigor

The experimental design is thorough and well-controlled. Training 96 SAEs with different random seeds provides substantial statistical power for estimating reappearance probabilities. The binomial framework for per-feature stability estimation is clean and principled, and the Glivenko-Cantelli convergence guarantee for the empirical CDF is appropriate.

Several methodological choices strengthen the work:

  • Frequency-matched masking protocol: By masking 4× more unstable features to match activation mass, they control for the confound that unstable features simply activate less frequently.
  • Hungarian matching robustness check: IoU of 0.978 between argmax-cosine and one-to-one matching validates the simpler many-to-one approach.
  • Dead-salmon control: Training SAEs on randomly initialized transformers provides a critical null baseline, showing that stability metrics can distinguish meaningful from meaningless features where auto-interpretation scores cannot.
  • Synthetic validation: The controlled low-rank model makes the subspace recovery mechanism explicit and testable.
  • However, some limitations are notable. The cosine threshold θ=0.7 and endpoint cutoff ε=0.05 are somewhat arbitrary, and while sensitivity analyses are provided, the precise boundary between stable and unstable remains threshold-dependent. The feature-pool construction experiment, while promising, involves brief post-training on only ~2M tokens—it's unclear whether this advantage persists at scale or with longer training.

    Potential Impact

    For mechanistic interpretability: This paper has high practical relevance. It provides a concrete, scalable diagnostic (reappearance probability) that practitioners can use to assess which SAE features are trustworthy for downstream analysis. The finding that auto-interpretation scores can be high even for random-model SAEs (Figure 23 vs. Figure 8) is a cautionary result that strengthens the case for stability as a complementary evaluation criterion.

    For SAE architecture design: The stability–reconstruction trade-off analysis across SAE variants (Table 1) provides actionable guidance. The finding that Vanilla ReLU+ℓ₁ is extremely stable but has lower EV, while TopK variants show the opposite pattern, suggests clear directions for future architecture design.

    For representation learning more broadly: The insight that learned dictionary elements can be individually non-identifiable yet collectively span reproducible subspaces connects to fundamental questions about identifiability in dictionary learning and independent component analysis. This extends beyond SAEs to any sparse coding method.

    Limitations on impact: The work is primarily diagnostic rather than prescriptive. While the feature-pool construction demonstrates a route to more stable SAEs, it requires training multiple SAEs first—a computationally expensive proposition. The paper identifies the problem structure but doesn't fully resolve it through a single-run training objective.

    Timeliness & Relevance

    This paper is highly timely. SAEs have become the dominant tool for mechanistic interpretability in 2024-2025, with major labs (Anthropic, OpenAI, DeepMind) investing heavily. Concurrent work by Paulo and Belrose (2025), Leask et al. (2025), and Bhalla et al. (2026) all raise concerns about SAE reproducibility, but this paper provides the most systematic empirical characterization and the clearest geometric explanation. The growing reliance on SAE-derived features for safety-relevant analyses (e.g., steering, monitoring) makes understanding their reliability a pressing concern.

    Strengths

    1. Scale and comprehensiveness: 96 seeds, multiple models (GPT-2, Pythia, Gemma-2), 5 SAE variants, multiple layers and dictionary sizes—this is one of the most thorough empirical studies on SAE behavior.

    2. Multi-level analysis: The paper connects activation statistics, token-level patterns, automatic interpretability, reconstruction impact, next-token loss, and geometric structure into a coherent narrative.

    3. The subspace recovery finding (Section 6.2, Figure 5) is the paper's strongest result—showing that cross-seed transfer of SVD subspaces works nearly as well as within-seed projection is compelling evidence for the basis ambiguity interpretation.

    4. Practical utility: The stability metric is cheap to compute, interpretable, and immediately useful for SAE practitioners.

    5. Intellectual honesty: The limitations section is unusually forthcoming, clearly delineating what the results do and do not show.

    Limitations & Weaknesses

    1. Causality gap: The paper demonstrates correlation between instability and low functional impact but doesn't establish whether low-impact features become unstable or whether instability causes features to be low-impact.

    2. Scale concerns: Most experiments use GPT-2 (124M parameters) and relatively small dictionary sizes. The Gemma-2 2B experiments are more limited (fewer seeds). Whether the stable/unstable dichotomy maintains the same character in frontier-scale models is unclear.

    3. The feature-pool construction requires training many SAEs upfront, limiting practical applicability. The brief 2M-token post-training is minimal.

    4. Missing comparison to concurrent methods: Archetypal SAEs (Fel et al., 2025) and ordered latent SAEs (Wang et al., 2025b) are mentioned but not rigorously compared.

    5. No mechanistic explanation for *why* certain activation space regions are low-rank is provided, though this is acknowledged as future work.

    Overall Assessment

    This is a well-executed empirical study that provides the mechanistic interpretability community with both a useful diagnostic tool and a deeper understanding of SAE failure modes. The subspace recovery insight is genuine and important. While the work is primarily observational rather than providing a definitive solution, it establishes the right conceptual framework and empirical baselines for future work on SAE reliability.

    Rating:7.2/ 10
    Significance 7.5Rigor 7.8Novelty 6.8Clarity 8.2

    Generated Jun 11, 2026

    Comparison History (17)

    Wonvs. VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

    Paper 2 addresses a fundamental question about the reliability and interpretability of sparse autoencoders, a widely-used tool in mechanistic interpretability research. Its findings—that stable features carry most functional signal while unstable features reflect basis ambiguity in reproducible subspaces—have broad implications for the rapidly growing field of AI interpretability. The methodological rigor (large-scale study across seeds, models, layers, plus synthetic verification) and practical contribution (constructing more stable SAEs) give it wide applicability. Paper 1, while solid, offers incremental improvements to motion generation with a narrower application domain.

    claude-opus-4-6·Jun 12, 2026
    Wonvs. Getting Better at Working With You: Compiling User Corrections into Runtime Enforcement for Coding Agents

    Paper 2 likely has higher scientific impact: it addresses a foundational, widely relevant issue in mechanistic interpretability—seed dependence and reproducibility of SAE features—using a scalable per-feature stability metric, extensive cross-condition experiments, geometric/subspace framing, and a synthetic model to establish mechanism. Its findings inform how SAEs should be evaluated, compared, and aggregated across runs, influencing interpretability, representation learning, and reliability/benchmarking practices across many labs. Paper 1 is practically valuable for coding agents, but is more application-specific and may generalize less broadly than Paper 2’s conceptual and methodological contribution.

    gpt-5.2·Jun 12, 2026
    Lostvs. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

    Paper 1 addresses a fundamental challenge in latent reasoning for LLMs—making hidden-state recurrence compatible with on-policy RL and interpretable. This has broad implications for efficient reasoning in language models, a highly active research area. The SWITCH framework offers practical advances (RL-trainability, mechanistic interpretability) with clear real-world applications in deploying efficient reasoning models. Paper 2, while rigorous and insightful regarding SAE feature stability, addresses a more niche interpretability methodology question. Paper 1's novelty in bridging latent reasoning with standard RL training and its timeliness in the reasoning-model era give it higher potential impact.

    claude-opus-4-6·Jun 12, 2026
    Lostvs. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

    Post-training for reasoning is currently a central focus in advancing LLMs. Paper 1 provides highly timely, mechanistic insights into how reinforcement learning enhances reasoning, offering actionable interventions for scaling model capabilities. While Paper 2 offers valuable insights into mechanistic interpretability, Paper 1 has broader and more immediate implications for advancing state-of-the-art AI performance and real-world applications.

    gemini-3.1-pro-preview·Jun 12, 2026
    Wonvs. Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

    While Paper 1 presents a highly innovative application of diffusion models to neuroscience, Paper 2 addresses a fundamental, timely bottleneck in AI mechanistic interpretability. By analyzing seed dependence and feature stability in Sparse Autoencoders (SAEs), Paper 2 provides crucial theoretical and empirical insights for understanding neural network representations. This foundational work in AI safety and interpretability will likely have a broader and more immediate impact on how researchers analyze and align large language models, making its methodological contributions highly influential for the rapidly moving AI research community.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. Mos-Gen: A Generative Molecular Framework for Mosquito Insecticide Design

    While Paper 1 offers valuable theoretical insights into AI interpretability, Paper 2 demonstrates profound real-world scientific impact by applying generative AI to a critical global health crisis (mosquito-borne diseases). Notably, Paper 2 goes beyond computational predictions by synthesizing and experimentally validating the generated compounds in the wet lab, achieving a 78% hit rate. This rigorous end-to-end validation bridging computational design and biological efficacy gives it exceptionally high potential for immediate, cross-disciplinary scientific impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. SPADE: Faster Drug Discovery by Learning from Sparse Data

    Paper 2 addresses a critical bottleneck in drug discovery, offering substantial improvements in sample efficiency and a 10x speedup over existing methods. Its application directly impacts healthcare, pharmaceuticals, and computational biology, providing immense real-world value. While Paper 1 makes strong contributions to the important niche of AI interpretability, Paper 2's cross-disciplinary utility and potential to accelerate life-saving therapeutics give it a broader and more immediate scientific and societal impact.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

    Paper 2 addresses a critical, fundamental bottleneck in mechanistic interpretability: the reproducibility and reliability of Sparse Autoencoders (SAEs). By theoretically and empirically explaining seed dependence and feature stability, it fundamentally shifts how researchers interpret neural network representations. While Paper 1 offers a valuable algorithmic efficiency improvement for agentic RL, Paper 2 has broader implications for AI safety, alignment, and our foundational understanding of model interpretability tools.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

    K-Forcing addresses a critical bottleneck in LLM inference—sequential decoding speed—which is highly relevant to industrial-scale deployment. The 2.4-3.5x speedup for batch serving fills an important gap not addressed by speculative decoding. Its compatibility with existing AR infrastructure increases adoption potential. While Paper 2 provides valuable insights into SAE feature stability and interpretability, it primarily deepens understanding of an existing tool rather than enabling new capabilities. K-Forcing's direct applicability to the massive and growing LLM serving ecosystem gives it broader practical impact.

    claude-opus-4-6·Jun 11, 2026
    Wonvs. Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

    Paper 1 addresses a fundamental methodological concern about sparse autoencoders in mechanistic interpretability—whether learned features are reproducible—with rigorous large-scale empirical analysis and theoretical grounding via synthetic models. It provides actionable insights (stable vs. unstable features, subspace reproducibility, basis ambiguity) that impact the entire field of neural network interpretability. Paper 2 presents a useful but more incremental contribution to AI control/monitoring with a narrower scope. Paper 1's breadth of impact across interpretability research, its methodological depth, and its relevance to the rapidly growing SAE literature give it higher potential impact.

    claude-opus-4-6·Jun 11, 2026