Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov
Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.
This paper introduces a principled framework for studying the reproducibility of sparse autoencoder (SAE) features through feature stability—the probability that a given SAE feature reappears across independently trained runs. The key insight is that unstable features are not merely noise or training artifacts, but reflect basis ambiguity within reproducible low-rank subspaces. The paper decomposes the instability problem into functional, geometric, and constructive components: (1) stable features dominate reconstruction and prediction performance while unstable features contribute minimally; (2) unstable features, though individually non-reproducible, collectively span reproducible lower-dimensional subspaces across seeds; and (3) pooling high-stability features across seeds produces more robust SAE dictionaries without sacrificing explained variance.
This reframing—from "are SAE features reproducible?" to "what structure underlies non-reproducible features?"—is a meaningful conceptual advance for the mechanistic interpretability community.
The experimental design is thorough and well-controlled. Training 96 SAEs with different random seeds provides substantial statistical power for estimating reappearance probabilities. The binomial framework for per-feature stability estimation is clean and principled, and the Glivenko-Cantelli convergence guarantee for the empirical CDF is appropriate.
Several methodological choices strengthen the work:
However, some limitations are notable. The cosine threshold θ=0.7 and endpoint cutoff ε=0.05 are somewhat arbitrary, and while sensitivity analyses are provided, the precise boundary between stable and unstable remains threshold-dependent. The feature-pool construction experiment, while promising, involves brief post-training on only ~2M tokens—it's unclear whether this advantage persists at scale or with longer training.
For mechanistic interpretability: This paper has high practical relevance. It provides a concrete, scalable diagnostic (reappearance probability) that practitioners can use to assess which SAE features are trustworthy for downstream analysis. The finding that auto-interpretation scores can be high even for random-model SAEs (Figure 23 vs. Figure 8) is a cautionary result that strengthens the case for stability as a complementary evaluation criterion.
For SAE architecture design: The stability–reconstruction trade-off analysis across SAE variants (Table 1) provides actionable guidance. The finding that Vanilla ReLU+ℓ₁ is extremely stable but has lower EV, while TopK variants show the opposite pattern, suggests clear directions for future architecture design.
For representation learning more broadly: The insight that learned dictionary elements can be individually non-identifiable yet collectively span reproducible subspaces connects to fundamental questions about identifiability in dictionary learning and independent component analysis. This extends beyond SAEs to any sparse coding method.
Limitations on impact: The work is primarily diagnostic rather than prescriptive. While the feature-pool construction demonstrates a route to more stable SAEs, it requires training multiple SAEs first—a computationally expensive proposition. The paper identifies the problem structure but doesn't fully resolve it through a single-run training objective.
This paper is highly timely. SAEs have become the dominant tool for mechanistic interpretability in 2024-2025, with major labs (Anthropic, OpenAI, DeepMind) investing heavily. Concurrent work by Paulo and Belrose (2025), Leask et al. (2025), and Bhalla et al. (2026) all raise concerns about SAE reproducibility, but this paper provides the most systematic empirical characterization and the clearest geometric explanation. The growing reliance on SAE-derived features for safety-relevant analyses (e.g., steering, monitoring) makes understanding their reliability a pressing concern.
1. Scale and comprehensiveness: 96 seeds, multiple models (GPT-2, Pythia, Gemma-2), 5 SAE variants, multiple layers and dictionary sizes—this is one of the most thorough empirical studies on SAE behavior.
2. Multi-level analysis: The paper connects activation statistics, token-level patterns, automatic interpretability, reconstruction impact, next-token loss, and geometric structure into a coherent narrative.
3. The subspace recovery finding (Section 6.2, Figure 5) is the paper's strongest result—showing that cross-seed transfer of SVD subspaces works nearly as well as within-seed projection is compelling evidence for the basis ambiguity interpretation.
4. Practical utility: The stability metric is cheap to compute, interpretable, and immediately useful for SAE practitioners.
5. Intellectual honesty: The limitations section is unusually forthcoming, clearly delineating what the results do and do not show.
1. Causality gap: The paper demonstrates correlation between instability and low functional impact but doesn't establish whether low-impact features become unstable or whether instability causes features to be low-impact.
2. Scale concerns: Most experiments use GPT-2 (124M parameters) and relatively small dictionary sizes. The Gemma-2 2B experiments are more limited (fewer seeds). Whether the stable/unstable dichotomy maintains the same character in frontier-scale models is unclear.
3. The feature-pool construction requires training many SAEs upfront, limiting practical applicability. The brief 2M-token post-training is minimal.
4. Missing comparison to concurrent methods: Archetypal SAEs (Fel et al., 2025) and ordered latent SAEs (Wang et al., 2025b) are mentioned but not rigorously compared.
5. No mechanistic explanation for *why* certain activation space regions are low-rank is provided, though this is acknowledged as future work.
This is a well-executed empirical study that provides the mechanistic interpretability community with both a useful diagnostic tool and a deeper understanding of SAE failure modes. The subspace recovery insight is genuine and important. While the work is primarily observational rather than providing a definitive solution, it establishes the right conceptual framework and empirical baselines for future work on SAE reliability.
Generated Jun 11, 2026
Paper 2 addresses a fundamental question about the reliability and interpretability of sparse autoencoders, a widely-used tool in mechanistic interpretability research. Its findings—that stable features carry most functional signal while unstable features reflect basis ambiguity in reproducible subspaces—have broad implications for the rapidly growing field of AI interpretability. The methodological rigor (large-scale study across seeds, models, layers, plus synthetic verification) and practical contribution (constructing more stable SAEs) give it wide applicability. Paper 1, while solid, offers incremental improvements to motion generation with a narrower application domain.
Paper 2 likely has higher scientific impact: it addresses a foundational, widely relevant issue in mechanistic interpretability—seed dependence and reproducibility of SAE features—using a scalable per-feature stability metric, extensive cross-condition experiments, geometric/subspace framing, and a synthetic model to establish mechanism. Its findings inform how SAEs should be evaluated, compared, and aggregated across runs, influencing interpretability, representation learning, and reliability/benchmarking practices across many labs. Paper 1 is practically valuable for coding agents, but is more application-specific and may generalize less broadly than Paper 2’s conceptual and methodological contribution.
Paper 1 addresses a fundamental challenge in latent reasoning for LLMs—making hidden-state recurrence compatible with on-policy RL and interpretable. This has broad implications for efficient reasoning in language models, a highly active research area. The SWITCH framework offers practical advances (RL-trainability, mechanistic interpretability) with clear real-world applications in deploying efficient reasoning models. Paper 2, while rigorous and insightful regarding SAE feature stability, addresses a more niche interpretability methodology question. Paper 1's novelty in bridging latent reasoning with standard RL training and its timeliness in the reasoning-model era give it higher potential impact.
Post-training for reasoning is currently a central focus in advancing LLMs. Paper 1 provides highly timely, mechanistic insights into how reinforcement learning enhances reasoning, offering actionable interventions for scaling model capabilities. While Paper 2 offers valuable insights into mechanistic interpretability, Paper 1 has broader and more immediate implications for advancing state-of-the-art AI performance and real-world applications.
While Paper 1 presents a highly innovative application of diffusion models to neuroscience, Paper 2 addresses a fundamental, timely bottleneck in AI mechanistic interpretability. By analyzing seed dependence and feature stability in Sparse Autoencoders (SAEs), Paper 2 provides crucial theoretical and empirical insights for understanding neural network representations. This foundational work in AI safety and interpretability will likely have a broader and more immediate impact on how researchers analyze and align large language models, making its methodological contributions highly influential for the rapidly moving AI research community.
While Paper 1 offers valuable theoretical insights into AI interpretability, Paper 2 demonstrates profound real-world scientific impact by applying generative AI to a critical global health crisis (mosquito-borne diseases). Notably, Paper 2 goes beyond computational predictions by synthesizing and experimentally validating the generated compounds in the wet lab, achieving a 78% hit rate. This rigorous end-to-end validation bridging computational design and biological efficacy gives it exceptionally high potential for immediate, cross-disciplinary scientific impact.
Paper 2 addresses a critical bottleneck in drug discovery, offering substantial improvements in sample efficiency and a 10x speedup over existing methods. Its application directly impacts healthcare, pharmaceuticals, and computational biology, providing immense real-world value. While Paper 1 makes strong contributions to the important niche of AI interpretability, Paper 2's cross-disciplinary utility and potential to accelerate life-saving therapeutics give it a broader and more immediate scientific and societal impact.
Paper 2 addresses a critical, fundamental bottleneck in mechanistic interpretability: the reproducibility and reliability of Sparse Autoencoders (SAEs). By theoretically and empirically explaining seed dependence and feature stability, it fundamentally shifts how researchers interpret neural network representations. While Paper 1 offers a valuable algorithmic efficiency improvement for agentic RL, Paper 2 has broader implications for AI safety, alignment, and our foundational understanding of model interpretability tools.
K-Forcing addresses a critical bottleneck in LLM inference—sequential decoding speed—which is highly relevant to industrial-scale deployment. The 2.4-3.5x speedup for batch serving fills an important gap not addressed by speculative decoding. Its compatibility with existing AR infrastructure increases adoption potential. While Paper 2 provides valuable insights into SAE feature stability and interpretability, it primarily deepens understanding of an existing tool rather than enabling new capabilities. K-Forcing's direct applicability to the massive and growing LLM serving ecosystem gives it broader practical impact.
Paper 1 addresses a fundamental methodological concern about sparse autoencoders in mechanistic interpretability—whether learned features are reproducible—with rigorous large-scale empirical analysis and theoretical grounding via synthetic models. It provides actionable insights (stable vs. unstable features, subspace reproducibility, basis ambiguity) that impact the entire field of neural network interpretability. Paper 2 presents a useful but more incremental contribution to AI control/monitoring with a narrower scope. Paper 1's breadth of impact across interpretability research, its methodological depth, and its relevance to the rapidly growing SAE literature give it higher potential impact.