Back to Rankings

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

Evan Duan

cs.LGcs.AI
Share
#4155 of 5669 · cs.LG
Tournament Score
1339±43
10501750
42%
Win Rate
8
Wins
11
Losses
19
Matches
Rating
6/ 10
Significance5.5
Rigor7.5
Novelty6.5
Clarity7.5

Abstract

Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral spread, and evaluate GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries. Across these settings, decoder geometry, activation statistics, co-activation structure, and direct-logit footprint predict steering modularity better than frequency-only and activation-magnitude baselines. The signal is strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, where it survives residualization against magnitude-related confounds, and weaker in Gemma-2-2B. Held-out screening shows that ranking unseen features by predicted cleanliness can select features that steer more cleanly on fresh contexts, but the successful axis varies by setting: GPT-2 improves most cleanly, Pythia improves mainly on stability, Llama mainly on collateral, and Gemma only partially. A controlled Llama Scope width comparison shows that the predictive signal persists under a 32K-to-128K dictionary-width change, although the screening payoff becomes less stable. Overall, SAE steering side effects are predictable in advance, but the useful predictor signature and transferred modularity axis are model- and dictionary-setting dependent.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a pre-intervention screening framework for predicting side effects when steering language models via SAE features. The key insight is that cheap, intervention-free statistics—decoder geometry, activation distributions, co-activation structure, and direct-logit footprint—can forecast two axes of "steering modularity": (1) effect stability (cross-context consistency of the steering effect) and (2) collateral spread (unintended perturbation of unrelated features). The framework is evaluated across four models (GPT-2-small, Pythia-70M, Gemma-2-2B, Llama-3.1-8B) and three SAE dictionary families (ReLU, JumpReLU, TopK).

The practical contribution is a feature-selection method that can identify cleaner-steering features without running expensive per-feature intervention sweeps. The conceptual contribution is the distinction between method-level transfer (the general screening procedure works across models) and mechanism-level transfer (the specific dominant predictor does not generalize).

Methodological Rigor

The experimental design is commendably thorough in several respects:

  • Multiple baselines: The paper consistently compares against frequency-only and activation-magnitude-only baselines, and introduces residualization against magnitude-related confounds to check whether predictors merely proxy for intervention strength.
  • Cross-validation: Five-fold CV with Spearman rank correlations provides appropriate evaluation for the regression-based predictive task.
  • Held-out screening: The separation of feature selection (training set) from evaluation (fresh contexts from Wikitext validation) provides a genuine out-of-sample test.
  • Multiple-comparison corrections: Holm and Benjamini-Hochberg adjustments are reported alongside raw p-values, with honest acknowledgment of which effects survive correction.
  • Effect-size balance checks: The paper carefully examines whether clean-messy group differences could be explained by realized intervention magnitude.
  • However, there are notable limitations. The sample sizes are relatively small: 300 features per model for prediction, and 25-32 features per screening group. The single intervention strength (α=1.0) is limiting, though a GPT-2 sweep over {0.5, 1.0, 2.0} partially addresses this. The collateral measurement relies on a downstream SAE proxy, which introduces its own reconstruction artifacts. Model and SAE family are confounded (not crossed), so it's impossible to isolate whether differences stem from model architecture or dictionary type. The paper is admirably transparent about all these limitations.

    Potential Impact

    Near-term practical utility: For practitioners using SAE steering for model control or safety interventions, this framework provides a concrete tool: compute cheap statistics, rank features, and select those predicted to steer cleanly. This could reduce the cost of finding well-behaved steering features by avoiding exhaustive intervention sweeps.

    Interpretability research: The finding that different predictor signatures dominate in different model/SAE settings provides diagnostic information about the geometric and statistical structure of SAE dictionaries. The dictionary-level diagnostics (Table C1) connecting crowding, norm dispersion, and coactivation density to screening success could inform SAE training objectives.

    Limitations on impact: The practical screening payoff is moderate and inconsistent—only GPT-2 shows improvement on both axes, while other models improve on only one axis or partially. The model-dependency of the dominant predictor means practitioners cannot simply apply a universal recipe. The paper does not demonstrate integration into an end-to-end steering pipeline where cleaner feature selection leads to measurably better downstream task performance.

    Timeliness & Relevance

    This paper is well-timed. SAE-based steering is an active area in mechanistic interpretability, with recent work (Arad et al., 2025; O'Brien et al., 2024) highlighting that feature selection matters for steering quality and that steering can degrade unrelated capabilities. The problem of steering reliability is a genuine bottleneck for safety applications. The paper leverages recently released infrastructure (Gemma Scope, Llama Scope) to provide cross-model evidence that would not have been possible a year earlier.

    The work also addresses an underexplored gap: most steering work characterizes failures *post hoc*, while this paper asks whether failures can be predicted *a priori*. This shift from reactive to proactive evaluation is valuable for the field.

    Strengths

    1. Honest calibration of claims: The paper is unusually careful about stating what works where and what doesn't. The explicit method-vs-mechanism distinction is a mature contribution.

    2. Comprehensive cross-model evaluation: Four models × multiple SAE families × multiple predictor sets × predictive + screening evaluations provide a broad evidence base.

    3. Mechanistic motivation: The local-linear approximation (Section 3.6) provides clean intuition for why decoder geometry and direct-logit statistics should predict collateral spread.

    4. Reproducibility: Code is provided, configurations are fully specified, and the extensive appendix (Tables A1-G4) enables detailed scrutiny.

    5. Robustness checks: Residualization, multiple-comparison correction, and effect-size balance checks go beyond standard practice.

    Limitations

    1. Scale of screening evaluation: 25-32 features per group is small; statistical power for detecting moderate effect sizes is limited, and some null results may reflect insufficient power rather than absence of signal.

    2. Limited intervention regime: A single α=1.0 (with only a brief GPT-2 sweep) leaves open whether the screening framework generalizes across intervention strengths relevant to practical use.

    3. No task-level evaluation: The paper measures modularity via cosine similarity and downstream feature counts but does not connect these to behavioral outcomes (e.g., whether "cleaner" features actually produce more targeted behavioral changes on specific tasks).

    4. Model-SAE confounding: The four settings vary simultaneously in model size, architecture, and SAE family, making it impossible to attribute differences cleanly.

    5. Modest effect sizes: The screening improvements, while statistically significant in some settings, are relatively small in absolute terms (e.g., stability differences of 0.01-0.08).

    Overall Assessment

    This is a solid, well-executed empirical contribution that opens a new research direction—predicting steering side effects before intervention. The careful cross-model evaluation and honest reporting of heterogeneous results add credibility. The practical payoff is currently moderate and setting-dependent, but the conceptual framework and methodology are valuable foundations for future work. The paper's impact will likely be strongest within the mechanistic interpretability and AI safety communities, where SAE steering is actively being developed for model control.

    Rating:6/ 10
    Significance 5.5Rigor 7.5Novelty 6.5Clarity 7.5

    Generated Jun 9, 2026

    Comparison History (19)

    Wonvs. Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

    Paper 1 addresses a highly timely and critical problem in AI safety and interpretability: controlling the steering side effects of Large Language Models using Sparse Autoencoders. Its empirical breadth across major models (Llama-3, Gemma-2, etc.) and direct applicability to LLM alignment gives it broader immediate relevance and higher potential impact compared to Paper 2, which offers a solid but more specialized theoretical contribution to contextual bandit algorithms.

    gemini-3.1-pro-preview·Jun 9, 2026
    Wonvs. BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

    Paper 2 addresses a critical and timely scientific problem in mechanistic interpretability and AI safety by analyzing and predicting the side effects of SAE feature steering across state-of-the-art language models. This contributes fundamental insights into model representations. In contrast, Paper 1 presents a highly useful engineering and workflow tool for manipulating model weights. While valuable for reproducibility, Paper 2's direct theoretical and empirical contributions to understanding LLM behavior offer higher potential for broad scientific impact.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

    Paper 2 achieves a rate-optimal result for contextual queueing bandits, improving the regret rate from O(T^{-1/4}) to O(T^{-1/2}) and proving a matching minimax lower bound. This fully characterizes the fundamental complexity of the problem, which is a clean and definitive theoretical contribution with broad implications for operations research, scheduling, and online learning theory. Paper 1 addresses an important practical concern (SAE steering side effects) but its contributions are more empirical and incremental, with findings that are model- and setting-dependent, limiting generalizability. Paper 2's tight theoretical result has more lasting and broadly applicable scientific impact.

    claude-opus-4-6·Jun 9, 2026
    Lostvs. Physically Consistent Null Space Alignment for Detection of Low-Magnitude False Data Injection Attacks

    Paper 2 addresses a critical cybersecurity problem in power systems with a principled, physically-grounded solution (PCNSA) that has immediate real-world applications in protecting critical infrastructure. It offers a novel theoretical contribution (proving that conventional preprocessing violates subspace alignment) with rigorous validation across standard benchmarks. Paper 1, while technically interesting for the mechanistic interpretability community, addresses a narrower problem (predicting SAE steering side effects) with model-dependent results and limited generalizability, serving primarily the AI safety/interpretability niche rather than broader scientific or engineering communities.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks

    Paper 2 likely has higher scientific impact due to timeliness and broad relevance to mechanistic interpretability and controllability of modern LLMs. It addresses a practical bottleneck—unpredictable side effects of SAE feature steering—by proposing a general pre-intervention screening framework validated across multiple model families and SAE variants, with methodological controls (held-out screening, residualization, dictionary-width comparison). The resulting tooling could improve safety/alignment workflows and steerability across many applications. Paper 1 is novel and useful for geolocalizing time series, but its impact is more domain-specific (e.g., power systems) and less cross-cutting across fields.

    gpt-5.2·Jun 9, 2026
    Wonvs. EinSort: Sorting is All We Need for Tensorizing LLM

    Paper 2 addresses a critical bottleneck in mechanistic interpretability—predicting and mitigating side effects in feature steering. Its comprehensive evaluation across multiple state-of-the-art architectures and SAE variants provides foundational insights crucial for AI alignment and safety. While Paper 1 offers a valuable practical compression technique, Paper 2 provides a novel theoretical framework with profound implications for understanding and safely controlling large language models.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

    Paper 1 presents a foundational advance in representation learning that spans multiple modalities (images, 3D shapes, climate data), offering massive efficiency gains (42x less memory) over existing meta-learning approaches. Its general-purpose nature ensures broad applicability across various scientific and engineering disciplines. In contrast, while Paper 2 is highly relevant to the timely subfield of LLM interpretability and AI safety, its scope is narrower and its findings are highly dependent on specific model and dictionary settings.

    gemini-3.1-pro-preview·Jun 9, 2026
    Lostvs. A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

    Paper 2 has higher likely impact due to broader real-world applicability and cross-domain relevance: it introduces a general, rigorously controlled audit framework for aperiodic (1/f) reliance that affects interpretability and clinical validity across EEG and ECG tasks, architectures, and even foundation models. Its intervention + sham + simulation methodology strengthens causal claims and suggests an actionable standard practice for biomedical ML. Paper 1 is timely and novel for mechanistic interpretability, but its applications are narrower (SAE steering) and findings are more model/dictionary dependent, limiting immediate generalization.

    gpt-5.2·Jun 9, 2026
    Lostvs. Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway

    Paper 2 has higher potential impact due to a clearer theoretical contribution with broad relevance: it revises established gradient-flow predictions by proving qualitatively different behavior under practical discrete GD with large step size (edge-of-stability regime). The results connect sharpness, step size, depth, and pathway symmetry—concepts central across deep learning theory and optimization—and may influence how researchers interpret specialization, stability, and representation sharing in overparameterized models. Paper 1 is timely and useful for mechanistic interpretability workflows, but its effects are model/dictionary-dependent and more narrowly scoped to SAE steering practice.

    gpt-5.2·Jun 9, 2026
    Lostvs. Learning to Route LLMs from Implicit Cost-Performance Preferences via Meta-Learning

    Paper 2 addresses a broadly applicable problem (LLM routing with cost-performance trade-offs) relevant to the rapidly growing LLM deployment ecosystem, with a novel meta-learning formulation that handles heterogeneous user preferences. Its practical applicability is immediate and spans many real-world settings. Paper 1, while technically rigorous and addressing an important interpretability question about SAE steering side effects, is narrower in scope—focused on a specific mechanistic interpretability technique—and its findings are model/dictionary-dependent, limiting generalizability. Paper 2's broader audience and direct practical utility give it higher potential impact.

    claude-opus-4-6·Jun 9, 2026