Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro

#18 of 2821 · Artificial Intelligence
Share
Tournament Score
1595±46
10501800
96%
Win Rate
27
Wins
1
Losses
28
Matches
Rating
9/ 10
Significance
Rigor
Novelty
Clarity

Abstract

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"

1. Core Contribution

This paper addresses a central open question in mechanistic interpretability: whether sparse autoencoders (SAEs), previously demonstrated only on small toy models, can extract interpretable features from production-scale language models. The authors train SAEs with up to 34 million features on the middle-layer residual stream of Claude 3 Sonnet and demonstrate that the resulting features are genuinely interpretable, multilingual, multimodal (generalizing to images despite text-only training), abstract, and causally influential on model behavior.

The key novelty is not the SAE method itself, but the demonstration that it scales — and the rich characterization of what emerges at scale. The paper bridges the gap between toy mechanistic interpretability and practical application to frontier models, which had been a major credibility concern for the field.

2. Methodological Rigor

The methodology is thorough but honest about its limitations. Several aspects stand out:

Scaling laws analysis: The authors apply neural scaling laws to SAE training itself, treating dictionary learning as a standard ML optimization problem. This is methodologically sound and practically useful — they demonstrate power-law relationships between compute budget and optimal feature count/training steps, enabling principled hyperparameter selection for large runs.

Multi-faceted feature validation: Features are validated through (a) automated specificity scoring using Claude 3 Opus, (b) causal steering experiments showing features influence behavior consistent with their interpretations, (c) attribution and ablation analyses demonstrating computational relevance, and (d) comparison against neuron-level baselines showing SAE features are substantially more interpretable.

Honest limitations: The authors are commendably transparent. They acknowledge that 65% of features in the 34M SAE are dead, the L1 objective is only a proxy for interpretability, cross-layer superposition remains unsolved, and they lack rigorous methods for evaluating faithfulness. The inability to evaluate ground-truth interpretability is presented as a fundamental challenge rather than glossed over.

Weaknesses in rigor: The feature selection for detailed analysis is necessarily cherry-picked — the authors acknowledge this but the paper would benefit from more systematic sampling. The automated interpretability rubric, while useful, relies on another LLM (Claude 3 Opus) for evaluation, creating a circularity concern. The comparison to few-shot probe-based steering (Appendix D.2.1) is limited to 7 examples and acknowledged as non-systematic.

3. Potential Impact

Mechanistic interpretability: This paper effectively de-risks the SAE approach for the interpretability community. By showing that interpretable features emerge at production scale, it validates years of theoretical work on superposition and linear representation hypotheses. The scaling laws for SAEs are a practical contribution enabling future work.

AI Safety: The identification of safety-relevant features (deception, power-seeking, sycophancy, bias, dangerous content) and demonstration that they causally influence model behavior is potentially transformative for AI safety. The deception case study — where clamping an "internal conflict" feature causes a model to stop complying with an impossible "forget" instruction — is particularly compelling as a proof of concept.

Model steering: Feature-based steering at this scale opens practical applications for behavior modification, red-teaming, and safety interventions. The Golden Gate Bridge demonstration became a cultural touchstone in the AI community.

Broader field influence: This work catalyzed an explosion of SAE research across multiple labs (Google DeepMind, OpenAI, independent researchers), effectively establishing dictionary learning on production models as a mainstream research direction.

4. Timeliness & Relevance

This paper arrived at a critical juncture. As frontier models became increasingly capable, the gap between interpretability research (focused on small models) and actual deployment raised serious concerns about whether interpretability could ever be practically relevant for safety. By demonstrating scalability on a production model, this work directly addressed the most pressing bottleneck in the field.

The multimodal generalization finding — features trained on text activating meaningfully on images — is particularly timely given the proliferation of multimodal models, and suggests that interpretability methods may generalize better than expected across modalities.

5. Strengths & Limitations

Key strengths:

  • Scale and ambition: 34M features on a production model is a massive leap from prior work on 1-layer transformers.
  • Feature sophistication: The code error feature (detecting diverse bugs across languages), function-tracking features (handling composition), and abstract safety features demonstrate that SAEs capture genuinely complex model representations, not just surface statistics.
  • Feature completeness analysis: The systematic study of how concept frequency relates to dictionary size needed is a novel and practically important finding (roughly: features appear when alive feature count exceeds inverse frequency of the concept).
  • Concrete-abstract generalization: Features responding to both specific code vulnerabilities and abstract discussions of security is a strong signal for safety applicability.
  • Geometric structure: Feature neighborhoods showing semantic clustering provides evidence that the learned features reflect meaningful model structure.
  • Notable limitations:

  • Single layer: Analysis is restricted to the middle-layer residual stream, missing potentially important features in other layers.
  • Incomplete coverage: Even the 34M SAE captures only ~60% of London boroughs, suggesting orders-of-magnitude more features exist.
  • Evaluation gap: The fundamental inability to evaluate whether features faithfully capture model computations is acknowledged but unresolved. The loss function remains a proxy.
  • Compute cost: The authors note that extracting "all features" across all layers could require more compute than training the model itself, raising scalability concerns.
  • Shrinkage and dead features: 65% dead features in the 34M SAE suggests significant room for methodological improvement.
  • Proprietary model: The work cannot be fully reproduced due to model proprietary constraints.
  • Additional Observations

    The paper's influence extends beyond its technical contributions. It established a new standard for interpretability publications — combining detailed feature case studies, systematic analyses, interactive visualizations, and honest discussion of limitations. The "Golden Gate Claude" demonstration became one of the most widely discussed AI results of 2024, bringing mechanistic interpretability to mainstream attention.

    The feature completeness analysis revealing a sigmoid relationship between concept frequency and feature presence (rescaled by alive feature count) provides the field's first quantitative handle on "how many features do we need?" — a question of fundamental importance.

    Rating:9/ 10
    Significance 9.5Rigor 7.5Novelty 8.5Clarity 9

    Generated May 29, 2026

    Comparison History (28)

    vs. A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography
    gpt-5.25/29/2026

    Paper 2 likely has higher near-term scientific impact: it introduces a scalable signal-language foundation model trained on a very large clinical dataset, rigorously validated across nine external cohorts (~1.5M ECGs) and 89 tasks, with clear metrics and demonstrated generalization and data efficiency. Its applications (broad cardiovascular assessment and opportunistic screening) are directly deployable and clinically consequential. Paper 1 is novel and timely for mechanistic interpretability and AI safety, but acknowledges major evaluation limitations and has less immediate real-world deployment pathway; its impact may be longer-term and narrower in practice.

    vs. ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure
    claude-opus-4.65/29/2026

    Paper 2 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability/mechanistic interpretability, demonstrating that sparse autoencoders scale to production-level models. It has profound implications for AI safety (identifying deception, power-seeking features), model steering, and understanding neural network internals. Its breadth of impact spans AI safety, interpretability, neuroscience-inspired ML, and AI governance. Paper 1 introduces a useful benchmark for evaluating LLM scientific reasoning but is more incremental—benchmarks are valuable but narrower in scope. Paper 2's findings are more foundational and have already catalyzed significant follow-up research across the field.

    vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
    claude-opus-4.65/29/2026

    Paper 2 (Scaling Monosemanticity) addresses a fundamental question in AI safety and mechanistic interpretability—whether dictionary learning scales to production models—with broad implications for understanding, controlling, and aligning large language models. Its discovery of safety-relevant features (deception, power-seeking) and demonstration of causal steering has transformative implications across AI safety, interpretability, and governance. While Paper 1 (AutoScientists) presents strong engineering contributions to AI-driven scientific experimentation with impressive benchmarks, Paper 2 opens a new research paradigm with deeper theoretical significance and wider cross-disciplinary impact.

    vs. Governing Technical Debt in Agentic AI Systems
    gpt-5.25/29/2026

    Paper 2 has higher likely scientific impact: it demonstrates a scalable, technically novel interpretability method (large sparse autoencoders with scaling-law guidance) on a frontier production model, with compelling evidence of multilingual/multimodal generalization and causal feature steering—including safety-relevant features. This advances mechanistic interpretability and AI safety with broad cross-field relevance and timely importance. Paper 1 introduces useful governance concepts (Agentic Technical Debt, Stochastic Tax) with practical managerial framing, but appears more conceptual and less methodologically rigorous or empirically validated, making its academic impact potentially narrower and less durable.

    vs. BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders
    claude-opus-4.65/29/2026

    Paper 1 ('Scaling Monosemanticity') is a landmark contribution that demonstrates dictionary learning scales to production-level LLMs, introduces foundational methodology (sparse autoencoders at 34M features), discovers multilingual/multimodal features, identifies safety-relevant features (deception, power-seeking), and establishes scaling laws for interpretability. It opened an entirely new research direction in mechanistic interpretability. Paper 2 applies SAE-based auditing to biosecurity refusal in small models but is preliminary, narrow in scope (one hackathon weekend, consumer hardware, limited models), and builds directly on the foundations Paper 1 established. Paper 1's breadth, rigor, and field-defining nature give it vastly greater impact.

    vs. Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure
    claude-opus-4.65/29/2026

    Paper 1 (Scaling Monosemanticity) demonstrates a fundamental breakthrough in mechanistic interpretability by scaling sparse autoencoders to production-level LLMs, revealing multilingual/multimodal features and safety-relevant concepts (deception, power-seeking) with causal influence on model behavior. This has enormous breadth of impact across AI safety, interpretability, and alignment research. Paper 2 identifies an important failure mode in reward model debiasing, which is valuable but narrower in scope. Paper 1's results reshape how the field approaches understanding large models and has already catalyzed significant follow-up research across multiple subfields.

    vs. Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration
    gpt-5.25/29/2026

    Paper 1 has higher potential impact due to strong novelty (scaling sparse autoencoders to production-scale LMs with tens of millions of features), broad applicability to mechanistic interpretability, alignment/safety, and controllability, plus timeliness for frontier-model governance. It demonstrates causal steering and identifies harm-relevant features, suggesting downstream tools for auditing and mitigation across AI research and deployment. Paper 2 is useful but largely descriptive/observational with limited methodological innovation and narrower domain scope; its findings depend on registry text quality and small labeled samples, reducing rigor and generalizability.

    vs. MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
    claude-opus-4.65/29/2026

    Paper 2 (Scaling Monosemanticity) has significantly higher scientific impact. It addresses a fundamental question in AI safety and mechanistic interpretability—whether dictionary learning scales to production models—and demonstrates that sparse autoencoders can extract meaningful, causally relevant features from frontier LLMs. The discovery of multimodal, multilingual features and safety-relevant features (deception, power-seeking) has broad implications for AI alignment, model understanding, and governance. Paper 1 contributes a useful benchmark and dataset for multi-agent LLM evaluation, but is more incremental in scope. Paper 2 has already catalyzed substantial follow-up research across the interpretability field.

    vs. Robust and Efficient Guardrails with Latent Reasoning
    gemini-3.15/29/2026

    Paper 1 represents a foundational breakthrough in mechanistic interpretability, proving for the first time that dictionary learning and sparse autoencoders can scale to state-of-the-art, production-level LLMs. Its discovery of interpretable, steerable features for abstract and safety-relevant concepts has massive implications for understanding black-box AI models. Paper 2 offers a highly practical but more incremental methodological improvement for safety guardrail efficiency. Thus, Paper 1 has significantly broader and deeper scientific impact.

    vs. GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation
    gemini-3.15/29/2026

    Paper 1 addresses a fundamental problem in AI—mechanistic interpretability of production-scale large language models. Its findings on extracting interpretable features related to deception, bias, and abstract concepts have profound implications for AI safety, alignment, and regulation, offering broad impact across the rapidly growing AI field. Paper 2, while methodologically sound, focuses on a much narrower application in urban transportation and tourist mobility modeling, limiting its broader scientific influence compared to a foundational breakthrough in AI interpretability.

    vs. The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane
    gpt-5.25/29/2026

    Paper 2 likely has higher scientific impact: it advances mechanistic interpretability with a demonstrated scaling of sparse autoencoders to a production-scale frontier model, yielding broadly useful tools (feature discovery and causal steering) with implications for safety, alignment, and basic understanding of deep learning. Its results are timely and relevant across multiple fields (ML interpretability, safety, NLP, multimodal generalization) and show methodological rigor via large-scale training, scaling-law-guided design, and causal interventions. Paper 1 is practically important for agent security architecture but is more application/system-design specific and less broadly generalizable scientifically.

    vs. It`s All About Speed: AI`s Impact on Workflow in Music Production
    gemini-3.15/29/2026

    Paper 1 tackles a critical challenge in AI safety and mechanistic interpretability by scaling sparse autoencoders to a production-level LLM. Its ability to identify, interpret, and steer features related to deception, bias, and abstract concepts has massive implications for AI alignment, regulation, and future model development across various domains. Paper 2 provides a valuable, yet relatively niche, ethnographic study on AI in music production. Paper 1's broader applicability, methodological scale, and profound relevance to the globally pressing issue of AI safety give it significantly higher potential scientific impact.

    vs. Why LLMs Fail at Causal Discovery and How Interventional Agents Escape
    claude-opus-4.65/29/2026

    Paper 1 demonstrates interpretable feature extraction from a production-scale LLM (Claude 3 Sonnet), establishing that mechanistic interpretability scales to frontier models. This has enormous implications for AI safety, alignment, and understanding model internals. The discovery of safety-relevant features (deception, power-seeking, sycophancy) with causal influence on outputs is groundbreaking. Its breadth of impact spans AI safety, interpretability, neuroscience-inspired ML, and governance. Paper 2 makes a solid theoretical contribution on LLM limitations in causal discovery with a practical workaround, but its scope is narrower and its impact more domain-specific.

    vs. GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents
    claude-opus-4.65/29/2026

    Paper 1 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability, demonstrating that mechanistic interpretability techniques scale to production-level models. It reveals multilingual/multimodal features, identifies safety-relevant features (deception, power-seeking), and enables causal steering of model behavior. Its breadth of impact spans AI safety, interpretability, and fundamental understanding of large language models. Paper 2, while methodologically sound with strong empirical results on agent self-improvement, addresses a narrower problem with more incremental contributions. Paper 1's implications for AI alignment and safety give it substantially broader and more lasting scientific impact.

    vs. MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models
    claude-opus-4.65/29/2026

    Paper 1 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability, demonstrating for the first time that mechanistic interpretability methods scale to production-level language models. Its discovery of safety-relevant features (deception, power-seeking, sycophancy) with causal influence on model behavior has profound implications for AI alignment and safety. The work has broad impact across interpretability, alignment, and the broader ML community. Paper 2 (MiraBench) is a solid benchmark contribution for robotic world models but addresses a narrower domain with more incremental impact. Paper 1's novelty, breadth, and timeliness for AI safety give it substantially higher impact.

    vs. The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF
    gpt-5.25/29/2026

    Paper 1 likely has higher scientific impact due to greater novelty and cross-cutting implications: scaling sparse autoencoders to production-scale LLM internals with millions of interpretable features advances mechanistic interpretability, offers a potential toolkit for auditing/steering harmful behaviors, and spans multilingual/multimodal generalization. Its breadth touches safety, interpretability, and model understanding across domains. Paper 2 is timely and practically valuable (new benchmark, inverse scaling in robustness, RL fix), but is more application-narrow (RAG/agent robustness) and less foundational than a scalable interpretability method that could influence many subfields.

    vs. A Versatile AI Agent for Rare Disease Diagnosis and Risk Gene Prioritization
    gpt-5.25/29/2026

    Paper 2 has higher estimated scientific impact due to its novelty and breadth: scaling sparse autoencoders to production LLMs with tens of millions of features tackles a central open problem in mechanistic interpretability, with cross-domain implications for AI safety, alignment, auditing, and governance. The demonstrated causal steering and identification of harm-related features increases real-world relevance and timeliness. While Paper 1 shows strong applied clinical utility, its impact is narrower to rare-disease diagnostics and depends heavily on deployment, regulation, and dataset generalization. Paper 2 is more likely to influence multiple fields and future foundational methods.

    vs. Anchorless Diversification for Parallel LLM Ideation
    gemini-3.15/29/2026

    Paper 2 addresses a fundamental challenge in AI: interpreting production-scale LLMs. By successfully scaling sparse autoencoders to Claude 3 Sonnet, it provides critical insights into AI safety, bias, and deception, offering mechanisms to causally steer model behavior. This foundational breakthrough has profound implications for AI alignment, safety, and regulation. Paper 1, while practically useful for improving generation diversity in creative tasks, addresses a much narrower application and lacks the broad, transformative impact of Paper 2 across multiple domains.

    vs. The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure
    gemini-3.15/29/2026

    Paper 2 represents a major breakthrough in mechanistic interpretability by scaling sparse autoencoders to a production-grade LLM (Claude 3 Sonnet). It opens new avenues for AI safety, model steering, and understanding complex model behaviors. Paper 1 offers a valuable but more narrow behavioral analysis of reasoning models under adversarial dialogue. Paper 2's methodological innovation and broad implications for AI alignment and transparency give it significantly higher potential scientific impact.

    vs. History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions
    gpt-5.25/29/2026

    Paper 2 likely has higher scientific impact due to a more foundational methodological advance: demonstrating that sparse autoencoders/dictionary learning can scale to a production frontier model with tens of millions of features, yielding interpretable, multilingual/multimodal representations and causal steering—including for safety-relevant traits. This creates broadly reusable tools for mechanistic interpretability, model control, and safety across many tasks and research areas. Paper 1 is timely and important for agent safety, but is more narrowly scoped as a behavioral vulnerability study with a specific trigger/prompting regime, likely yielding fewer cross-field downstream methods.