Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro
Abstract
We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet"
1. Core Contribution
This paper addresses a central open question in mechanistic interpretability: whether sparse autoencoders (SAEs), previously demonstrated only on small toy models, can extract interpretable features from production-scale language models. The authors train SAEs with up to 34 million features on the middle-layer residual stream of Claude 3 Sonnet and demonstrate that the resulting features are genuinely interpretable, multilingual, multimodal (generalizing to images despite text-only training), abstract, and causally influential on model behavior.
The key novelty is not the SAE method itself, but the demonstration that it scales — and the rich characterization of what emerges at scale. The paper bridges the gap between toy mechanistic interpretability and practical application to frontier models, which had been a major credibility concern for the field.
2. Methodological Rigor
The methodology is thorough but honest about its limitations. Several aspects stand out:
Scaling laws analysis: The authors apply neural scaling laws to SAE training itself, treating dictionary learning as a standard ML optimization problem. This is methodologically sound and practically useful — they demonstrate power-law relationships between compute budget and optimal feature count/training steps, enabling principled hyperparameter selection for large runs.
Multi-faceted feature validation: Features are validated through (a) automated specificity scoring using Claude 3 Opus, (b) causal steering experiments showing features influence behavior consistent with their interpretations, (c) attribution and ablation analyses demonstrating computational relevance, and (d) comparison against neuron-level baselines showing SAE features are substantially more interpretable.
Honest limitations: The authors are commendably transparent. They acknowledge that 65% of features in the 34M SAE are dead, the L1 objective is only a proxy for interpretability, cross-layer superposition remains unsolved, and they lack rigorous methods for evaluating faithfulness. The inability to evaluate ground-truth interpretability is presented as a fundamental challenge rather than glossed over.
Weaknesses in rigor: The feature selection for detailed analysis is necessarily cherry-picked — the authors acknowledge this but the paper would benefit from more systematic sampling. The automated interpretability rubric, while useful, relies on another LLM (Claude 3 Opus) for evaluation, creating a circularity concern. The comparison to few-shot probe-based steering (Appendix D.2.1) is limited to 7 examples and acknowledged as non-systematic.
3. Potential Impact
Mechanistic interpretability: This paper effectively de-risks the SAE approach for the interpretability community. By showing that interpretable features emerge at production scale, it validates years of theoretical work on superposition and linear representation hypotheses. The scaling laws for SAEs are a practical contribution enabling future work.
AI Safety: The identification of safety-relevant features (deception, power-seeking, sycophancy, bias, dangerous content) and demonstration that they causally influence model behavior is potentially transformative for AI safety. The deception case study — where clamping an "internal conflict" feature causes a model to stop complying with an impossible "forget" instruction — is particularly compelling as a proof of concept.
Model steering: Feature-based steering at this scale opens practical applications for behavior modification, red-teaming, and safety interventions. The Golden Gate Bridge demonstration became a cultural touchstone in the AI community.
Broader field influence: This work catalyzed an explosion of SAE research across multiple labs (Google DeepMind, OpenAI, independent researchers), effectively establishing dictionary learning on production models as a mainstream research direction.
4. Timeliness & Relevance
This paper arrived at a critical juncture. As frontier models became increasingly capable, the gap between interpretability research (focused on small models) and actual deployment raised serious concerns about whether interpretability could ever be practically relevant for safety. By demonstrating scalability on a production model, this work directly addressed the most pressing bottleneck in the field.
The multimodal generalization finding — features trained on text activating meaningfully on images — is particularly timely given the proliferation of multimodal models, and suggests that interpretability methods may generalize better than expected across modalities.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Additional Observations
The paper's influence extends beyond its technical contributions. It established a new standard for interpretability publications — combining detailed feature case studies, systematic analyses, interactive visualizations, and honest discussion of limitations. The "Golden Gate Claude" demonstration became one of the most widely discussed AI results of 2024, bringing mechanistic interpretability to mainstream attention.
The feature completeness analysis revealing a sigmoid relationship between concept frequency and feature presence (rescaled by alive feature count) provides the field's first quantitative handle on "how many features do we need?" — a question of fundamental importance.
Generated May 29, 2026
Comparison History (28)
Paper 2 likely has higher near-term scientific impact: it introduces a scalable signal-language foundation model trained on a very large clinical dataset, rigorously validated across nine external cohorts (~1.5M ECGs) and 89 tasks, with clear metrics and demonstrated generalization and data efficiency. Its applications (broad cardiovascular assessment and opportunistic screening) are directly deployable and clinically consequential. Paper 1 is novel and timely for mechanistic interpretability and AI safety, but acknowledges major evaluation limitations and has less immediate real-world deployment pathway; its impact may be longer-term and narrower in practice.
Paper 2 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability/mechanistic interpretability, demonstrating that sparse autoencoders scale to production-level models. It has profound implications for AI safety (identifying deception, power-seeking features), model steering, and understanding neural network internals. Its breadth of impact spans AI safety, interpretability, neuroscience-inspired ML, and AI governance. Paper 1 introduces a useful benchmark for evaluating LLM scientific reasoning but is more incremental—benchmarks are valuable but narrower in scope. Paper 2's findings are more foundational and have already catalyzed significant follow-up research across the field.
Paper 2 (Scaling Monosemanticity) addresses a fundamental question in AI safety and mechanistic interpretability—whether dictionary learning scales to production models—with broad implications for understanding, controlling, and aligning large language models. Its discovery of safety-relevant features (deception, power-seeking) and demonstration of causal steering has transformative implications across AI safety, interpretability, and governance. While Paper 1 (AutoScientists) presents strong engineering contributions to AI-driven scientific experimentation with impressive benchmarks, Paper 2 opens a new research paradigm with deeper theoretical significance and wider cross-disciplinary impact.
Paper 2 has higher likely scientific impact: it demonstrates a scalable, technically novel interpretability method (large sparse autoencoders with scaling-law guidance) on a frontier production model, with compelling evidence of multilingual/multimodal generalization and causal feature steering—including safety-relevant features. This advances mechanistic interpretability and AI safety with broad cross-field relevance and timely importance. Paper 1 introduces useful governance concepts (Agentic Technical Debt, Stochastic Tax) with practical managerial framing, but appears more conceptual and less methodologically rigorous or empirically validated, making its academic impact potentially narrower and less durable.
Paper 1 ('Scaling Monosemanticity') is a landmark contribution that demonstrates dictionary learning scales to production-level LLMs, introduces foundational methodology (sparse autoencoders at 34M features), discovers multilingual/multimodal features, identifies safety-relevant features (deception, power-seeking), and establishes scaling laws for interpretability. It opened an entirely new research direction in mechanistic interpretability. Paper 2 applies SAE-based auditing to biosecurity refusal in small models but is preliminary, narrow in scope (one hackathon weekend, consumer hardware, limited models), and builds directly on the foundations Paper 1 established. Paper 1's breadth, rigor, and field-defining nature give it vastly greater impact.
Paper 1 (Scaling Monosemanticity) demonstrates a fundamental breakthrough in mechanistic interpretability by scaling sparse autoencoders to production-level LLMs, revealing multilingual/multimodal features and safety-relevant concepts (deception, power-seeking) with causal influence on model behavior. This has enormous breadth of impact across AI safety, interpretability, and alignment research. Paper 2 identifies an important failure mode in reward model debiasing, which is valuable but narrower in scope. Paper 1's results reshape how the field approaches understanding large models and has already catalyzed significant follow-up research across multiple subfields.
Paper 1 has higher potential impact due to strong novelty (scaling sparse autoencoders to production-scale LMs with tens of millions of features), broad applicability to mechanistic interpretability, alignment/safety, and controllability, plus timeliness for frontier-model governance. It demonstrates causal steering and identifies harm-relevant features, suggesting downstream tools for auditing and mitigation across AI research and deployment. Paper 2 is useful but largely descriptive/observational with limited methodological innovation and narrower domain scope; its findings depend on registry text quality and small labeled samples, reducing rigor and generalizability.
Paper 2 (Scaling Monosemanticity) has significantly higher scientific impact. It addresses a fundamental question in AI safety and mechanistic interpretability—whether dictionary learning scales to production models—and demonstrates that sparse autoencoders can extract meaningful, causally relevant features from frontier LLMs. The discovery of multimodal, multilingual features and safety-relevant features (deception, power-seeking) has broad implications for AI alignment, model understanding, and governance. Paper 1 contributes a useful benchmark and dataset for multi-agent LLM evaluation, but is more incremental in scope. Paper 2 has already catalyzed substantial follow-up research across the interpretability field.
Paper 1 represents a foundational breakthrough in mechanistic interpretability, proving for the first time that dictionary learning and sparse autoencoders can scale to state-of-the-art, production-level LLMs. Its discovery of interpretable, steerable features for abstract and safety-relevant concepts has massive implications for understanding black-box AI models. Paper 2 offers a highly practical but more incremental methodological improvement for safety guardrail efficiency. Thus, Paper 1 has significantly broader and deeper scientific impact.
Paper 1 addresses a fundamental problem in AI—mechanistic interpretability of production-scale large language models. Its findings on extracting interpretable features related to deception, bias, and abstract concepts have profound implications for AI safety, alignment, and regulation, offering broad impact across the rapidly growing AI field. Paper 2, while methodologically sound, focuses on a much narrower application in urban transportation and tourist mobility modeling, limiting its broader scientific influence compared to a foundational breakthrough in AI interpretability.
Paper 2 likely has higher scientific impact: it advances mechanistic interpretability with a demonstrated scaling of sparse autoencoders to a production-scale frontier model, yielding broadly useful tools (feature discovery and causal steering) with implications for safety, alignment, and basic understanding of deep learning. Its results are timely and relevant across multiple fields (ML interpretability, safety, NLP, multimodal generalization) and show methodological rigor via large-scale training, scaling-law-guided design, and causal interventions. Paper 1 is practically important for agent security architecture but is more application/system-design specific and less broadly generalizable scientifically.
Paper 1 tackles a critical challenge in AI safety and mechanistic interpretability by scaling sparse autoencoders to a production-level LLM. Its ability to identify, interpret, and steer features related to deception, bias, and abstract concepts has massive implications for AI alignment, regulation, and future model development across various domains. Paper 2 provides a valuable, yet relatively niche, ethnographic study on AI in music production. Paper 1's broader applicability, methodological scale, and profound relevance to the globally pressing issue of AI safety give it significantly higher potential scientific impact.
Paper 1 demonstrates interpretable feature extraction from a production-scale LLM (Claude 3 Sonnet), establishing that mechanistic interpretability scales to frontier models. This has enormous implications for AI safety, alignment, and understanding model internals. The discovery of safety-relevant features (deception, power-seeking, sycophancy) with causal influence on outputs is groundbreaking. Its breadth of impact spans AI safety, interpretability, neuroscience-inspired ML, and governance. Paper 2 makes a solid theoretical contribution on LLM limitations in causal discovery with a practical workaround, but its scope is narrower and its impact more domain-specific.
Paper 1 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability, demonstrating that mechanistic interpretability techniques scale to production-level models. It reveals multilingual/multimodal features, identifies safety-relevant features (deception, power-seeking), and enables causal steering of model behavior. Its breadth of impact spans AI safety, interpretability, and fundamental understanding of large language models. Paper 2, while methodologically sound with strong empirical results on agent self-improvement, addresses a narrower problem with more incremental contributions. Paper 1's implications for AI alignment and safety give it substantially broader and more lasting scientific impact.
Paper 1 (Scaling Monosemanticity) represents a landmark contribution to AI interpretability, demonstrating for the first time that mechanistic interpretability methods scale to production-level language models. Its discovery of safety-relevant features (deception, power-seeking, sycophancy) with causal influence on model behavior has profound implications for AI alignment and safety. The work has broad impact across interpretability, alignment, and the broader ML community. Paper 2 (MiraBench) is a solid benchmark contribution for robotic world models but addresses a narrower domain with more incremental impact. Paper 1's novelty, breadth, and timeliness for AI safety give it substantially higher impact.
Paper 1 likely has higher scientific impact due to greater novelty and cross-cutting implications: scaling sparse autoencoders to production-scale LLM internals with millions of interpretable features advances mechanistic interpretability, offers a potential toolkit for auditing/steering harmful behaviors, and spans multilingual/multimodal generalization. Its breadth touches safety, interpretability, and model understanding across domains. Paper 2 is timely and practically valuable (new benchmark, inverse scaling in robustness, RL fix), but is more application-narrow (RAG/agent robustness) and less foundational than a scalable interpretability method that could influence many subfields.
Paper 2 has higher estimated scientific impact due to its novelty and breadth: scaling sparse autoencoders to production LLMs with tens of millions of features tackles a central open problem in mechanistic interpretability, with cross-domain implications for AI safety, alignment, auditing, and governance. The demonstrated causal steering and identification of harm-related features increases real-world relevance and timeliness. While Paper 1 shows strong applied clinical utility, its impact is narrower to rare-disease diagnostics and depends heavily on deployment, regulation, and dataset generalization. Paper 2 is more likely to influence multiple fields and future foundational methods.
Paper 2 addresses a fundamental challenge in AI: interpreting production-scale LLMs. By successfully scaling sparse autoencoders to Claude 3 Sonnet, it provides critical insights into AI safety, bias, and deception, offering mechanisms to causally steer model behavior. This foundational breakthrough has profound implications for AI alignment, safety, and regulation. Paper 1, while practically useful for improving generation diversity in creative tasks, addresses a much narrower application and lacks the broad, transformative impact of Paper 2 across multiple domains.
Paper 2 represents a major breakthrough in mechanistic interpretability by scaling sparse autoencoders to a production-grade LLM (Claude 3 Sonnet). It opens new avenues for AI safety, model steering, and understanding complex model behaviors. Paper 1 offers a valuable but more narrow behavioral analysis of reasoning models under adversarial dialogue. Paper 2's methodological innovation and broad implications for AI alignment and transparency give it significantly higher potential scientific impact.
Paper 2 likely has higher scientific impact due to a more foundational methodological advance: demonstrating that sparse autoencoders/dictionary learning can scale to a production frontier model with tens of millions of features, yielding interpretable, multilingual/multimodal representations and causal steering—including for safety-relevant traits. This creates broadly reusable tools for mechanistic interpretability, model control, and safety across many tasks and research areas. Paper 1 is timely and important for agent safety, but is more narrowly scoped as a behavioral vulnerability study with a specific trigger/prompting regime, likely yielding fewer cross-field downstream methods.