Back to Rankings

ICA Lens: Interpreting Language Models Without Training Another Dictionary

Sida Liu, Feijiang Han

cs.LGcs.AIcs.CL
Share
#849 of 5669 · cs.LG
Tournament Score
1481±44
10501750
71%
Win Rate
17
Wins
7
Losses
24
Matches
Rating
7/ 10
Significance7
Rigor7.5
Novelty6.5
Clarity8.5

Abstract

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ICA Lens

Core Contribution

This paper argues that ICA has been underestimated for LLM interpretability due to implementation brittleness and lack of systematic evaluation infrastructure, rather than fundamental limitations. The authors introduce ICALens, a complete workflow comprising: (1) a GPU-parallel FastICA pipeline with three stabilization recipes (row normalization, p95-LIM acceptance, adaptive refitting), (2) an interactive component explorer for human annotation, and (3) systematic evaluation against SAEs on standardized benchmarks. The key conceptual insight is that many interpretable directions are statistically non-Gaussian, and ICA—which directly optimizes for non-Gaussianity—can recover interpretable structure without the expense of training overcomplete dictionaries.

The paper also introduces the Effective Receptive Field (ERF) diagnostic, which measures how much context is needed for a component to activate, providing an operational link between non-Gaussianity and interpretability: high-kurtosis components tend to be more local and easier to annotate.

Methodological Rigor

The methodology is thorough and well-structured. The three fitting recipes are each motivated by concrete failure modes (outlier activations, slow-converging tail components, layer-specific convergence difficulties) and validated with ablations (Table 1 showing improvements from 2 to 10 accepted layers on GPT-2 Small). The convergence diagnostics are comprehensive, with full layer-wise curves provided in the appendix.

The evaluation is multi-faceted: statistical validation (excess kurtosis comparisons), human annotation (150 randomly sampled components across three models with a secondary expert audit), and task-based evaluation (SAEBench sparse probing and TPP). The annotation protocol is notably rigorous—featuring ERF-guided inspection, hypothesis testing with targeted prompts, and a secondary contrastive audit where 121/127 high-confidence labels were supported. This level of annotation quality control is uncommon in interpretability work.

However, there are methodological limitations. The ERF computation uses somewhat arbitrary thresholds (top-15 ranking, half-maximum score, K_max=11). The sparse probing evaluation averages over only two layers per model, and the TPP comparison uses only public SAEs as baselines (not Matryoshka variants or ITDA). The row normalization discards norm information, which the authors acknowledge but don't quantify the impact of on downstream tasks.

Potential Impact

Practical impact: The most immediate value is as a rapid exploration tool. Researchers studying new models or layers can run ICA in minutes rather than spending days training SAEs. This could accelerate the pace of mechanistic interpretability research, particularly for under-resourced labs or new model architectures without pre-existing SAE checkpoints.

Conceptual impact: The paper provides an important conceptual reframing. By showing that non-Gaussianity alone recovers much interpretable structure, it challenges the assumption that sparse reconstruction is necessary for finding meaningful directions. The finding that ICA outperforms SAEs in TPP under small intervention budgets is particularly noteworthy—it suggests that for lightweight interventions, compact bases may be preferable.

Methodological impact: The ERF diagnostic is a genuinely useful contribution that could be applied beyond ICA. It provides a principled way to characterize whether a direction is local or contextual, and offers an operational explanation for the previously observed kurtosis-interpretability correlation. The annotation protocol and explorer tool contribute infrastructure for the field.

Broader influence: The paper could influence the SAE community to more seriously consider what statistical signals their methods implicitly learn, potentially leading to hybrid approaches or better-motivated objectives.

Timeliness & Relevance

This paper addresses a real bottleneck in mechanistic interpretability. The proliferation of open LLMs has far outpaced the availability of trained SAE dictionaries. Gemma Scope required enormous compute, and coverage remains limited. The paper arrives at a moment when the field is actively debating SAE evaluation methodology (citing Chanin 2026 on SAEBench reliability) and exploring alternatives (ITDA, Matryoshka SAEs). Positioning ICA as a "first lens" rather than a replacement is strategically sound and timely.

Strengths

1. Complete workflow: Unlike prior ICA-for-LLMs attempts, this provides the full stack—fitting, diagnostics, exploration, annotation, and evaluation—making it immediately usable.

2. Breadth of evaluation: Three model families (GPT-2 Small, Gemma 2 2B, Qwen 3.5 2B Base), multiple evaluation modalities, and strong baselines.

3. ERF as a diagnostic: A novel and useful contribution connecting statistical properties to interpretive difficulty.

4. Honest framing: The paper carefully positions ICA as complementary rather than superior, acknowledging SAE advantages for high-resolution feature discovery. The limitations section is substantive and identifies concrete future directions.

5. Reproducibility: Release of checkpoints, explorer, code, and annotations.

6. Polysemy case study: The "bank" decomposition across layers (Figures 7, 20, 21) is a compelling illustration of what component-level analysis can reveal, including the insight about autoregressive conditioning affecting R2.

Limitations

1. Compact basis constraint: ICA returns at most d components, while SAEs can have 16k+ features. For fine-grained feature catalogs, ICA is fundamentally limited.

2. No automated interpretability scores: The paper relies on human annotation. While this is more reliable, it limits scalability of claims about interpretability rates.

3. Limited causal evaluation: TPP is a probe-based proxy. The paper lacks direct steering/intervention experiments showing ICA components can causally influence model behavior on downstream tasks.

4. Model scale: The largest model tested is 2B parameters. Whether these findings extend to 7B+ models is unclear.

5. No direct comparison with overcomplete ICA: The paper mentions overcomplete ICA as future work but doesn't test it, leaving a natural extension unexplored.

Overall, this is a well-executed paper that makes a convincing case for a previously overlooked method, backed by thorough evaluation and useful infrastructure contributions. Its impact will likely be as a practical tool and conceptual reframing rather than as a fundamental methodological breakthrough.

Rating:7/ 10
Significance 7Rigor 7.5Novelty 6.5Clarity 8.5

Generated Jun 11, 2026

Comparison History (24)

Lostvs. Demystifying Hidden-State Recurrence: Switchable Latent Reasoning with On-Policy Reinforcement Learning

Paper 1 addresses a highly timely frontier in AI: optimizing continuous latent reasoning with on-policy RL (like GRPO). This has profound implications for scaling inference-time compute efficiently. While Paper 2 offers a valuable, computationally efficient tool for interpretability, Paper 1 introduces fundamental architectural and training advancements that enhance both model capability and mechanistic interpretability, promising broader impact on next-generation reasoning models.

gemini-3.1-pro-preview·Jun 12, 2026
Lostvs. Learning with Simulators: No Regret in a Computationally Bounded World

Paper 1 makes a fundamental theoretical contribution to learning theory by introducing simulatable processes, broadening the PAC framework to handle dependent data with VC-dimension-based guarantees. It connects learning theory to Kolmogorov complexity and computational complexity in novel ways, with broad implications across machine learning theory. Paper 2 presents a practical engineering contribution (applying ICA to LLM interpretability) that, while useful, revisits a classical method and offers incremental improvements over SAEs in specific settings. Paper 1's conceptual novelty and theoretical depth give it substantially higher long-term scientific impact.

claude-opus-4-6·Jun 12, 2026
Lostvs. Select and Improve: Understanding the Mechanics of Post-Training for Reasoning

Paper 1 addresses the most critical and timely frontier in LLM research: understanding and scaling reinforcement learning post-training for reasoning capabilities. Its insights into strategy selection and improvement offer practical pathways to enhance model reasoning, directly impacting the development of next-generation AI systems. While Paper 2 presents a valuable and efficient methodological advancement for interpretability, Paper 1's findings have broader implications for fundamentally advancing core AI capabilities and performance across diverse domains.

gemini-3.1-pro-preview·Jun 12, 2026
Lostvs. Neuron Populations Exhibit Divergent Selectivity with Scale

Paper 1 likely has higher scientific impact due to stronger novelty and breadth: it proposes neuron-level scaling laws (sublinear Rosetta neuron growth, polarization toward monosemanticity, increasing specialization) across both language and vision models, backed by an analytical model and a practical data-filtering case study. This connects interpretability to scaling theory and could influence model design, training, and evaluation across domains. Paper 2 is highly practical and timely, but relies on a classical method (ICA) and primarily improves workflows/engineering; its impact is more incremental and narrower to interpretability tooling.

gpt-5.2·Jun 11, 2026
Wonvs. When Are Multimodal Predictions Biologically Supported? A Diagnostic Evaluation Framework

Paper 1 introduces a highly efficient, training-free alternative to sparse autoencoders for LLM interpretability. Given the massive computational bottlenecks of current interpretability methods and the explosive growth of LLM research, this optimized ICA approach will likely see rapid, widespread adoption across the AI community. While Paper 2 addresses a critical trustworthiness issue in medical AI, Paper 1's foundational contribution to understanding general-purpose language models offers broader immediate applicability and a higher potential for rapid, widespread scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026
Wonvs. A Riemannian Approach to Low-Rank Optimal Transport

Paper 2 tackles a highly timely and critical bottleneck in LLM interpretability by reviving a classical method (ICA) to replace computationally expensive Sparse Autoencoders (SAEs). Its potential for immediate, widespread adoption in the booming fields of AI safety and interpretability gives it exceptional real-world applicability and breadth of impact. While Paper 1 offers a rigorous and innovative theoretical advancement in optimal transport optimization, Paper 2's practical utility in democratizing LLM analysis positions it for broader and faster scientific influence.

gemini-3.1-pro-preview·Jun 11, 2026
Wonvs. Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Paper 2 (ICA Lens) introduces a fundamentally new perspective on LLM interpretability by revisiting ICA as a training-free alternative to sparse autoencoders. This has broader scientific impact: it challenges the dominant SAE paradigm, offers a practical and efficient methodology applicable across models, and addresses the critical problem of understanding neural network representations. Paper 1 is a useful engineering benchmark contribution but is more incremental—extending SWE-bench to evaluate agent harnesses. Paper 2's novelty, methodological contribution, and relevance to the growing mechanistic interpretability field give it higher potential impact across the ML community.

claude-opus-4-6·Jun 11, 2026
Lostvs. A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

Paper 1 offers a fundamental theoretical reframing of Supervised Fine-Tuning (SFT), a core component of modern LLM training. By generalizing the target distribution and proposing Target-SFT, it provides a method that directly improves reasoning capabilities across multiple models. While Paper 2 presents a valuable and efficient tool for mechanistic interpretability, Paper 1 has a broader potential impact as its findings can be immediately integrated into the training pipelines of nearly all instruction-tuned language models, affecting a wider segment of AI research and application.

gemini-3.1-pro-preview·Jun 11, 2026
Wonvs. How Low Can You Go? Active Learning for Sparse Model Discovery in the Ultra-Low-Data Limit

Paper 2 addresses a critical bottleneck in LLM interpretability by reviving and optimizing Independent Component Analysis (ICA) as a highly efficient alternative to expensive Sparse Autoencoders. Given the massive scale and urgent need for AI safety and alignment tools, a method that provides cheaper interpretable directions for modern LLMs has immense and immediate real-world utility. While Paper 1 offers valuable advances in computational physics and active learning, Paper 2's potential breadth of impact across the rapidly expanding AI landscape makes it more timely and scientifically impactful.

gemini-3.1-pro-preview·Jun 11, 2026
Wonvs. From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

Paper 1 is likely to have higher impact: it introduces a practical, GPU-optimized, auditable ICA workflow for LLM interpretability that removes a major bottleneck (training/storing SAEs) and shows competitive-to-better results on established benchmarks across multiple prominent models. This is timely and broadly relevant to mechanistic interpretability, model debugging, and control—areas with wide cross-field interest and rapid uptake. Paper 2 is a solid methodological improvement for NRI via diffusion-learned priors, but is more niche (trajectory-to-graph discovery) and may have narrower immediate adoption outside structured latent-variable modeling.

gpt-5.2·Jun 11, 2026