Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Yongzhong Xu

Jun 8, 2026arXiv:2606.09607v1

cs.LGcs.AI

#3486of 5669·cs.LG

#3486 of 5669 · cs.LG

Tournament Score

1374±43

10501750

50%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor6

Novelty5

Clarity7.5

Abstract

Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a methodological question in mechanistic interpretability: does unsupervised co-activation clustering of attention heads actually identify load-bearing circuits, or merely correlated groups? The authors adapt the binarized-Ising clustering framework of Bhalla et al. (originally for SAE features) to attention heads, then validate discovered communities through a "closure test" — ablating the community and comparing per-example damage against matched random controls across multiple metrics (cross-entropy loss, accuracy, target-token logit).

The main finding is a clear separation: in dense 1B-scale models (Pythia 1B, OLMo 1B), co-activation communities pass closure across both synthetic and natural text distributions (4/4 tests). In the MoE model (OLMoE-1B-7B), even after careful route-conditional stratification that recovers statistical signal +3σ above a null, the discovered community fails closure in the *wrong direction* — ablation improves loss. This establishes that statistical community recovery ≠ circuit discovery.

A secondary contribution is the "downstream redundancy signature" in Pythia 1B, where a 25× divergence between loss z-score (+1.95σ) and target-logit z-score (−52.5σ) reveals that the model can reconstruct ablated outputs downstream, masking the circuit's causal role when measured by loss alone.

Methodological Rigor

The experimental design is generally careful. The multi-metric closure protocol (loss + accuracy + target-logit) with matched random controls is well-motivated and the paper demonstrates convincingly why single-metric reporting is insufficient. The random-partition null for the MoE route-conditional case is a thoughtful control that strengthens the negative result.

However, several methodological concerns limit confidence:

1. Small model/test sample: Only three models, all ~1B scale. The dense-vs-MoE asymmetry claim rests on a single MoE model with a single route-conditioning protocol (k-means, K=4). The authors acknowledge this but it substantially limits generalizability.

2. Zeroing ablation: The destructive ablation protocol (zeroing head outputs) is standard but known to overestimate effects. The MoE wrong-direction result could potentially reverse under mean or counterfactual ablation, as the authors note.

3. Supervised label leakage in pipeline: While the Ising clustering itself is unsupervised, the hyperparameter selection (choosing k) and candidate selection (choosing which sub-cluster to test) both use supervised probe classifications. This makes the method not truly unsupervised — it's more accurately "unsupervised proposal with supervised candidate selection and causal validation."

4. Only 5 random controls per test: The z-scores are computed against only 5 matched random ablations. While the extreme z-scores (50σ+) are obviously meaningful, the moderate ones (1.83–2.68σ) could be sensitive to this small control sample. The authors' statistical conclusions at the moderate end should be treated cautiously.

5. The "template-free" signal (max attention weight) is a reasonable choice but quite lossy — it collapses the entire attention pattern to a single scalar. Alternative signals might yield different clustering outcomes.

Potential Impact

The paper's central methodological thesis — that clustering proposes but ablation disposes — is valuable for the interpretability community, which increasingly relies on unsupervised discovery methods. The concrete demonstration that statistically significant clustering can fail causally (and in the wrong direction) provides an important cautionary example.

The multi-metric closure protocol is a useful practical contribution. The observation that loss alone would misrank several tests, while target-token logit provides more stable signal, could influence standard reporting practices.

The MoE negative result, while based on a single model, opens a research direction: understanding why co-activation-based methods break down in sparse expert architectures. The route-modulated noise interpretation is interesting but preliminary.

The training-axis analysis (§7), showing bidirectional decoupling between attention selectivity/participation ratio and causal function, is a genuinely interesting finding. The "function without form" (BOS heads load-bearing before attention pattern forms) and "form without function" (previous-token heads with sharp patterns but no closure signal) observations challenge the common assumption that attention pattern sharpness implies circuit formation.

Timeliness & Relevance

The paper addresses a current need in mechanistic interpretability, where the field is moving toward unsupervised and scalable circuit discovery methods. The question of validation — when can you trust what clustering finds? — is increasingly important as these methods are applied to larger models and more complex behaviors. The MoE angle is timely given the growing deployment of MoE architectures.

Strengths

Clean negative result: The MoE wrong-direction failure is the paper's strongest contribution. It cannot be explained away by low power or threshold choice.

Multi-metric framework: Demonstrating metric divergence (the 25× Pythia gap) is methodologically valuable.

Careful scoping: The "what we do not claim" section (§8.3) is unusually responsible and specific.

Training-axis analysis: The bidirectional decoupling finding adds genuine insight beyond the main clustering story.

Reproducibility: Code and full results are provided.

Limitations

Scale: All models are ~1B parameters. The findings may not transfer to frontier-scale models.

Narrow architectural coverage: One MoE model is insufficient for architectural claims.

Supervised selection contamination: The pipeline isn't truly unsupervised end-to-end.

Limited ablation protocols: Only zeroing ablation tested; no activation patching or mean ablation comparison.

The positive results are somewhat expected: That co-activation clusters in dense models contain some load-bearing structure is not deeply surprising; the MoE negative is the novel finding but rests on thin evidence.

Overall Assessment

This is a solid methodological contribution to mechanistic interpretability that makes a clear and well-scoped argument. The central insight — validation by closure, not by statistical recovery — is sound and well-demonstrated. The paper's impact is primarily methodological rather than scientific: it establishes a validation protocol and provides cautionary evidence rather than revealing new computational mechanisms. The work is limited by its small model sample and single MoE architecture, but the quality of the negative results and the multi-metric analysis provide genuine value to the field.

Rating:5.5/ 10

Significance 5.5Rigor 6Novelty 5Clarity 7.5

Generated Jun 9, 2026

Comparison History (16)

Wonvs. Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Paper 1 offers a more novel, broadly relevant methodological contribution: it critically tests co-activation-based circuit discovery with a causal “closure” ablation validation across multiple 1B-scale models, distributions, and even training time, yielding insights that challenge common interpretability practice. This can reshape how mechanistic interpretability claims are evaluated and generalized. Paper 2 is timely and practically valuable for agent evaluation reproducibility, but is primarily an engineering/benchmarking contribution with impact concentrated in coding-agent assessment. Overall, Paper 1 has higher potential cross-field scientific impact and conceptual novelty.

gpt-5.2·Jun 11, 2026

Wonvs. XtrAIn: Training-Guided Occlusion for Feature Attribution

Paper 2 addresses a fundamental methodological question in mechanistic interpretability—whether co-activation statistics reliably identify functional circuits—and provides a principled validation framework (closure testing) with clear negative results in MoE models. This has broad implications for the growing interpretability field, offering a critical methodological contribution that could reshape how circuits are discovered and validated. Paper 1, while technically sound, offers an incremental improvement to occlusion-based attribution methods in a more crowded subfield, with applications limited to specific domains.

claude-opus-4-6·Jun 10, 2026

Lostvs. Trajectory Geometry of Transformer Representations Across Layers

Paper 2 introduces a novel, probe-free geometric framework bridging computational neuroscience and LLM interpretability. Its discovery of universal trajectory patterns, attractor-like dynamics, and curvature-complexity links offers broader theoretical insights and applications across architectures than Paper 1's targeted, albeit rigorous, methodological critique of attention head clustering.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Distilling Safe LLM Systems via Soft Prompts for On Device Settings

Paper 1 addresses a highly relevant and immediate bottleneck in AI deployment: running safe LLMs on resource-constrained edge devices. Its practical approach combining soft prompts and distillation offers clear, high-impact real-world applications in mobile and IoT computing. While Paper 2 presents rigorous advancements in mechanistic interpretability, Paper 1's methodology directly enables safer, broader accessibility of LLMs, giving it a higher potential for widespread technological and societal impact.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. What Structural Inductive Bias Helps Transformers Reason Over Knowledge Graphs? A Study with Tabula RASA

Paper 2 has higher potential impact because it introduces a broadly applicable, causally grounded validation criterion (“closure”) for circuit discovery that generalizes across models, data distributions, and training stages, and reveals a key failure mode in MoE settings. This directly affects mechanistic interpretability methodology, improving rigor (causal ablation over correlational clustering) and offering a reusable evaluation protocol. Paper 1 provides clear, valuable insight for KGQA transformers, but its scope is narrower and more domain-specific, whereas Paper 2’s framework can influence a wider swath of interpretability and model analysis work.

gpt-5.2·Jun 9, 2026

Wonvs. Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

Paper 1 has higher impact potential: it introduces a clear, causally grounded validation criterion (“closure”) for circuit discovery from co-activation clustering, directly addressing a widely used but weakly validated interpretability workflow. It demonstrates the method across multiple 1B dense models and input distributions, and provides a salient negative result in an MoE setting plus training-time analyses showing metric/function decoupling—findings likely to reshape best practices across mechanistic interpretability and model analysis. Paper 2 offers a useful evaluation metric for diffusion representations, but is more incremental within existing SSL-style probing and may have narrower cross-field methodological implications.

gpt-5.2·Jun 9, 2026

Lostvs. CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

CaliDist introduces a novel and practically useful calibration framework for LLMs that bridges behavioral robustness and confidence estimation—a broadly applicable idea with strong empirical results (70% relative ECE reduction across 7 benchmarks and 6 LLMs). Its real-world applicability to trustworthy AI deployment gives it wider impact. Paper 1 offers a valuable but narrower methodological contribution to mechanistic interpretability, primarily showing that co-activation clustering alone is insufficient without causal validation—an important but more incremental finding within a specialized subfield.

claude-opus-4-6·Jun 9, 2026

Lostvs. Escaping the KL Agreement Trap in On-Policy Distillation

Paper 2 introduces a clear failure mode in a widely used training paradigm (on-policy distillation), proposes a simple, deployable fix (online termination rule), and demonstrates consistent gains plus large efficiency improvements across multiple benchmarks. This is timely and likely to be adopted broadly in RLHF/agentic training and distillation workflows, giving strong real-world and cross-domain impact. Paper 1 is methodologically thoughtful and valuable for interpretability rigor, but its primary contribution is a caution/validation framework with narrower immediate applicability and less direct performance impact.

gpt-5.2·Jun 9, 2026

Wonvs. A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning

Paper 2 introduces a broadly applicable, methodologically rigorous framework (closure-validated circuit discovery) that directly links correlational “cheap signals” (co-activation clustering) to causal function via ablation, across multiple architectures and distributions, including negative results in MoE. This offers a general tool and standard for interpretability that can reshape how circuits are claimed and evaluated, with cross-field relevance (mechanistic interpretability, model evaluation, safety). Paper 1 is insightful but narrower (AIME-style math reasoning, one flagship model) and more descriptive/diagnostic than providing a reusable causal validation method.

gpt-5.2·Jun 9, 2026

Wonvs. Operator learning for solving Fokker-Planck equations with various initial conditions

Paper 1 addresses a critical bottleneck in LLM mechanistic interpretability by introducing a rigorous validation framework (closure via causal ablation) for circuit discovery. Given the massive ongoing efforts in AI safety and model understanding, challenging current heuristic assumptions will significantly influence the field. While Paper 2 offers a strong methodological advance for solving Fokker-Planck equations using PINNs, its impact is likely more confined to specialized computational physics and numerical methods communities compared to the broader and highly active LLM research landscape.

gemini-3.1-pro-preview·Jun 9, 2026

#3486of 5669·cs.LG

#3486 of 5669 · cs.LG

Tournament Score

1374±43

10501750

50%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor6

Novelty5

Clarity7.5