Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri
Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.
This paper introduces a data-centric post-training pipeline that leverages interpretability tools (primarily Sparse Autoencoders) to diagnose what behaviors a preference dataset will teach a model, and then intervene to modulate those behaviors before or during training. The key intellectual contributions are threefold:
1. A unifying framework that casts activation steering, reward shaping, inoculation prompting, and data filtering as different instantiations of "explaining away" latent concept-level classifiers from the exponential tilt formulation of optimal RLHF policies (Eq. 1-3).
2. Two hypothesis generation pipelines — prompt-conditioned and feature-conditioned — that use SAE feature clusters and two-sample hypothesis tests to identify concepts statistically distinguishing chosen from rejected responses in preference data.
3. Extensive empirical validation across controlled poisoning experiments and realistic post-training on the Dolci dataset with models from 7B to 70B parameters.
The theoretical grounding is sound but involves important simplifying assumptions. The decomposition of rewards into independent concept-specific classifiers (Eq. 2-3) is a useful modeling choice, but the paper honestly acknowledges this independence assumption and documents its failures systematically — particularly in the over-stylization experiments where targeting one formatting attribute affects others. This intellectual honesty strengthens the work.
The hypothesis generation pipeline is well-constructed. The feature-conditioned pipeline achieves R² = 0.9 correlation between predicted feature changes and actual post-DPO changes, which is impressive. The prompt-conditioned pipeline shows lower but meaningful correlation (R² = 0.58), which the authors explain via the sparsity of specific prompt clusters.
However, several methodological concerns arise:
Practical impact: The pipeline addresses a genuine pain point in production LLM post-training. The finding that Dolci DPO training *degrades* safety relative to the SFT baseline (Fig. 6) is practically significant and demonstrates the pipeline's diagnostic value. The interactive viewer (Figs. 4-5) represents a genuinely useful tool for practitioners.
Methodological impact: The unifying framework connecting PPS, CAFT, data filtering, and inoculation prompting is a valuable conceptual contribution. By showing these are all operationalizations of the same "explaining away" principle, the paper provides practitioners with a principled basis for choosing interventions.
Broader influence: This work bridges interpretability research (traditionally focused on understanding trained models) and alignment/post-training (traditionally focused on reward engineering). This bridge is timely and could catalyze cross-pollination between these communities.
The paper is extremely timely. The problems it addresses — sycophancy (OpenAI's public acknowledgment), over-stylization (the "bold formatting" complaints), reward hacking (the "goblins" incident) — are all issues that have received significant public attention in 2025-2026. The fact that these problems persist despite years of methodological refinement in RLHF underscores the need for fundamentally different approaches.
The emergence of high-quality SAEs and the maturation of mechanistic interpretability tools make this work feasible now in a way it wouldn't have been 18 months ago. The paper effectively rides the wave of SAE development while demonstrating a concrete downstream application.
1. Scale and breadth of experiments: Testing across 4 model families (7B–70B), 5+ intervention types, multiple interpretability tools, and diverse behavioral targets provides unusually comprehensive coverage.
2. Honest treatment of failures: The paper is refreshingly candid about when methods don't work — physics sycophancy being essentially unmodulable due to sparse signal (Fig. 12), off-target effects in style modulation (Fig. 9), and the general difficulty of modulating entangled concepts. This builds trust in the positive results.
3. Actionable diagnostics: The discovery that Dolci degrades safety is not just an academic finding — it's immediately actionable for teams using this dataset.
4. Reward shaping emerges as the best general-purpose intervention: The consistent finding across experiments that reward shaping (which operates on the loss directly) outperforms representation-level interventions provides practical guidance.
1. Independence assumption is the Achilles' heel: The paper's central formalism (Eq. 3) assumes concept independence, and most failure modes trace back to violations of this assumption. The discussion acknowledges this but the proposed solution ("structure-aware reward shaping") remains vague.
2. SAE quality dependence: The pipeline inherits all limitations of current SAEs — feature splitting, incomplete coverage, and potential sensitivity to training corpus. The paper does not systematically assess how SAE quality affects downstream results.
3. Evaluation challenges: For the most interesting behaviors (prompt-conditioned ones), the interventions show statistically insignificant improvements (Fig. 11). The paper frames these as informative negative results, but they weaken the overall practical case.
4. Scalability questions: While tested up to 70B parameters, the hypothesis generation pipeline's computational cost (running SAEs over entire datasets, clustering, testing) is not discussed, nor is the human effort required to interpret the viewer output.
5. No comparison to simpler baselines: The paper doesn't compare against, e.g., simply training a classifier on chosen/rejected responses and examining top features, or using reward model probing approaches.
This is an ambitious systems paper that makes a meaningful contribution by connecting interpretability research to practical post-training challenges. The unifying framework is elegant, the experiments are extensive, and the honest treatment of limitations adds scientific value. The diagnostic pipeline is more convincing than the intervention pipeline — the paper is better at finding problems than solving them, particularly for complex or entangled behaviors. Nevertheless, it opens a productive research direction and provides immediately useful tools.
Generated Jun 11, 2026
Paper 1 addresses a fundamental and timely problem in LLM post-training by bridging interpretability and alignment, offering a practical pipeline that diagnoses spurious correlations, mitigates sycophancy, and shapes model behavior at the concept level. Given the enormous current investment in LLM alignment and the broad applicability across the AI industry, this work has potential for very wide adoption. Paper 2 is methodologically strong with nice theoretical guarantees for latent dynamics recovery, but targets a narrower scientific audience. Paper 1's combination of novelty (unifying interpretability with training signal design), practical utility, and timeliness in the rapidly growing alignment field gives it higher expected impact.
Paper 1 addresses a broadly impactful problem—making LLM post-training more transparent and controllable using interpretability—relevant to the massive and growing LLM alignment community. It offers a practical, actionable pipeline (data auditing, reward shaping) with immediate real-world applications for mitigating sycophancy and over-stylization. Paper 2, while theoretically rigorous with novel certified horizon guarantees for equivariant world models, targets a narrower audience (equivariant dynamics/world models). Paper 1's breadth of impact across AI safety, alignment, and practical ML deployment, combined with its timeliness given current LLM scaling trends, gives it higher estimated impact.
Paper 1 addresses a critical and highly timely bottleneck in AI alignment: the opacity of LLM post-training and reward modeling. By unifying interpretability with preference optimization, it offers an immediate, practical solution to widespread issues like sycophancy and over-stylization. While Paper 2 presents profound theoretical advancements in PAC learning and non-i.i.d. generalization, Paper 1 is poised for broader, more immediate real-world adoption across both academia and industry in the rapidly moving field of large language models.
Paper 2 likely has higher impact due to a clearer, timely control/oversight contribution: a concrete protocol (bootstrapped monitoring) addressing the capability gap in monitoring, evaluated on multi-turn agentic tasks with adversarial collusion. If its assumptions hold (access to raw chain-of-thought), it offers near-term, broadly relevant applications for AI safety, agent deployment, and governance. Paper 1 is novel and valuable for post-training interpretability and data auditing, but its impact may be more incremental and toolchain-dependent, with narrower immediate deployment paths.
Paper 1 addresses a fundamental and widely-relevant problem in LLM post-training—making the learning signal interpretable and controllable—with broad practical applications across the entire LLM development ecosystem. It unifies multiple training protocols under a principled interpretability framework, offering immediate utility to practitioners. Paper 2 is innovative in using brain signals to guide LLM reasoning, but its impact is narrower: it requires fMRI data, addresses a more niche intersection of neuroscience and AI, and the practical scalability of brain-guided approaches remains limited compared to the broadly applicable data-centric pipeline of Paper 1.
Paper 1 is more novel and broadly impactful: it proposes an interpretability-driven, data-centric post-training framework to audit and sculpt the learning signal, addressing widely relevant issues (spurious correlations, sycophancy, over-stylization) across RLHF/DPO-style pipelines and model governance. Its potential applications span safety, alignment, personalization, and dataset curation for many model classes, making it timely and field-shaping. Paper 2 is rigorous and practically valuable for diffusion-model deployment on specific hardware, but is more incremental, model/hardware-specific, and narrower in cross-field impact.
Paper 1 introduces a novel, actionable pipeline that bridges interpretability and post-training at scale, addressing a fundamental problem in LLM alignment. It provides both diagnostic tools and concrete interventions (feature/data shaping) with broad applicability across post-training workflows. While Paper 2 identifies an important and well-characterized failure mode (epistemic blind spots in source evaluation), its contribution is more diagnostic than constructive—it characterizes a problem without offering effective solutions. Paper 1's framework has broader methodological impact, unifying multiple training protocols and enabling practitioners to systematically audit and sculpt learning signals.
Paper 2 introduces a novel data-centric post-training pipeline that bridges interpretability and preference optimization, addressing fundamental problems (sycophancy, over-stylization) in LLM alignment. It offers a unifying framework with broad practical applications across the entire post-training ecosystem. Paper 1, while valuable, is primarily a negative/cautionary empirical re-evaluation of an existing technique (confidence remasking in masked diffusion LMs), with narrower scope and incremental contribution. Paper 2's methodological innovation, broader applicability, and relevance to the critical challenge of LLM alignment give it significantly higher impact potential.
Paper 1 presents a fundamental leap in AI for Science by enabling the automated discovery of mathematical formulas in multiscale complex systems. While Paper 2 offers significant improvements for LLM alignment and interpretability, Paper 1 has a broader potential scientific impact across diverse disciplines (e.g., physics, biology, chemistry) by providing a novel, highly efficient tool for fundamental scientific discovery.
Paper 1 introduces a fundamentally new paradigm for post-training that bridges interpretability and alignment—two of the most critical areas in AI safety and LLM development. Its concept-level auditing of preference data addresses widely recognized problems (sycophancy, over-stylization) and offers a general framework unifying multiple training protocols. Paper 2 makes solid engineering/theoretical contributions to efficient attention, but operates in an increasingly crowded space of attention approximations. Paper 1's breadth of impact on alignment practices, its novelty in connecting interpretability to training signals, and its timeliness give it higher potential impact.