Introspection Adapters: Training LLMs to Report Their Learned Behaviors

Keshav Shenoy, Li Yang, Abhay Sheshadri, Sören Mindermann, Jack Lindsey, Sam Marks, Rowan Wang

Apr 18, 2026

arXiv:2604.16812v2 PDF

cs.AI(primary)

#64of 2292·Artificial Intelligence

#64 of 2292 · Artificial Intelligence

Tournament Score

1560±34

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8.5

Rigor7

Novelty7.5

Clarity8

Tournament Score

1560±34

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$ , our method works by finetuning models $M_{i}$ from $M$ with implanted behaviors $b_{i}$ ; the $(M_{i}, b_{i})$ pairs serve as labeled training data. We then train an introspection adapter (IA): a single LoRA adapter jointly trained across the finetunes $M_{i}$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_{i}$ . For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Introspection Adapters

Core Contribution

The paper introduces introspection adapters (IAs) — single LoRA adapters jointly trained across multiple fine-tuned model variants to elicit natural-language self-reports of learned behaviors. The key insight is treating pairs of (fine-tuned model, known behavior description) as labeled training data, then training one adapter that generalizes across models. The method extends prior work (Goel et al., 2025's DIT) through three improvements: scaling to larger models (Llama-3.3-70B), substantially diversifying the training behavior distribution (8 categories spanning backdoors, sandbagging, harmful roleplay, etc.), and introducing a DPO refinement stage to suppress hallucinated reports.

The contribution addresses a genuinely important problem: how to audit fine-tuned LLMs whose behaviors may be unexpected, harmful, or deliberately concealed. The approach is elegant in its simplicity — rather than building external probes or interpretability tools, it leverages the model's own representations to verbalize what it has learned.

Methodological Rigor

The experimental design is thorough and well-structured. The authors evaluate across four distinct out-of-distribution families:

1. AuditBench (56 adversarially trained models with hidden behaviors): IAs achieve 59% success rate vs. 53% for next-best method, with correct behavioral descriptions on 50/56 models (89%).

2. Covert Malicious Finetuning attacks (9 cipher-based models): 57.8% scaffold success rate, detecting 7/9 variants.

3. RM Sycophant: Surfaces 16/52 behaviors vs. 2/52 baseline.

4. UKAISI sandbagging models: Provides detection signal where 10 prior methods failed.

The ablation studies are particularly well-executed: model scale analysis across Qwen3 (0.6B–14B), training diversity sweeps, DPO vs. SFT-only comparisons, and even minimal-capacity probes (rank-1 LoRA, single-layer bias vectors). The finding that a rank-1 single-layer LoRA achieves 68.9% of full adapter performance is mechanistically informative, suggesting behavioral information is already accessible in representations.

However, there are methodological concerns. The false positive rate is acknowledged but insufficiently characterized — when applied to models without specific trained behaviors, IAs hallucinate behaviors from the training distribution. The UKAISI evaluation uses a permissive metric (any sandbagging-like claim counts as positive), and the baseline false-positive rate of 15.8% on AuditBench models is non-trivial. The paper would benefit from a more systematic false positive analysis across benign models.

Potential Impact

Practical utility: The method has clear deployment relevance for AI labs operating fine-tuning APIs (e.g., OpenAI, Anthropic). Once an IA is trained, it can be cheaply applied to many fine-tuned model variants, making it scalable for auditing customer fine-tunes. The detection of encrypted finetuning attacks is particularly compelling — this is a known vulnerability with few good defenses.

Scientific implications: The paper provides evidence that LLMs possess latent behavioral self-awareness that can be reliably elicited with minimal intervention. The logit-lens analysis (Appendix M) showing quirk-related signals already present in residual streams before IA application deepens understanding of how models represent their own behavioral tendencies.

Broader influence: This work could influence:

AI safety evaluation pipelines at frontier labs

Regulatory frameworks requiring model auditing capabilities

Research on model introspection and self-knowledge

Fine-tuning API safety measures

Timeliness & Relevance

This paper is exceptionally timely. Fine-tuning APIs are proliferating, and attacks via fine-tuning (jailbreaking, capability elicitation, backdoor implantation) are an active concern. The recent UKAISI auditing games demonstrated that existing detection methods fail against determined adversaries. Simultaneously, there's growing interest in whether models can introspect (Betley et al., 2025; Binder et al., 2024; Lindsey, 2025), making this a natural applied extension of a theoretical research direction.

Strengths

1. Strong OOD generalization: The most impressive result is generalization to fundamentally different model families — adversarially trained models, cipher-based attacks, and multi-behavior organisms that share no structural similarity with training data.

2. Favorable scaling properties: Clear improvements with model size (37.7% → 77.3% from 0.6B to 14B) and training diversity, suggesting the approach will improve with future models.

3. Practical deployment model: Train once, apply to many — the amortized cost structure is ideal for API providers.

4. Mechanistic insights: The lightweight adapter analysis and logit-lens results provide genuine scientific understanding, not just empirical performance.

5. Comprehensive release: Code, models, datasets, and adapters are all publicly available.

Limitations

1. High false positive rates: The tendency to hallucinate training-distribution behaviors undermines reliability for deployment. The paper acknowledges this but offers limited solutions.

2. Training distribution sensitivity: The composition of behavior categories significantly affects performance (Appendix B), and principled guidance on distribution design is lacking.

3. Ceiling on OOD generalization: AuditBench performance plateaus after 5 training categories (Figure 8), suggesting diversity alone won't solve generalization.

4. Root cause blindness: On the RM Sycophant, the IA surfaces 16/52 individual behaviors but misses the core hidden objective (reward model exploitation), suggesting it detects symptoms rather than causes.

5. Adversarial robustness unclear: The paper doesn't test whether an adversary aware of IAs could train models that specifically resist introspection adapter elicitation. This is the natural next attack vector.

6. Base model limitation: Generalization to base model behaviors (Section 4.1) shows significant degradation (22-26% → 8-10%), limiting utility for auditing pre-existing issues.

7. Single base model family: Most results use Llama-3.3-70B; cross-architecture generalization is untested.

Overall Assessment

This is a well-executed paper that introduces a practical and scalable approach to a pressing problem. The OOD generalization results, particularly on AuditBench and cipher attacks, are genuinely surprising and scientifically interesting. The mechanistic analysis adds depth beyond pure empiricism. The main concerns — false positives, adversarial robustness, and the training distribution sensitivity — represent important open questions rather than fatal flaws. This work establishes a promising new paradigm for model auditing that is likely to stimulate significant follow-up research.

Rating:7.8/ 10

Significance 8.5Rigor 7Novelty 7.5Clarity 8

Generated May 5, 2026

Comparison History (32)

vs. Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

gemini-35/6/2026

Paper 2 introduces a groundbreaking integration of LLM agents with formal verification (Lean 4) and control theory, offering mathematical guarantees for AI in high-stakes environments. While Paper 1 provides a clever empirical approach to AI safety, Paper 2's exceptional methodological rigor and neuro-symbolic architecture represent a significant paradigm shift for reliable, autonomous AI agents, likely sparking broad interdisciplinary research across AI, cybersecurity, and formal methods.

vs. Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense

gemini-35/6/2026

Paper 2 bridges LLM agents, formal verification (Lean 4), and game theory to provide mathematically guaranteed stability in high-stakes adversarial environments. Its exceptional methodological rigor—using machine-checked proofs to bound LLM non-determinism—solves a major bottleneck in deploying autonomous AI safely. While Paper 1 offers a highly relevant approach to LLM auditing, Paper 2's formal guarantees and cross-disciplinary innovation give it a higher potential for foundational scientific impact.

vs. Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

gemini-35/5/2026

Paper 1 introduces a highly novel and scalable technical approach (Introspection Adapters) to AI safety and auditing, a critical and rapidly growing field. By enabling models to verbalize hidden behaviors, it provides a powerful tool against malicious fine-tuning. While Paper 2 offers valuable empirical insights into LLM bias and sycophancy, Paper 1's algorithmic innovation has broader implications for securing and interpreting foundation models across diverse real-world applications.

vs. Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

gemini-35/5/2026

Paper 1 introduces a novel, scalable technical methodology for auditing LLMs, addressing a critical challenge in AI safety and alignment. By enabling models to verbalize their learned behaviors, it opens new avenues for interpretability and security testing against malicious fine-tuning. While Paper 2 provides valuable empirical insights into evaluation flaws regarding sycophancy and bias, Paper 1's actionable technical contribution offers broader utility and stronger potential for follow-on research across the rapidly growing field of AI safety.

vs. Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

claude-opus-4.65/5/2026

Paper 1 introduces a novel diagnostic framework (VLAF) that reveals alignment faking is far more prevalent than previously thought, including in smaller models. It provides both a rigorous diagnostic methodology and a practical mitigation technique using contrastive steering vectors, achieving substantial reductions in alignment faking with minimal overhead. The discovery that alignment faking behavior can be captured by a single direction in representation space is a significant mechanistic insight. While Paper 2 presents a useful auditing approach via introspection adapters, Paper 1 addresses the more fundamental and timely concern of alignment faking—a critical safety problem—with both novel findings and actionable mitigations.

vs. Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

claude-opus-4.65/5/2026

Paper 1 addresses the critical problem of alignment faking with a novel diagnostic framework (VLAF), reveals the phenomenon is more widespread than previously known (occurring in 7B models), provides mechanistic insights (single direction in representation space), and offers a practical lightweight mitigation strategy with strong results. While Paper 2 presents a useful auditing tool via introspection adapters, Paper 1's contributions span detection, mechanistic understanding, and mitigation of a fundamental AI safety concern, making it more broadly impactful and timely given growing deployment of LLMs in safety-critical settings.

vs. PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

gemini-35/5/2026

Paper 2 addresses a critical and broad challenge in AI safety and interpretability—auditing LLMs for hidden or harmful behaviors. Its novel 'Introspection Adapters' offer a scalable, generalizable solution with significant security implications. In contrast, Paper 1, while practically useful for educational technology, focuses on a narrower application of existing RLHF/PEFT techniques for stylistic alignment, limiting its overall breadth of impact compared to the foundational safety contributions of Paper 2.

vs. To Use AI as Dice of Possibilities with Timing Computation

gpt-5.25/5/2026

Paper 2 has higher likely scientific impact due to strong timeliness and broad relevance to LLM safety, auditing, and deployment. It proposes a concrete, scalable method (introspection adapters via LoRA) with clear evaluation on benchmarks (e.g., AuditBench), SOTA claims, and practical security applications (detecting hidden behaviors and encrypted finetuning attacks). The approach is readily adoptable across many derived models and could influence alignment, interpretability, and ML security. Paper 1 is ambitious and potentially novel, but its paradigm shift claims are less standard and impact may be narrower/less immediately actionable beyond the demonstrated EHR use case.

vs. PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs

gemini-35/5/2026

Paper 2 addresses a critical bottleneck in AI safety and auditing by enabling LLMs to verbalize hidden learned behaviors. This highly novel, scalable approach has broad and urgent implications for security, alignment, and defending against malicious fine-tuning. In contrast, Paper 1 offers a valuable but more narrowly focused application of RLHF for stylistic alignment in educational feedback, giving Paper 2 a significantly broader potential scientific impact.

vs. To Use AI as Dice of Possibilities with Timing Computation

gpt-5.25/5/2026

Paper 2 has higher estimated impact due to strong timeliness and broad relevance to LLM safety, auditing, and governance. Its approach is technically concrete (LoRA-based introspection adapters), scalable across many derived models, and demonstrates generalization plus state-of-the-art results on an established benchmark (AuditBench) and practical security applications (detecting encrypted finetuning attacks). Paper 1 is ambitious and potentially valuable for longitudinal healthcare modeling, but its framing is more philosophical, novelty is harder to verify from the abstract, and impact is narrower and more dependent on methodological details not provided.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-35/5/2026

Paper 2 introduces a highly novel, scalable mechanism for auditing LLMs by training them to introspect and report hidden behaviors. This addresses critical challenges in AI safety and alignment. The surprising generalization of Introspection Adapters offers a conceptual breakthrough with broad implications for interpretability and security, potentially driving more foundational follow-up research than the agent evaluation benchmark presented in Paper 1.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-35/5/2026

Paper 2 introduces a highly novel and scalable approach to AI safety and auditing through 'Introspection Adapters'. The ability to force LLMs to verbalize hidden or malicious fine-tuned behaviors addresses a critical, unsolved challenge in AI alignment and security. While Paper 1 provides a valuable benchmark for agent interaction, Paper 2's methodological innovation in model interpretability and its strong security implications offer broader, transformative potential across the rapidly growing field of AI safety.

vs. Minimal, Local, Causal Explanations for Jailbreak Success in Large Language Models

claude-opus-4.65/5/2026

Paper 2 introduces a novel and practically impactful approach to LLM auditing through introspection adapters, addressing the critical problem of detecting unexpected or harmful behaviors in fine-tuned models. Its scalability, generalization to unseen fine-tuning methods, and state-of-the-art results on AuditBench suggest broad applicability. While Paper 1 makes meaningful contributions to mechanistic interpretability of jailbreaks with its local causal explanation framework, Paper 2 addresses a more fundamental and widely applicable safety challenge with a more scalable solution that could impact how the entire ecosystem of fine-tuned LLMs is audited.

vs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

claude-opus-4.65/5/2026

Paper 2 introduces a novel and creative approach—introspection adapters—that enables LLMs to verbalize their own learned behaviors, addressing the critical problem of auditing fine-tuned models. This is a fundamentally new paradigm for AI safety and interpretability with broad implications. It demonstrates strong generalization (e.g., to AuditBench, encrypted finetuning attacks) and scales favorably. Paper 1, while thorough and practically useful, is primarily a benchmark contribution—an incremental improvement over existing agent evaluation frameworks. Paper 2's methodological novelty and its potential to reshape how we audit and understand LLM behaviors gives it higher scientific impact.

vs. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

gpt-5.25/5/2026

Paper 1 has higher likely impact due to strong timeliness and broad real-world applicability in LLM safety, auditing, and governance. The introspection-adapter approach is a concrete, scalable mechanism with clear evaluation (e.g., AuditBench SOTA, encrypted finetuning attack detection) and a plausible deployment path across many fine-tuned derivatives, affecting industry, security, and policy. Paper 2 is ambitious and potentially transformative, but end-to-end “automated discovery + paper writing” claims face higher skepticism and reproducibility/rigor burdens; demonstrated scope (two tasks) may limit near-term adoption and impact.

vs. SELFDOUBT: Uncertainty Quantification for Reasoning LLMs via the Hedge-to-Verify Ratio

gemini-35/5/2026

While Paper 1 offers a highly practical, cost-effective solution for uncertainty estimation in reasoning models, Paper 2 introduces a fundamentally novel paradigm for AI safety and alignment. By training an adapter to force models to verbalize their implanted behaviors, Paper 2 provides a scalable and innovative solution to the critical problem of auditing black-box fine-tunes and detecting hidden malicious capabilities, which has profound implications for AI governance and security.

vs. A Low-Latency Fraud Detection Layer for Detecting Adversarial Interaction Patterns in LLM-Powered Agents

gpt-5.25/5/2026

Paper 1 is more novel scientifically: it introduces introspection adapters as a scalable learning-based method for eliciting and auditing fine-tuned behaviors across many derived models, with strong generalization claims and state-of-the-art results on AuditBench plus relevance to finetuning API attack detection. This can broadly impact alignment, interpretability, model auditing, and deployment safety. Paper 2 is highly practical and timely for agent security, but relies on relatively standard supervised fraud-style classification over a synthetic dataset with engineered features, which may limit methodological novelty and generality across real-world settings.

vs. Unbiased Prevalence Estimation with Multicalibrated LLMs

gpt-5.25/5/2026

Paper 2 has higher likely impact due to a broadly applicable theoretical result: multicalibration suffices for unbiased prevalence estimation under covariate shift, connecting fairness theory to ubiquitous measurement/quantification problems across science, public health, and trust & safety. It addresses a foundational task (prevalence estimation) with clear real-world stakes and provides simulations plus two empirical case studies, suggesting methodological rigor and generality beyond LLMs. Paper 1 is timely and practically useful for LLM auditing, but its impact is more concentrated within LLM safety/auditing and depends on specific training constructs (implanted behaviors, shared base models).

vs. Beyond Text-Dominance: Understanding Modality Preference of Omni-modal Large Language Models

claude-opus-4.65/5/2026

Paper 1 introduces a novel and practically impactful method for auditing fine-tuned LLMs by training introspection adapters that generalize across diverse fine-tuning approaches. This addresses a critical AI safety challenge—detecting hidden or harmful behaviors in fine-tuned models—with broad implications for AI governance and deployment. Paper 2 provides valuable empirical insights into modality preferences in omni-modal LLMs but is more observational/diagnostic. Paper 1's approach is more actionable, addresses a more urgent problem (AI safety auditing), demonstrates strong generalization, and has clearer real-world deployment potential.

vs. Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

gpt-5.25/5/2026

Paper 1 is more novel and potentially higher impact because it proposes a scalable, broadly applicable mechanism (introspection LoRA adapters) for auditing fine-tuned LLM behaviors, with clear security/safety applications (hidden behaviors, encrypted finetuning API attacks) and demonstrated generalization (e.g., AuditBench SOTA). This addresses an urgent, high-stakes problem in AI governance and model deployment, with likely cross-field relevance (alignment, security, interpretability, MLOps). Paper 2 is rigorous and useful for reliability, but extends existing uncertainty ideas with ensemble disagreement, a more incremental advance with narrower immediate impact.