Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko, Mauro Conti, Matteo Zavatteri, Roberto Confalonieri

#1206 of 2682 · Artificial Intelligence
Share
Tournament Score
1420±44
10501800
44%
Win Rate
7
Wins
9
Losses
16
Matches
Rating
4.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper makes two interrelated claims: (1) refusal behavior in LLMs is linearly decodable from intermediate residual-stream activations well before the final transformer layer, and (2) this signal is not merely diagnostic but *actionable* — it can replace full-model output evaluation in a jailbreak prompt search, yielding comparable attack success rates (ASR) with significant computational savings. The concrete instantiation is Mechanistic AutoDAN, which substitutes AutoDAN's log-likelihood fitness function (requiring full forward passes and target-string matching) with probe-based scoring from partial forward passes.

The novelty is moderate. The finding that safety-related features are linearly separable in intermediate activations is well-established by prior work (Arditi et al., 2024; Zou et al., 2023a; Li et al., 2025). The paper's incremental contribution is demonstrating that this signal can serve as a *fitness function* in a discrete genetic search loop, rather than being used for detection, defense, or direct activation manipulation. This is a meaningful but narrow extension — it shows operational utility of a known phenomenon in a specific attack framework.

2. Methodological Rigor

Strengths:

  • The experimental design includes appropriate controls: a template-only baseline, vanilla AutoDAN comparisons under identical hyperparameters, and an opposite-direction ablation that tests whether the probe signal is actually informative versus search success being dominated by synonym substitution.
  • Evaluation spans three models of different scales (3B, 4B, 27B) and two families (Llama, Qwen), providing some generalizability evidence.
  • The counterfactual dataset construction with clustered train-test splits is a reasonable approach to mitigate shortcut learning.
  • Weaknesses:

  • The opposite-direction ablation is perhaps the most revealing result — and it undermines the core claim for smaller models. On Llama-3B and Qwen-4B, reversing the optimization direction barely affects ASR, meaning the probe signal is essentially irrelevant for these models. The authors acknowledge this but frame it as a "scale effect," which is somewhat generous.
  • Only one jailbreak method (AutoDAN) is tested. The generalizability to GCG, SAA, or other attack families is entirely unknown.
  • The statistical rigor is limited. Standard deviations in iteration counts are often comparable to or larger than means (Tables 7-9), making many comparisons statistically ambiguous. No significance tests are reported.
  • Population sizes differ across models (32, 16, 128) due to hardware constraints, complicating cross-model comparisons.
  • The LLM-as-a-judge evaluation using GPT-5-mini introduces an uncontrolled dependency; no inter-annotator agreement or calibration is reported.
  • Layer selection is acknowledged as ad hoc — held-out accuracy is "a weak criterion" for identifying useful layers, and Block 10 for Qwen-27B was chosen pragmatically rather than systematically.
  • 3. Potential Impact

    The practical impact is limited but non-trivial. The 72% search time reduction on the 27B model is meaningful for researchers conducting red-teaming at scale. However, the requirement for white-box access to intermediate activations limits direct adversarial applicability. The transferability results partially address this but are mixed — probe-guided prompts sometimes transfer better, sometimes worse, with no clear systematic advantage.

    The defensive implications are perhaps more interesting than the offensive ones. If refusal is reliably detectable at early layers, this could enable early-exit detection mechanisms or more efficient safety classifiers. The paper mentions this but does not explore it, which is a missed opportunity.

    The broader influence on the field is incremental. The paper sits at the intersection of mechanistic interpretability and adversarial robustness, both active areas. It provides another data point supporting the linear representation hypothesis for safety-relevant features but doesn't substantially advance theoretical understanding of *why* or *how* these representations form.

    4. Timeliness & Relevance

    The paper addresses a timely topic — understanding and probing LLM safety mechanisms is of significant current interest. The connection between mechanistic interpretability and adversarial attacks is an emerging research direction. However, several concurrent works (JBShield, NeuroStrike, LatentBreak, SSR) already explore related territory, and this paper's contribution relative to that landscape is incremental.

    5. Strengths & Limitations

    Key Strengths:

  • Clean experimental setup with multiple baselines and ablations
  • The scale-dependent effect is a genuinely interesting finding: probe guidance matters more when the search problem is harder (larger, more robust models)
  • Practical efficiency gains are substantial for the 27B model
  • Honest and thorough limitations section
  • Key Limitations:

  • The core finding (linear decodability of refusal) is not novel
  • The probe signal is demonstrably uninformative for smaller models (opposite-direction ablation)
  • Single attack method limits generalizability claims
  • High variance in experimental results undermines statistical confidence
  • No defensive applications are explored despite being a natural extension
  • Code and data not yet released (promised after acceptance)
  • Additional Observations

    The paper's framing as "detecting and exploiting refusal signals" slightly oversells what is demonstrated. The detection is well-established in prior work; the exploitation is shown only for one attack method and is only clearly meaningful at 27B scale. The most interesting scientific insight — that probe utility scales with model robustness — could have been explored more deeply with analysis of *why* early layers fail on larger models (e.g., what additional information develops between layers 1 and 10).

    The dataset construction pipeline involving LLM-based counterfactual augmentation and LLM-based filtering introduces circular dependencies that could bias probe training, though the clustered splits partially mitigate this concern.

    Rating:4.8/ 10
    Significance 4.5Rigor 5Novelty 4Clarity 7

    Generated May 28, 2026

    Comparison History (16)

    vs. VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora
    gpt-5.25/28/2026

    Paper 2 likely has higher impact: it introduces a broadly useful, verifiable benchmark for open-web, multimodal travel-planning agents—an area with immediate real-world relevance and wide applicability to retrieval, grounding, evaluation, and agent reliability. The MRB+VKB design and cell-wise verification protocol can standardize evaluation across models and spur progress across multiple subfields. Paper 1 is novel mechanistic/safety work with practical attack-acceleration implications, but its impact is narrower (focused on refusal/jailbreak dynamics and specific probing/optimization) and may face deployment/ethics constraints that limit uptake.

    vs. Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness
    claude-opus-4.65/28/2026

    Paper 1 offers a more novel and mechanistically grounded contribution: demonstrating that refusal behavior is linearly decodable from intermediate LLM activations and leveraging this for efficient adversarial attacks. This has broad implications for AI safety, interpretability, and red-teaming, with a concrete 72% speedup. Paper 2 addresses an important but more incremental problem in XAI faithfulness verification with a narrower scope. Paper 1's findings about structured safety signals in intermediate representations are more likely to inspire follow-up research across multiple subfields of AI safety and mechanistic interpretability.

    vs. Prompt Codebooks: Discrete Compositional Optimization for Language Model Instruction Refinement
    gemini-3.15/28/2026

    Paper 1 introduces a foundational shift in prompt optimization by moving from monolithic prompts to a compositional, instance-aware approach. This has broad applicability across nearly all LLM workflows, improving both performance and efficiency. Paper 2 is highly relevant for AI safety and mechanistic interpretability, but its immediate impact is narrower, primarily benefiting red-teaming and alignment research. Therefore, Paper 1 has a higher potential for widespread adoption and cross-disciplinary impact.

    vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
    claude-opus-4.65/28/2026

    AIBuildAI-2 addresses a broader and more impactful problem—democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent achieves state-of-the-art results on MLE-Bench (70.7% medal rate) and demonstrates real-world competitiveness against human experts. The self-evolving knowledge system is a novel contribution with wide applicability. Paper 2, while technically interesting in revealing refusal signals in intermediate activations and offering efficiency gains for adversarial attacks, addresses a narrower problem in LLM safety/red-teaming with more limited breadth of impact across fields.

    vs. Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
    claude-opus-4.65/28/2026

    Paper 2 addresses a fundamental and timely problem in AI safety for healthcare: chain-of-thought distillation can improve answer accuracy while degrading reasoning quality. This has broad implications for deploying LLMs in high-stakes domains, challenges widely-used evaluation practices, and could reshape how the community evaluates distilled models. The finding that accuracy and reasoning faithfulness can diverge is a critical insight with immediate practical consequences. Paper 1, while technically sound, represents an incremental improvement to adversarial attack efficiency rather than revealing a fundamentally new problem, limiting its broader impact.

    vs. Finite-Time Analysis of MCTS in Continuous POMDP Planning
    claude-opus-4.65/28/2026

    Paper 1 offers a novel intersection of mechanistic interpretability and adversarial attacks on LLMs, showing refusal signals are linearly decodable in intermediate activations and exploitable for efficient jailbreaking. This is highly timely given the surge in LLM safety research, has immediate practical implications for both red-teaming and defense, and opens new research directions connecting interpretability with adversarial robustness. Paper 2 provides valuable theoretical contributions to MCTS in continuous POMDPs, but addresses a more niche audience. Paper 1's broader relevance to the rapidly growing AI safety community gives it higher impact potential.

    vs. PetroBench: A Benchmark for Large Language Models in Petroleum Engineering
    gemini-3.15/28/2026

    Paper 1 addresses fundamental AI safety and mechanistic interpretability, proposing a novel method to detect and exploit LLM refusal signals prior to output generation. This has broad, high-impact implications for AI red-teaming, alignment, and security. In contrast, Paper 2 introduces a domain-specific benchmark for petroleum engineering. While practically useful for that specific industry, its methodological novelty is lower and its scientific impact is significantly narrower compared to the foundational AI safety research presented in Paper 1.

    vs. From Feasible to Practical: Pareto-Optimal Synthesis Planning
    claude-opus-4.65/28/2026

    Paper 1 introduces a fundamentally new formulation of synthesis planning as a multi-objective optimization problem with theoretical guarantees, addressing a significant gap between CASP tools and real-world chemical practice. It has broad industrial applicability across pharmaceuticals, materials science, and green chemistry. Paper 2 makes a solid contribution to LLM safety/interpretability by showing refusal signals are linearly decodable in intermediate layers and exploiting this for faster adversarial attacks, but it is more incremental—building on existing probing and AutoDAN methods—and has narrower impact scope within AI safety. Paper 1's cross-disciplinary relevance and practical utility give it higher potential impact.

    vs. OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol
    gemini-3.15/28/2026

    While Paper 1 offers valuable optimizations for LLM red-teaming, Paper 2 presents a foundational, open-source hardware implementation of a next-generation interconnect standard (Unified Bus). By drastically reducing RDMA latency (4.37x improvement) and connection state overhead, OpenURMA tackles a critical bottleneck in modern datacenter and distributed AI training architectures. Providing open RTL and simulation frameworks for a previously closed-silicon specification unlocks broad, long-term systems research and hardware innovation, likely yielding a more profound and lasting scientific impact across high-performance computing and networking.

    vs. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
    claude-opus-4.65/28/2026

    SAGE addresses a fundamental and broadly impactful bottleneck—long-term memory for language agents—with a novel self-evolving graph memory framework combining a writer and Graph Foundation Model-based reader with feedback loops. It demonstrates strong results across diverse benchmarks (multi-hop QA, open-domain retrieval, agent memory, hallucination detection), suggesting broad applicability. Paper 1 makes a solid contribution to AI safety/red-teaming by showing refusal signals are linearly decodable in intermediate activations, but its scope is narrower (adversarial attacks on alignment). SAGE's architecture has wider potential to influence agent design, retrieval systems, and memory-augmented LLMs across many domains.

    vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning
    claude-opus-4.65/28/2026

    TRACER addresses a fundamental challenge in combining reinforcement learning with multi-agent LLM reasoning, introducing a novel framework grounded in game theory (regret matching) with mathematical convergence guarantees. It tackles multiple important problems (sparse rewards, free-riding, fixed protocols) and demonstrates generalization across benchmarks. Paper 2 presents interesting mechanistic interpretability findings about refusal signals, but its scope is narrower—optimizing an existing attack method (AutoDAN) with efficiency gains. TRACER's broader applicability to cooperative multi-agent systems, novel theoretical contributions, and potential to reshape how LLMs collaborate give it higher impact potential.

    vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning
    gpt-5.25/28/2026

    Paper 2 has higher potential impact due to a clearer, timely real-world application: understanding and exploiting safety/refusal mechanisms. It demonstrates an actionable mechanistic signal (refusal decodable pre-output) and leverages it to materially improve an established attack method (AutoDAN) with large efficiency gains and competitive success, implying broader relevance to safety evaluation, red-teaming, and defense design. Paper 1 offers valuable mechanistic insights into agent depth usage, but its immediate applications and cross-field consequences are less direct than Paper 2’s security- and governance-relevant contributions.

    vs. Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory
    claude-opus-4.65/28/2026

    Paper 1 introduces a fundamentally new conceptual framework (GEM) for agent memory as a data-management workload, addressing a critical gap in the rapidly growing AI agent ecosystem. It formalizes correctness conditions and proves limitations of existing paradigms, opening entirely new research directions. Paper 2 makes a solid but more incremental contribution—applying linear probes to detect refusal signals and using them for adversarial attacks. While useful for AI safety, it builds on well-established mechanistic interpretability techniques. Paper 1's broader scope, formal foundations, and relevance to the booming agent infrastructure space give it higher long-term impact potential.

    vs. Can LLMs Introspect? A Reality Check
    gemini-3.15/28/2026

    Paper 1 offers a highly actionable, novel methodology (Mechanistic AutoDAN) that significantly improves the efficiency of LLM red-teaming and provides concrete insights into mechanistic interpretability. While Paper 2 provides an important critical analysis of LLM metacognition, Paper 1 introduces a practical tool and optimization technique that will likely see broader immediate adoption and follow-up work in the rapidly growing field of AI safety evaluation.

    vs. Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?
    gpt-5.25/28/2026

    Paper 2 has higher estimated impact due to strong novelty and timeliness in mechanistic interpretability and LLM safety: it shows refusal is linearly decodable from intermediate activations and exploits this signal to speed jailbreak prompt search via probe-guided optimization. The method is broadly applicable across models and directly informs both attack and defense research, with cross-field relevance (security, interpretability, alignment). Paper 1 is valuable for biomedical hypothesis generation, but impact may be more domain-specific and dependent on downstream validation and dataset/schema integration details.

    vs. LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
    gemini-3.15/28/2026

    Paper 2 provides fundamental insights into LLM interpretability and safety by demonstrating early refusal detection in intermediate activations. Its novel mechanistic adversarial attack significantly improves red-teaming efficiency. This addresses critical challenges in AI alignment and security, offering broader foundational implications across AI development compared to Paper 1's domain-specific educational benchmark, despite its useful dynamic design.