Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations
Matteo Gioele Collu, Riccardo Conte, Alberto Giaretta, Denis Kleyko, Mauro Conti, Matteo Zavatteri, Roberto Confalonieri
Abstract
In this paper, we investigate whether refusal behavior can be predicted from LLM intermediate activations before decoding using linear probes trained on residual stream activations at each transformer block. We find that refusal is linearly decodable well before the final layer, indicating that safety-relevant behavior is represented in intermediate activations before output generation. To test whether this signal is actionable, we introduce Mechanistic AutoDAN, a probe-guided variant of AutoDAN that replaces full-model fitness evaluation with partial forward passes and probe-based scoring inside a genetic prompt search loop. Across the evaluated models, our method achieves attack success rates competitive with vanilla AutoDAN while reducing per-iteration search time by up to 72%, and probe-guided prompts match or exceed AutoDAN's cross-model transfer in several configurations. We further find that the usefulness of probe guidance increases with model scale. Our results show that refusal is not only observable at the output level, but is encoded as a structured and actionable signal in intermediate LLM activations.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
The paper makes two interrelated claims: (1) refusal behavior in LLMs is linearly decodable from intermediate residual-stream activations well before the final transformer layer, and (2) this signal is not merely diagnostic but *actionable* — it can replace full-model output evaluation in a jailbreak prompt search, yielding comparable attack success rates (ASR) with significant computational savings. The concrete instantiation is Mechanistic AutoDAN, which substitutes AutoDAN's log-likelihood fitness function (requiring full forward passes and target-string matching) with probe-based scoring from partial forward passes.
The novelty is moderate. The finding that safety-related features are linearly separable in intermediate activations is well-established by prior work (Arditi et al., 2024; Zou et al., 2023a; Li et al., 2025). The paper's incremental contribution is demonstrating that this signal can serve as a *fitness function* in a discrete genetic search loop, rather than being used for detection, defense, or direct activation manipulation. This is a meaningful but narrow extension — it shows operational utility of a known phenomenon in a specific attack framework.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The practical impact is limited but non-trivial. The 72% search time reduction on the 27B model is meaningful for researchers conducting red-teaming at scale. However, the requirement for white-box access to intermediate activations limits direct adversarial applicability. The transferability results partially address this but are mixed — probe-guided prompts sometimes transfer better, sometimes worse, with no clear systematic advantage.
The defensive implications are perhaps more interesting than the offensive ones. If refusal is reliably detectable at early layers, this could enable early-exit detection mechanisms or more efficient safety classifiers. The paper mentions this but does not explore it, which is a missed opportunity.
The broader influence on the field is incremental. The paper sits at the intersection of mechanistic interpretability and adversarial robustness, both active areas. It provides another data point supporting the linear representation hypothesis for safety-relevant features but doesn't substantially advance theoretical understanding of *why* or *how* these representations form.
4. Timeliness & Relevance
The paper addresses a timely topic — understanding and probing LLM safety mechanisms is of significant current interest. The connection between mechanistic interpretability and adversarial attacks is an emerging research direction. However, several concurrent works (JBShield, NeuroStrike, LatentBreak, SSR) already explore related territory, and this paper's contribution relative to that landscape is incremental.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Additional Observations
The paper's framing as "detecting and exploiting refusal signals" slightly oversells what is demonstrated. The detection is well-established in prior work; the exploitation is shown only for one attack method and is only clearly meaningful at 27B scale. The most interesting scientific insight — that probe utility scales with model robustness — could have been explored more deeply with analysis of *why* early layers fail on larger models (e.g., what additional information develops between layers 1 and 10).
The dataset construction pipeline involving LLM-based counterfactual augmentation and LLM-based filtering introduces circular dependencies that could bias probe training, though the clustered splits partially mitigate this concern.
Generated May 28, 2026
Comparison History (16)
Paper 2 likely has higher impact: it introduces a broadly useful, verifiable benchmark for open-web, multimodal travel-planning agents—an area with immediate real-world relevance and wide applicability to retrieval, grounding, evaluation, and agent reliability. The MRB+VKB design and cell-wise verification protocol can standardize evaluation across models and spur progress across multiple subfields. Paper 1 is novel mechanistic/safety work with practical attack-acceleration implications, but its impact is narrower (focused on refusal/jailbreak dynamics and specific probing/optimization) and may face deployment/ethics constraints that limit uptake.
Paper 1 offers a more novel and mechanistically grounded contribution: demonstrating that refusal behavior is linearly decodable from intermediate LLM activations and leveraging this for efficient adversarial attacks. This has broad implications for AI safety, interpretability, and red-teaming, with a concrete 72% speedup. Paper 2 addresses an important but more incremental problem in XAI faithfulness verification with a narrower scope. Paper 1's findings about structured safety signals in intermediate representations are more likely to inspire follow-up research across multiple subfields of AI safety and mechanistic interpretability.
Paper 1 introduces a foundational shift in prompt optimization by moving from monolithic prompts to a compositional, instance-aware approach. This has broad applicability across nearly all LLM workflows, improving both performance and efficiency. Paper 2 is highly relevant for AI safety and mechanistic interpretability, but its immediate impact is narrower, primarily benefiting red-teaming and alignment research. Therefore, Paper 1 has a higher potential for widespread adoption and cross-disciplinary impact.
AIBuildAI-2 addresses a broader and more impactful problem—democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent achieves state-of-the-art results on MLE-Bench (70.7% medal rate) and demonstrates real-world competitiveness against human experts. The self-evolving knowledge system is a novel contribution with wide applicability. Paper 2, while technically interesting in revealing refusal signals in intermediate activations and offering efficiency gains for adversarial attacks, addresses a narrower problem in LLM safety/red-teaming with more limited breadth of impact across fields.
Paper 2 addresses a fundamental and timely problem in AI safety for healthcare: chain-of-thought distillation can improve answer accuracy while degrading reasoning quality. This has broad implications for deploying LLMs in high-stakes domains, challenges widely-used evaluation practices, and could reshape how the community evaluates distilled models. The finding that accuracy and reasoning faithfulness can diverge is a critical insight with immediate practical consequences. Paper 1, while technically sound, represents an incremental improvement to adversarial attack efficiency rather than revealing a fundamentally new problem, limiting its broader impact.
Paper 1 offers a novel intersection of mechanistic interpretability and adversarial attacks on LLMs, showing refusal signals are linearly decodable in intermediate activations and exploitable for efficient jailbreaking. This is highly timely given the surge in LLM safety research, has immediate practical implications for both red-teaming and defense, and opens new research directions connecting interpretability with adversarial robustness. Paper 2 provides valuable theoretical contributions to MCTS in continuous POMDPs, but addresses a more niche audience. Paper 1's broader relevance to the rapidly growing AI safety community gives it higher impact potential.
Paper 1 addresses fundamental AI safety and mechanistic interpretability, proposing a novel method to detect and exploit LLM refusal signals prior to output generation. This has broad, high-impact implications for AI red-teaming, alignment, and security. In contrast, Paper 2 introduces a domain-specific benchmark for petroleum engineering. While practically useful for that specific industry, its methodological novelty is lower and its scientific impact is significantly narrower compared to the foundational AI safety research presented in Paper 1.
Paper 1 introduces a fundamentally new formulation of synthesis planning as a multi-objective optimization problem with theoretical guarantees, addressing a significant gap between CASP tools and real-world chemical practice. It has broad industrial applicability across pharmaceuticals, materials science, and green chemistry. Paper 2 makes a solid contribution to LLM safety/interpretability by showing refusal signals are linearly decodable in intermediate layers and exploiting this for faster adversarial attacks, but it is more incremental—building on existing probing and AutoDAN methods—and has narrower impact scope within AI safety. Paper 1's cross-disciplinary relevance and practical utility give it higher potential impact.
While Paper 1 offers valuable optimizations for LLM red-teaming, Paper 2 presents a foundational, open-source hardware implementation of a next-generation interconnect standard (Unified Bus). By drastically reducing RDMA latency (4.37x improvement) and connection state overhead, OpenURMA tackles a critical bottleneck in modern datacenter and distributed AI training architectures. Providing open RTL and simulation frameworks for a previously closed-silicon specification unlocks broad, long-term systems research and hardware innovation, likely yielding a more profound and lasting scientific impact across high-performance computing and networking.
SAGE addresses a fundamental and broadly impactful bottleneck—long-term memory for language agents—with a novel self-evolving graph memory framework combining a writer and Graph Foundation Model-based reader with feedback loops. It demonstrates strong results across diverse benchmarks (multi-hop QA, open-domain retrieval, agent memory, hallucination detection), suggesting broad applicability. Paper 1 makes a solid contribution to AI safety/red-teaming by showing refusal signals are linearly decodable in intermediate activations, but its scope is narrower (adversarial attacks on alignment). SAGE's architecture has wider potential to influence agent design, retrieval systems, and memory-augmented LLMs across many domains.
TRACER addresses a fundamental challenge in combining reinforcement learning with multi-agent LLM reasoning, introducing a novel framework grounded in game theory (regret matching) with mathematical convergence guarantees. It tackles multiple important problems (sparse rewards, free-riding, fixed protocols) and demonstrates generalization across benchmarks. Paper 2 presents interesting mechanistic interpretability findings about refusal signals, but its scope is narrower—optimizing an existing attack method (AutoDAN) with efficiency gains. TRACER's broader applicability to cooperative multi-agent systems, novel theoretical contributions, and potential to reshape how LLMs collaborate give it higher impact potential.
Paper 2 has higher potential impact due to a clearer, timely real-world application: understanding and exploiting safety/refusal mechanisms. It demonstrates an actionable mechanistic signal (refusal decodable pre-output) and leverages it to materially improve an established attack method (AutoDAN) with large efficiency gains and competitive success, implying broader relevance to safety evaluation, red-teaming, and defense design. Paper 1 offers valuable mechanistic insights into agent depth usage, but its immediate applications and cross-field consequences are less direct than Paper 2’s security- and governance-relevant contributions.
Paper 1 introduces a fundamentally new conceptual framework (GEM) for agent memory as a data-management workload, addressing a critical gap in the rapidly growing AI agent ecosystem. It formalizes correctness conditions and proves limitations of existing paradigms, opening entirely new research directions. Paper 2 makes a solid but more incremental contribution—applying linear probes to detect refusal signals and using them for adversarial attacks. While useful for AI safety, it builds on well-established mechanistic interpretability techniques. Paper 1's broader scope, formal foundations, and relevance to the booming agent infrastructure space give it higher long-term impact potential.
Paper 1 offers a highly actionable, novel methodology (Mechanistic AutoDAN) that significantly improves the efficiency of LLM red-teaming and provides concrete insights into mechanistic interpretability. While Paper 2 provides an important critical analysis of LLM metacognition, Paper 1 introduces a practical tool and optimization technique that will likely see broader immediate adoption and follow-up work in the rapidly growing field of AI safety evaluation.
Paper 2 has higher estimated impact due to strong novelty and timeliness in mechanistic interpretability and LLM safety: it shows refusal is linearly decodable from intermediate activations and exploits this signal to speed jailbreak prompt search via probe-guided optimization. The method is broadly applicable across models and directly informs both attack and defense research, with cross-field relevance (security, interpretability, alignment). Paper 1 is valuable for biomedical hypothesis generation, but impact may be more domain-specific and dependent on downstream validation and dataset/schema integration details.
Paper 2 provides fundamental insights into LLM interpretability and safety by demonstrating early refusal detection in intermediate activations. Its novel mechanistic adversarial attack significantly improves red-teaming efficiency. This addresses critical challenges in AI alignment and security, offering broader foundational implications across AI development compared to Paper 1's domain-specific educational benchmark, despite its useful dynamic design.