Latent-space Attacks for Refusal Evasion in Language Models
Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio
Abstract
Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Latent-space Attacks for Refusal Evasion in Language Models"
1. Core Contribution
This paper makes two interconnected contributions. First, it provides a theoretical reframing of existing refusal-ablation methods (DiM, RDO, PS, MD) as latent-space evasion attacks against linear probes. Under this lens, the widely-used difference-in-means ablation is shown to be equivalent to the DeepFool minimum-confidence attack—a projection onto the linear decision boundary separating refused from compliant representations. This is an elegant insight: it unifies disparate methods under a single adversarial ML framework and immediately reveals their shared limitation—they only push representations to the decision boundary, not past it.
Second, building on this insight, the paper proposes Controlled Latent-space Evasion (CLE), which adds an optimized confidence margin to push representations into the compliant region. Two variants are introduced: CLE-P (projective, reprojecting every token) and CLE-A (additive, computing perturbation once and reusing it). The surprising finding that CLE-A outperforms CLE-P suggests that continuous reprojection is unnecessary and potentially harmful—a fixed additive shift suffices to sustain evasion throughout generation.
2. Methodological Rigor
The theoretical framework is cleanly developed. The decomposition of the gradient into direct and indirect components (Eq. 5), the truncated gradient approximation, and the reduction of arbitrary latent perturbations to 2L scalar parameters are well-motivated. The connection between margin and logistic confidence (Appendix E) is mathematically precise.
However, several methodological choices warrant scrutiny:
3. Potential Impact
For AI safety research: This work is significant because it demonstrates that current safety alignment based on linearly separable refusal representations is fundamentally fragile. The fact that a simple linear probe + margin optimization can defeat alignment across 15 diverse models (including defended models like Mistral-7B-RR and reasoning models like DeepSeek-R1-8B) is a strong negative result for the robustness of current alignment approaches.
For adversarial ML: The connection between refusal ablation and DeepFool-style evasion attacks is theoretically valuable, bridging interpretability/alignment literature with classical adversarial robustness. This cross-pollination could inspire new attack-defense dynamics.
For defenses: The paper explicitly suggests that alignment procedures should discourage linear separability of refusal representations. This is actionable guidance—though implementing it without degrading model quality is non-trivial.
Dual-use concerns: The attack is powerful, computationally cheap at inference time (single forward pass after optimization), and universal across prompts. The white-box access requirement limits immediate misuse to open-weight models, but the growing availability of such models makes this a real concern.
4. Timeliness & Relevance
This work arrives at a critical juncture. The rapid deployment of open-weight LLMs, combined with increasing evidence that safety alignment is brittle, makes understanding and characterizing failure modes urgent. The paper addresses the specific bottleneck of explaining *why* refusal ablation works—prior work was largely empirical. The extension to reasoning and multimodal models (including recent architectures like Qwen3.5, Phi-4, Gemma3) demonstrates relevance to the current model landscape.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
6. Additional Observations
The qualitative examples (Appendix J) are striking—models produce detailed harmful content that would clearly fail safety standards. The Mistral-7B-RR example is particularly notable: the defended model produces garbled refusals without CLE but coherent harmful content with CLE-P, suggesting the defense primarily disrupts refusal token generation rather than fundamentally changing representation geometry.
The paper's framing as an "attack" paper rather than a "defense" paper is appropriate—it advances understanding of vulnerabilities. However, the defensive implications (Section 6) remain speculative.
Generated May 22, 2026
Comparison History (17)
Paper 2 addresses a critical and highly active area of AI safety: model alignment and jailbreaking. By providing a principled framework for latent-space attacks and demonstrating state-of-the-art success rates across 15 diverse models, it exposes severe vulnerabilities in current safety mechanisms. This will likely drive significant follow-up research in both mechanistic interpretability and robust alignment, giving it a broader and more urgent impact than Paper 1's focus on XAI evaluation metrics.
Paper 1 addresses a fundamental scientific problem in AI safety by reframing LLM refusal suppression as a latent-space evasion attack, advancing theoretical understanding and empirical evaluation across 15 models. Paper 2, while offering a highly useful engineering framework for reducing API boilerplate, is primarily a software engineering tool rather than a contribution to fundamental scientific knowledge. Consequently, Paper 1 has much higher potential for broad scientific impact in the critical field of AI alignment and security.
Paper 1 addresses AI safety and LLM jailbreaking, a critically important field with broad implications for the safe deployment of foundation models. Its novel theoretical framework for understanding refusal suppression provides fundamental insights that could significantly influence future alignment strategies. While Paper 2 offers strong contributions to collaborative autonomous driving, the widespread use of LLMs and the urgent need for robust safety mechanisms give Paper 1 a broader and more immediate scientific impact.
Paper 2 likely has higher scientific impact: it proposes a broadly useful architectural improvement for efficient long-context modeling (decoupled erase/write gating in linear attention), provides algorithmic/optimization contributions (chunkwise WY, gate-aware backward pass), and reports large-scale training with consistent gains across multiple benchmarks and settings, enabling real-world deployment benefits (constant-memory decoding, long-context retrieval). Paper 1 is insightful and timely for safety research, but its primary contribution is an attack/analysis on refusal mechanisms with narrower positive application scope and potentially shorter-lived impact as defenses evolve.
Paper 1 addresses a critical bottleneck in scaling reinforcement learning for multimodal reasoning by introducing role-aware token-level credit assignment. Improving how large vision-language models learn to reason from verifiable rewards has massive implications for developing more capable, reliable AI systems. While Paper 2 offers valuable insights into AI safety and jailbreaking, Paper 1's methodology fundamentally advances model capabilities in a highly active and impactful frontier (RL-based reasoning for multimodal models), likely leading to broader adoption and downstream applications.
Paper 1 addresses a critical and highly timely issue in AI safety: LLM jailbreaking and alignment. By providing a novel theoretical framing of refusal suppression as latent-space evasion and achieving state-of-the-art results across 15 modern models, it has profound implications for the secure deployment of foundation models. In contrast, Paper 2 applies standard reinforcement learning (PPO) to a classic operations research problem (FJSP). While practically useful, Paper 2 represents an incremental methodological advance with a narrower domain of impact compared to the broad, urgent relevance of Paper 1 in generative AI.
Paper 2 likely has higher scientific impact due to its clearer conceptual reframing of refusal suppression as a principled latent-space evasion attack, unifying prior heuristics with probe geometry and yielding a general, extensible attack method. It demonstrates broad empirical validation across 15 diverse models (including multimodal/reasoning) and sets new state-of-the-art attack success, making it timely and highly relevant to safety research and evaluation. Paper 1 is promising for alignment architecture, but its impact is narrower, and robustness/rigor beyond benchmark harm reduction is less clearly established from the abstract.
Paper 1 provides a critical theoretical foundation for understanding refusal suppression in AI safety. By recasting refusal ablation as a latent-space evasion attack, it unifies empirical observations and proposes a state-of-the-art attack. This conceptual breakthrough will broadly impact how the community evaluates and improves LLM safety alignment, a highly urgent and relevant field. While Paper 2 offers significant efficiency improvements for multimodal reasoning, Paper 1's contribution addresses foundational security vulnerabilities across a wide range of state-of-the-art models, suggesting broader scientific impact.
AI safety and jailbreak robustness are currently critical, high-priority areas in LLM research. Paper 2 provides a principled theoretical reframing of refusal suppression and demonstrates state-of-the-art results across 15 foundational models, offering broad applicability and significant implications for AI alignment. While Paper 1 presents an innovative approach to proactive task-oriented dialogue, its impact is concentrated in a narrower application domain (e.g., sales agents), making Paper 2 more timely and universally impactful.
Paper 1 addresses a critical and highly active area of research: LLM safety, alignment, and jailbreaking. By providing a principled understanding of refusal suppression in latent space and demonstrating state-of-the-art attack success across many models, it exposes fundamental vulnerabilities in current AI systems. While Paper 2 offers a valuable framework for embodied AI simulation, the broader implications, urgency, and cross-disciplinary relevance of securing foundational models give Paper 1 a higher potential for immediate and widespread scientific impact.
Paper 1 addresses a highly active and critical area (LLM safety and jailbreaking), offering a novel theoretical perspective and achieving state-of-the-art results across 15 modern models. Its findings have immediate, widespread applicability in AI safety. Paper 2, while methodologically sound, focuses on a much more niche subfield (formal assurance arguments), limiting its breadth of impact and real-world adoption compared to Paper 1.
Paper 2 has higher likely impact: it offers a principled theoretical reframing of refusal suppression (as latent-space evasion against linear probes) and derives a stronger, controllable attack that achieves state-of-the-art results across many model families, making it broadly relevant to LLM security, alignment, and interpretability. Its methodological rigor and generality (clear formalization + extensive multi-model evaluation) support wide adoption and follow-on work. Paper 1 is timely and useful for political speech analysis, but is based on a small single-speech case study and narrower domain scope.
Paper 1 addresses a critical and highly active area of AI safety (LLM alignment and jailbreaking). By providing a principled theoretical framework for refusal suppression and demonstrating state-of-the-art attack success across 15 models, it offers broad foundational impact for both attacking and defending AI systems. In contrast, Paper 2 focuses on a relatively niche application (political speech analysis) with a very small case study, limiting its broader methodological and scientific applicability.
Paper 1 bridges mechanistic interpretability and adversarial robustness by theoretically reframing refusal ablation as a latent-space evasion attack. This principled understanding explains existing empirical successes and generates a more powerful attack mechanism with broad implications for AI safety. While Paper 2 addresses a timely issue in reasoning efficiency, Paper 1 offers deeper theoretical insights into model internals and vulnerabilities, likely driving broader foundational impact across the alignment and security communities.
Paper 2 offers a principled theoretical framework that reinterprets and improves upon existing refusal-suppression methods in LLMs, achieving state-of-the-art results across 15 diverse models. Its contributions—formalizing latent-space attacks, explaining why prior methods work, and proposing a superior approach—have broad implications for AI safety, alignment, and adversarial robustness. Paper 1, while addressing an important BCI problem, reports modest improvements (cosine similarity of 0.181 vs 0.139) on a single dataset and acknowledges that EEG-to-text decoding remains far from practical utility. Paper 2's timeliness given current AI safety concerns and broader applicability give it higher impact potential.
Paper 1 tackles AI safety and LLM alignment, an urgent and universally critical challenge in modern AI. By providing a principled theoretical framing for refusal evasion (recasting it as a latent-space evasion attack) and demonstrating state-of-the-art results across 15 models, it offers broad scientific impact for mechanistic interpretability and AI security. Paper 2 showcases impressive real-world commercial results in digital advertising, but its scientific scope is narrower and primarily represents an applied engineering optimization in ad-tech rather than a foundational shift in general AI research.
Paper 2 likely has higher scientific impact: it proposes a broadly applicable, constructive framework for reliable automatic equation discovery with clear real-world use in scientific modeling, engineering, and automated discovery. Its agentic loop (generation, mixed fitting, repair, and semantic memory) is methodologically rich and can generalize across domains and LLM backbones, supporting breadth and durability. Paper 1 is novel and rigorous but primarily advances jailbreak/refusal-evasion techniques, which may face limited publish-and-deploy applicability due to safety constraints and narrower positive downstream adoption.