Latent-space Attacks for Refusal Evasion in Language Models

Giorgio Piras, Raffaele Mura, Fabio Brau, Maura Pintor, Luca Oneto, Fabio Roli, Battista Biggio

#356 of 2292 · Artificial Intelligence
Share
Tournament Score
1492±48
10501800
82%
Win Rate
14
Wins
3
Losses
17
Matches
Rating
7.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Latent-space Attacks for Refusal Evasion in Language Models"

1. Core Contribution

This paper makes two interconnected contributions. First, it provides a theoretical reframing of existing refusal-ablation methods (DiM, RDO, PS, MD) as latent-space evasion attacks against linear probes. Under this lens, the widely-used difference-in-means ablation is shown to be equivalent to the DeepFool minimum-confidence attack—a projection onto the linear decision boundary separating refused from compliant representations. This is an elegant insight: it unifies disparate methods under a single adversarial ML framework and immediately reveals their shared limitation—they only push representations to the decision boundary, not past it.

Second, building on this insight, the paper proposes Controlled Latent-space Evasion (CLE), which adds an optimized confidence margin to push representations into the compliant region. Two variants are introduced: CLE-P (projective, reprojecting every token) and CLE-A (additive, computing perturbation once and reusing it). The surprising finding that CLE-A outperforms CLE-P suggests that continuous reprojection is unnecessary and potentially harmful—a fixed additive shift suffices to sustain evasion throughout generation.

2. Methodological Rigor

The theoretical framework is cleanly developed. The decomposition of the gradient into direct and indirect components (Eq. 5), the truncated gradient approximation, and the reduction of arbitrary latent perturbations to 2L scalar parameters are well-motivated. The connection between margin and logistic confidence (Appendix E) is mathematically precise.

However, several methodological choices warrant scrutiny:

  • Bayesian Optimization: The use of BO with 500-700 trials over a validation set introduces a non-trivial optimization cost (~6 hours per model). The contiguous window constraint on layer selection, while reducing search space, is an assumption that may not hold universally. The paper acknowledges this as a limitation.
  • Probe training: SVM probes are trained on only 128 harmful and 128 harmless samples. While the ROC analysis (Fig. 5) suggests strong separability, the generalization of these probes to diverse harmful categories is not thoroughly validated.
  • Evaluation protocol: ASR is measured using a single classifier judge (HarmBench-Llama-2-13B-cls), averaged over 3 seeds. While standard for the field, this single-judge evaluation may not capture nuances in response quality or harmfulness. The model coherence evaluation (MMLU, ARC, TruthfulQA) shows modest drops, but the benchmarks selected are relatively coarse measures.
  • Truncated gradient approximation: Dropping indirect gradient contributions through subsequent layers is a convenience that may leave performance on the table, though the strong empirical results suggest it works well in practice.
  • 3. Potential Impact

    For AI safety research: This work is significant because it demonstrates that current safety alignment based on linearly separable refusal representations is fundamentally fragile. The fact that a simple linear probe + margin optimization can defeat alignment across 15 diverse models (including defended models like Mistral-7B-RR and reasoning models like DeepSeek-R1-8B) is a strong negative result for the robustness of current alignment approaches.

    For adversarial ML: The connection between refusal ablation and DeepFool-style evasion attacks is theoretically valuable, bridging interpretability/alignment literature with classical adversarial robustness. This cross-pollination could inspire new attack-defense dynamics.

    For defenses: The paper explicitly suggests that alignment procedures should discourage linear separability of refusal representations. This is actionable guidance—though implementing it without degrading model quality is non-trivial.

    Dual-use concerns: The attack is powerful, computationally cheap at inference time (single forward pass after optimization), and universal across prompts. The white-box access requirement limits immediate misuse to open-weight models, but the growing availability of such models makes this a real concern.

    4. Timeliness & Relevance

    This work arrives at a critical juncture. The rapid deployment of open-weight LLMs, combined with increasing evidence that safety alignment is brittle, makes understanding and characterizing failure modes urgent. The paper addresses the specific bottleneck of explaining *why* refusal ablation works—prior work was largely empirical. The extension to reasoning and multimodal models (including recent architectures like Qwen3.5, Phi-4, Gemma3) demonstrates relevance to the current model landscape.

    5. Strengths & Limitations

    Key Strengths:

  • Unifying theoretical framework that elegantly explains multiple prior methods as special cases of latent-space evasion
  • State-of-the-art results across a diverse set of 15 models, including defended and reasoning models
  • Surprising CLE-A finding: that a single additive perturbation outperforms per-token reprojection provides mechanistic insight about how refusal trajectories propagate
  • Comprehensive evaluation with ablation studies isolating each component's contribution
  • The monotonic relationship between compliance confidence and ASR (Fig. 3a) strongly validates the theoretical framework
  • Notable Limitations:

  • BO optimization cost is non-negligible and model-specific; no theoretical guidance on margin selection
  • Linear probe assumption may not hold for future models with non-linear refusal representations
  • Limited evaluation of output quality beyond ASR—harmful completions may be incoherent or incomplete
  • The 128-sample probe training set is small; sensitivity to training data composition is unexplored
  • No evaluation against more sophisticated defenses beyond Representation Rerouting
  • The paper does not discuss how CLE interacts with system prompts or multi-turn conversations
  • 6. Additional Observations

    The qualitative examples (Appendix J) are striking—models produce detailed harmful content that would clearly fail safety standards. The Mistral-7B-RR example is particularly notable: the defended model produces garbled refusals without CLE but coherent harmful content with CLE-P, suggesting the defense primarily disrupts refusal token generation rather than fundamentally changing representation geometry.

    The paper's framing as an "attack" paper rather than a "defense" paper is appropriate—it advances understanding of vulnerabilities. However, the defensive implications (Section 6) remain speculative.

    Rating:7.8/ 10
    Significance 8Rigor 7.5Novelty 7.5Clarity 8.5

    Generated May 22, 2026

    Comparison History (17)

    vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
    gemini-3.15/22/2026

    Paper 2 addresses a critical and highly active area of AI safety: model alignment and jailbreaking. By providing a principled framework for latent-space attacks and demonstrating state-of-the-art success rates across 15 diverse models, it exposes severe vulnerabilities in current safety mechanisms. This will likely drive significant follow-up research in both mechanistic interpretability and robust alignment, giving it a broader and more urgent impact than Paper 1's focus on XAI evaluation metrics.

    vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools
    gemini-3.15/22/2026

    Paper 1 addresses a fundamental scientific problem in AI safety by reframing LLM refusal suppression as a latent-space evasion attack, advancing theoretical understanding and empirical evaluation across 15 models. Paper 2, while offering a highly useful engineering framework for reducing API boilerplate, is primarily a software engineering tool rather than a contribution to fundamental scientific knowledge. Consequently, Paper 1 has much higher potential for broad scientific impact in the critical field of AI alignment and security.

    vs. LACO: Adaptive Latent Communication for Collaborative Driving
    gemini-3.15/22/2026

    Paper 1 addresses AI safety and LLM jailbreaking, a critically important field with broad implications for the safe deployment of foundation models. Its novel theoretical framework for understanding refusal suppression provides fundamental insights that could significantly influence future alignment strategies. While Paper 2 offers strong contributions to collaborative autonomous driving, the widespread use of LLMs and the urgent need for robust safety mechanisms give Paper 1 a broader and more immediate scientific impact.

    vs. Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact: it proposes a broadly useful architectural improvement for efficient long-context modeling (decoupled erase/write gating in linear attention), provides algorithmic/optimization contributions (chunkwise WY, gate-aware backward pass), and reports large-scale training with consistent gains across multiple benchmarks and settings, enabling real-world deployment benefits (constant-memory decoding, long-context retrieval). Paper 1 is insightful and timely for safety research, but its primary contribution is an attack/analysis on refusal mechanisms with narrower positive application scope and potentially shorter-lived impact as defenses evolve.

    vs. Structured Role-Aware Policy Optimization for Multimodal Reasoning
    gemini-3.15/22/2026

    Paper 1 addresses a critical bottleneck in scaling reinforcement learning for multimodal reasoning by introducing role-aware token-level credit assignment. Improving how large vision-language models learn to reason from verifiable rewards has massive implications for developing more capable, reliable AI systems. While Paper 2 offers valuable insights into AI safety and jailbreaking, Paper 1's methodology fundamentally advances model capabilities in a highly active and impactful frontier (RL-based reasoning for multimodal models), likely leading to broader adoption and downstream applications.

    vs. Deep Reinforcement Learning for Flexible Job Shop Scheduling with Random Job Arrivals
    gemini-3.15/22/2026

    Paper 1 addresses a critical and highly timely issue in AI safety: LLM jailbreaking and alignment. By providing a novel theoretical framing of refusal suppression as latent-space evasion and achieving state-of-the-art results across 15 modern models, it has profound implications for the secure deployment of foundation models. In contrast, Paper 2 applies standard reinforcement learning (PPO) to a classic operations research problem (FJSP). While practically useful, Paper 2 represents an incremental methodological advance with a narrower domain of impact compared to the broad, urgent relevance of Paper 1 in generative AI.

    vs. Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact due to its clearer conceptual reframing of refusal suppression as a principled latent-space evasion attack, unifying prior heuristics with probe geometry and yielding a general, extensible attack method. It demonstrates broad empirical validation across 15 diverse models (including multimodal/reasoning) and sets new state-of-the-art attack success, making it timely and highly relevant to safety research and evaluation. Paper 1 is promising for alignment architecture, but its impact is narrower, and robustness/rigor beyond benchmark harm reduction is less clearly established from the abstract.

    vs. TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens
    gemini-3.15/22/2026

    Paper 1 provides a critical theoretical foundation for understanding refusal suppression in AI safety. By recasting refusal ablation as a latent-space evasion attack, it unifies empirical observations and proposes a state-of-the-art attack. This conceptual breakthrough will broadly impact how the community evaluates and improves LLM safety alignment, a highly urgent and relevant field. While Paper 2 offers significant efficiency improvements for multimodal reasoning, Paper 1's contribution addresses foundational security vulnerabilities across a wide range of state-of-the-art models, suggesting broader scientific impact.

    vs. Unlocking Proactivity in Task-Oriented Dialogue
    gemini-3.15/22/2026

    AI safety and jailbreak robustness are currently critical, high-priority areas in LLM research. Paper 2 provides a principled theoretical reframing of refusal suppression and demonstrates state-of-the-art results across 15 foundational models, offering broad applicability and significant implications for AI alignment. While Paper 1 presents an innovative approach to proactive task-oriented dialogue, its impact is concentrated in a narrower application domain (e.g., sales agents), making Paper 2 more timely and universally impactful.

    vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
    gemini-3.15/22/2026

    Paper 1 addresses a critical and highly active area of research: LLM safety, alignment, and jailbreaking. By providing a principled understanding of refusal suppression in latent space and demonstrating state-of-the-art attack success across many models, it exposes fundamental vulnerabilities in current AI systems. While Paper 2 offers a valuable framework for embodied AI simulation, the broader implications, urgency, and cross-disciplinary relevance of securing foundational models give Paper 1 a higher potential for immediate and widespread scientific impact.

    vs. Towards a compositional semantics for quantitative confidence assessment in assurance arguments
    gemini-3.15/22/2026

    Paper 1 addresses a highly active and critical area (LLM safety and jailbreaking), offering a novel theoretical perspective and achieving state-of-the-art results across 15 modern models. Its findings have immediate, widespread applicability in AI safety. Paper 2, while methodologically sound, focuses on a much more niche subfield (formal assurance arguments), limiting its breadth of impact and real-world adoption compared to Paper 1.

    vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
    gpt-5.25/22/2026

    Paper 2 has higher likely impact: it offers a principled theoretical reframing of refusal suppression (as latent-space evasion against linear probes) and derives a stronger, controllable attack that achieves state-of-the-art results across many model families, making it broadly relevant to LLM security, alignment, and interpretability. Its methodological rigor and generality (clear formalization + extensive multi-model evaluation) support wide adoption and follow-on work. Paper 1 is timely and useful for political speech analysis, but is based on a small single-speech case study and narrower domain scope.

    vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models
    gemini-3.15/22/2026

    Paper 1 addresses a critical and highly active area of AI safety (LLM alignment and jailbreaking). By providing a principled theoretical framework for refusal suppression and demonstrating state-of-the-art attack success across 15 models, it offers broad foundational impact for both attacking and defending AI systems. In contrast, Paper 2 focuses on a relatively niche application (political speech analysis) with a very small case study, limiting its broader methodological and scientific applicability.

    vs. CLORE: Content-Level Optimization for Reasoning Efficiency
    gemini-3.15/22/2026

    Paper 1 bridges mechanistic interpretability and adversarial robustness by theoretically reframing refusal ablation as a latent-space evasion attack. This principled understanding explains existing empirical successes and generates a more powerful attack mechanism with broad implications for AI safety. While Paper 2 addresses a timely issue in reasoning efficiency, Paper 1 offers deeper theoretical insights into model internals and vulnerabilities, likely driving broader foundational impact across the alignment and security communities.

    vs. RAG-based EEG-to-Text Translation Using Deep Learning and LLMs
    claude-opus-4.65/22/2026

    Paper 2 offers a principled theoretical framework that reinterprets and improves upon existing refusal-suppression methods in LLMs, achieving state-of-the-art results across 15 diverse models. Its contributions—formalizing latent-space attacks, explaining why prior methods work, and proposing a superior approach—have broad implications for AI safety, alignment, and adversarial robustness. Paper 1, while addressing an important BCI problem, reports modest improvements (cosine similarity of 0.181 vs 0.139) on a single dataset and acknowledges that EEG-to-text decoding remains far from practical utility. Paper 2's timeliness given current AI safety concerns and broader applicability give it higher impact potential.

    vs. Generative Auto-Bidding with Unified Modeling and Exploration
    gemini-3.15/22/2026

    Paper 1 tackles AI safety and LLM alignment, an urgent and universally critical challenge in modern AI. By providing a principled theoretical framing for refusal evasion (recasting it as a latent-space evasion attack) and demonstrating state-of-the-art results across 15 models, it offers broad scientific impact for mechanistic interpretability and AI security. Paper 2 showcases impressive real-world commercial results in digital advertising, but its scientific scope is narrower and primarily represents an applied engineering optimization in ad-tech rather than a foundational shift in general AI research.

    vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery
    gpt-5.25/22/2026

    Paper 2 likely has higher scientific impact: it proposes a broadly applicable, constructive framework for reliable automatic equation discovery with clear real-world use in scientific modeling, engineering, and automated discovery. Its agentic loop (generation, mixed fitting, repair, and semantic memory) is methodologically rich and can generalize across domains and LLM backbones, supporting breadth and durability. Paper 1 is novel and rigorous but primarily advances jailbreak/refusal-evasion techniques, which may face limited publish-and-deploy applicability due to safety constraints and narrower positive downstream adoption.