Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

#1270 of 2292 · Artificial Intelligence
Share
Tournament Score
1399±44
10501800
53%
Win Rate
10
Wins
9
Losses
19
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper identifies a novel empirical observation about attention patterns in Large Reasoning Models (LRMs) during jailbreak attacks: successful jailbreaks correlate with *lower* attention to harmful tokens in the input prompt but *higher* attention to harmful tokens in the chain-of-thought (CoT) reasoning content. Building on this observation, the authors propose AGR (Attention-Guided Reward), an RL-based jailbreak framework that uses the signed distance to a linear SVM decision boundary in (AP_p, AP_r) space as a reward signal, guiding prompt refinement toward attention patterns associated with successful attacks. The method also introduces a 17-action space incorporating cognitive persuasion strategies to diversify prompt mutations.

The core insight—that the attention allocation discrepancy between prompt and reasoning content discriminates successful from failed jailbreaks—is genuinely interesting and specific to LRMs. This distinguishes the work from prior methods that simply apply LLM-oriented jailbreak techniques to reasoning models.

2. Methodological Rigor

Strengths in methodology:

  • The attention analysis is systematic: 100 manually labeled cases are analyzed across multiple models (Qwen3-1.7B, Qwen3-8B, DeepSeek-R1-Distill-Llama-8B), with consistent patterns observed across architectures.
  • The RL formulation is clean: MDP definition, state representation, PPO with GAE, and the SVM-based reward are well-specified and reproducible.
  • The ablation study (Table 5) effectively disentangles contributions of the reward signal, PPO optimization, and action space diversity.
  • Judge consistency is validated with human annotators (Fleiss' κ = 0.97) and two LLM judges, with GPT-4 achieving Cohen's κ = 0.92.
  • Robustness analysis for harmful word extraction noise (Table 7) addresses a practical concern.
  • Weaknesses:

  • The foundational observation (Figure 2) is based on only 100 samples with a specific set of jailbreak techniques. The SVM decision boundary is linear and trained on this small dataset—overfitting risk is real, though the consistent cross-model results partially mitigate this concern.
  • The causal direction is unclear: does manipulating attention *cause* successful jailbreaks, or is the attention pattern merely a correlate of successful attacks? The paper implicitly assumes causality but doesn't rigorously establish it.
  • The harmful word extraction relies on GPT-4 + a predefined dictionary. The robustness study shows degradation at 40% noise, and real-world extraction quality is not independently validated.
  • The 100-sample SVM training set is small and drawn exclusively from AdvBench; generalization of the decision boundary to other distributions is assumed but not rigorously tested.
  • 3. Potential Impact

    Positive contributions:

  • The attention-pattern finding provides a mechanistic lens for understanding LRM safety failures, which could inform defense development. Understanding *why* jailbreaks succeed at the attention level is valuable for the safety community.
  • The RL framework with explicit optimization objectives represents a methodological advance over heuristic prompt-rewriting approaches.
  • Transfer results to closed-source models (o4-mini: 64%, Gemini-2.5-Flash: 71.3%) have practical security implications.
  • Dual-use concerns:

    This paper provides a highly effective automated jailbreak tool. The 96-98% ASR on open-source models and 64-71% on closed-source models, combined with an average attack time of ~10 seconds, represents a potent offensive capability. The persuasion-strategy-based action space essentially automates social engineering of AI systems. While red-teaming research is valuable, the completeness of the attack pipeline (code, prompt templates, and detailed methodology) raises responsible disclosure questions.

    Broader impact:

    The work could influence: (1) LRM safety alignment research by highlighting the prompt-vs-reasoning attention asymmetry; (2) defense mechanisms specifically targeting attention manipulation; (3) future RL-based red-teaming frameworks.

    4. Timeliness & Relevance

    The paper is highly timely. LRMs (DeepSeek-R1, Qwen3, o4-mini, Gemini 2.5) are rapidly being deployed, and their safety properties are less understood than those of standard LLMs. The observation that exposed reasoning traces create additional attack surfaces is increasingly relevant as more models adopt CoT generation. The paper directly addresses the gap between LLM jailbreak research and the emerging LRM paradigm.

    5. Strengths & Limitations

    Key strengths:

  • Novel, interpretable finding about attention patterns specific to LRM jailbreaking
  • Principled RL framework with a well-motivated reward function
  • Comprehensive evaluation: 5 models, 3 benchmarks, 4 defense mechanisms, ablation studies
  • Strong empirical results: 96-98% ASR outperforming all baselines, with 10.8s per attack
  • Practical efficiency: lightweight MLP agent, single GPU training
  • Key limitations:

  • Small sample size (100 cases) for the foundational observation and SVM training
  • Correlation vs. causation issue in the attention pattern analysis
  • Dependence on white-box access for reward computation limits direct applicability to closed-source models (transfer is indirect)
  • The SVM is target-model-specific, requiring retraining for each new model
  • The persuasion strategies are borrowed from prior work [33]; their integration is useful but not deeply analyzed
  • Limited analysis of what *types* of harmful content are most/least susceptible
  • Defense evaluation is somewhat limited—more recent or LRM-specific defenses are not tested
  • Missing comparisons: Mousetrap [30] is discussed in related work but not included in baselines. The paper also does not compare against other RL-based jailbreak methods for LLMs.

    Overall Assessment

    This is a solid, timely contribution that combines an interesting empirical observation with a well-engineered RL framework. The attention-pattern finding is the paper's most valuable contribution and could influence both offensive and defensive research on LRM safety. The experimental results are strong and comprehensive. However, the small sample foundation for the core observation, the lack of causal analysis, and the inherent dual-use concerns somewhat temper the impact. The work represents a meaningful advance in understanding and exploiting LRM vulnerabilities, though the mechanistic understanding remains surface-level.

    Rating:6.8/ 10
    Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

    Generated May 20, 2026

    Comparison History (19)

    vs. COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
    claude-opus-4.65/21/2026

    Paper 1 addresses the critical and timely topic of AI safety for Large Reasoning Models, which are rapidly being deployed. The discovery that attention patterns correlate with jailbreak success is novel and provides mechanistic insight. The RL-based attack method with attention-guided rewards is innovative and has broad implications for AI safety research. Given the explosive growth of LLM/LRM deployment, this work has high relevance and urgency. Paper 2, while solid, addresses a more established optimization domain (VRP) with incremental improvements over existing methods, limiting its broader impact.

    vs. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions
    gpt-5.25/21/2026

    Paper 1 likely has higher scientific impact: it introduces a practical framework to distill LLM agent interaction traces into reusable, efficient RPA code, addressing a clear deployment bottleneck (cost/latency) and enabling broad real-world automation across enterprise and consumer GUI workflows. The translator–builder pipeline plus hybrid repair strategy suggests methodological completeness, and the large token/runtime reductions indicate immediate utility. Paper 2 is novel and timely in LRM security, but as an offensive jailbreak method its broader adoption and positive downstream impact may be constrained; its contributions may be more specialized to safety research.

    vs. Interference-Aware Multi-Task Unlearning
    claude-opus-4.65/20/2026

    Paper 2 addresses a timely and critical safety concern with Large Reasoning Models (LRMs), a rapidly emerging area. Its novel insight connecting attention patterns to jailbreak vulnerability, combined with an RL-based attack framework, has broader impact across AI safety, alignment research, and policy. The findings affect both open-source and closed-source models, increasing practical relevance. Paper 1, while methodologically sound in extending unlearning to multi-task settings, addresses a more incremental and narrower problem. The AI safety implications and timeliness of Paper 2 give it higher potential impact across the research community.

    vs. Neurosymbolic Learning for Inference-Time Argumentation
    gemini-3.15/20/2026

    Paper 2 addresses a highly critical and timely issue—the safety vulnerabilities of Large Reasoning Models (LRMs) to jailbreak attacks. Its novel use of attention patterns combined with reinforcement learning to expose and exploit these vulnerabilities provides significant contributions to the rapidly growing field of AI safety and red-teaming. While Paper 1 offers valuable advancements in neurosymbolic claim verification, the immediate security implications, broader interest in LRM safety, and extensive experimental validation in Paper 2 give it a higher potential for broad scientific and practical impact.

    vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
    gpt-5.25/20/2026

    Paper 1 likely has higher broad scientific impact: it introduces a large, multilingual benchmark (10,372 pairs) and a multi-metric evaluation framework for programmatic spatiotemporal reasoning, uncovering a general “execution–spatial” failure mode relevant to code generation, vision-language, and evaluation research. Its outputs are readily reusable by the community and can standardize comparisons across models. Paper 2 is novel and timely for AI safety, but jailbreak methods are narrower in positive scientific utility and can face dissemination/ethical constraints, limiting adoption and cross-field impact despite methodological sophistication.

    vs. Memory-Augmented Reinforcement Learning Agent for CAD Generation
    gpt-5.25/20/2026

    Paper 2 likely has higher scientific impact due to strong real-world applicability (manufacturing/CAD automation), broad relevance across AI, robotics/agents, geometry, and design tools, and a constructive methodology (toolchain + verification + memory + RL) that can generalize to other long-horizon, constraint-heavy generation tasks. It targets a timely, high-value industrial bottleneck and proposes an architecture enabling self-correction without large new annotations. Paper 1 is novel but primarily advances offensive jailbreak capabilities, which may limit adoption and downstream impact despite security relevance.

    vs. OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models
    gemini-3.15/20/2026

    Paper 1 addresses the urgent and highly relevant issue of safety and vulnerabilities in emerging Large Reasoning Models (LRMs). Given the rapid deployment of reasoning LLMs, understanding and mitigating their jailbreak vulnerabilities is critical. The novel use of attention-guided reinforcement learning provides a rigorous approach to AI safety, likely leading to broader immediate impact and real-world application in securing foundational models compared to the more mature field of vision model explainability addressed in Paper 2.

    vs. What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
    claude-opus-4.65/20/2026

    Paper 2 addresses a fundamental question about what drives reasoning improvements in LLMs, providing systematic evidence through large-scale controlled experiments on a 10T-token corpus. Its findings—that structured reasoning traces, not executable code per se, improve mathematical reasoning—have broad implications for LLM pretraining data strategies across the entire field. Paper 1, while methodologically sound, focuses on a narrower adversarial attack technique. Paper 2's insights into data-centric optimization and cross-domain interactions will likely influence foundation model training practices more broadly and durably.

    vs. Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation
    gemini-3.15/20/2026

    Paper 2 addresses critical safety vulnerabilities in cutting-edge Large Reasoning Models, a highly timely and impactful area in AI alignment. By linking attention mechanisms to jailbreak success and leveraging RL, it offers novel insights that will significantly influence robust model design and security, providing broader real-world implications than the methodological evaluation improvements in Paper 1.

    vs. Agentic Trading: When LLM Agents Meet Financial Markets
    gemini-3.15/20/2026

    Paper 2 addresses a critical, highly timely issue: the safety vulnerabilities of emerging Large Reasoning Models (LRMs). By proposing a novel attention-guided reinforcement learning attack, it exposes significant security flaws and provides a methodological advancement that will stimulate crucial downstream research in AI alignment and defense. While Paper 1 is a rigorous and valuable systematic review, it focuses on reproducibility issues within the narrower, applied domain of algorithmic trading. The foundational AI security implications of Paper 2 give it a significantly higher potential for widespread scientific impact.

    vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents
    gemini-3.15/20/2026

    Paper 1 addresses a critical bottleneck in the rapidly growing field of agentic AI by providing a robust, scalable framework and benchmark for evaluating computer-use agents. Infrastructure and evaluation frameworks typically have broader and longer-lasting scientific impact across multiple disciplines compared to specific attack methodologies like the jailbreak technique presented in Paper 2.

    vs. TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?
    claude-opus-4.65/20/2026

    Paper 1 addresses a timely and critical safety concern for Large Reasoning Models, introducing a novel attention-guided RL framework for jailbreaking that reveals fundamental insights about attention patterns and safety vulnerabilities. This has broad impact across AI safety, alignment research, and red-teaming communities. Paper 2, while valuable as a domain-specific benchmark for telecom, has narrower impact limited primarily to the telecommunications NLP community. Paper 1's methodological novelty (attention-based reward design for RL jailbreaking) and its implications for AI safety across multiple model families give it higher potential for cross-disciplinary influence.

    vs. Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents
    claude-opus-4.65/20/2026

    Paper 2 addresses a critical and timely safety vulnerability in Large Reasoning Models, a rapidly growing area. It provides novel mechanistic insights (attention patterns correlating with jailbreak success) and proposes a principled RL-based attack method with strong empirical results across multiple models and benchmarks. This has broad implications for AI safety research, red-teaming, and alignment. Paper 1, while practically useful, is more of an engineering contribution (a runtime framework for LLM agents) with less fundamental scientific novelty. Safety/security findings tend to have outsized impact as they influence model development practices across the field.

    vs. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling
    gpt-5.25/20/2026

    Paper 1 likely has higher scientific impact: it introduces a general, constructive framework (signed-graph modeling and conflict-aware message passing) that can improve robustness and aggregation in multi-agent LLM systems, with broad applicability to coordination, decision support, and ensemble reasoning across domains. Its methodological contribution is more reusable and extensible beyond the specific benchmarks. Paper 2 is novel and timely for AI safety evaluation, but its primary contribution advances offensive jailbreak capability; practical deployment and downstream adoption may be limited by ethical constraints, and its impact may be narrower to security research despite strong relevance.

    vs. Generative Auto-Bidding with Unified Modeling and Exploration
    gpt-5.25/20/2026

    Paper 1 offers a novel, system-level framework that unifies generative sequence modeling with directed exploration and an explicit safety fallback for a high-stakes, high-volume real-world domain. Its methodological rigor is strengthened by evaluation across public data, simulation, and large-scale online deployment with clear business metrics, supporting reproducibility and practical validity. The potential applications are immediate and economically significant for ad platforms, and the explore–safeguard–select idea may transfer to other safe decision-making/control settings. Paper 2 is timely for AI safety research but is primarily an offensive jailbreak technique with narrower positive deployment pathways.

    vs. AgentNLQ: A General-Purpose Agent for Natural Language to SQL
    claude-opus-4.65/20/2026

    Paper 1 addresses a timely and critical problem in AI safety—jailbreaking Large Reasoning Models—with a novel approach combining attention-pattern analysis with RL-based attacks. The finding that attention patterns correlate with jailbreak success is a genuinely novel insight with broad implications for AI alignment and safety research. Paper 2 presents an incremental multi-agent approach to NL2SQL, a well-studied problem, achieving competitive but not groundbreaking results on BIRD benchmark. Paper 1's novelty, timeliness given the rapid deployment of LRMs, and broader impact on AI safety give it higher scientific impact potential.

    vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
    gpt-5.25/20/2026

    Paper 2 offers a broadly enabling, constructive contribution: a programmatic representation for generating editable, physically interactable indoor scenes with articulated objects, validated via execution and exported to simulators. This has clear real-world applications in robotics, embodied AI, simulation, and content creation, and its “executable world programs” framing could influence multiple communities (graphics, HCI, planning, sim2real). Paper 1 is novel but primarily advances offensive jailbreak techniques; impact is narrower and may face dissemination/ethical constraints despite relevance to safety research.

    vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows
    claude-opus-4.65/20/2026

    Paper 2 addresses a critical and timely AI safety concern—jailbreaking Large Reasoning Models—with a novel mechanistic insight linking attention patterns to attack success, plus a concrete RL-based method. This has broader immediate impact across AI safety, alignment, and model deployment. The attention-guided reward mechanism is innovative and actionable for both offensive and defensive research. Paper 1, while thorough as a benchmarking substrate for agentic delegation, addresses a narrower community and its key finding—that quality is indistinguishable across conditions—limits its immediate transformative impact.

    vs. Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine
    gpt-5.25/20/2026

    Paper 2 has higher potential impact due to broader, constructive real-world applications: an evidence-grounded literature mapping and hypothesis generation system for nanomedicine, a large translational domain. It introduces a multi-stage, auditable workflow with retrospective benchmarks and human evaluation, suggesting stronger methodological rigor and practical utility across drug delivery, biomaterials, and related fields. Paper 1 is novel but primarily advances jailbreak effectiveness (dual-use, potentially harmful), which may limit dissemination, adoption, and cross-field benefit despite timeliness in LRM safety research.