Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models
Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao
Abstract
Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
This paper identifies a novel empirical observation about attention patterns in Large Reasoning Models (LRMs) during jailbreak attacks: successful jailbreaks correlate with *lower* attention to harmful tokens in the input prompt but *higher* attention to harmful tokens in the chain-of-thought (CoT) reasoning content. Building on this observation, the authors propose AGR (Attention-Guided Reward), an RL-based jailbreak framework that uses the signed distance to a linear SVM decision boundary in (AP_p, AP_r) space as a reward signal, guiding prompt refinement toward attention patterns associated with successful attacks. The method also introduces a 17-action space incorporating cognitive persuasion strategies to diversify prompt mutations.
The core insight—that the attention allocation discrepancy between prompt and reasoning content discriminates successful from failed jailbreaks—is genuinely interesting and specific to LRMs. This distinguishes the work from prior methods that simply apply LLM-oriented jailbreak techniques to reasoning models.
2. Methodological Rigor
Strengths in methodology:
Weaknesses:
3. Potential Impact
Positive contributions:
Dual-use concerns:
This paper provides a highly effective automated jailbreak tool. The 96-98% ASR on open-source models and 64-71% on closed-source models, combined with an average attack time of ~10 seconds, represents a potent offensive capability. The persuasion-strategy-based action space essentially automates social engineering of AI systems. While red-teaming research is valuable, the completeness of the attack pipeline (code, prompt templates, and detailed methodology) raises responsible disclosure questions.
Broader impact:
The work could influence: (1) LRM safety alignment research by highlighting the prompt-vs-reasoning attention asymmetry; (2) defense mechanisms specifically targeting attention manipulation; (3) future RL-based red-teaming frameworks.
4. Timeliness & Relevance
The paper is highly timely. LRMs (DeepSeek-R1, Qwen3, o4-mini, Gemini 2.5) are rapidly being deployed, and their safety properties are less understood than those of standard LLMs. The observation that exposed reasoning traces create additional attack surfaces is increasingly relevant as more models adopt CoT generation. The paper directly addresses the gap between LLM jailbreak research and the emerging LRM paradigm.
5. Strengths & Limitations
Key strengths:
Key limitations:
Missing comparisons: Mousetrap [30] is discussed in related work but not included in baselines. The paper also does not compare against other RL-based jailbreak methods for LLMs.
Overall Assessment
This is a solid, timely contribution that combines an interesting empirical observation with a well-engineered RL framework. The attention-pattern finding is the paper's most valuable contribution and could influence both offensive and defensive research on LRM safety. The experimental results are strong and comprehensive. However, the small sample foundation for the core observation, the lack of causal analysis, and the inherent dual-use concerns somewhat temper the impact. The work represents a meaningful advance in understanding and exploiting LRM vulnerabilities, though the mechanistic understanding remains surface-level.
Generated May 20, 2026
Comparison History (19)
Paper 1 addresses the critical and timely topic of AI safety for Large Reasoning Models, which are rapidly being deployed. The discovery that attention patterns correlate with jailbreak success is novel and provides mechanistic insight. The RL-based attack method with attention-guided rewards is innovative and has broad implications for AI safety research. Given the explosive growth of LLM/LRM deployment, this work has high relevance and urgency. Paper 2, while solid, addresses a more established optimization domain (VRP) with incremental improvements over existing methods, limiting its broader impact.
Paper 1 likely has higher scientific impact: it introduces a practical framework to distill LLM agent interaction traces into reusable, efficient RPA code, addressing a clear deployment bottleneck (cost/latency) and enabling broad real-world automation across enterprise and consumer GUI workflows. The translator–builder pipeline plus hybrid repair strategy suggests methodological completeness, and the large token/runtime reductions indicate immediate utility. Paper 2 is novel and timely in LRM security, but as an offensive jailbreak method its broader adoption and positive downstream impact may be constrained; its contributions may be more specialized to safety research.
Paper 2 addresses a timely and critical safety concern with Large Reasoning Models (LRMs), a rapidly emerging area. Its novel insight connecting attention patterns to jailbreak vulnerability, combined with an RL-based attack framework, has broader impact across AI safety, alignment research, and policy. The findings affect both open-source and closed-source models, increasing practical relevance. Paper 1, while methodologically sound in extending unlearning to multi-task settings, addresses a more incremental and narrower problem. The AI safety implications and timeliness of Paper 2 give it higher potential impact across the research community.
Paper 2 addresses a highly critical and timely issue—the safety vulnerabilities of Large Reasoning Models (LRMs) to jailbreak attacks. Its novel use of attention patterns combined with reinforcement learning to expose and exploit these vulnerabilities provides significant contributions to the rapidly growing field of AI safety and red-teaming. While Paper 1 offers valuable advancements in neurosymbolic claim verification, the immediate security implications, broader interest in LRM safety, and extensive experimental validation in Paper 2 give it a higher potential for broad scientific and practical impact.
Paper 1 likely has higher broad scientific impact: it introduces a large, multilingual benchmark (10,372 pairs) and a multi-metric evaluation framework for programmatic spatiotemporal reasoning, uncovering a general “execution–spatial” failure mode relevant to code generation, vision-language, and evaluation research. Its outputs are readily reusable by the community and can standardize comparisons across models. Paper 2 is novel and timely for AI safety, but jailbreak methods are narrower in positive scientific utility and can face dissemination/ethical constraints, limiting adoption and cross-field impact despite methodological sophistication.
Paper 2 likely has higher scientific impact due to strong real-world applicability (manufacturing/CAD automation), broad relevance across AI, robotics/agents, geometry, and design tools, and a constructive methodology (toolchain + verification + memory + RL) that can generalize to other long-horizon, constraint-heavy generation tasks. It targets a timely, high-value industrial bottleneck and proposes an architecture enabling self-correction without large new annotations. Paper 1 is novel but primarily advances offensive jailbreak capabilities, which may limit adoption and downstream impact despite security relevance.
Paper 1 addresses the urgent and highly relevant issue of safety and vulnerabilities in emerging Large Reasoning Models (LRMs). Given the rapid deployment of reasoning LLMs, understanding and mitigating their jailbreak vulnerabilities is critical. The novel use of attention-guided reinforcement learning provides a rigorous approach to AI safety, likely leading to broader immediate impact and real-world application in securing foundational models compared to the more mature field of vision model explainability addressed in Paper 2.
Paper 2 addresses a fundamental question about what drives reasoning improvements in LLMs, providing systematic evidence through large-scale controlled experiments on a 10T-token corpus. Its findings—that structured reasoning traces, not executable code per se, improve mathematical reasoning—have broad implications for LLM pretraining data strategies across the entire field. Paper 1, while methodologically sound, focuses on a narrower adversarial attack technique. Paper 2's insights into data-centric optimization and cross-domain interactions will likely influence foundation model training practices more broadly and durably.
Paper 2 addresses critical safety vulnerabilities in cutting-edge Large Reasoning Models, a highly timely and impactful area in AI alignment. By linking attention mechanisms to jailbreak success and leveraging RL, it offers novel insights that will significantly influence robust model design and security, providing broader real-world implications than the methodological evaluation improvements in Paper 1.
Paper 2 addresses a critical, highly timely issue: the safety vulnerabilities of emerging Large Reasoning Models (LRMs). By proposing a novel attention-guided reinforcement learning attack, it exposes significant security flaws and provides a methodological advancement that will stimulate crucial downstream research in AI alignment and defense. While Paper 1 is a rigorous and valuable systematic review, it focuses on reproducibility issues within the narrower, applied domain of algorithmic trading. The foundational AI security implications of Paper 2 give it a significantly higher potential for widespread scientific impact.
Paper 1 addresses a critical bottleneck in the rapidly growing field of agentic AI by providing a robust, scalable framework and benchmark for evaluating computer-use agents. Infrastructure and evaluation frameworks typically have broader and longer-lasting scientific impact across multiple disciplines compared to specific attack methodologies like the jailbreak technique presented in Paper 2.
Paper 1 addresses a timely and critical safety concern for Large Reasoning Models, introducing a novel attention-guided RL framework for jailbreaking that reveals fundamental insights about attention patterns and safety vulnerabilities. This has broad impact across AI safety, alignment research, and red-teaming communities. Paper 2, while valuable as a domain-specific benchmark for telecom, has narrower impact limited primarily to the telecommunications NLP community. Paper 1's methodological novelty (attention-based reward design for RL jailbreaking) and its implications for AI safety across multiple model families give it higher potential for cross-disciplinary influence.
Paper 2 addresses a critical and timely safety vulnerability in Large Reasoning Models, a rapidly growing area. It provides novel mechanistic insights (attention patterns correlating with jailbreak success) and proposes a principled RL-based attack method with strong empirical results across multiple models and benchmarks. This has broad implications for AI safety research, red-teaming, and alignment. Paper 1, while practically useful, is more of an engineering contribution (a runtime framework for LLM agents) with less fundamental scientific novelty. Safety/security findings tend to have outsized impact as they influence model development practices across the field.
Paper 1 likely has higher scientific impact: it introduces a general, constructive framework (signed-graph modeling and conflict-aware message passing) that can improve robustness and aggregation in multi-agent LLM systems, with broad applicability to coordination, decision support, and ensemble reasoning across domains. Its methodological contribution is more reusable and extensible beyond the specific benchmarks. Paper 2 is novel and timely for AI safety evaluation, but its primary contribution advances offensive jailbreak capability; practical deployment and downstream adoption may be limited by ethical constraints, and its impact may be narrower to security research despite strong relevance.
Paper 1 offers a novel, system-level framework that unifies generative sequence modeling with directed exploration and an explicit safety fallback for a high-stakes, high-volume real-world domain. Its methodological rigor is strengthened by evaluation across public data, simulation, and large-scale online deployment with clear business metrics, supporting reproducibility and practical validity. The potential applications are immediate and economically significant for ad platforms, and the explore–safeguard–select idea may transfer to other safe decision-making/control settings. Paper 2 is timely for AI safety research but is primarily an offensive jailbreak technique with narrower positive deployment pathways.
Paper 1 addresses a timely and critical problem in AI safety—jailbreaking Large Reasoning Models—with a novel approach combining attention-pattern analysis with RL-based attacks. The finding that attention patterns correlate with jailbreak success is a genuinely novel insight with broad implications for AI alignment and safety research. Paper 2 presents an incremental multi-agent approach to NL2SQL, a well-studied problem, achieving competitive but not groundbreaking results on BIRD benchmark. Paper 1's novelty, timeliness given the rapid deployment of LRMs, and broader impact on AI safety give it higher scientific impact potential.
Paper 2 offers a broadly enabling, constructive contribution: a programmatic representation for generating editable, physically interactable indoor scenes with articulated objects, validated via execution and exported to simulators. This has clear real-world applications in robotics, embodied AI, simulation, and content creation, and its “executable world programs” framing could influence multiple communities (graphics, HCI, planning, sim2real). Paper 1 is novel but primarily advances offensive jailbreak techniques; impact is narrower and may face dissemination/ethical constraints despite relevance to safety research.
Paper 2 addresses a critical and timely AI safety concern—jailbreaking Large Reasoning Models—with a novel mechanistic insight linking attention patterns to attack success, plus a concrete RL-based method. This has broader immediate impact across AI safety, alignment, and model deployment. The attention-guided reward mechanism is innovative and actionable for both offensive and defensive research. Paper 1, while thorough as a benchmarking substrate for agentic delegation, addresses a narrower community and its key finding—that quality is indistinguishable across conditions—limits its immediate transformative impact.
Paper 2 has higher potential impact due to broader, constructive real-world applications: an evidence-grounded literature mapping and hypothesis generation system for nanomedicine, a large translational domain. It introduces a multi-stage, auditable workflow with retrospective benchmarks and human evaluation, suggesting stronger methodological rigor and practical utility across drug delivery, biomaterials, and related fields. Paper 1 is novel but primarily advances jailbreak effectiveness (dual-use, potentially harmful), which may limit dissemination, adoption, and cross-field benefit despite timeliness in LRM safety research.