Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

#505 of 2682 · Artificial Intelligence
Share
Tournament Score
1480±44
10501800
71%
Win Rate
15
Wins
6
Losses
21
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates a mechanistically important question: how does chain-of-thought (CoT) reasoning interact with refusal mechanisms in Large Reasoning Models (LRMs)? Prior work (Arditi et al., 2024) established that refusal in standard LLMs is mediated by a single directional subspace in the residual stream. This paper demonstrates that in LRMs, refusal is jointly encoded across two channels—residual stream activations and the generated CoT trace—creating a dual-signal mechanism that is more robust to single-channel interventions.

The key finding is decomposed through a clean experimental progression:

  • Fixed CoT + steering: 39–43% compliance (steering alone is insufficient)
  • No CoT + steering: 70% compliance (CoT actively reinforces refusal)
  • Regenerated CoT + steering: 94% compliance (steering biases CoT generation, which amplifies compliance)
  • Swapped compliant CoT, no steering: 48% compliance (CoT independently carries partial compliance signal)
  • This decomposition provides a clear mechanistic picture: the CoT is not a passive byproduct but an active participant in the refusal decision, capable of both reinforcing refusal (counteracting steering) and independently carrying compliance signals.

    Methodological Rigor

    The experimental design is well-structured, with each experiment isolating a specific variable (steering presence, CoT condition). The use of 100 harmful instructions from JailbreakBench as a held-out test set with a 0% baseline compliance rate provides a clean starting point.

    However, there are notable methodological concerns:

    1. Single model: All experiments use only DeepSeek-R1-Distill-LLaMA-8B. This is a distilled model, not a natively trained reasoning model, which raises questions about whether the findings reflect properties of CoT reasoning generally or this specific model's training procedure.

    2. Evaluation metric: The paper uses Llama-3.2-8B-Guard as the refusal classifier, which achieves 86% accuracy against human annotations. While better than phrase-based heuristics, this introduces ~14% noise into all measurements, and the gap between conditions (e.g., 39% vs. 43%) may fall within this margin of error.

    3. No statistical significance testing: Results are reported as point estimates without confidence intervals or significance tests. With 100 test examples and an imperfect classifier, the differences between some conditions (e.g., EOI vs. EOT at 39% vs. 43%) may not be statistically meaningful.

    4. Layer selection: The paper shows compliance rates across layers but reports peak performance. The optimal layer varies across conditions, and the paper doesn't address how to select the optimal layer in practice.

    5. Greedy decoding only: Using a single decoding strategy limits understanding of the robustness of findings.

    Potential Impact

    The paper has implications for both AI safety and mechanistic interpretability:

    Safety implications: The finding that CoT creates a dual-encoding of refusal has a double-edged nature. On one hand, it suggests LRMs are more robust to activation-level jailbreaking attacks (only 39% vs. typical high success in LLMs). On the other hand, it reveals that CoT manipulation (a surface-level attack) can serve as an alternative attack vector—the 48% compliance from CoT swapping alone demonstrates this vulnerability.

    Mechanistic understanding: The paper extends our understanding of how safety behaviors are encoded in reasoning models. The insight that refusal is distributed across tokens (including self-generated tokens) rather than localized at template positions challenges the simple linear probe view of safety mechanisms.

    Practical defense implications: The observation that monitoring the residual stream for refusal signals while constraining CoT could be an effective defense strategy is actionable, though not tested.

    Timeliness & Relevance

    This paper addresses a very current need. With the rapid deployment of reasoning models (DeepSeek-R1, OpenAI o-series), understanding how safety mechanisms operate in these models is critical. The paper correctly identifies that prior work on refusal mechanisms focused on standard instruction-tuned LLMs, and the extension to LRMs fills a timely gap. The security implications are particularly relevant given growing concerns about jailbreaking attacks on deployed systems.

    Strengths

    1. Clean experimental isolation: The four-condition design effectively decomposes the contributions of steering and CoT to refusal behavior.

    2. Novel insight: The dual-signal mechanism is a genuinely new observation that extends our mechanistic understanding beyond prior work.

    3. Clear narrative: The paper builds a logical progression through experiments, with each condition motivated by the previous result.

    4. Practical relevance: The findings have direct implications for both attack and defense strategies for reasoning models.

    5. Informative appendices: The qualitative examples in the appendices effectively illustrate the different behaviors across conditions, including the interesting phenomenon of "CoT leakage" where harmful reasoning appears in the think trace despite final refusal.

    Limitations & Weaknesses

    1. Single model, single family: The biggest limitation. Without testing on other LRMs (e.g., QwQ, other DeepSeek variants, or different distillation targets), the generalizability is unknown.

    2. No faithfulness analysis: The paper acknowledges but does not address whether the CoT is causally responsible for the final output or merely correlated. This is crucial for the mechanistic claims.

    3. Limited scale: 100 test examples is relatively small for robust conclusions.

    4. No comparison with concurrent work: Yamaguchi et al. (2025) also study activation steering in reasoning models. A direct comparison would strengthen positioning.

    5. The "dual-signal" framing may be an oversimplification: The CoT is itself generated by the model's activations, so the two signals are not truly independent. The paper doesn't fully address this circularity.

    6. Missing analysis of CoT content: Beyond brief qualitative inspection, there's no systematic analysis of what features of the CoT carry the compliance/refusal signal.

    7. No exploration of scaling: The single 8B model tested is relatively small; it's unclear if larger models would show the same pattern.

    Overall Assessment

    This is a focused, well-executed empirical study that makes a clear and timely contribution to understanding safety mechanisms in reasoning models. The core finding—that CoT creates a distributed refusal mechanism resistant to simple activation steering—is novel and practically relevant. However, the single-model limitation significantly constrains the generalizability of the claims, and the lack of statistical analysis and faithfulness investigation weakens the mechanistic conclusions. The paper would benefit substantially from multi-model validation and more rigorous statistical treatment.

    Rating:5.8/ 10
    Significance 6.5Rigor 5Novelty 6.5Clarity 7.5

    Generated May 27, 2026

    Comparison History (21)

    vs. A Unified Framework for the Evaluation of LLM Agentic Capabilities
    claude-opus-4.65/28/2026

    Paper 1 offers a more novel and fundamental insight into how chain-of-thought reasoning interacts with safety mechanisms in large reasoning models, revealing a dual encoding of refusal that has significant implications for AI safety and alignment. The finding that CoT can independently carry compliance signals is a mechanistic discovery with broad theoretical impact. Paper 2, while practically useful as an engineering contribution for standardizing agentic benchmarks, is more incremental—unification frameworks tend to have shorter-lived impact as benchmarks evolve rapidly. Paper 1's findings are more likely to influence future safety research directions.

    vs. AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
    claude-opus-4.65/28/2026

    AutoScientists presents a broadly applicable framework for autonomous scientific experimentation with demonstrated improvements across multiple domains (biomedical ML, language model optimization, protein fitness prediction). Its practical impact potential is substantial—automating and accelerating scientific discovery across diverse fields. Paper 1, while offering important mechanistic insights about refusal in reasoning models, addresses a narrower topic (AI safety/alignment of a specific model architecture) with findings that, though interesting, are more incremental. Paper 2's breadth of validated applications, state-of-the-art improvements, and paradigm-shifting approach to AI-driven research give it significantly higher impact potential.

    vs. Calibrating Conservatism for Scalable Oversight
    gpt-5.25/28/2026

    Paper 1 likely has higher impact due to a more novel, end-to-end oversight framework with formal finite-time guarantees (distribution-free conformal calibration) and demonstrated effectiveness in sequential, agentic settings—directly addressing a timely core problem in AI alignment/control with clear real-world applicability. Its methodology combines theory and empirical evaluation across two meaningful benchmarks, suggesting broader cross-field relevance (RL safety, governance, statistics). Paper 2 offers important mechanistic insight into refusal/steering in LRMs and highlights a new attack surface, but is narrower in scope and less directly translated into general control solutions.

    vs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
    claude-opus-4.65/27/2026

    Paper 2 addresses a fundamental mechanistic question about how chain-of-thought reasoning interacts with safety mechanisms (refusal) in large reasoning models, revealing that refusal is jointly encoded in activations and CoT. This has broad implications for AI safety, interpretability, and alignment research. The finding that CoT both reinforces safety mechanisms and creates new attack surfaces is novel and timely given the rapid deployment of reasoning models. Paper 1, while thorough as a benchmark contribution, is more incremental—extending evaluation methodology for agents—and benchmarks tend to have shorter-lived impact than mechanistic insights.

    vs. Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
    gpt-5.25/27/2026

    Paper 2 likely has higher impact: it identifies a broadly relevant failure mode (monitoring-control gap) in retrieval-augmented LLMs, evaluated at scale (50k+ turn-level evals) across multiple model families, with human validation and converging mechanistic analyses. The results directly affect real-world, high-stakes RAG deployments and challenge common evaluation assumptions, making it timely and widely applicable across safety, HCI, and applied NLP. Paper 1 is novel mechanistically for LRMs/CoT steering, but is narrower in scope and model-specific, with more limited immediate deployment implications.

    vs. StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental and timely question about the safety and controllability of large reasoning models (LRMs), revealing that chain-of-thought creates a dual encoding of refusal that both strengthens robustness against activation steering and exposes new attack surfaces. This has broad implications for AI safety, alignment, and mechanistic interpretability—fields of intense current interest. Paper 2 presents a useful but more incremental contribution to agent RL credit assignment with narrower scope and limited model scales. Paper 1's insights about CoT's role in safety mechanisms are likely to influence multiple research directions more broadly.

    vs. TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews
    gemini-3.15/27/2026

    Paper 2 addresses a fundamental issue in AI safety and alignment for Large Reasoning Models, exploring how Chain-of-Thought interacts with refusal mechanisms. Its insights into mechanistic interpretability and model steering have broad implications for the secure deployment of advanced AI systems across all domains. While Paper 1 provides a valuable tool and benchmark for academic peer review, Paper 2's focus on the core mechanics of reasoning and safety in LRMs offers a wider and more foundational scientific impact.

    vs. Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
    claude-opus-4.65/27/2026

    Paper 2 addresses a more novel and timely question about AI safety and mechanistic interpretability of reasoning models. It reveals that chain-of-thought in LRMs creates a dual encoding of refusal (in activations and CoT), making simple steering attacks less effective but exposing new attack surfaces. This has significant implications for AI alignment and safety research. Paper 1, while methodologically sound, tests a relatively incremental question about code execution vs. CoT robustness on grade-school math with non-significant results on a single model, limiting its broader impact.

    vs. 2-ASP(Q) programs with weak constraints: Complexity and efficient implementation
    claude-opus-4.65/27/2026

    Paper 1 addresses a highly timely and broadly relevant topic—safety and alignment of large reasoning models (LRMs)—revealing that chain-of-thought reasoning fundamentally changes how refusal mechanisms work and complicates existing steering interventions. This has immediate implications for AI safety, red-teaming, and alignment research, which are areas of intense current interest across multiple communities. Paper 2 makes solid theoretical and practical contributions to ASP(Q) with weak constraints, but targets a narrower audience in computational logic and knowledge representation, limiting its broader impact.

    vs. Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems
    gemini-3.15/27/2026

    Paper 1 addresses AI safety and control mechanisms in cutting-edge Large Reasoning Models (LRMs). Given the rapid adoption of CoT-based models (e.g., DeepSeek-R1), understanding their vulnerability to steering and jailbreaking has immediate, high-stakes implications for AI alignment and security. Paper 2, while methodologically rigorous, focuses on a more niche reinforcement learning problem with simulated environments, making its potential impact narrower and less urgent compared to the global relevance of LLM safety.

    vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs
    gemini-3.15/27/2026

    Paper 2 investigates the safety and refusal mechanisms of emerging Large Reasoning Models (LRMs), revealing that Chain-of-Thought acts as an independent, dynamic state reinforcing refusal. This provides novel, fundamental insights into the mechanistic interpretability of a rapidly growing class of models, offering broader implications for AI safety, adversarial attacks, and alignment than the applied, framework-based approach of Paper 1.

    vs. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental and timely question about the safety and controllability of large reasoning models (LRMs), revealing that chain-of-thought creates a dual encoding of refusal that complicates existing alignment techniques. This has immediate implications for AI safety research and the rapidly growing deployment of reasoning models like DeepSeek-R1 and OpenAI o1. The finding that CoT both strengthens robustness against activation steering but opens new attack surfaces is novel and consequential. Paper 2 presents a useful engineering contribution for multi-agent scaling, but its impact is more incremental within the agent framework literature.

    vs. JobBench: Aligning Agent Work With Human Will
    gemini-3.15/27/2026

    Paper 1 investigates the mechanistic interpretability of Large Reasoning Models (LRMs), a highly timely frontier in AI. Its discovery that Chain-of-Thought traces dynamically interact with residual streams to encode refusal presents a fundamental shift in AI safety and steering. While Paper 2 introduces a valuable, human-centric benchmark for AI agents, Paper 1 offers deeper methodological insights into the internal mechanics and vulnerabilities of state-of-the-art models. This mechanistic discovery will have a broader and more immediate scientific impact on fundamental AI architecture, alignment, and security research.

    vs. From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation
    gpt-5.25/27/2026

    Paper 2 is more novel and broadly impactful: it identifies a distinct control/failure mode specific to large reasoning models where chain-of-thought jointly encodes refusal/compliance, quantifies intervention effects, and reveals a new attack surface relevant to AI safety, alignment, and interpretability. The methodological contribution (two-stage intervention disentangling activations vs regenerated CoT) is timely and generalizable across safety research and model control. Paper 1 is valuable and applied, but is narrower to legal indicator computation and depends on an in-house corpus, limiting breadth and likely downstream impact compared to core LRM safety/control insights.

    vs. What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation
    gemini-3.15/27/2026

    Paper 2 fundamentally challenges the prevailing assumption that Chain-of-Thought efficacy stems from logical derivation, demonstrating it relies primarily on short-range token co-occurrence. This insight has broad, paradigm-shifting implications for understanding LLM reasoning mechanisms across the entire field. Paper 1 offers valuable insights into AI safety and refusal mechanisms via activation steering, but its scope is narrower and more specialized compared to the foundational nature of Paper 2's findings.

    vs. The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?
    claude-opus-4.65/27/2026

    Paper 1 addresses a timely and critical topic—the safety and controllability of large reasoning models (LRMs)—which is highly relevant given the rapid deployment of CoT-based models like DeepSeek-R1 and OpenAI's o-series. It reveals a fundamental mechanistic insight: that chain-of-thought creates a dual encoding of refusal that makes simple activation steering insufficient, while also exposing a new attack surface. This has broad implications for AI safety, alignment, and interpretability research. Paper 2 offers useful but more incremental insights about KG utility in hypothesis generation within a narrower domain (battery materials). Paper 1's findings are more likely to influence safety practices and future model design across the field.

    vs. Proper Scoring Rules for Agentic Uncertainty Quantification
    claude-opus-4.65/27/2026

    Paper 1 introduces a rigorous mathematical framework (Trajectory Proper Score) addressing a fundamental gap in evaluating uncertainty quantification for language model agents—a rapidly growing area. Its contributions are broadly applicable across agentic AI systems, providing theoretical guarantees (strict properness proofs) and practical tools (censored trajectory handling). Paper 2 offers valuable empirical insights into refusal mechanisms in reasoning models, but its scope is narrower (specific to activation steering of a single model family) and more incremental. Paper 1's methodological contribution has broader potential to shape evaluation standards across the field.

    vs. Generating Robust Portfolios of Optimization Models using Large Language Models
    gpt-5.25/27/2026

    Paper 1 offers a broadly applicable, timely framework for using LLMs to generate robust portfolios of optimization models, addressing a major real-world bottleneck in operations research and decision support. It claims theoretical guarantees under clear alignment assumptions and demonstrates empirical validation across tasks, suggesting methodological rigor and generality. Its impact could span optimization, AI-assisted modeling, and human-in-the-loop systems with immediate practical relevance. Paper 2 is novel and relevant for mechanistic interpretability and safety, but is narrower (focused on refusal steering in a specific LRM) and primarily exposes an attack surface rather than delivering a generalizable constructive method.

    vs. LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
    claude-opus-4.65/27/2026

    Paper 1 addresses a fundamental and timely question about the mechanistic nature of refusal in large reasoning models, revealing that chain-of-thought creates a dual encoding of safety behaviors that complicates existing control mechanisms. This has significant implications for AI safety, interpretability, and alignment research—areas of critical importance as reasoning models become widespread. Paper 2, while useful as a benchmark contribution, is more incremental (another evaluation benchmark) and its impact is bounded by the typical lifecycle of benchmarks. Paper 1's insights about CoT as an independent carrier of compliance/refusal signals open new research directions in both safety and mechanistic interpretability.

    vs. From Model Scaling to System Scaling: Scaling the Harness in Agentic AI
    gemini-3.15/27/2026

    Paper 1 provides concrete, novel empirical insights into the internal mechanics of Large Reasoning Models, specifically how Chain-of-Thought interacts with activation steering and refusal. Its rigorous mechanistic approach directly addresses critical AI safety and alignment challenges in cutting-edge models. Paper 2, while relevant, is primarily a position paper on system architecture, offering conceptual frameworks rather than novel foundational discoveries. Thus, Paper 1's specific methodological breakthroughs offer higher potential for scientific impact.