Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

Kyungmin Park, Taesup Kim

#502 of 3355 · Artificial Intelligence
Share
Tournament Score
1485±44
10501800
69%
Win Rate
11
Wins
5
Losses
16
Matches
Rating
7.4/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper makes three interconnected contributions. First, it demonstrates that the previously identified "shallow safety" phenomenon — where safety alignment concentrates in the first few output tokens — is a special case of a broader inference-time vulnerability. Short token injections at *any* generation step can substantially redirect safety behavior, not just at the initial prefix. Second, the paper shows that a model's alignment with known refusal directions in hidden-state space does not predict robustness to injection, establishing a disconnect between internal representational state and behavioral resilience. Third, the paper proposes trajectory-level safety alignment: constructing training data by simulating mid-sequence perturbation (bidirectional trajectory augmentation) and training models via preference optimization (SimPO) over these augmented trajectories.

The key conceptual insight — that safety should be understood as a trajectory-level property rather than an initial-token or representation-level property — is both intuitive and well-supported by the experiments. The "refused but didn't resist" finding is particularly compelling: models occupy refusal-like hidden states yet are trivially redirected by injection.

2. Methodological Rigor

The experimental design is thorough and well-controlled. The paper evaluates across three model families (Llama-3.1-8B, Mistral-7B, Qwen2.5-7B), four safety benchmarks (AdvBench as in-domain, plus HarmBench, HEx-PHI, and JailbreakBench as out-of-domain), and two independent safety judges (LlamaGuard and OpenAI Moderation API). The dual-judge approach mitigates evaluation bias.

The ablation studies are comprehensive, covering injection phrase length, semantic diversity, timing, multiple injections, threshold sensitivity, and scaling to 70B. The finding that even semantically benign phrases ("Let me think") raise base-model ASR from 0% to 20%+ strengthens the argument that the vulnerability is structural rather than content-dependent.

However, several methodological choices warrant scrutiny:

  • The prototype computation uses difference-in-means at the last layer only; vulnerability patterns at intermediate layers remain unexplored.
  • The harmful injection sequence is selected via ablation as the strongest attacker, which may overfit evaluation to a specific injection style, though the phrase diversity ablation partially addresses this.
  • All evaluation uses greedy decoding, which limits ecological validity since most deployed systems use temperature-based sampling.
  • The paper trains and evaluates primarily on AdvBench-derived data (520 instructions), which is relatively small.
  • 3. Potential Impact

    Practical impact: The method is lightweight — it uses QLoRA on a single RTX 3090 GPU with ~30-40 minutes per training iteration and only a few hundred trajectory pairs. This makes it immediately deployable for practitioners fine-tuning open-weight models. The iterative refinement procedure that drives ASR toward zero is analogous to iterative adversarial training in computer vision, establishing a useful paradigm.

    Transferability to unseen attacks: The strongest practical result is generalization to prefilling attacks (ASR dropping from 84.62% to 0.00% on AdvBench with LG), I-GCG suffix attacks, and PAIR semantic attacks — despite training only on a single injection scenario. This suggests the method teaches a general recovery mechanism rather than pattern-matching specific attack signatures.

    Conceptual impact: The framing of safety as a trajectory-level property could shift how the community thinks about alignment. Current methods (RLHF, DPO, Circuit Breakers, representation engineering) all operate either at the input level or at fixed internal representations. This paper argues persuasively that the generation dynamics themselves must be part of the training signal.

    Broader field influence: The trajectory augmentation idea could extend beyond safety to other alignment properties (factuality, consistency, instruction-following under perturbation). The bidirectional augmentation framework — training models to both resist harmful redirection and recover from unsafe states — is a general technique.

    4. Timeliness & Relevance

    This work is highly timely. Prefilling attacks have been shown to achieve near-perfect ASR on major commercial APIs, and open-weight model deployment makes inference-time manipulation increasingly practical. The paper directly addresses a gap identified by Qi et al. (2025) on shallow safety, extending the analysis and providing a concrete defense. The proliferation of open-weight models (Llama 3, Mistral, Qwen) makes inference-time attacks a first-order concern.

    5. Strengths & Limitations

    Key Strengths:

  • The diagnostic analysis (Sections 2.2–2.4) is clean and insightful. The "refused but didn't resist" finding with per-bin ASR vs. refusal-prototype similarity is a memorable and important result.
  • Strong out-of-domain generalization across four benchmarks and three attack families.
  • Minimal degradation on general capabilities (MMLU within 1.1 points) and low over-refusal (≤11.6% on XSTest).
  • Scaling demonstration to 70B and cross-lingual transfer (Korean, Swahili) without language-specific augmentation.
  • Reproducible setup: single GPU, public models, detailed hyperparameters.
  • Key Limitations:

  • The attack model (fixed injection at a threshold-triggered step) is relatively simple compared to adaptive adversaries who might vary injection content, timing, and frequency simultaneously. The multiple-injection ablation shows resilience, but a truly adaptive attacker is not tested.
  • Comparison fairness: Circuit Breakers uses pre-released checkpoints on slightly different base models (Mistral-v0.2 vs. v0.3, Llama-3 vs. 3.1), and LAT is only evaluated on Llama, limiting direct comparison.
  • On Qwen, SafeProbing outperforms trajectory alignment under OM metric in some settings, suggesting model-specific variability.
  • The paper does not evaluate against representation-level attacks (e.g., CAVGAN) that directly manipulate hidden states rather than injecting tokens.
  • The injection framework assumes white-box access to hidden states for triggering augmentation, though the resulting defense does not require this at inference time.
  • 6. Additional Observations

    The qualitative examples (Tables 14-15) are compelling: the aligned model absorbs adversarial prefixes and semantically reframes within the same sentence ("here's a detailed guide on how to write a secure script instead of an exploit script"). This suggests the model learns genuine compositional recovery rather than simple pattern suppression.

    The dataset sizes are remarkably small (298-342 pairs in iteration 1), making the efficiency impressive but also raising questions about whether the method would scale to more diverse attack distributions.

    Rating:7.4/ 10
    Significance 7.5Rigor 7.5Novelty 7.5Clarity 8.5

    Generated Jun 5, 2026

    Comparison History (16)

    vs. InfoDensity: Rewarding Information-Dense Traces for Efficient Reasoning
    gpt-5.26/6/2026

    Paper 2 likely has higher impact due to strong timeliness and broad relevance: inference-time attacks on LLM safety are an urgent, widely applicable problem across deployed systems. It reframes “shallow safety” as a general trajectory-level vulnerability, introduces a practically important threat model (mid-sequence token injection), and proposes a training approach (trajectory-based alignment via simulated perturbations) that could influence safety evaluation and alignment methods across labs. Paper 1 is solid and useful for efficiency, but its scope is narrower (reasoning cost/verbosity) and less societally critical than robust safety under adversarial inference-time interventions.

    vs. ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation
    gpt-5.26/6/2026

    Paper 2 targets a timely, high-stakes problem—LLM safety under inference-time attacks—with clear implications for real-world deployments. Its framing generalizes “shallow safety” into a broader vulnerability class, and it proposes a conceptually direct mitigation (trajectory-level alignment via simulated mid-sequence perturbations) that can influence both academic research and industry practice. The impact spans security, alignment, and evaluation. Paper 1 is innovative for agent adaptivity and self-reconfiguration, but its impact may be more niche and benchmark-dependent compared to broadly relevant safety robustness findings.

    vs. Where does Absolute Position come from in decoder-only Transformers?
    claude-opus-4.66/5/2026

    Paper 2 provides a fundamental mechanistic understanding of how absolute position information emerges in RoPE-based transformers despite only relative positional encoding being explicitly applied. This addresses a core architectural mystery with broad implications for transformer design, context window extension, and interpretability research. Paper 1 addresses an important but more incremental safety concern—extending the understanding of inference-time vulnerabilities beyond shallow safety. While practically relevant, Paper 2's insights are more foundational, affecting how the community understands and designs transformer architectures, giving it broader and longer-lasting impact.

    vs. Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models
    gemini-3.16/5/2026

    Paper 1 addresses a critical and highly timely bottleneck in deploying Large Reasoning Models: the massive inference cost and memory footprint of extended reasoning traces. By introducing a method to dynamically select and retain only decision-critical KV cache states, it offers immediate, practical efficiency gains that will significantly impact the scalability and real-world deployment of cutting-edge AI systems across the industry.

    vs. The Self-Correction Illusion: LLMs Correct Others but Not Themselves
    gpt-5.26/5/2026

    Paper 2 likely has higher impact because it targets a broadly relevant, timely safety failure mode—inference-time adversarial perturbations across the whole generation trajectory—and proposes a general training remedy (trajectory-based alignment) with clear real-world implications for deployed LLM robustness. Its claims connect to security, alignment, and sequence modeling, with potential downstream influence on evaluation protocols and training curricula. Paper 1 is novel and actionable (role-label artifact), but its mechanism may be more template-specific and narrower in scope compared to trajectory-level robustness that applies across interfaces and deployment settings.

    vs. Minimizing the Hidden Cost of Scales: Graph-Guided Ultra-Low-Bit Quantization for Large Language Models
    gemini-3.16/5/2026

    While Paper 1 addresses an important AI safety issue, Paper 2 offers a massive leap in practical LLM deployment. Achieving ~1-bit weight quantization with usable perplexity and significantly reduced memory overhead democratizes the use of massive models (e.g., LLaMA-2-70B on a single GPU). This breakthrough in efficiency has immediate, widespread real-world applications and methodological rigor that will highly impact both academia and industry.

    vs. Log analysis is necessary for credible evaluation of AI agents
    claude-opus-4.66/5/2026

    Paper 2 addresses a fundamental vulnerability in LLM safety alignment with a novel finding that shallow safety is a special case of broader inference-time vulnerability. It proposes a concrete training methodology (aligning on generation trajectories) with demonstrated improvements. This has immediate implications for AI safety research and deployment. Paper 1 makes an important but more incremental methodological contribution about evaluation practices (log analysis), which, while valuable, is more of a best-practices recommendation than a novel scientific discovery. Paper 2's findings are more likely to spawn follow-up research and influence safety training paradigms.

    vs. SentinelBench: A Benchmark for Long-Running Monitoring Agents
    gpt-5.26/5/2026

    Paper 2 has higher potential impact: it identifies a broadly applicable and timely safety vulnerability (mid-sequence inference-time token injections) and proposes a concrete mitigation (trajectory-based alignment) with likely relevance across many LLM deployments. Its implications span alignment, robustness, security, and evaluation, and could influence both research directions and practical safety training protocols. Paper 1 is novel and useful as an evaluation benchmark for monitoring agents, but its scope is narrower (web-based long-running monitoring) and impact is primarily in agent evaluation methodology rather than a cross-cutting concern like safety robustness.

    vs. Adaptive auditing of AI systems with anytime-valid guarantees
    gpt-5.26/5/2026

    Paper 2 is likely to have higher impact because it contributes a broadly applicable, statistically rigorous framework for adaptive auditing with anytime-valid guarantees—directly addressing a major practical bottleneck in AI evaluation. Its methodology (e-processes/SAVI, dueling hypotheses, asymptotic inverse guarantees) is principled and generalizable across models, domains, and audit setups, with clear real-world deployment value under small-sample regimes. Paper 1 is timely and relevant for LLM safety, but its contribution is more specialized to inference-time token-injection robustness and may have narrower cross-field reach.

    vs. Safety Certification is Classification
    gemini-3.16/5/2026

    Paper 2 addresses a critical, highly timely issue in LLM safety alignment, impacting the widespread deployment of generative AI. By demonstrating vulnerabilities throughout the generation trajectory and proposing a mitigation strategy, it has broad, immediate real-world applications. While Paper 1 offers a strong methodological advance in control theory, Paper 2's focus on LLM inference-time vulnerabilities gives it a higher potential for cross-disciplinary impact and rapid adoption in the current AI landscape.

    vs. Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization
    claude-opus-4.66/5/2026

    Paper 1 addresses a fundamental and broadly impactful problem in LLM safety alignment, revealing that inference-time vulnerabilities extend beyond 'shallow safety' to any point in generation. This finding has immediate implications for the entire field of AI safety and alignment, affecting all deployed LLMs. The proposed trajectory-based training approach is novel and addresses a critical gap. Paper 2, while solid, addresses a more niche application in molecular optimization with a relatively incremental multi-agent tree-search framework. LLM safety research has broader cross-field impact and higher urgency given widespread deployment.

    vs. Pramana: Fine-Tuning Large Language Models for Epistemic Reasoning through Navya-Nyaya
    gpt-5.26/5/2026

    Paper 1 likely has higher impact: it targets a widely recognized, high-stakes problem (LLM jailbreaks/inference-time safety) with a general vulnerability claim (token injections at any step), provides a training-time mitigation (trajectory-based alignment via simulated perturbations), and reports generalization to multiple attack regimes—suggesting broader applicability and methodological strength. Its implications span AI safety, deployment robustness, and alignment research and are timely given rapid real-world adoption. Paper 2 is novel culturally/epistemically but appears limited by small training set (55 problems) and narrower immediate applicability.

    vs. MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
    gpt-5.26/5/2026

    Paper 2 (MIRAGE) has higher likely impact: it introduces a broadly applicable paradigm—implicit latent reasoning plus a generative world-model objective—for efficient, deployable mobile/UI agents, with clear real-world utility (faster, cheaper inference) and strong empirical gains on established benchmarks. Its ideas can transfer across agentic settings (web, robotics, HCI, multimodal planning). Paper 1 is timely and valuable for LLM safety, but is more specialized to robustness against inference-time token injection and may have narrower application scope than MIRAGE’s efficiency and agent-control contributions.

    vs. Where Do Deep-Research Agents Go Wrong? Span-Level Error Localization in Agent Trajectories
    gemini-3.16/5/2026

    Paper 1 addresses a critical and foundational issue in LLM safety by exposing vulnerabilities beyond 'shallow safety' and proposing a novel trajectory-based alignment method. Because safety and robustness against jailbreaks are paramount for real-world LLM deployment, this fundamental shift in alignment training has broader implications across the AI field. Paper 2 is highly relevant for agentic AI evaluation, but its focus on debugging deep-research agents is comparatively more niche than foundational model safety.

    vs. CP-Agent: Context-Aware Multimodal Reasoning for Cellular Morphological Profiling under Chemical Perturbations
    gpt-5.26/5/2026

    Paper 2 likely has higher scientific impact due to broad relevance and timeliness: inference-time safety vulnerabilities affect nearly all deployed LLMs across domains. It extends “shallow safety” to a more general, actionable threat model (mid-sequence token injections) and proposes a methodology (trajectory-based alignment via simulated perturbations) that could influence alignment training practices widely. Paper 1 is innovative for phenotypic screening and drug discovery, but its impact is narrower (Cell Painting workflows) and depends on dataset access and biological validation; Paper 2’s insights generalize across models and applications.

    vs. ToolGate: Token-Efficient Pre-Call Control for Tool-Augmented Vision-Language Agents
    gpt-5.26/5/2026

    Paper 2 targets a broadly important and timely problem—LLM safety robustness under inference-time adversarial interventions—extending “shallow safety” into a more general trajectory-level vulnerability and proposing trajectory-based alignment as a mitigation. If validated, this reframes how safety should be trained and evaluated, with implications across security, alignment, and deployment of generative systems. Paper 1 is useful and practical for reducing tool-call costs in VLM agents, but is more incremental and narrower in scope, with impact concentrated in agent efficiency rather than foundational safety concerns.