Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

May 28, 2026

arXiv:2605.29396v1 PDF

cs.AI(primary)

#816of 2821·Artificial Intelligence

#816 of 2821 · Artificial Intelligence

Tournament Score

1453±45

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance5.5

Rigor5.5

Novelty6

Clarity7

Tournament Score

1453±45

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

5.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper addresses the fragility of safety-aligned LLMs under post-alignment perturbations (weight noise, activation noise, quantization) and proposes an optimizer-centric solution. The core contribution is a hybrid FO-ZO framework: first-order (FO) optimization performs standard safety alignment, then zeroth-order (ZO) optimization is applied as a refinement stage to improve robustness. The key insight is that ZO optimization inherently evaluates the loss under parameter perturbations, effectively optimizing a smoothed objective that encourages flatter minima around the safety-aligned solution. Additionally, the paper introduces a robustness-aware layer selection mechanism that exploits the perturbation-based nature of ZO to identify which layers are most vulnerable to safety degradation, concentrating refinement updates on those layers.

The conceptual link between ZO optimization and robustness is well-motivated: ZO gradient estimates naturally probe the loss landscape in a neighborhood around current parameters, which aligns with the goal of making safety alignment robust to small parameter perturbations.

2. Methodological Rigor

Theoretical analysis: The paper provides two theoretical results: (1) a convergence guarantee for ZO optimization under the PL condition justifying why ZO should be used as refinement rather than replacing FO entirely (Theorem 4.2), and (2) a one-step robustness improvement guarantee showing that ZO refinement can reduce the perturbation robustness gap even at FO stationary points (Theorem 4.4). While these results are technically sound, they rely on standard assumptions (Lipschitz continuity, smoothness, PL condition) and the proofs follow relatively standard ZO analysis patterns. The assumption that the FO solution is a stationary point (∇f(θ_fo) = 0) is idealized—in practice, FO alignment rarely reaches exact stationarity after only 100 steps.

Experimental evaluation: Experiments are conducted on two models (Llama-3-8B-Instruct, Qwen2-7B-Instruct) with three perturbation types and multiple safety benchmarks (HarmBench, LlamaGuard3, AdvBench). The evaluation is reasonably comprehensive. However, several concerns arise:

The absolute improvements from ZO refinement are often quite small (e.g., ASR reductions of 0.01-0.07 in many settings), and some entries show no change or slight degradation.

Only 10 ZO refinement steps are used, which is a strength for efficiency but raises questions about the magnitude of achievable improvements.

The comparison baseline is limited—the paper does not compare against existing robust alignment methods (e.g., Vaccine, representation noising, SafeRLHF) that target the same problem.

The perturbation model (random Gaussian noise to parameters/activations) is relatively simple compared to adversarial fine-tuning attacks or targeted jailbreaks.

3. Potential Impact

Practical relevance: The framework is lightweight (13.3% additional training time, 0.32× memory) and can be applied as a plug-in post-processing step after any FO-based safety alignment. This modularity is appealing for practitioners. The robustness-aware layer selection provides interpretable insights about which layers are critical for safety robustness.

Scope limitations: The threat model is restricted to random/unstructured perturbations. Real-world safety failures more commonly arise from adversarial fine-tuning, jailbreak prompts, or targeted weight manipulation. The paper does not evaluate against these more realistic attack vectors. The improvements against quantization (a highly practical deployment scenario) are more compelling but still modest.

Broader influence: The optimizer-centric perspective is a genuinely underexplored angle in the alignment robustness literature. This could inspire follow-up work on optimizer design for alignment, beyond the specific FO-ZO hybrid proposed here.

4. Timeliness & Relevance

The paper is timely: LLM safety alignment is a critical and active research area, and the fragility of alignment under post-deployment modifications (quantization, noise) is a recognized problem. The connection to ZO optimization capitalizes on recent interest in ZO methods for LLM fine-tuning. However, the concurrent work by Lang et al. (2026, cited as [26]) on ZO for LLM unlearning robustness somewhat diminishes the novelty of the optimizer-centric perspective.

5. Strengths & Limitations

Strengths:

Novel and well-motivated perspective connecting ZO optimization to alignment robustness through the smoothed objective interpretation

Lightweight and modular—can be applied post-hoc to any aligned model

Both theoretical and empirical support for the approach

Robustness-aware layer selection is practical and principled, with clear comparison showing it outperforms pruning-based alternatives (SNIP, WANDA)

Efficiency is demonstrated with concrete runtime/memory measurements

Limitations:

Magnitude of robustness improvements is often marginal (many ASR changes < 0.03)

No comparison against existing robust alignment baselines (Vaccine [18], representation noising [19], dual-objective optimization [17])

Threat model limited to random perturbations; no evaluation against adversarial fine-tuning or jailbreak attacks

Only two models tested, both in the 7-8B range; scalability to larger models is unknown

The theoretical analysis, while correct, offers limited additional insight beyond standard ZO convergence theory

Some results show no improvement or slight degradation (e.g., AdvBench under W4A4 quantization in Table 4)

The W4A4 quantization results on Qwen2 show catastrophic degradation (PPL > 2000) regardless of alignment method, suggesting fundamental model limitations that ZO refinement cannot address

Additional Observations

The paper's claim of being "the first to study robustness of safety alignment from the perspective of the base optimizer" is somewhat overstated given concurrent work [26]. The robustness-aware layer selection (Section 4.2) may be more impactful than the ZO refinement itself, as it provides a diagnostic tool for identifying safety-critical layers. The experimental protocol of using only 800 training samples and 100 FO steps is somewhat unusual for safety alignment and may affect generalizability of findings.

Rating:5.2/ 10

Significance 5.5Rigor 5.5Novelty 6Clarity 7

Generated May 29, 2026

Comparison History (15)

vs. Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

claude-opus-4.65/29/2026

Paper 1 demonstrates higher scientific impact through several factors: (1) it produces a tangible, large-scale scientific resource—the largest integrated marine Pb database—with immediate utility for oceanography and environmental science; (2) it introduces a generalizable expert-guided LLM framework applicable across geosciences; (3) it bridges AI and domain science with rigorous validation (92% expert-verified accuracy); (4) it has broad interdisciplinary impact spanning NLP, marine science, and environmental monitoring. Paper 2, while technically sound in improving LLM safety robustness via zeroth-order optimization, addresses a narrower problem within AI safety with less cross-disciplinary reach.

vs. MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

gpt-5.25/29/2026

Paper 2 likely has higher impact: it introduces a new benchmark and dataset (16k human judgments) targeting an under-evaluated but crucial property—action-conditioned reliability of robotic world models—yielding broadly applicable diagnostics and findings across 12 diverse systems (open/closed, text/vector, scales). Benchmarks often become community standards, influencing many downstream works in robotics, world modeling, and evaluation. Paper 1 is novel in optimizer-centric robustness for LLM safety, but its scope is narrower (alignment robustness) and may see more incremental adoption compared to a widely usable evaluation suite.

vs. When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

gemini-3.15/29/2026

Paper 1 addresses a critical, ecosystem-level problem of multi-model self-consuming loops and presents a highly counter-intuitive finding that human curation can backfire. This has profound implications for how future foundation models will be trained on web data, likely sparking widespread follow-up research across the AI community. Paper 2 offers a valuable but narrower technical optimization solution for safety fragility.

vs. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

gpt-5.25/29/2026

Paper 2 likely has higher impact: it targets LLM safety robustness, a timely, widely relevant problem affecting deployment across many domains. Its optimizer-centric framing is relatively novel and proposes a broadly applicable hybrid first-order + zeroth-order refinement with theoretical and empirical support, potentially influencing both alignment research and robust optimization practice. Paper 1 is innovative and valuable for materials discovery, but its impact is more domain-specific, whereas Paper 2’s methods and implications can transfer across models, modalities, and safety-critical applications.

vs. VikingMem: A Memory Base Management System for Stateful LLM-based Applications

claude-opus-4.65/29/2026

Paper 1 addresses a fundamental and timely problem in LLM safety alignment robustness from a novel optimizer-centric perspective, introducing zeroth-order optimization as a principled approach. It offers both theoretical and empirical contributions, with broad implications for all safety-aligned LLMs. Paper 2 presents an engineering-oriented memory management system that, while practically useful, is more incremental and application-specific. Paper 1's novelty in connecting optimization theory to safety robustness and its potential to influence future alignment research gives it higher scientific impact.

vs. MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

gemini-3.15/29/2026

Paper 1 tackles the critical and timely issue of LLM safety robustness. By introducing zeroth-order optimization to prevent safety degradation from perturbations like quantization or noise, it provides a novel theoretical and methodological contribution to AI alignment. While Paper 2 offers highly valuable practical improvements for training efficiency and data curation, Paper 1 addresses fundamental safety vulnerabilities, giving it broader societal relevance and higher potential scientific impact in the high-stakes field of AI safety.

vs. Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

gpt-5.25/29/2026

Paper 1 introduces an optimizer-centric lens on safety robustness—an underexplored angle—using zeroth-order refinement to explicitly optimize alignment under perturbations, with theoretical and empirical support plus an efficiency improvement via layer-wise sensitivity. This targets a high-stakes, timely problem (robust LLM safety under deployment transformations like quantization/noise) with clear real-world implications and potentially broad impact across alignment, robustness, and optimization. Paper 2 is a clever, low-cost RLVR tweak with practical gains, but is more incremental and narrower in scope.

vs. Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation

claude-opus-4.65/29/2026

Paper 2 addresses a fundamental and broadly impactful problem—the fragility of LLM safety alignment—with a novel optimizer-centric perspective (zeroth-order optimization) that is both theoretically grounded and practically applicable. Safety robustness is a critical concern across the entire LLM ecosystem, giving it broader impact. Paper 1, while valuable, is more narrowly focused on web code generation evaluation, a niche within LLM benchmarking. Paper 2's methodological innovation (hybrid first-order/zeroth-order framework with layer-wise sensitivity estimation) offers generalizable insights applicable beyond safety alignment to robust optimization more broadly.

vs. HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

claude-opus-4.65/29/2026

Paper 1 introduces a novel optimizer-centric perspective on LLM safety robustness, proposing a concrete hybrid zeroth-order optimization framework with both theoretical and empirical contributions. This addresses a fundamental and timely problem—fragility of safety alignment—with a methodologically innovative approach that opens a new research direction. Paper 2, while useful as a benchmarking framework for hybrid-reasoning mode switching, is primarily a systematization and comparison of existing methods rather than introducing a fundamentally new technique. Paper 1's novelty, theoretical grounding, and direct impact on the critical problem of AI safety give it higher potential impact.

vs. Measuring Progress Toward AGI: A Cognitive Framework

claude-opus-4.65/29/2026

Paper 1 addresses a fundamental gap in AI research—how to measure progress toward AGI—by proposing a comprehensive cognitive taxonomy and evaluation framework grounded in decades of cognitive science. This has broader cross-disciplinary impact spanning AI, cognitive science, policy, and governance. Its timeliness is high given current AGI debates. Paper 2, while technically solid and addressing an important LLM safety robustness problem with a novel optimizer-centric approach, is more incremental and narrower in scope, focused on a specific technical improvement within the existing safety alignment paradigm.

vs. You Live More Than Once: Towards Hierarchical Skill Meta-Evolving

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact due to strong timeliness and broad applicability: robustness of LLM safety alignment is a widely recognized, high-stakes problem affecting deployment across domains. Its optimizer-centric framing and hybrid first-/zeroth-order refinement is a clear conceptual contribution with direct practical implications (robustness to quantization/noise) and potential to influence alignment practice and tooling. The inclusion of theoretical and empirical evidence plus an efficiency method (layer-wise sensitivity) suggests solid rigor. Paper 1 is innovative for agent skill meta-evolving, but its impact is narrower and more benchmark-dependent.

vs. Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

gpt-5.25/29/2026

Paper 2 is likely higher impact: it introduces a broadly applicable, conceptually novel training signal (Belief Entropy) for diagnosing and optimizing memory quality in long-horizon LLM agents, addressing a central bottleneck for agentic systems. The approach has clear real-world applications (autonomous assistants, tool-using agents) and strong timeliness given rapid growth of long-context and agent research. Its proxy-based, fine-grained supervision may generalize across tasks and architectures. Paper 1 is valuable and novel optimizer-centrically, but its impact is narrower (safety robustness under perturbations) and may be more incremental to existing alignment/robustness lines.

vs. Defending LLM-based Multi-Agent Systems Against Cooperative Attacks with Sentence-Level Rectification

gpt-5.25/29/2026

Paper 1 likely has higher scientific impact: it introduces an optimizer-centric angle on safety alignment robustness and leverages zeroth-order refinement to directly optimize robustness under perturbations, with theoretical support and an efficiency improvement via layer-wise sensitivity—broadly applicable across alignment methods and deployment perturbations (noise, quantization). Paper 2 addresses an important, timely MAS security setting, but the contribution is more application-specific (agent communication defenses/attacks) and may generalize less beyond multi-agent prompting setups. Overall, Paper 1’s methodological novelty and cross-cutting relevance to LLM safety/robustness are stronger.

vs. MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

claude-opus-4.65/29/2026

MolLingo addresses the high-impact problem of AI-driven molecular design with a novel multi-agent framework combining chemically meaningful representations (BFE), multi-agent coordination, and biologically grounded reasoning. It demonstrates strong empirical results across four benchmarks, including state-of-the-art on TOMG-Bench and a fourfold docking score improvement over GPT-5.4. Its practical applications in drug discovery give it broader real-world impact. Paper 1, while addressing an important LLM safety robustness problem with a novel optimizer-centric perspective, has a narrower scope primarily within the alignment community.

vs. Utility-Aware Multimodal Contrastive Learning for Product Image Generation

gemini-3.15/29/2026

Paper 1 addresses a fundamental and critical challenge in AI safety—making LLM alignment robust against perturbations. Its foundational approach to safety has broad implications across all LLM deployments and societal impacts. In contrast, Paper 2 focuses on a niche, commercially-driven application (optimizing product images for e-commerce), which, while practically useful, offers narrower scientific impact and breadth compared to core AI safety research.