Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization
Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu
Abstract
Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
The paper addresses the fragility of safety-aligned LLMs under post-alignment perturbations (weight noise, activation noise, quantization) and proposes an optimizer-centric solution. The core contribution is a hybrid FO-ZO framework: first-order (FO) optimization performs standard safety alignment, then zeroth-order (ZO) optimization is applied as a refinement stage to improve robustness. The key insight is that ZO optimization inherently evaluates the loss under parameter perturbations, effectively optimizing a smoothed objective that encourages flatter minima around the safety-aligned solution. Additionally, the paper introduces a robustness-aware layer selection mechanism that exploits the perturbation-based nature of ZO to identify which layers are most vulnerable to safety degradation, concentrating refinement updates on those layers.
The conceptual link between ZO optimization and robustness is well-motivated: ZO gradient estimates naturally probe the loss landscape in a neighborhood around current parameters, which aligns with the goal of making safety alignment robust to small parameter perturbations.
2. Methodological Rigor
Theoretical analysis: The paper provides two theoretical results: (1) a convergence guarantee for ZO optimization under the PL condition justifying why ZO should be used as refinement rather than replacing FO entirely (Theorem 4.2), and (2) a one-step robustness improvement guarantee showing that ZO refinement can reduce the perturbation robustness gap even at FO stationary points (Theorem 4.4). While these results are technically sound, they rely on standard assumptions (Lipschitz continuity, smoothness, PL condition) and the proofs follow relatively standard ZO analysis patterns. The assumption that the FO solution is a stationary point (∇f(θ_fo) = 0) is idealized—in practice, FO alignment rarely reaches exact stationarity after only 100 steps.
Experimental evaluation: Experiments are conducted on two models (Llama-3-8B-Instruct, Qwen2-7B-Instruct) with three perturbation types and multiple safety benchmarks (HarmBench, LlamaGuard3, AdvBench). The evaluation is reasonably comprehensive. However, several concerns arise:
3. Potential Impact
Practical relevance: The framework is lightweight (13.3% additional training time, 0.32× memory) and can be applied as a plug-in post-processing step after any FO-based safety alignment. This modularity is appealing for practitioners. The robustness-aware layer selection provides interpretable insights about which layers are critical for safety robustness.
Scope limitations: The threat model is restricted to random/unstructured perturbations. Real-world safety failures more commonly arise from adversarial fine-tuning, jailbreak prompts, or targeted weight manipulation. The paper does not evaluate against these more realistic attack vectors. The improvements against quantization (a highly practical deployment scenario) are more compelling but still modest.
Broader influence: The optimizer-centric perspective is a genuinely underexplored angle in the alignment robustness literature. This could inspire follow-up work on optimizer design for alignment, beyond the specific FO-ZO hybrid proposed here.
4. Timeliness & Relevance
The paper is timely: LLM safety alignment is a critical and active research area, and the fragility of alignment under post-deployment modifications (quantization, noise) is a recognized problem. The connection to ZO optimization capitalizes on recent interest in ZO methods for LLM fine-tuning. However, the concurrent work by Lang et al. (2026, cited as [26]) on ZO for LLM unlearning robustness somewhat diminishes the novelty of the optimizer-centric perspective.
5. Strengths & Limitations
Strengths:
Limitations:
Additional Observations
The paper's claim of being "the first to study robustness of safety alignment from the perspective of the base optimizer" is somewhat overstated given concurrent work [26]. The robustness-aware layer selection (Section 4.2) may be more impactful than the ZO refinement itself, as it provides a diagnostic tool for identifying safety-critical layers. The experimental protocol of using only 800 training samples and 100 FO steps is somewhat unusual for safety alignment and may affect generalizability of findings.
Generated May 29, 2026
Comparison History (15)
Paper 1 demonstrates higher scientific impact through several factors: (1) it produces a tangible, large-scale scientific resource—the largest integrated marine Pb database—with immediate utility for oceanography and environmental science; (2) it introduces a generalizable expert-guided LLM framework applicable across geosciences; (3) it bridges AI and domain science with rigorous validation (92% expert-verified accuracy); (4) it has broad interdisciplinary impact spanning NLP, marine science, and environmental monitoring. Paper 2, while technically sound in improving LLM safety robustness via zeroth-order optimization, addresses a narrower problem within AI safety with less cross-disciplinary reach.
Paper 2 likely has higher impact: it introduces a new benchmark and dataset (16k human judgments) targeting an under-evaluated but crucial property—action-conditioned reliability of robotic world models—yielding broadly applicable diagnostics and findings across 12 diverse systems (open/closed, text/vector, scales). Benchmarks often become community standards, influencing many downstream works in robotics, world modeling, and evaluation. Paper 1 is novel in optimizer-centric robustness for LLM safety, but its scope is narrower (alignment robustness) and may see more incremental adoption compared to a widely usable evaluation suite.
Paper 1 addresses a critical, ecosystem-level problem of multi-model self-consuming loops and presents a highly counter-intuitive finding that human curation can backfire. This has profound implications for how future foundation models will be trained on web data, likely sparking widespread follow-up research across the AI community. Paper 2 offers a valuable but narrower technical optimization solution for safety fragility.
Paper 2 likely has higher impact: it targets LLM safety robustness, a timely, widely relevant problem affecting deployment across many domains. Its optimizer-centric framing is relatively novel and proposes a broadly applicable hybrid first-order + zeroth-order refinement with theoretical and empirical support, potentially influencing both alignment research and robust optimization practice. Paper 1 is innovative and valuable for materials discovery, but its impact is more domain-specific, whereas Paper 2’s methods and implications can transfer across models, modalities, and safety-critical applications.
Paper 1 addresses a fundamental and timely problem in LLM safety alignment robustness from a novel optimizer-centric perspective, introducing zeroth-order optimization as a principled approach. It offers both theoretical and empirical contributions, with broad implications for all safety-aligned LLMs. Paper 2 presents an engineering-oriented memory management system that, while practically useful, is more incremental and application-specific. Paper 1's novelty in connecting optimization theory to safety robustness and its potential to influence future alignment research gives it higher scientific impact.
Paper 1 tackles the critical and timely issue of LLM safety robustness. By introducing zeroth-order optimization to prevent safety degradation from perturbations like quantization or noise, it provides a novel theoretical and methodological contribution to AI alignment. While Paper 2 offers highly valuable practical improvements for training efficiency and data curation, Paper 1 addresses fundamental safety vulnerabilities, giving it broader societal relevance and higher potential scientific impact in the high-stakes field of AI safety.
Paper 1 introduces an optimizer-centric lens on safety robustness—an underexplored angle—using zeroth-order refinement to explicitly optimize alignment under perturbations, with theoretical and empirical support plus an efficiency improvement via layer-wise sensitivity. This targets a high-stakes, timely problem (robust LLM safety under deployment transformations like quantization/noise) with clear real-world implications and potentially broad impact across alignment, robustness, and optimization. Paper 2 is a clever, low-cost RLVR tweak with practical gains, but is more incremental and narrower in scope.
Paper 2 addresses a fundamental and broadly impactful problem—the fragility of LLM safety alignment—with a novel optimizer-centric perspective (zeroth-order optimization) that is both theoretically grounded and practically applicable. Safety robustness is a critical concern across the entire LLM ecosystem, giving it broader impact. Paper 1, while valuable, is more narrowly focused on web code generation evaluation, a niche within LLM benchmarking. Paper 2's methodological innovation (hybrid first-order/zeroth-order framework with layer-wise sensitivity estimation) offers generalizable insights applicable beyond safety alignment to robust optimization more broadly.
Paper 1 introduces a novel optimizer-centric perspective on LLM safety robustness, proposing a concrete hybrid zeroth-order optimization framework with both theoretical and empirical contributions. This addresses a fundamental and timely problem—fragility of safety alignment—with a methodologically innovative approach that opens a new research direction. Paper 2, while useful as a benchmarking framework for hybrid-reasoning mode switching, is primarily a systematization and comparison of existing methods rather than introducing a fundamentally new technique. Paper 1's novelty, theoretical grounding, and direct impact on the critical problem of AI safety give it higher potential impact.
Paper 1 addresses a fundamental gap in AI research—how to measure progress toward AGI—by proposing a comprehensive cognitive taxonomy and evaluation framework grounded in decades of cognitive science. This has broader cross-disciplinary impact spanning AI, cognitive science, policy, and governance. Its timeliness is high given current AGI debates. Paper 2, while technically solid and addressing an important LLM safety robustness problem with a novel optimizer-centric approach, is more incremental and narrower in scope, focused on a specific technical improvement within the existing safety alignment paradigm.
Paper 2 likely has higher scientific impact due to strong timeliness and broad applicability: robustness of LLM safety alignment is a widely recognized, high-stakes problem affecting deployment across domains. Its optimizer-centric framing and hybrid first-/zeroth-order refinement is a clear conceptual contribution with direct practical implications (robustness to quantization/noise) and potential to influence alignment practice and tooling. The inclusion of theoretical and empirical evidence plus an efficiency method (layer-wise sensitivity) suggests solid rigor. Paper 1 is innovative for agent skill meta-evolving, but its impact is narrower and more benchmark-dependent.
Paper 2 is likely higher impact: it introduces a broadly applicable, conceptually novel training signal (Belief Entropy) for diagnosing and optimizing memory quality in long-horizon LLM agents, addressing a central bottleneck for agentic systems. The approach has clear real-world applications (autonomous assistants, tool-using agents) and strong timeliness given rapid growth of long-context and agent research. Its proxy-based, fine-grained supervision may generalize across tasks and architectures. Paper 1 is valuable and novel optimizer-centrically, but its impact is narrower (safety robustness under perturbations) and may be more incremental to existing alignment/robustness lines.
Paper 1 likely has higher scientific impact: it introduces an optimizer-centric angle on safety alignment robustness and leverages zeroth-order refinement to directly optimize robustness under perturbations, with theoretical support and an efficiency improvement via layer-wise sensitivity—broadly applicable across alignment methods and deployment perturbations (noise, quantization). Paper 2 addresses an important, timely MAS security setting, but the contribution is more application-specific (agent communication defenses/attacks) and may generalize less beyond multi-agent prompting setups. Overall, Paper 1’s methodological novelty and cross-cutting relevance to LLM safety/robustness are stronger.
MolLingo addresses the high-impact problem of AI-driven molecular design with a novel multi-agent framework combining chemically meaningful representations (BFE), multi-agent coordination, and biologically grounded reasoning. It demonstrates strong empirical results across four benchmarks, including state-of-the-art on TOMG-Bench and a fourfold docking score improvement over GPT-5.4. Its practical applications in drug discovery give it broader real-world impact. Paper 1, while addressing an important LLM safety robustness problem with a novel optimizer-centric perspective, has a narrower scope primarily within the alignment community.
Paper 1 addresses a fundamental and critical challenge in AI safety—making LLM alignment robust against perturbations. Its foundational approach to safety has broad implications across all LLM deployments and societal impacts. In contrast, Paper 2 focuses on a niche, commercially-driven application (optimizing product images for e-commerce), which, while practically useful, offers narrower scientific impact and breadth compared to core AI safety research.