On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
Bo Yin, Qi Li, Xinchao Wang
Abstract
Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.
AI Impact Assessments
(1 models)Scientific Impact Assessment: FATE — On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
1. Core Contribution
FATE addresses a genuine and increasingly important problem: how to align tool-using LLM agents at the trajectory level rather than the response level. The key insight is that agent safety failures manifest across entire interaction sequences (tool calls, state changes, injected instruction compliance), not just in final outputs. The paper's primary novelty is a self-evolving framework that (1) mines the current policy's own failures, (2) uses the same policy to propose repairs conditioned on failure diagnostics, (3) filters repairs via multi-objective Pareto-front selection across security, utility, over-refusal, and trajectory control, and (4) trains the policy on the filtered repairs using SFT warmup followed by Pareto-Front Policy Optimization (PFPO).
The central methodological contribution—transforming verifier-scored failures into on-policy repair supervision without expert demonstrations—is a meaningful advance over existing approaches that either rely on response-level signals (RLHF/DPO), external demonstrations, or inference-time guardrails that leave the underlying policy unchanged.
2. Methodological Rigor
Strengths: The framework is well-formalized. The multi-objective formulation (Eq. 1-12) is mathematically clean, and the formal analysis in Appendix D (showing q*_t is a KL-regularized projection onto the Pareto front) provides theoretical grounding. The experimental protocol is commendable: strict dev/test splits, three random seeds, evaluation across five backbone families (Qwen3, Llama-3.1, Ministral, Gemma-3, Phi-4), six model scales (0.6B–32B), and an external generalization benchmark (ATBench).
Concerns: Several aspects weaken rigor. The verifier quality is a critical bottleneck acknowledged but not deeply analyzed—if verifiers are imperfect, the entire supervision pipeline inherits systematic biases. The paper does not clearly quantify verifier accuracy or analyze failure modes of the verification step itself. The K=8 repair candidates per failure, while tested in ablation (Table 14), may be insufficient for complex failures. The improvements are substantial on paper (33.5% ASR reduction, 82.6% HCR reduction), but some baselines seem weak—the ReAct and Reflexion baselines are prompting strategies rather than training-time interventions, making the comparison somewhat asymmetric. The "SFT + safety-only GRPO" ablation is more informative. Additionally, statistical significance is not formally tested despite reporting standard deviations.
The ATBench generalization results (Table 4) are impressive but somewhat confounded—the FATE-refined model outperforms even closed-source models like GPT-5.4 on coarse classification, which raises questions about whether the evaluation task partially overlaps with FATE's training signal in subtle ways, despite no ATBench data being used directly.
3. Potential Impact
Direct impact: This work could meaningfully influence how the community approaches agent safety alignment. The shift from response-level to trajectory-level safety supervision is conceptually important and practically necessary as agents become more autonomous. The framework's ability to work across model families and scales suggests broad applicability.
Practical applications: Organizations deploying tool-using agents (customer service, code execution, data analysis) could use FATE-style pipelines to iteratively harden their agents against prompt injection and harmful compliance without sacrificing task utility.
Broader influence: The Pareto-front multi-objective optimization for safety-utility trade-offs could inspire similar approaches in other alignment settings. The self-evolving paradigm—where the policy's own failures drive improvement—connects to broader themes in self-play and self-improvement that are increasingly central to LLM development.
4. Timeliness & Relevance
This paper is highly timely. As LLM agents are deployed in production environments with tool access (function calling, API integration, web browsing), trajectory-level safety has become an urgent concern. The benchmarks used (AgentDojo, AgentHarm, ATBench) are recent and represent the current evaluation frontier. The safety-utility trade-off problem—where safety improvements cause over-refusal—is a well-known practical pain point that this work directly addresses.
The paper also arrives at a moment when the field is transitioning from single-turn safety alignment to multi-step agent alignment, filling a genuine gap between evaluation-focused work and training-time solutions.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional Observations
The paper's formal analysis (Theorem 1) is clean but somewhat predictable—showing that the supervision distribution is a KL-regularized projection is a standard result given the distributional form. The practical value lies more in the engineering of the pipeline than in theoretical novelty. The iterative evolution results (Figure 3) show diminishing returns after round 2, suggesting the framework may quickly saturate on current benchmarks.
Generated May 13, 2026
Comparison History (22)
Paper 1 is more novel and high-impact: it identifies a new multi-agent attack class (semantic hijacking) and a counterintuitive, broadly relevant “capability paradox” showing stronger components can worsen system security, supported by large-scale experiments and mediation analysis that offers a mechanistic explanation. The proposed defense (heterogeneous ensemble verification) is simple, actionable, and yields a dramatic ASR reduction with minimal utility loss, making it immediately relevant for real-world MAS deployments. Paper 2 is timely and useful, but aligns with an active line of trajectory-level/on-policy safety training; impact is likely incremental and verifier-dependent.
TRACE addresses the critical problem of LLM hallucinations with a novel, training-free, inference-time intervention based on internal cross-layer evidence. Its universality and significant empirical gains across 15 models without requiring fine-tuning, labels, or external retrieval give it massive potential for immediate real-world adoption. While Paper 1 makes strong contributions to agent safety, Paper 2's fundamental approach to internal model mechanics and broader generalizability across all LLM use cases suggests a higher potential for widespread scientific and practical impact.
Paper 1 introduces a novel paradigm—world models for clinical ECG simulation under interventions—combining physiological ODE priors with latent diffusion in a principled way. This addresses a significant gap in computational cardiology and clinical decision support, with direct real-world medical applications. Its interdisciplinary nature (ML + clinical medicine + physiology) broadens impact. Paper 2 makes solid contributions to LLM agent safety alignment but operates in an increasingly crowded space. While impactful for AI safety, Paper 1's methodological novelty (physiology-informed world models) and potential to transform clinical practice give it higher long-term scientific impact.
Paper 1 fundamentally challenges the dominant paradigm of Chain-of-Thought reasoning in LLMs, advocating a shift toward latent-state dynamics. This theoretical paradigm shift has profound implications across AI interpretability, evaluation, and fundamental model design, offering a broader and deeper scientific impact compared to Paper 2's more specialized, albeit important, framework for agent safety alignment.
Paper 1 (FATE) addresses a critical and timely problem—safety alignment for tool-using LLM agents—with a novel framework that tackles trajectory-level failures rather than just response-level issues. Its Pareto-front optimization for balancing safety and utility is methodologically innovative. The problem has immediate real-world implications as agents are deployed in high-stakes settings. Paper 2 (DiffMAS) proposes interesting latent communication optimization for multi-agent systems, but addresses a less urgent problem with more incremental gains. FATE's broader safety implications, novel self-evolution approach without expert demonstrations, and strong empirical results across multiple safety benchmarks give it higher potential impact.
Paper 1 (FATE) addresses a critical and timely problem in AI safety alignment for tool-using LLM agents, proposing a novel on-policy self-evolution framework with Pareto-front optimization. Its contributions—trajectory-level failure repair, multi-objective safety-utility balancing, and strong empirical results across multiple benchmarks—have broader impact potential given the urgency of AI safety research. Paper 2 (UAF) makes a solid engineering contribution to full-duplex speech interaction by unifying front-end tasks, but it is more incremental and narrower in scope. The safety alignment problem addressed by Paper 1 has wider cross-field implications and greater timeliness.
Paper 1 (FATE) addresses a critical and timely problem—safety alignment of tool-using LLM agents—with a novel framework combining trajectory-level failure repair, multi-objective optimization via Pareto-front methods, and on-policy self-evolution without expert demonstrations. Its impact spans AI safety, alignment, and agent deployment, which are high-stakes areas. Paper 2 (PRISM-MCTS) offers efficiency improvements to MCTS-based reasoning via shared memory, which is valuable but more incremental. FATE's broader applicability to real-world agent safety, strong empirical results across multiple benchmarks, and novel Pareto-aware optimization give it higher potential impact.
Paper 1 (FATE) addresses a critical and timely problem—safety alignment for tool-using LLM agents—with a novel framework combining trajectory-level failure repair, multi-objective optimization via Pareto-front methods, and on-policy self-evolution. This tackles fundamental challenges in deploying agents safely at scale, with broad implications across AI safety, alignment, and agentic systems. Paper 2 (PyRAG) offers a clever reformulation of multi-hop RAG as code execution, but addresses a more incremental improvement in retrieval-augmented QA. FATE's novelty in safety-utility Pareto optimization and its applicability to the rapidly growing agent ecosystem gives it higher potential impact.
Paper 1 likely has higher impact due to broader scope and methodological contributions: it targets agentic safety at the trajectory/tool-use level (more general than reasoning-only), introduces a concrete self-evolution pipeline (failure→repair supervision) plus a multi-objective Pareto-aware optimization method (PFPO) to manage safety–utility–over-refusal trade-offs. It reports improvements across multiple agent benchmarks and metrics relevant to deployment. Paper 2 is timely and useful but narrower (unsafe reasoning recovery) and framed as a single-objective RL robustness approach, with potentially less cross-field applicability.
Paper 2 addresses a critical and timely problem—safety alignment of tool-using LLM agents—with a novel framework (FATE) that introduces several innovative contributions: on-policy self-evolution from failure trajectories, Pareto-front policy optimization balancing safety and utility, and dense trajectory-level supervision. The substantial empirical improvements (33.5% reduction in attack success, 82.6% in harmful compliance) across multiple benchmarks demonstrate strong results. Safety alignment is a foundational concern with broad impact across all LLM agent applications, making this work more broadly impactful than Paper 1's more specialized contribution to structured knowledge in deep research reports.
Paper 2 addresses the critical and highly relevant challenge of LLM agent safety alignment, specifically tackling the pervasive safety-utility trade-off. By focusing on trajectory-level failures rather than just final responses, it offers a more robust methodology for real-world agent deployments. The approach has broader societal implications and higher potential applicability across AI safety domains compared to the more specialized systems-level memory management problem addressed in Paper 1.
Paper 1 addresses a fundamental and broadly applicable challenge in AI safety alignment for tool-using LLM agents, proposing a novel on-policy self-evolving framework (FATE) with Pareto-front optimization. Its methodological contributions—trajectory-level failure repair, multi-objective safety-utility balancing, and dense supervision from failures—are generalizable across models and scales. The significant quantitative improvements (33.5% ASR reduction, 82.6% harmful compliance reduction) demonstrate strong results. While Paper 2 makes a valuable contribution to healthcare QI, its impact is more domain-specific, and its concordance threshold (≥70%) is modest. Paper 1's broader applicability to the rapidly growing field of AI agent safety gives it higher potential impact.
Paper 2 tackles a critical, high-impact challenge in AI—agentic safety alignment and the safety-utility trade-off. Its methodological innovations, specifically trajectory-level on-policy self-evolution and Pareto-Front Policy Optimization, offer broad applicability across all tool-using LLM agents. While Paper 1 provides a valuable benchmark for data analysis, Paper 2's focus on foundational AI safety mechanisms gives it greater potential for widespread, cross-disciplinary impact and urgent real-world implementation.
Paper 2 addresses a critical and highly timely challenge: the safety-utility trade-off in autonomous tool-using LLM agents. By focusing on trajectory-level failures rather than just final responses, the FATE framework offers a robust, on-policy self-evolution method for agentic safety alignment. This has significant real-world applications in deploying secure AI agents. While Paper 1 provides valuable mechanistic insights into multimodal LLMs, Paper 2's focus on autonomous agent safety and its strong empirical improvements across multiple benchmarks suggest a broader and more immediate scientific and practical impact.
Paper 1 addresses a critical and highly timely bottleneck in AI: the safety and alignment of autonomous, tool-using LLM agents. By focusing on trajectory-level failures rather than just final responses, and proposing an automated self-evolution framework that mitigates the safety-utility trade-off, it offers broad, real-world utility for deploying safe AI. While Paper 2 presents a clever stealthy backdoor attack for VLMs, Paper 1's focus on foundational alignment and self-improving safety mechanisms across different models promises a wider, more constructive impact on the rapidly growing field of agentic AI.
FATE addresses a critical and timely problem—safety alignment of tool-using LLM agents—with a novel on-policy self-evolution framework that transforms failure trajectories into repair supervision. Its multi-objective Pareto optimization for safety-utility trade-offs is methodologically innovative. The breadth of impact is larger given the widespread deployment of LLM agents. Paper 2 makes a solid contribution to DRL backdoor defense with an online, trigger-agnostic approach, but targets a narrower domain with fewer immediate real-world applications compared to the rapidly growing LLM agent ecosystem.
Paper 1 has higher likely scientific impact due to a concrete, novel training framework (FATE + PFPO) addressing a pressing real-world problem—agentic safety for tool-using LLMs—with on-policy trajectory-level supervision and multi-objective filtering. It reports quantitative improvements on multiple established benchmarks and offers an actionable method that can be adopted and extended across agent systems, RLHF/verification pipelines, and safety tooling. Paper 2 presents a valuable conceptual framing, but it is largely theoretical/conjectural with examples and less direct empirical or engineering leverage, limiting near-term uptake and measurable downstream impact.
Paper 1 (FATE) addresses the critical and broadly relevant problem of safety alignment for tool-using LLM agents, introducing novel concepts like trajectory-level failure-based self-evolution and Pareto-front policy optimization for safety-utility trade-offs. Its impact spans the entire LLM agent safety community with strong empirical results across multiple benchmarks. Paper 2 (ToolCUA) makes a solid contribution to GUI-tool orchestration but addresses a narrower problem domain. FATE's focus on safety alignment is more timely and foundational, with broader implications for deploying agents in real-world settings.
Paper 2 addresses the critical and timely problem of safety alignment for LLM agents, proposing a novel framework (FATE) that tackles trajectory-level failures rather than just response-level issues. Its contributions—on-policy self-evolution from failures, Pareto-front optimization for safety-utility trade-offs, and dense trajectory-level supervision—are more innovative and broadly impactful. The dramatic improvements (33.5% reduction in attack success, 82.6% in harmful compliance) demonstrate strong practical value. LLM agent safety is a rapidly growing area with wide real-world implications, giving Paper 2 greater breadth and timeliness compared to Paper 1's incremental improvements in domain incremental learning.
Paper 1 addresses a fundamental, urgent challenge in AI: safety alignment for tool-using LLM agents. By introducing trajectory-level supervision and Pareto-Front Policy Optimization, it provides a novel algorithmic advance that solves the pervasive safety-utility trade-off, backed by rigorous quantitative results across multiple benchmarks. In contrast, Paper 2 presents a valuable but highly applied framework for supply chain automation using existing multi-agent paradigms without specific quantitative metrics in the abstract. Paper 1's foundational contributions have critical implications for the secure deployment of AI agents across all domains, yielding higher scientific impact.