On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Bo Yin, Qi Li, Xinchao Wang

#196 of 2292 · Artificial Intelligence
Share
Tournament Score
1520±44
10501800
81%
Win Rate
17
Wins
4
Losses
21
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FATE — On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

1. Core Contribution

FATE addresses a genuine and increasingly important problem: how to align tool-using LLM agents at the trajectory level rather than the response level. The key insight is that agent safety failures manifest across entire interaction sequences (tool calls, state changes, injected instruction compliance), not just in final outputs. The paper's primary novelty is a self-evolving framework that (1) mines the current policy's own failures, (2) uses the same policy to propose repairs conditioned on failure diagnostics, (3) filters repairs via multi-objective Pareto-front selection across security, utility, over-refusal, and trajectory control, and (4) trains the policy on the filtered repairs using SFT warmup followed by Pareto-Front Policy Optimization (PFPO).

The central methodological contribution—transforming verifier-scored failures into on-policy repair supervision without expert demonstrations—is a meaningful advance over existing approaches that either rely on response-level signals (RLHF/DPO), external demonstrations, or inference-time guardrails that leave the underlying policy unchanged.

2. Methodological Rigor

Strengths: The framework is well-formalized. The multi-objective formulation (Eq. 1-12) is mathematically clean, and the formal analysis in Appendix D (showing q*_t is a KL-regularized projection onto the Pareto front) provides theoretical grounding. The experimental protocol is commendable: strict dev/test splits, three random seeds, evaluation across five backbone families (Qwen3, Llama-3.1, Ministral, Gemma-3, Phi-4), six model scales (0.6B–32B), and an external generalization benchmark (ATBench).

Concerns: Several aspects weaken rigor. The verifier quality is a critical bottleneck acknowledged but not deeply analyzed—if verifiers are imperfect, the entire supervision pipeline inherits systematic biases. The paper does not clearly quantify verifier accuracy or analyze failure modes of the verification step itself. The K=8 repair candidates per failure, while tested in ablation (Table 14), may be insufficient for complex failures. The improvements are substantial on paper (33.5% ASR reduction, 82.6% HCR reduction), but some baselines seem weak—the ReAct and Reflexion baselines are prompting strategies rather than training-time interventions, making the comparison somewhat asymmetric. The "SFT + safety-only GRPO" ablation is more informative. Additionally, statistical significance is not formally tested despite reporting standard deviations.

The ATBench generalization results (Table 4) are impressive but somewhat confounded—the FATE-refined model outperforms even closed-source models like GPT-5.4 on coarse classification, which raises questions about whether the evaluation task partially overlaps with FATE's training signal in subtle ways, despite no ATBench data being used directly.

3. Potential Impact

Direct impact: This work could meaningfully influence how the community approaches agent safety alignment. The shift from response-level to trajectory-level safety supervision is conceptually important and practically necessary as agents become more autonomous. The framework's ability to work across model families and scales suggests broad applicability.

Practical applications: Organizations deploying tool-using agents (customer service, code execution, data analysis) could use FATE-style pipelines to iteratively harden their agents against prompt injection and harmful compliance without sacrificing task utility.

Broader influence: The Pareto-front multi-objective optimization for safety-utility trade-offs could inspire similar approaches in other alignment settings. The self-evolving paradigm—where the policy's own failures drive improvement—connects to broader themes in self-play and self-improvement that are increasingly central to LLM development.

4. Timeliness & Relevance

This paper is highly timely. As LLM agents are deployed in production environments with tool access (function calling, API integration, web browsing), trajectory-level safety has become an urgent concern. The benchmarks used (AgentDojo, AgentHarm, ATBench) are recent and represent the current evaluation frontier. The safety-utility trade-off problem—where safety improvements cause over-refusal—is a well-known practical pain point that this work directly addresses.

The paper also arrives at a moment when the field is transitioning from single-turn safety alignment to multi-step agent alignment, filling a genuine gap between evaluation-focused work and training-time solutions.

5. Strengths & Limitations

Key strengths:

  • Conceptual clarity: The failure-to-repair supervision pipeline is intuitive and well-motivated. The paper clearly articulates why response-level signals and inference-time defenses are insufficient.
  • Multi-objective formulation: Explicitly optimizing across security, utility, over-refusal, and trajectory control avoids the degenerate solutions common in single-objective safety optimization.
  • Comprehensive evaluation: Five model families, six scales, iterative evolution analysis, extensive ablations, and external generalization testing.
  • On-policy design: Using the same policy for both failure generation and repair proposal ensures relevance to the current model's capability and failure distribution.
  • Reproducibility: Detailed prompt templates, hyperparameters, and algorithmic pseudocode are provided.
  • Notable weaknesses:

  • Verifier dependence: The entire framework's quality ceiling is set by verifier accuracy, which is neither independently validated nor stress-tested against adversarial examples designed to fool verifiers.
  • Scalability concerns: K×|F_t| verifier calls per round, plus G completions for PFPO, creates significant computational overhead that may limit applicability at scale.
  • Limited benchmark diversity: Despite three benchmarks, the agent settings remain relatively constrained. Real-world agents face longer horizons, multi-user interactions, and composable tool chains not represented here.
  • Baseline fairness: Comparing training-time FATE against inference-time baselines (ReAct, Reflexion, Tool Filter) conflates orthogonal approaches. More training-time baselines would strengthen claims.
  • Pareto weight sensitivity: While ablated (Table 15), the default weights appear hand-tuned per task mode (Table 9) with limited guidance on selection for new domains.
  • No human evaluation: All evaluation relies on automated metrics and verifiers; human judgment of trajectory safety quality would strengthen validation.
  • Additional Observations

    The paper's formal analysis (Theorem 1) is clean but somewhat predictable—showing that the supervision distribution is a KL-regularized projection is a standard result given the distributional form. The practical value lies more in the engineering of the pipeline than in theoretical novelty. The iterative evolution results (Figure 3) show diminishing returns after round 2, suggesting the framework may quickly saturate on current benchmarks.

    Rating:7/ 10
    Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

    Generated May 13, 2026

    Comparison History (22)

    vs. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
    gpt-5.25/19/2026

    Paper 1 is more novel and high-impact: it identifies a new multi-agent attack class (semantic hijacking) and a counterintuitive, broadly relevant “capability paradox” showing stronger components can worsen system security, supported by large-scale experiments and mediation analysis that offers a mechanistic explanation. The proposed defense (heterogeneous ensemble verification) is simple, actionable, and yields a dramatic ASR reduction with minimal utility loss, making it immediately relevant for real-world MAS deployments. Paper 2 is timely and useful, but aligns with an active line of trajectory-level/on-policy safety training; impact is likely incremental and verifier-dependent.

    vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
    gemini-3.15/19/2026

    TRACE addresses the critical problem of LLM hallucinations with a novel, training-free, inference-time intervention based on internal cross-layer evidence. Its universality and significant empirical gains across 15 models without requiring fine-tuning, labels, or external retrieval give it massive potential for immediate real-world adoption. While Paper 1 makes strong contributions to agent safety, Paper 2's fundamental approach to internal model mechanics and broader generalizability across all LLM use cases suggests a higher potential for widespread scientific and practical impact.

    vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
    claude-opus-4.65/19/2026

    Paper 1 introduces a novel paradigm—world models for clinical ECG simulation under interventions—combining physiological ODE priors with latent diffusion in a principled way. This addresses a significant gap in computational cardiology and clinical decision support, with direct real-world medical applications. Its interdisciplinary nature (ML + clinical medicine + physiology) broadens impact. Paper 2 makes solid contributions to LLM agent safety alignment but operates in an increasingly crowded space. While impactful for AI safety, Paper 1's methodological novelty (physiology-informed world models) and potential to transform clinical practice give it higher long-term scientific impact.

    vs. LLM Reasoning Is Latent, Not the Chain of Thought
    gemini-3.15/16/2026

    Paper 1 fundamentally challenges the dominant paradigm of Chain-of-Thought reasoning in LLMs, advocating a shift toward latent-state dynamics. This theoretical paradigm shift has profound implications across AI interpretability, evaluation, and fundamental model design, offering a broader and deeper scientific impact compared to Paper 2's more specialized, albeit important, framework for agent safety alignment.

    vs. Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems
    claude-opus-4.65/16/2026

    Paper 1 (FATE) addresses a critical and timely problem—safety alignment for tool-using LLM agents—with a novel framework that tackles trajectory-level failures rather than just response-level issues. Its Pareto-front optimization for balancing safety and utility is methodologically innovative. The problem has immediate real-world implications as agents are deployed in high-stakes settings. Paper 2 (DiffMAS) proposes interesting latent communication optimization for multi-agent systems, but addresses a less urgent problem with more incremental gains. FATE's broader safety implications, novel self-evolution approach without expert demonstrations, and strong empirical results across multiple safety benchmarks give it higher potential impact.

    vs. UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
    claude-opus-4.65/16/2026

    Paper 1 (FATE) addresses a critical and timely problem in AI safety alignment for tool-using LLM agents, proposing a novel on-policy self-evolution framework with Pareto-front optimization. Its contributions—trajectory-level failure repair, multi-objective safety-utility balancing, and strong empirical results across multiple benchmarks—have broader impact potential given the urgency of AI safety research. Paper 2 (UAF) makes a solid engineering contribution to full-duplex speech interaction by unifying front-end tasks, but it is more incremental and narrower in scope. The safety alignment problem addressed by Paper 1 has wider cross-field implications and greater timeliness.

    vs. PRISM-MCTS: Learning from Reasoning Trajectories with Metacognitive Reflection
    claude-opus-4.65/16/2026

    Paper 1 (FATE) addresses a critical and timely problem—safety alignment of tool-using LLM agents—with a novel framework combining trajectory-level failure repair, multi-objective optimization via Pareto-front methods, and on-policy self-evolution without expert demonstrations. Its impact spans AI safety, alignment, and agent deployment, which are high-stakes areas. Paper 2 (PRISM-MCTS) offers efficiency improvements to MCTS-based reasoning via shared memory, which is valuable but more incremental. FATE's broader applicability to real-world agent safety, strong empirical results across multiple benchmarks, and novel Pareto-aware optimization give it higher potential impact.

    vs. Retrieval is Cheap, Show Me the Code: Executable Multi-Hop Reasoning for Retrieval-Augmented Generation
    claude-opus-4.65/16/2026

    Paper 1 (FATE) addresses a critical and timely problem—safety alignment for tool-using LLM agents—with a novel framework combining trajectory-level failure repair, multi-objective optimization via Pareto-front methods, and on-policy self-evolution. This tackles fundamental challenges in deploying agents safely at scale, with broad implications across AI safety, alignment, and agentic systems. Paper 2 (PyRAG) offers a clever reformulation of multi-hop RAG as code execution, but addresses a more incremental improvement in retrieval-augmented QA. FATE's novelty in safety-utility Pareto optimization and its applicability to the rapidly growing agent ecosystem gives it higher potential impact.

    vs. Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories
    gpt-5.25/16/2026

    Paper 1 likely has higher impact due to broader scope and methodological contributions: it targets agentic safety at the trajectory/tool-use level (more general than reasoning-only), introduces a concrete self-evolution pipeline (failure→repair supervision) plus a multi-objective Pareto-aware optimization method (PFPO) to manage safety–utility–over-refusal trade-offs. It reports improvements across multiple agent benchmarks and metrics relevant to deployment. Paper 2 is timely and useful but narrower (unsafe reasoning recovery) and framed as a single-objective RL robustness approach, with potentially less cross-field applicability.

    vs. Towards Knowledgeable Deep Research: Framework and Benchmark
    claude-opus-4.65/16/2026

    Paper 2 addresses a critical and timely problem—safety alignment of tool-using LLM agents—with a novel framework (FATE) that introduces several innovative contributions: on-policy self-evolution from failure trajectories, Pareto-front policy optimization balancing safety and utility, and dense trajectory-level supervision. The substantial empirical improvements (33.5% reduction in attack success, 82.6% in harmful compliance) across multiple benchmarks demonstrate strong results. Safety alignment is a foundational concern with broad impact across all LLM agent applications, making this work more broadly impactful than Paper 1's more specialized contribution to structured knowledge in deep research reports.

    vs. MEMOREPAIR: Barrier-First Cascade Repair in Agentic Memory
    gemini-3.15/16/2026

    Paper 2 addresses the critical and highly relevant challenge of LLM agent safety alignment, specifically tackling the pervasive safety-utility trade-off. By focusing on trajectory-level failures rather than just final responses, it offers a more robust methodology for real-world agent deployments. The approach has broader societal implications and higher potential applicability across AI safety domains compared to the more specialized systems-level memory management problem addressed in Paper 1.

    vs. From Fuzzy to Formal: Scaling Hospital Quality Improvement with AI
    claude-opus-4.65/16/2026

    Paper 1 addresses a fundamental and broadly applicable challenge in AI safety alignment for tool-using LLM agents, proposing a novel on-policy self-evolving framework (FATE) with Pareto-front optimization. Its methodological contributions—trajectory-level failure repair, multi-objective safety-utility balancing, and dense supervision from failures—are generalizable across models and scales. The significant quantitative improvements (33.5% ASR reduction, 82.6% harmful compliance reduction) demonstrate strong results. While Paper 2 makes a valuable contribution to healthcare QI, its impact is more domain-specific, and its concordance threshold (≥70%) is modest. Paper 1's broader applicability to the rapidly growing field of AI agent safety gives it higher potential impact.

    vs. DataClaw: A Process-Oriented Agent Benchmark for Exploratory Real-World Data Analysis
    gemini-3.15/13/2026

    Paper 2 tackles a critical, high-impact challenge in AI—agentic safety alignment and the safety-utility trade-off. Its methodological innovations, specifically trajectory-level on-policy self-evolution and Pareto-Front Policy Optimization, offer broad applicability across all tool-using LLM agents. While Paper 1 provides a valuable benchmark for data analysis, Paper 2's focus on foundational AI safety mechanisms gives it greater potential for widespread, cross-disciplinary impact and urgent real-world implementation.

    vs. Probing Cross-modal Information Hubs in Audio-Visual LLMs
    gemini-3.15/13/2026

    Paper 2 addresses a critical and highly timely challenge: the safety-utility trade-off in autonomous tool-using LLM agents. By focusing on trajectory-level failures rather than just final responses, the FATE framework offers a robust, on-policy self-evolution method for agentic safety alignment. This has significant real-world applications in deploying secure AI agents. While Paper 1 provides valuable mechanistic insights into multimodal LLMs, Paper 2's focus on autonomous agent safety and its strong empirical improvements across multiple benchmarks suggest a broader and more immediate scientific and practical impact.

    vs. CBV: Clean-label Backdoor Attacks on Vision Language Models via Diffusion Models
    gemini-3.15/13/2026

    Paper 1 addresses a critical and highly timely bottleneck in AI: the safety and alignment of autonomous, tool-using LLM agents. By focusing on trajectory-level failures rather than just final responses, and proposing an automated self-evolution framework that mitigates the safety-utility trade-off, it offers broad, real-world utility for deploying safe AI. While Paper 2 presents a clever stealthy backdoor attack for VLMs, Paper 1's focus on foundational alignment and self-improving safety mechanisms across different models promises a wider, more constructive impact on the rapidly growing field of agentic AI.

    vs. BehaviorGuard: Online Backdoor Defense for Deep Reinforcement Learning
    claude-opus-4.65/13/2026

    FATE addresses a critical and timely problem—safety alignment of tool-using LLM agents—with a novel on-policy self-evolution framework that transforms failure trajectories into repair supervision. Its multi-objective Pareto optimization for safety-utility trade-offs is methodologically innovative. The breadth of impact is larger given the widespread deployment of LLM agents. Paper 2 makes a solid contribution to DRL backdoor defense with an online, trigger-agnostic approach, but targets a narrower domain with fewer immediate real-world applications compared to the rapidly growing LLM agent ecosystem.

    vs. A CAP-like Trilemma for Large Language Models: Correctness, Non-bias, and Utility under Semantic Underdetermination
    gpt-5.25/13/2026

    Paper 1 has higher likely scientific impact due to a concrete, novel training framework (FATE + PFPO) addressing a pressing real-world problem—agentic safety for tool-using LLMs—with on-policy trajectory-level supervision and multi-objective filtering. It reports quantitative improvements on multiple established benchmarks and offers an actionable method that can be adopted and extended across agent systems, RLHF/verification pipelines, and safety tooling. Paper 2 presents a valuable conceptual framing, but it is largely theoretical/conjectural with examples and less direct empirical or engineering leverage, limiting near-term uptake and measurable downstream impact.

    vs. ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents
    claude-opus-4.65/13/2026

    Paper 1 (FATE) addresses the critical and broadly relevant problem of safety alignment for tool-using LLM agents, introducing novel concepts like trajectory-level failure-based self-evolution and Pareto-front policy optimization for safety-utility trade-offs. Its impact spans the entire LLM agent safety community with strong empirical results across multiple benchmarks. Paper 2 (ToolCUA) makes a solid contribution to GUI-tool orchestration but addresses a narrower problem domain. FATE's focus on safety alignment is more timely and foundational, with broader implications for deploying agents in real-world settings.

    vs. HEDP: A Hybrid Energy-Distance Prompt-based Framework for Domain Incremental Learning
    claude-opus-4.65/13/2026

    Paper 2 addresses the critical and timely problem of safety alignment for LLM agents, proposing a novel framework (FATE) that tackles trajectory-level failures rather than just response-level issues. Its contributions—on-policy self-evolution from failures, Pareto-front optimization for safety-utility trade-offs, and dense trajectory-level supervision—are more innovative and broadly impactful. The dramatic improvements (33.5% reduction in attack success, 82.6% in harmful compliance) demonstrate strong practical value. LLM agent safety is a rapidly growing area with wide real-world implications, giving Paper 2 greater breadth and timeliness compared to Paper 1's incremental improvements in domain incremental learning.

    vs. Flowr -- Scaling Up Retail Supply Chain Operations Through Agentic AI in Large Scale Supermarket Chains
    gemini-3.15/13/2026

    Paper 1 addresses a fundamental, urgent challenge in AI: safety alignment for tool-using LLM agents. By introducing trajectory-level supervision and Pareto-Front Policy Optimization, it provides a novel algorithmic advance that solves the pervasive safety-utility trade-off, backed by rigorous quantitative results across multiple benchmarks. In contrast, Paper 2 presents a valuable but highly applied framework for supply chain automation using existing multi-agent paradigms without specific quantitative metrics in the abstract. Paper 1's foundational contributions have critical implications for the secure deployment of AI agents across all domains, yielding higher scientific impact.