From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

Jun 4, 2026

arXiv:2606.05805v1 PDF

cs.AI(primary)

#1987of 3355·Artificial Intelligence

#1987 of 3355 · Artificial Intelligence

Tournament Score

1382±47

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1382±47

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: TRIAD — From Risk Classification to Action Plan Remediation

1. Core Contribution

TRIAD addresses a genuine gap in LLM agent safety: existing guardrails operate as binary allow/block gates, which causes them to sacrifice benign task objectives when threats are embedded within otherwise legitimate workflows (e.g., prompt injection attacks contaminating benign tasks). The paper's key insight is that guardrails should provide actionable revision guidance rather than just risk signals.

The core novelty is threefold: (1) a three-way decision framework (proceed/update/refuse) that distinguishes between safe plans, partially unsafe plans needing revision, and purely harmful requests; (2) structured natural language feedback that gets injected into the agent's context via ICL templates, forming a closed loop between guardrail assessment and agent planning; and (3) a training pipeline using GPT-5.4 distillation to create trajectory-feedback pairs for fine-tuning a 9B guardrail model (Tri-Guard). The "update" pathway is the most distinctive element — it enables plan remediation rather than wholesale rejection, preserving the benign portions of contaminated tasks.

2. Methodological Rigor

Strengths in methodology:

The problem motivation is empirically grounded: Figure 1 quantitatively demonstrates that existing guardrails achieve only ~58.57% recall on PIAs, and even detected threats rarely translate to safer agent behavior (TSR < 2.31%).

The training data pipeline is well-structured across four stages (task collection, trajectory generation, knowledge distillation, data pair construction), with label-consistency filtering to ensure quality.

Contamination analysis (Appendix C.1) using n-gram overlap checks provides reasonable confidence against train-test leakage.

Evaluation spans four diverse agent backbones (two open-weight, two proprietary) across two benchmarks covering different attack settings.

Concerns:

The training data construction relies heavily on GPT-5.4 for both task rewriting and knowledge distillation, introducing potential biases from a single teacher model. The paper acknowledges but does not systematically study this dependency.

The weighted SFT approach uses teacher confidence as sample weights, but no ablation isolates its contribution versus uniform weighting.

The "update" mechanism allows up to K=3 revision attempts, but the paper doesn't deeply analyze failure modes when updates don't converge to safe plans.

ASR for TRIAD + Tri-Guard is sometimes higher than the base Qwen3.5-9B model (Table 3), suggesting the training may occasionally make the model too permissive — a concerning safety tradeoff that deserves more analysis.

3. Potential Impact

Direct applications: The framework is relevant for deploying LLM agents in high-stakes environments (financial services, system administration, healthcare) where tools have irreversible effects. The plug-and-play, black-box nature of TRIAD makes it practically deployable.

Broader influence: The paper advances an important paradigm shift — from guardrails as external risk detectors to guardrails as integrated planning components. This "closed-loop" design philosophy could influence how the community thinks about safety interventions in agentic systems more broadly. The idea that safety modules should provide constructive feedback rather than binary verdicts parallels developments in RLHF and constitutional AI.

Limitations on impact: The 9B guardrail model adds ~5 seconds of latency per step (Table 7), which may be prohibitive for latency-sensitive applications. The approach is also tested only on ReAct-style agents; generalization to other agent architectures (tree-of-thought, plan-and-execute) remains unverified.

4. Timeliness & Relevance

This work is highly timely. As LLM agents are increasingly deployed with tool-use capabilities, prompt injection attacks represent one of the most pressing security threats. The observation that existing guardrails fail to preserve benign task completion under PIAs directly addresses a current bottleneck. The paper appears concurrent with several recent works (ToolSafe, AGrail, Safiron) but differentiates itself through the update/remediation pathway and end-to-end integration.

5. Strengths & Limitations

Key Strengths:

Well-motivated problem: The empirical demonstration that detection alone is insufficient (Figure 1) is compelling and clearly establishes the need for the proposed approach.

Strong empirical results: Average ASR drops from 74.45% to 10.42% while TSR improves from 28.45% to 68.60% — this is a substantial improvement on both dimensions simultaneously.

Comprehensive evaluation: Four agent backbones, two benchmarks, three attack settings, multiple baselines, plus extensive ablations and case studies.

Practical design: Black-box access to the target agent, plug-and-play deployment, reasonable model size (9B).

The HS metric is a useful contribution for evaluating safety-utility tradeoffs.

Notable Weaknesses:

Safety-utility tradeoff is real: Tri-Guard sometimes has higher ASR than the conservative base model (13.04% vs 5.56% average DPI), meaning the training makes the guardrail less cautious. In safety-critical settings, this could be problematic.

Limited attack diversity: Only two PIA templates are used in main experiments (though five are tested in supplementary). More sophisticated adaptive attacks targeting the guardrail feedback loop are not considered.

No adversarial robustness analysis: An attacker who knows about the three-way decision mechanism could potentially craft inputs that consistently trigger "proceed" decisions.

Scalability concerns: The approach requires trajectory-level context for guardrail evaluation, which grows with interaction length. The 16K token limit may be insufficient for complex, long-horizon tasks.

Evaluation benchmarks: ASB and AgentHarm, while standard, use simulated tool environments. Real-world deployment implications remain uncertain.

The paper uses unreleased models (GPT-5.1, GPT-5.4) for training and evaluation, which limits reproducibility.

Overall Assessment

TRIAD presents a well-executed contribution that addresses a clearly defined and practically important gap in LLM agent safety. The three-way decision framework with structured feedback is a meaningful advance over binary guardrails, and the empirical results are strong. However, the fundamental safety-utility tradeoff — where training the guardrail to be less conservative inherently allows more attacks through — deserves deeper theoretical and empirical investigation. The reliance on unreleased models for critical pipeline components (distillation, evaluation) somewhat limits reproducibility despite the code release.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

Generated Jun 5, 2026

Comparison History (19)

vs. AEGIS: A Backup Reflex for Physical AI

gemini-3.16/8/2026

Paper 2 addresses a critical bottleneck in LLM agent deployment: safety and reliability. By transitioning guardrails from simple blocking mechanisms to iterative feedback loops for plan remediation, it offers a highly practical solution. Given the explosive growth and broad applicability of LLM agents across diverse software domains, this framework promises higher immediate real-world utility and wider cross-disciplinary impact than Paper 1, which focuses on the more specialized domain of physical robotics.

vs. OpenSkill: Open-World Self-Evolution for LLM Agents

claude-opus-4.66/8/2026

OpenSkill addresses a more fundamental and broadly impactful challenge—enabling LLM agents to self-evolve in open-world settings without any target-task supervision. This tackles a core limitation of current agent systems (reliance on curated training signals) with a novel bootstrapping framework that builds both skills and verification from scratch. While TRIAD makes a solid contribution to LLM safety with its three-way guardrail feedback loop, it operates within a more established safety/guardrail paradigm. OpenSkill's transferability across models and its supervision-free design have broader implications for autonomous agent development across many domains.

vs. Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

gpt-5.26/8/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: guardrailing LLM agents is an active, fast-moving area with immediate deployment needs. TRIAD introduces a practical “update” intervention and closed-loop integration between guardrail feedback and agent planning, evaluated on established benchmarks with safety–utility trade-offs—relevant across agentic systems, security, and HCI. Paper 1 is a solid engineering advance for GPU-based pseudo-Boolean SAT, but its impact is narrower (specialized solver domain) and more incremental relative to existing GPU/JAX optimization trends.

vs. Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation

claude-opus-4.66/6/2026

Paper 2 (TRIAD) addresses a more novel and broadly impactful problem—safe LLM agent behavior through guardrail-integrated feedback loops. Its tripartite decision framework (proceed/refuse/update) with closed-loop plan revision is more innovative than Paper 1's curriculum learning approach, which combines known techniques. Paper 2 tackles the critical and timely AI safety challenge with a practical framework applicable across many domains, provides open-source code, and evaluates on established benchmarks. Paper 1's contribution is more incremental, applying curriculum learning to medical QA with BERTScore evaluation, which is narrower in scope and methodological novelty.

vs. Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

claude-opus-4.66/6/2026

Paper 2 (TRIAD) addresses the critical and timely problem of LLM agent safety with a novel three-way guardrail framework that goes beyond binary allow/deny decisions. Its closed-loop feedback mechanism for plan revision is innovative and has broader impact across all LLM agent applications. The safety-utility trade-off problem is fundamental as agents are deployed in high-stakes settings. Paper 1, while solid, focuses on incremental improvements to web automation skill retrieval—a narrower domain. Paper 2's contributions are more generalizable and address a more pressing concern in the rapidly growing field of autonomous agents.

vs. Semantic Partial Grounding via LLMs

gemini-3.16/6/2026

Paper 1 addresses a highly critical and timely issue: the safety and alignment of LLM-based autonomous agents. By introducing a framework that uses guardrail feedback to remediate plans rather than just blocking them, it significantly advances practical AI safety. Its relevance to the booming field of LLM agents gives it broader potential real-world applications and a higher estimated scientific impact compared to Paper 2, which applies LLMs to the more specialized, niche domain of classical planning grounding.

vs. EvoDrive: Pareto Evolution for Safety-Critical Autonomous Driving via Self-Improving LLM Agents

gpt-5.26/6/2026

Paper 1 likely has higher impact due to stronger novelty and broader real-world relevance: a simulator-grounded, Pareto-based evolutionary framework for safety-critical autonomous driving scenario generation, validated on major simulators (MetaDrive, CARLA) and directly useful for validation and training. Its multi-objective, grounded agentic evolution with budget-aware evaluation and Pareto archiving is methodologically distinctive and can influence AV testing, RL/simulation research, and safety engineering. Paper 2 is timely and practical for LLM agent safety, but its contributions (guardrail feedback loops and a triage decision) are more incremental within a rapidly crowded guardrails space.

vs. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

gpt-5.26/5/2026

Paper 1 has higher potential scientific impact due to its novel, formal causal decomposition that corrects a widely used but biased estimand in RLVR, with pre-registered experimental confirmation, identification analysis, and re-audits showing immediate implications for interpreting prior results. The contribution is methodological and broadly applicable as an evaluation/audit tool across alignment and RLHF/RLVR research, improving rigor and scientific validity. Paper 2 is practically useful for agent safety and shows empirical gains, but is a more incremental systems framework (finetune + feedback loop) with narrower conceptual novelty and less theoretical generality.

vs. AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

gpt-5.26/5/2026

Paper 2 has higher impact potential because it proposes a novel, actionable guardrail-agent integration (TRIAD) that moves beyond risk labeling to iterative remediation, directly improving downstream agent behavior and enabling safer task completion. It evaluates on established agent safety benchmarks (ASB, AgentHarm) with clear safety-utility trade-offs, suggesting stronger methodological relevance and broader applicability to real-world LLM agents and tool-using systems. Paper 1’s benchmark is valuable and timely, but its impact is narrower (AI companions) and primarily evaluative rather than intervention-oriented.

vs. Edit-R2: Context-Aware Reinforcement Learning for Multi-Turn Image Editing

gemini-3.16/5/2026

Paper 2 addresses a critical bottleneck in LLM agent deployment: safety guardrails that over-block tasks. By introducing an iterative feedback framework that allows agents to revise plans and salvage benign components, it offers a highly practical solution with broad applicability across all domains utilizing autonomous agents. While Paper 1 presents an innovative approach to multi-turn image editing, Paper 2's focus on AI safety and improving the safety-utility trade-off provides a wider, more urgent real-world impact across the AI ecosystem.

vs. AdaMEM: Test-Time Adaptive Memory for Language Agents

gpt-5.26/5/2026

Paper 1 (AdaMEM) is likely to have higher scientific impact due to a more novel, general-purpose adaptation mechanism for long-horizon LLM agents: continuous test-time behavior adaptation via hybrid long-/short-term memory without online parameter updates, plus a training method (STEP-MFT) to synthesize strategies from retrieved experience. This targets a broad capability bottleneck (agent robustness over time) with applicability across many agent settings (web, embodied, QA/search) and aligns with timely interest in scalable agent memory. Paper 2 is valuable and timely for safety, but is narrower (guardrail feedback loop) and more policy/dataset dependent.

vs. Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

gemini-3.16/5/2026

Paper 1 addresses a critical and highly timely challenge in artificial intelligence—LLM agent safety and alignment. Its novel approach of using iterative guardrail feedback to revise action plans rather than outright blocking tasks has broad implications across any domain deploying autonomous agents. In contrast, Paper 2 presents a valuable but narrower application of computer vision to infrastructure inspection. The explosive growth and cross-disciplinary relevance of LLM agents give Paper 1 a significantly higher potential for widespread scientific and real-world impact.

vs. Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns

claude-opus-4.66/5/2026

Paper 1 addresses a critical and timely problem in LLM agent safety with a concrete, well-evaluated framework (TRIAD) that demonstrates strong empirical results on established benchmarks. It introduces a novel three-way guardrail decision mechanism with closed-loop feedback, showing clear improvements in safety-utility trade-offs. Paper 2 proposes interesting entropy-based evaluation metrics but is more conceptual, lacks rigorous empirical validation on benchmarks, and serves as a complementary measurement tool rather than solving a pressing problem. Paper 1's direct applicability to AI safety and its methodological rigor give it higher potential impact.

vs. TRACE: A Temporal Conditional Estimation for Multimodal Time Series Foundation Models

gpt-5.26/5/2026

Paper 1 (TRACE) likely has higher scientific impact due to a more broadly applicable methodological contribution: conditional cross-modal estimation for irregularly sampled, partially missing multimodal time series—an endemic issue in real-world sensing and healthcare. It targets foundation-model pipelines and is evaluated on diverse domains (clinical + affective computing), suggesting wider cross-field reuse. Paper 2 (TRIAD) is timely and practically important for LLM agent safety, but appears more system/benchmark-driven with narrower methodological generality and faster-obsolescence risk given rapid guardrail/agent iteration cycles.

vs. Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

gemini-3.16/5/2026

Paper 1 addresses a critical and universal challenge in AI—agent safety and alignment—by introducing a novel, closed-loop guardrail framework for plan remediation. This methodological innovation offers broad applicability across all LLM agent domains. In contrast, Paper 2 presents a valuable but narrower empirical evaluation of LLMs in a specific medical subfield, offering less methodological innovation and a more restricted scope of impact.

vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

claude-opus-4.66/5/2026

Paper 1 (TRIAD) addresses a critical and timely problem in LLM agent safety with a novel framework that goes beyond binary allow/deny guardrails to enable iterative plan remediation. Its closed-loop feedback mechanism between guardrails and agent planning is innovative and has broad applicability across the rapidly growing LLM agent ecosystem. Paper 2 (PLAN-S) makes solid contributions to autonomous driving world models with style-conditioned cost maps, but operates in a more narrow domain. Given the explosive growth of LLM agents and urgent safety concerns, TRIAD's approach to preserving utility while mitigating risks has higher potential for broad impact across multiple fields deploying LLM agents.

vs. Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

gpt-5.26/5/2026

Paper 1 likely has higher scientific impact due to stronger novelty and broader conceptual relevance: it uncovers and quantifies an intrinsic convergence/attractor bias in LLM-driven program mutation, contrasted against a classical GP operator, yielding a generally applicable insight for LLM-based search, program synthesis, and open-ended evolution. Its methodological framing (mutation chains, cycle analysis, robustness across prompts/models) targets a foundational limitation that could influence many downstream systems. Paper 2 is timely and application-relevant for agent safety, but is closer to an engineering framework/finetuning recipe whose impact may be narrower and more benchmark-dependent.

vs. RedditPersona: A Modular Framework for Community-Conditioned LLM Adaptation from Reddit

claude-opus-4.66/5/2026

TRIAD addresses a critical and timely problem in LLM agent safety with a novel three-way guardrail decision framework (proceed/refuse/update) that preserves utility while improving safety. Its closed-loop feedback mechanism between guardrails and agent planning is innovative and practically important as LLM agents are increasingly deployed. The paper demonstrates strong empirical results on established benchmarks. RedditPersona, while methodologically sound and useful for community-conditioned adaptation, addresses a narrower problem with more incremental contributions (standardizing existing approaches). Agent safety has broader cross-field impact and greater urgency.

vs. Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

gemini-3.16/5/2026

Paper 2 introduces a foundational, theoretical framework for knowledge infusion across multimodal generative models, offering broad conceptual insights. While Paper 1 presents a highly practical and effective solution for LLM agent safety, Paper 2's structured categorization of intervention layers provides a broader methodological foundation that is likely to influence a wider range of generative AI research, architectures, and alignment strategies.