From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents
Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan
Abstract
LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.
AI Impact Assessments
(1 models)Scientific Impact Assessment: TRIAD — From Risk Classification to Action Plan Remediation
1. Core Contribution
TRIAD addresses a genuine gap in LLM agent safety: existing guardrails operate as binary allow/block gates, which causes them to sacrifice benign task objectives when threats are embedded within otherwise legitimate workflows (e.g., prompt injection attacks contaminating benign tasks). The paper's key insight is that guardrails should provide actionable revision guidance rather than just risk signals.
The core novelty is threefold: (1) a three-way decision framework (proceed/update/refuse) that distinguishes between safe plans, partially unsafe plans needing revision, and purely harmful requests; (2) structured natural language feedback that gets injected into the agent's context via ICL templates, forming a closed loop between guardrail assessment and agent planning; and (3) a training pipeline using GPT-5.4 distillation to create trajectory-feedback pairs for fine-tuning a 9B guardrail model (Tri-Guard). The "update" pathway is the most distinctive element — it enables plan remediation rather than wholesale rejection, preserving the benign portions of contaminated tasks.
2. Methodological Rigor
Strengths in methodology:
Concerns:
3. Potential Impact
Direct applications: The framework is relevant for deploying LLM agents in high-stakes environments (financial services, system administration, healthcare) where tools have irreversible effects. The plug-and-play, black-box nature of TRIAD makes it practically deployable.
Broader influence: The paper advances an important paradigm shift — from guardrails as external risk detectors to guardrails as integrated planning components. This "closed-loop" design philosophy could influence how the community thinks about safety interventions in agentic systems more broadly. The idea that safety modules should provide constructive feedback rather than binary verdicts parallels developments in RLHF and constitutional AI.
Limitations on impact: The 9B guardrail model adds ~5 seconds of latency per step (Table 7), which may be prohibitive for latency-sensitive applications. The approach is also tested only on ReAct-style agents; generalization to other agent architectures (tree-of-thought, plan-and-execute) remains unverified.
4. Timeliness & Relevance
This work is highly timely. As LLM agents are increasingly deployed with tool-use capabilities, prompt injection attacks represent one of the most pressing security threats. The observation that existing guardrails fail to preserve benign task completion under PIAs directly addresses a current bottleneck. The paper appears concurrent with several recent works (ToolSafe, AGrail, Safiron) but differentiates itself through the update/remediation pathway and end-to-end integration.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
TRIAD presents a well-executed contribution that addresses a clearly defined and practically important gap in LLM agent safety. The three-way decision framework with structured feedback is a meaningful advance over binary guardrails, and the empirical results are strong. However, the fundamental safety-utility tradeoff — where training the guardrail to be less conservative inherently allows more attacks through — deserves deeper theoretical and empirical investigation. The reliance on unreleased models for critical pipeline components (distillation, evaluation) somewhat limits reproducibility despite the code release.
Generated Jun 5, 2026
Comparison History (19)
Paper 2 addresses a critical bottleneck in LLM agent deployment: safety and reliability. By transitioning guardrails from simple blocking mechanisms to iterative feedback loops for plan remediation, it offers a highly practical solution. Given the explosive growth and broad applicability of LLM agents across diverse software domains, this framework promises higher immediate real-world utility and wider cross-disciplinary impact than Paper 1, which focuses on the more specialized domain of physical robotics.
OpenSkill addresses a more fundamental and broadly impactful challenge—enabling LLM agents to self-evolve in open-world settings without any target-task supervision. This tackles a core limitation of current agent systems (reliance on curated training signals) with a novel bootstrapping framework that builds both skills and verification from scratch. While TRIAD makes a solid contribution to LLM safety with its three-way guardrail feedback loop, it operates within a more established safety/guardrail paradigm. OpenSkill's transferability across models and its supervision-free design have broader implications for autonomous agent development across many domains.
Paper 2 likely has higher impact due to timeliness and broad applicability: guardrailing LLM agents is an active, fast-moving area with immediate deployment needs. TRIAD introduces a practical “update” intervention and closed-loop integration between guardrail feedback and agent planning, evaluated on established benchmarks with safety–utility trade-offs—relevant across agentic systems, security, and HCI. Paper 1 is a solid engineering advance for GPU-based pseudo-Boolean SAT, but its impact is narrower (specialized solver domain) and more incremental relative to existing GPU/JAX optimization trends.
Paper 2 (TRIAD) addresses a more novel and broadly impactful problem—safe LLM agent behavior through guardrail-integrated feedback loops. Its tripartite decision framework (proceed/refuse/update) with closed-loop plan revision is more innovative than Paper 1's curriculum learning approach, which combines known techniques. Paper 2 tackles the critical and timely AI safety challenge with a practical framework applicable across many domains, provides open-source code, and evaluates on established benchmarks. Paper 1's contribution is more incremental, applying curriculum learning to medical QA with BERTScore evaluation, which is narrower in scope and methodological novelty.
Paper 2 (TRIAD) addresses the critical and timely problem of LLM agent safety with a novel three-way guardrail framework that goes beyond binary allow/deny decisions. Its closed-loop feedback mechanism for plan revision is innovative and has broader impact across all LLM agent applications. The safety-utility trade-off problem is fundamental as agents are deployed in high-stakes settings. Paper 1, while solid, focuses on incremental improvements to web automation skill retrieval—a narrower domain. Paper 2's contributions are more generalizable and address a more pressing concern in the rapidly growing field of autonomous agents.
Paper 1 addresses a highly critical and timely issue: the safety and alignment of LLM-based autonomous agents. By introducing a framework that uses guardrail feedback to remediate plans rather than just blocking them, it significantly advances practical AI safety. Its relevance to the booming field of LLM agents gives it broader potential real-world applications and a higher estimated scientific impact compared to Paper 2, which applies LLMs to the more specialized, niche domain of classical planning grounding.
Paper 1 likely has higher impact due to stronger novelty and broader real-world relevance: a simulator-grounded, Pareto-based evolutionary framework for safety-critical autonomous driving scenario generation, validated on major simulators (MetaDrive, CARLA) and directly useful for validation and training. Its multi-objective, grounded agentic evolution with budget-aware evaluation and Pareto archiving is methodologically distinctive and can influence AV testing, RL/simulation research, and safety engineering. Paper 2 is timely and practical for LLM agent safety, but its contributions (guardrail feedback loops and a triage decision) are more incremental within a rapidly crowded guardrails space.
Paper 1 has higher potential scientific impact due to its novel, formal causal decomposition that corrects a widely used but biased estimand in RLVR, with pre-registered experimental confirmation, identification analysis, and re-audits showing immediate implications for interpreting prior results. The contribution is methodological and broadly applicable as an evaluation/audit tool across alignment and RLHF/RLVR research, improving rigor and scientific validity. Paper 2 is practically useful for agent safety and shows empirical gains, but is a more incremental systems framework (finetune + feedback loop) with narrower conceptual novelty and less theoretical generality.
Paper 2 has higher impact potential because it proposes a novel, actionable guardrail-agent integration (TRIAD) that moves beyond risk labeling to iterative remediation, directly improving downstream agent behavior and enabling safer task completion. It evaluates on established agent safety benchmarks (ASB, AgentHarm) with clear safety-utility trade-offs, suggesting stronger methodological relevance and broader applicability to real-world LLM agents and tool-using systems. Paper 1’s benchmark is valuable and timely, but its impact is narrower (AI companions) and primarily evaluative rather than intervention-oriented.
Paper 2 addresses a critical bottleneck in LLM agent deployment: safety guardrails that over-block tasks. By introducing an iterative feedback framework that allows agents to revise plans and salvage benign components, it offers a highly practical solution with broad applicability across all domains utilizing autonomous agents. While Paper 1 presents an innovative approach to multi-turn image editing, Paper 2's focus on AI safety and improving the safety-utility trade-off provides a wider, more urgent real-world impact across the AI ecosystem.
Paper 1 (AdaMEM) is likely to have higher scientific impact due to a more novel, general-purpose adaptation mechanism for long-horizon LLM agents: continuous test-time behavior adaptation via hybrid long-/short-term memory without online parameter updates, plus a training method (STEP-MFT) to synthesize strategies from retrieved experience. This targets a broad capability bottleneck (agent robustness over time) with applicability across many agent settings (web, embodied, QA/search) and aligns with timely interest in scalable agent memory. Paper 2 is valuable and timely for safety, but is narrower (guardrail feedback loop) and more policy/dataset dependent.
Paper 1 addresses a critical and highly timely challenge in artificial intelligence—LLM agent safety and alignment. Its novel approach of using iterative guardrail feedback to revise action plans rather than outright blocking tasks has broad implications across any domain deploying autonomous agents. In contrast, Paper 2 presents a valuable but narrower application of computer vision to infrastructure inspection. The explosive growth and cross-disciplinary relevance of LLM agents give Paper 1 a significantly higher potential for widespread scientific and real-world impact.
Paper 1 addresses a critical and timely problem in LLM agent safety with a concrete, well-evaluated framework (TRIAD) that demonstrates strong empirical results on established benchmarks. It introduces a novel three-way guardrail decision mechanism with closed-loop feedback, showing clear improvements in safety-utility trade-offs. Paper 2 proposes interesting entropy-based evaluation metrics but is more conceptual, lacks rigorous empirical validation on benchmarks, and serves as a complementary measurement tool rather than solving a pressing problem. Paper 1's direct applicability to AI safety and its methodological rigor give it higher potential impact.
Paper 1 (TRACE) likely has higher scientific impact due to a more broadly applicable methodological contribution: conditional cross-modal estimation for irregularly sampled, partially missing multimodal time series—an endemic issue in real-world sensing and healthcare. It targets foundation-model pipelines and is evaluated on diverse domains (clinical + affective computing), suggesting wider cross-field reuse. Paper 2 (TRIAD) is timely and practically important for LLM agent safety, but appears more system/benchmark-driven with narrower methodological generality and faster-obsolescence risk given rapid guardrail/agent iteration cycles.
Paper 1 addresses a critical and universal challenge in AI—agent safety and alignment—by introducing a novel, closed-loop guardrail framework for plan remediation. This methodological innovation offers broad applicability across all LLM agent domains. In contrast, Paper 2 presents a valuable but narrower empirical evaluation of LLMs in a specific medical subfield, offering less methodological innovation and a more restricted scope of impact.
Paper 1 (TRIAD) addresses a critical and timely problem in LLM agent safety with a novel framework that goes beyond binary allow/deny guardrails to enable iterative plan remediation. Its closed-loop feedback mechanism between guardrails and agent planning is innovative and has broad applicability across the rapidly growing LLM agent ecosystem. Paper 2 (PLAN-S) makes solid contributions to autonomous driving world models with style-conditioned cost maps, but operates in a more narrow domain. Given the explosive growth of LLM agents and urgent safety concerns, TRIAD's approach to preserving utility while mitigating risks has higher potential for broad impact across multiple fields deploying LLM agents.
Paper 1 likely has higher scientific impact due to stronger novelty and broader conceptual relevance: it uncovers and quantifies an intrinsic convergence/attractor bias in LLM-driven program mutation, contrasted against a classical GP operator, yielding a generally applicable insight for LLM-based search, program synthesis, and open-ended evolution. Its methodological framing (mutation chains, cycle analysis, robustness across prompts/models) targets a foundational limitation that could influence many downstream systems. Paper 2 is timely and application-relevant for agent safety, but is closer to an engineering framework/finetuning recipe whose impact may be narrower and more benchmark-dependent.
TRIAD addresses a critical and timely problem in LLM agent safety with a novel three-way guardrail decision framework (proceed/refuse/update) that preserves utility while improving safety. Its closed-loop feedback mechanism between guardrails and agent planning is innovative and practically important as LLM agents are increasingly deployed. The paper demonstrates strong empirical results on established benchmarks. RedditPersona, while methodologically sound and useful for community-conditioned adaptation, addresses a narrower problem with more incremental contributions (standardizing existing approaches). Agent safety has broader cross-field impact and greater urgency.
Paper 2 introduces a foundational, theoretical framework for knowledge infusion across multimodal generative models, offering broad conceptual insights. While Paper 1 presents a highly practical and effective solution for LLM agent safety, Paper 2's structured categorization of intervention layers provides a broader methodological foundation that is likely to influence a wider range of generative AI research, architectures, and alignment strategies.