Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents
Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng
Abstract
Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents"
1. Core Contribution
This paper formalizes Sleeper Attack, a cross-interaction threat model for LLM agents in which adversarial content injected via external observations (tool returns, webpages, MCP context) persists in agent state across interactions and is later activated by benign user queries. The key insight is that existing agent safety evaluations focus almost exclusively on single-interaction attacks — where injection and exploitation happen within one user request — while modern agents maintain persistent state (session context, memory, reusable skills) that creates temporal attack surfaces.
The paper contributes three attack strategies: Latent Instruction Planting (LIP), where dormant malicious instructions are stored and later triggered; Proactive Information Elicitation (PIE), where underspecified planted instructions force the agent to solicit missing sensitive parameters from users; and Persistent Information Corruption (PIC), where stored facts are poisoned to corrupt future agent outputs. These strategies are evaluated across three agent state targets (session, memory, skill), yielding a systematic taxonomy.
2. Methodological Rigor
The benchmark construction is methodologically thorough. The 1,896 instances span six harm domains grounded in existing taxonomies, and the evaluation uses a rule-based protocol rather than LLM judges, which the authors correctly identify as insufficiently reliable for trajectory-level safety evaluation. The structured evaluation employs ordered trace matching and argument-value matching with a three-model quality-control pipeline (proposer, critic, arbiter) for eval_config validation — a sensible design choice that improves reproducibility.
The experimental design includes several well-motivated controls: a direct single-interaction baseline enables direct-versus-sleeper comparison, fresh-session replay tests whether hazards survive session boundaries, and longer-horizon sweeps (up to 20 interactions) test temporal persistence. The evaluation of seven models spanning open and closed-source families (Gemini, DeepSeek, Qwen, GPT, Llama) provides reasonable coverage.
However, there are notable methodological limitations. The simulated environment (ToolEmu-derived) with DeepSeek-v3.2 generating tool outputs introduces an abstraction gap — real deployments have authentication, rate limits, and confirmation dialogs that would materially affect attack feasibility. The ask-user simulator is cooperative by design, which inflates PIE success rates relative to realistic user behavior. The template optimization procedure (Appendix A.10) uses iterative prompt engineering to maximize ASR, which, while transparent, means the reported rates reflect optimized attack templates rather than naive adversarial attempts.
3. Potential Impact
This work addresses a genuine deployment concern. As LLM agents are increasingly deployed with persistent memory (e.g., ChatGPT memory, Claude's artifacts), MCP integrations, and skill libraries, the cross-interaction attack surface is expanding rapidly. The paper's demonstration that agents with low single-interaction ASR can have dramatically higher sleeper ASR (e.g., Gemini-3.1-Pro: 6.2% direct → 92.6% on skill for LIP) is a practically important finding that should influence both agent design and safety evaluation practices.
The benchmark itself could serve as a standard evaluation tool, though its reliance on a simulated environment limits direct applicability to production systems. The finding that lightweight defenses (rule-based instructions, LlamaGuard filtering) provide only partial mitigation is valuable for practitioners, though the defense evaluation is relatively shallow — only two defenses on one model.
The work's influence could extend to: (1) agent framework design, motivating stricter state isolation and provenance tracking; (2) safety evaluation standards, pushing for multi-interaction assessment; (3) MCP and tool ecosystem security, given the paper's relevance to emerging agentic infrastructure.
4. Timeliness & Relevance
The paper is highly timely. The rapid deployment of persistent-state LLM agents (memory-equipped assistants, MCP-connected systems, coding agents with skill libraries) creates exactly the attack surfaces this paper studies. The references include 2025-2026 papers on MCP poisoning, skill injection, and memory attacks, positioning this work at the intersection of several active research threads. The unification of session, memory, and skill attack surfaces into a single framework is a timely contribution given the fragmented nature of prior work.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper effectively argues that single-interaction safety evaluation is insufficient, but the practical exploitability of these attacks depends heavily on deployment-specific factors (state persistence policies, user confirmation requirements) that the benchmark cannot capture. The open-model scaling results showing that larger Qwen3 models have *higher* sleeper ASR (especially PIC: 30.2% at 4B → 49.2% at 32B) is a particularly interesting finding suggesting that improved capability may increase vulnerability to state-based attacks.
Generated May 28, 2026
Comparison History (17)
Paper 2 likely has higher scientific impact due to its timely, security-critical framing (persistent, stateful “Sleeper Attacks” on LLM agents), broad relevance across agent frameworks, safety, and deployment contexts, and clear real-world implications for tool-using systems. It introduces a novel threat model beyond single-turn jailbreaks and provides a sizable benchmark spanning outcomes, strategies, and state targets, with evidence across multiple open/closed models. Paper 1 is valuable for scalable model routing and benchmarking, but its impact is narrower and more systems/ML-infra focused.
Paper 1 identifies a novel, persistent security vulnerability in LLM agents (Sleeper Attacks), moving beyond standard single-interaction exploits. Given the rapid, widespread deployment of autonomous LLM agents across diverse industries, addressing long-term memory and state-based vulnerabilities is critical for AI safety. Paper 2 presents an interesting interdisciplinary application of LLMs for reward shaping and fairness in building energy management, but its scope is heavily restricted to HVAC and thermal comfort. Consequently, Paper 1 has a significantly broader potential impact, higher relevance to core AI development, and addresses a more urgent generalized security threat.
CaMBRAIN introduces a fundamentally new architecture for EEG processing that addresses critical limitations (quadratic scaling, fixed-length inputs) with a novel causal SSM approach and custom training pipeline. It achieves SOTA across 3 datasets with 10x throughput gains, enabling real-time continuous EEG monitoring with clear clinical applications. Paper 1, while identifying an important LLM security threat (sleeper attacks), is more incremental within the adversarial AI safety space, extending known attack paradigms to multi-interaction settings. Paper 2's cross-disciplinary impact (ML + neuroscience + clinical medicine) and practical applicability give it broader potential impact.
Paper 2 addresses a critical and timely security vulnerability in LLM agents—persistent cross-interaction sleeper attacks—which has immediate practical implications for AI safety as LLM agents are increasingly deployed. It introduces a novel threat formalization, a comprehensive benchmark (1,896 instances), and evaluates across seven LLMs, providing actionable insights for the safety community. Paper 1, while intellectually interesting in studying perceptual geometry in LLMs, is more observational and narrower in its impact scope, primarily contributing to interpretability research without clear downstream applications.
Paper 1 likely has higher impact because it identifies and formalizes a new, practically critical threat model for LLM agents—persistent, cross-interaction “sleeper” attacks—directly relevant to deploying agents with tools, memory, and skills. It contributes a sizable benchmark (1,896 instances), evaluates across multiple leading open/closed models, and targets real-world safety outcomes, making it timely and broadly applicable to security, safety, and agent design. Paper 2 is innovative for multimodal alignment, but its impact is narrower and more contingent on adoption within RLHF pipelines.
Paper 2 introduces a novel and practically significant threat model ('Sleeper Attack') for LLM agents that formalizes cross-interaction persistence of adversarial content—a largely unexplored attack surface. This has broad implications for AI safety and security as LLM agents become widely deployed. The comprehensive benchmark (1,896 instances, multiple attack strategies, seven LLMs) demonstrates methodological rigor. Paper 1, while solid, represents an incremental improvement in self-evolving LLMs using confidence signals. Paper 2's novelty in identifying a new class of vulnerabilities and its timeliness given rapid LLM agent adoption give it higher potential impact.
Paper 1 introduces a novel and important security threat model ('Sleeper Attack') for LLM agents that reveals vulnerabilities persisting across interactions—a largely unexplored attack surface. Its comprehensive benchmark (1,896 instances, 7 LLMs, multiple attack strategies) and formalization of cross-interaction adversarial persistence address a critical and timely safety concern as LLM agents are widely deployed. Paper 2, while technically solid, offers an incremental optimization framework for agent skills with narrower scope. The security implications of Paper 1 have broader impact across the AI safety community and are more likely to influence future research and deployment practices.
Paper 2 addresses a highly timely and critical issue in AI safety: vulnerabilities in LLM agents. While Paper 1 offers a solid methodological improvement for encoder-based masked language modeling, Paper 2's focus on a novel, persistent 'Sleeper Attack' across multi-turn interactions has broader implications for the real-world deployment and security of modern LLM agents. The introduction of a new benchmark and the exploration of a previously understudied attack vector give Paper 2 a higher potential for widespread impact and future citations in the rapidly growing field of AI safety.
SAGE addresses a fundamental infrastructure challenge (long-term memory for language agents) with a novel self-evolving graph memory framework combining theoretical analysis and strong empirical results across multiple benchmarks. Its contributions—dynamic graph memory, reader-writer feedback loops, and graph foundation model integration—have broader applicability across many agent systems. While Paper 1 identifies an important security threat (sleeper attacks on LLM agents), it is more narrowly focused on a specific attack vector. SAGE's architectural innovation is likely to influence more downstream research in agent memory, RAG systems, and knowledge graphs.
Paper 2 offers a profound theoretical contribution to a fundamental debate in AI: whether LLMs build internal world models. By establishing a universal 'L3 reasoning cliff' in spatial reasoning across languages and scales, and validating it against human baselines, it reveals inherent limitations in text-only working memory. While Paper 1 identifies a critical security vulnerability in stateful agents, Paper 2's rigorous methodological hierarchy and implications for future architectural designs give it broader foundational impact across AI, cognitive science, and NLP.
Paper 2 likely has higher impact due to its timely, high-stakes security framing for LLM agents, introducing a broadly applicable threat model (persistent, dormant “sleeper” injections) that spans session context, memory, and skills. The benchmark (1,896 instances) across multiple real-world harmful outcomes and evaluation on seven models increases methodological strength and reproducibility, and the findings directly inform mitigation, policy, and agent design across many domains. Paper 1 is innovative for agent memory, but its impact is more specialized and less urgent than systemic safety vulnerabilities.
Paper 1 introduces a frontier-tier foundation model featuring novel agent-native RL and self-evolution capabilities. Foundation model papers that demonstrate architectural efficiency (MoE) and new training paradigms typically achieve massive adoption, set new industry baselines, and drive broader scientific impact across the AI community compared to domain-specific security benchmarks, despite the strong novelty of Paper 2.
Paper 2 addresses a novel and critical security vulnerability in LLM agents—cross-interaction sleeper attacks that persist in agent state and activate later. This formalizes a new threat model with broader implications for AI safety, trust, and deployment across many domains. Its comprehensive benchmark (1,896 instances, multiple attack strategies, multiple LLMs) and the timeliness of LLM agent security make it highly impactful. Paper 1, while solid in improving algorithmic testing, addresses a narrower software engineering problem with more incremental contributions.
Paper 2 likely has higher impact: it introduces a timely, broadly relevant security threat model (persistent “Sleeper Attacks” on agent state) with clear real-world implications for deployed LLM agents. It formalizes the attack, provides a sizable benchmark (1,896 instances) spanning multiple harms, strategies, and state targets, and validates across seven strong models—supporting methodological rigor and wide applicability across AI safety, security, HCI, and agent systems. Paper 1 is useful and user-centric, but text detection is a narrower, more volatile area with weaker long-term robustness guarantees.
Paper 1 introduces a novel and concerning security threat ('Sleeper Attack') for LLM agents that persists across interactions—a fundamentally new attack paradigm with broad implications for AI safety. It formalizes the threat model, provides a comprehensive benchmark, and demonstrates vulnerability across seven major LLMs. Given the rapid deployment of LLM agents in real-world systems, this work has urgent, cross-cutting impact on AI security. Paper 2, while valuable for CAD/engineering evaluation, addresses a narrower application domain with less transformative implications for the broader AI research community.
Paper 1 addresses a critical and timely security vulnerability in LLM agents—sleeper attacks that persist across interactions—which is highly relevant given the rapid deployment of LLM agents in real-world applications. It introduces a novel threat formalization, a comprehensive benchmark, and evaluates across multiple LLMs. The breadth of impact is significant as it affects AI safety, security, and trustworthiness across many domains. Paper 2, while technically solid, addresses a narrower problem in computational advertising with more incremental contributions over existing auto-bidding methods.
Paper 1 introduces a novel and fundamental vulnerability ('Sleeper Attack') in LLM agents, addressing the critical area of AI safety and security. By formalizing a new threat model that exploits persistent agent states, it is likely to spur extensive follow-up research on defensive mechanisms across the broader AI community. Paper 2 is highly practical but more focused on a specific application domain, making Paper 1's conceptual innovation more impactful.