Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu, Dongrui Liu, Wenjie Wang, Fuli Feng

#500 of 2682 · Artificial Intelligence
Share
Tournament Score
1480±48
10501800
76%
Win Rate
13
Wins
4
Losses
17
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large Language Model (LLM) agents remain vulnerable to safety threats from the external environment, where attackers inject adversarial content into external observations such as tool-returned data, webpages, or MCP context, causing harmful agentic behaviors such as unsafe actions or incorrect outputs. Existing studies typically focus on single-interaction attacks, where the agent observes adversarial content and immediately exhibits harmful behavior within one user request. However, we show that adversarial content can also persist across interactions served by the same agent, making such threats harder to detect and mitigate. Specifically, adversarial content may persist in the agent state, remain dormant across interactions, and later be activated by a benign user query. We formalize this type of safety threat as Sleeper Attack. To evaluate it, we construct a benchmark with 1,896 instances covering six real-world harmful outcomes, three attack strategies, and three agent state targets: session context, memory, and reusable skills. Experiments on seven strong open-source and closed-source LLMs show that state-of-the-art LLM agents remain vulnerable to Sleeper Attack, even when they achieve low attack success rates under a single-interaction baseline. Our code and data are available at https://anonymous.4open.science/r/skdvnfu23ihr9wdscnksf1asdffsaef.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents"

1. Core Contribution

This paper formalizes Sleeper Attack, a cross-interaction threat model for LLM agents in which adversarial content injected via external observations (tool returns, webpages, MCP context) persists in agent state across interactions and is later activated by benign user queries. The key insight is that existing agent safety evaluations focus almost exclusively on single-interaction attacks — where injection and exploitation happen within one user request — while modern agents maintain persistent state (session context, memory, reusable skills) that creates temporal attack surfaces.

The paper contributes three attack strategies: Latent Instruction Planting (LIP), where dormant malicious instructions are stored and later triggered; Proactive Information Elicitation (PIE), where underspecified planted instructions force the agent to solicit missing sensitive parameters from users; and Persistent Information Corruption (PIC), where stored facts are poisoned to corrupt future agent outputs. These strategies are evaluated across three agent state targets (session, memory, skill), yielding a systematic taxonomy.

2. Methodological Rigor

The benchmark construction is methodologically thorough. The 1,896 instances span six harm domains grounded in existing taxonomies, and the evaluation uses a rule-based protocol rather than LLM judges, which the authors correctly identify as insufficiently reliable for trajectory-level safety evaluation. The structured evaluation employs ordered trace matching and argument-value matching with a three-model quality-control pipeline (proposer, critic, arbiter) for eval_config validation — a sensible design choice that improves reproducibility.

The experimental design includes several well-motivated controls: a direct single-interaction baseline enables direct-versus-sleeper comparison, fresh-session replay tests whether hazards survive session boundaries, and longer-horizon sweeps (up to 20 interactions) test temporal persistence. The evaluation of seven models spanning open and closed-source families (Gemini, DeepSeek, Qwen, GPT, Llama) provides reasonable coverage.

However, there are notable methodological limitations. The simulated environment (ToolEmu-derived) with DeepSeek-v3.2 generating tool outputs introduces an abstraction gap — real deployments have authentication, rate limits, and confirmation dialogs that would materially affect attack feasibility. The ask-user simulator is cooperative by design, which inflates PIE success rates relative to realistic user behavior. The template optimization procedure (Appendix A.10) uses iterative prompt engineering to maximize ASR, which, while transparent, means the reported rates reflect optimized attack templates rather than naive adversarial attempts.

3. Potential Impact

This work addresses a genuine deployment concern. As LLM agents are increasingly deployed with persistent memory (e.g., ChatGPT memory, Claude's artifacts), MCP integrations, and skill libraries, the cross-interaction attack surface is expanding rapidly. The paper's demonstration that agents with low single-interaction ASR can have dramatically higher sleeper ASR (e.g., Gemini-3.1-Pro: 6.2% direct → 92.6% on skill for LIP) is a practically important finding that should influence both agent design and safety evaluation practices.

The benchmark itself could serve as a standard evaluation tool, though its reliance on a simulated environment limits direct applicability to production systems. The finding that lightweight defenses (rule-based instructions, LlamaGuard filtering) provide only partial mitigation is valuable for practitioners, though the defense evaluation is relatively shallow — only two defenses on one model.

The work's influence could extend to: (1) agent framework design, motivating stricter state isolation and provenance tracking; (2) safety evaluation standards, pushing for multi-interaction assessment; (3) MCP and tool ecosystem security, given the paper's relevance to emerging agentic infrastructure.

4. Timeliness & Relevance

The paper is highly timely. The rapid deployment of persistent-state LLM agents (memory-equipped assistants, MCP-connected systems, coding agents with skill libraries) creates exactly the attack surfaces this paper studies. The references include 2025-2026 papers on MCP poisoning, skill injection, and memory attacks, positioning this work at the intersection of several active research threads. The unification of session, memory, and skill attack surfaces into a single framework is a timely contribution given the fragmented nature of prior work.

5. Strengths & Limitations

Key Strengths:

  • Novel formalization: The sleeper attack threat model is well-defined with clear mathematical notation, distinguishing planting, persistence, and triggering phases.
  • Comprehensive taxonomy: The two-axis design (attack strategy × agent state target) provides systematic coverage that individual memory-only or skill-only papers lack.
  • Robust evaluation protocol: Rule-based structured evaluation avoids LLM judge unreliability; the three-model QC pipeline adds rigor.
  • Revealing empirical findings: The large direct-to-sleeper safety gaps (e.g., PIE: 0.6% → 41.6%) demonstrate that current safety alignment is insufficient for cross-interaction threats.
  • Well-designed ablations: Fresh-session replay, conditional triggers, longer horizons, scaling experiments, and defense evaluations provide multi-faceted analysis.
  • Notable Limitations:

  • Simulated environment: All experiments run in ToolEmu, not against real services. The authors acknowledge this but it limits ecological validity.
  • Cooperative user simulator: The PIE strategy's ASR is inflated by a user that almost always provides requested information.
  • Template optimization: Iterative optimization of attack templates means results represent a capable attacker, potentially overstating risk for naive adversaries.
  • Limited defense evaluation: Only two lightweight defenses on one model; no evaluation of architectural mitigations (state provenance tracking, sandboxed state).
  • Inconsistent model coverage: Supplementary experiments use narrow model slices, making cross-experiment comparison difficult.
  • No analysis of false positives: The benchmark evaluates attack success but not whether defenses would degrade normal agent utility.
  • Additional Observations

    The paper effectively argues that single-interaction safety evaluation is insufficient, but the practical exploitability of these attacks depends heavily on deployment-specific factors (state persistence policies, user confirmation requirements) that the benchmark cannot capture. The open-model scaling results showing that larger Qwen3 models have *higher* sleeper ASR (especially PIC: 30.2% at 4B → 49.2% at 32B) is a particularly interesting finding suggesting that improved capability may increase vulnerability to state-based attacks.

    Rating:7/ 10
    Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

    Generated May 28, 2026

    Comparison History (17)

    vs. Continual Model Routing in Evolving Model Hubs
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact due to its timely, security-critical framing (persistent, stateful “Sleeper Attacks” on LLM agents), broad relevance across agent frameworks, safety, and deployment contexts, and clear real-world implications for tool-using systems. It introduces a novel threat model beyond single-turn jailbreaks and provides a sizable benchmark spanning outcomes, strategies, and state targets, with evidence across multiple open/closed models. Paper 1 is valuable for scalable model routing and benchmarking, but its impact is narrower and more systems/ML-infra focused.

    vs. OccuReward: LLM-Guided Occupant-Centric Reward Shaping for Demographic Equity in Grid-Interactive Buildings
    gemini-3.15/28/2026

    Paper 1 identifies a novel, persistent security vulnerability in LLM agents (Sleeper Attacks), moving beyond standard single-interaction exploits. Given the rapid, widespread deployment of autonomous LLM agents across diverse industries, addressing long-term memory and state-based vulnerabilities is critical for AI safety. Paper 2 presents an interesting interdisciplinary application of LLMs for reward shaping and fairness in building energy management, but its scope is heavily restricted to HVAC and thermal comfort. Consequently, Paper 1 has a significantly broader potential impact, higher relevance to core AI development, and addresses a more urgent generalized security threat.

    vs. CaMBRAIN: Real-time, Continuous EEG Inference with Causal State Space Models
    claude-opus-4.65/28/2026

    CaMBRAIN introduces a fundamentally new architecture for EEG processing that addresses critical limitations (quadratic scaling, fixed-length inputs) with a novel causal SSM approach and custom training pipeline. It achieves SOTA across 3 datasets with 10x throughput gains, enabling real-time continuous EEG monitoring with clear clinical applications. Paper 1, while identifying an important LLM security threat (sleeper attacks), is more incremental within the adversarial AI safety space, extending known attack paradigms to multi-interaction settings. Paper 2's cross-disciplinary impact (ML + neuroscience + clinical medicine) and practical applicability give it broader potential impact.

    vs. Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations
    claude-opus-4.65/28/2026

    Paper 2 addresses a critical and timely security vulnerability in LLM agents—persistent cross-interaction sleeper attacks—which has immediate practical implications for AI safety as LLM agents are increasingly deployed. It introduces a novel threat formalization, a comprehensive benchmark (1,896 instances), and evaluates across seven LLMs, providing actionable insights for the safety community. Paper 1, while intellectually interesting in studying perceptual geometry in LLMs, is more observational and narrower in its impact scope, primarily contributing to interpretability research without clear downstream applications.

    vs. Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
    gpt-5.25/28/2026

    Paper 1 likely has higher impact because it identifies and formalizes a new, practically critical threat model for LLM agents—persistent, cross-interaction “sleeper” attacks—directly relevant to deploying agents with tools, memory, and skills. It contributes a sizable benchmark (1,896 instances), evaluates across multiple leading open/closed models, and targets real-world safety outcomes, making it timely and broadly applicable to security, safety, and agent design. Paper 2 is innovative for multimodal alignment, but its impact is narrower and more contingent on adoption within RLHF pipelines.

    vs. Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback
    claude-opus-4.65/28/2026

    Paper 2 introduces a novel and practically significant threat model ('Sleeper Attack') for LLM agents that formalizes cross-interaction persistence of adversarial content—a largely unexplored attack surface. This has broad implications for AI safety and security as LLM agents become widely deployed. The comprehensive benchmark (1,896 instances, multiple attack strategies, seven LLMs) demonstrates methodological rigor. Paper 1, while solid, represents an incremental improvement in self-evolving LLMs using confidence signals. Paper 2's novelty in identifying a new class of vulnerabilities and its timeliness given rapid LLM agent adoption give it higher potential impact.

    vs. SkillGrad: Optimizing Agent Skills Like Gradient Descent
    claude-opus-4.65/28/2026

    Paper 1 introduces a novel and important security threat model ('Sleeper Attack') for LLM agents that reveals vulnerabilities persisting across interactions—a largely unexplored attack surface. Its comprehensive benchmark (1,896 instances, 7 LLMs, multiple attack strategies) and formalization of cross-interaction adversarial persistence address a critical and timely safety concern as LLM agents are widely deployed. Paper 2, while technically solid, offers an incremental optimization framework for agent skills with narrower scope. The security implications of Paper 1 have broader impact across the AI safety community and are more likely to influence future research and deployment practices.

    vs. Entropy-aware Masking for Masked Language Modeling
    gemini-3.15/28/2026

    Paper 2 addresses a highly timely and critical issue in AI safety: vulnerabilities in LLM agents. While Paper 1 offers a solid methodological improvement for encoder-based masked language modeling, Paper 2's focus on a novel, persistent 'Sleeper Attack' across multi-turn interactions has broader implications for the real-world deployment and security of modern LLM agents. The introduction of a new benchmark and the exploration of a previously understudied attack vector give Paper 2 a higher potential for widespread impact and future citations in the rapidly growing field of AI safety.

    vs. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
    claude-opus-4.65/28/2026

    SAGE addresses a fundamental infrastructure challenge (long-term memory for language agents) with a novel self-evolving graph memory framework combining theoretical analysis and strong empirical results across multiple benchmarks. Its contributions—dynamic graph memory, reader-writer feedback loops, and graph foundation model integration—have broader applicability across many agent systems. While Paper 1 identifies an important security threat (sleeper attacks on LLM agents), it is more narrowly focused on a specific attack vector. SAGE's architectural innovation is likely to influence more downstream research in agent memory, RAG systems, and knowledge graphs.

    vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
    gemini-3.15/28/2026

    Paper 2 offers a profound theoretical contribution to a fundamental debate in AI: whether LLMs build internal world models. By establishing a universal 'L3 reasoning cliff' in spatial reasoning across languages and scales, and validating it against human baselines, it reveals inherent limitations in text-only working memory. While Paper 1 identifies a critical security vulnerability in stateful agents, Paper 2's rigorous methodological hierarchy and implications for future architectural designs give it broader foundational impact across AI, cognitive science, and NLP.

    vs. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to its timely, high-stakes security framing for LLM agents, introducing a broadly applicable threat model (persistent, dormant “sleeper” injections) that spans session context, memory, and skills. The benchmark (1,896 instances) across multiple real-world harmful outcomes and evaluation on seven models increases methodological strength and reproducibility, and the findings directly inform mitigation, policy, and agent design across many domains. Paper 1 is innovative for agent memory, but its impact is more specialized and less urgent than systemic safety vulnerabilities.

    vs. The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
    gemini-3.15/28/2026

    Paper 1 introduces a frontier-tier foundation model featuring novel agent-native RL and self-evolution capabilities. Foundation model papers that demonstrate architectural efficiency (MoE) and new training paradigms typically achieve massive adoption, set new industry baselines, and drive broader scientific impact across the AI community compared to domain-specific security benchmarks, despite the strong novelty of Paper 2.

    vs. STAB: Specification-driven Testing for Algorithmic Bottlenecks
    claude-opus-4.65/28/2026

    Paper 2 addresses a novel and critical security vulnerability in LLM agents—cross-interaction sleeper attacks that persist in agent state and activate later. This formalizes a new threat model with broader implications for AI safety, trust, and deployment across many domains. Its comprehensive benchmark (1,896 instances, multiple attack strategies, multiple LLMs) and the timeliness of LLM agent security make it highly impactful. Paper 1, while solid in improving algorithmic testing, addresses a narrower software engineering problem with more incremental contributions.

    vs. Show, Don't TELL: Explainable AI-Generated Text Detection
    gpt-5.25/28/2026

    Paper 2 likely has higher impact: it introduces a timely, broadly relevant security threat model (persistent “Sleeper Attacks” on agent state) with clear real-world implications for deployed LLM agents. It formalizes the attack, provides a sizable benchmark (1,896 instances) spanning multiple harms, strategies, and state targets, and validates across seven strong models—supporting methodological rigor and wide applicability across AI safety, security, HCI, and agent systems. Paper 1 is useful and user-centric, but text detection is a narrower, more volatile area with weaker long-term robustness guarantees.

    vs. MUSE: Benchmarking Manufacturable, Functional, and Assemblable Text-to-CAD Generation
    claude-opus-4.65/28/2026

    Paper 1 introduces a novel and concerning security threat ('Sleeper Attack') for LLM agents that persists across interactions—a fundamentally new attack paradigm with broad implications for AI safety. It formalizes the threat model, provides a comprehensive benchmark, and demonstrates vulnerability across seven major LLMs. Given the rapid deployment of LLM agents in real-world systems, this work has urgent, cross-cutting impact on AI security. Paper 2, while valuable for CAD/engineering evaluation, addresses a narrower application domain with less transformative implications for the broader AI research community.

    vs. Constrained Auto-Bidding via Generative Response Modeling
    claude-opus-4.65/28/2026

    Paper 1 addresses a critical and timely security vulnerability in LLM agents—sleeper attacks that persist across interactions—which is highly relevant given the rapid deployment of LLM agents in real-world applications. It introduces a novel threat formalization, a comprehensive benchmark, and evaluates across multiple LLMs. The breadth of impact is significant as it affects AI safety, security, and trustworthiness across many domains. Paper 2, while technically solid, addresses a narrower problem in computational advertising with more incremental contributions over existing auto-bidding methods.

    vs. Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
    gemini-3.15/28/2026

    Paper 1 introduces a novel and fundamental vulnerability ('Sleeper Attack') in LLM agents, addressing the critical area of AI safety and security. By formalizing a new threat model that exploits persistent agent states, it is likely to spur extensive follow-up research on defensive mechanisms across the broader AI community. Paper 2 is highly practical but more focused on a specific application domain, making Paper 1's conceptual innovation more impactful.