Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, Ming Jin

#105 of 2292 · Artificial Intelligence
Share
Tournament Score
1541±43
10501800
80%
Win Rate
20
Wins
5
Losses
25
Matches
Rating
7.3/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes temporal memory contamination — a non-adversarial failure mode where memory-equipped LLM agents become progressively less safe as benign memory accumulates across independent tasks over time. The key insight is that memory safety should be treated as a *longitudinal* property rather than a single-state snapshot property. The authors introduce a trigger-probe protocol that isolates memory accumulation effects from stream non-stationarity by evaluating a fixed probe set against read-only memory snapshots at varying prefix lengths, using a NullMemory counterfactual baseline to attribute violations specifically to memory.

This reframing is conceptually significant. Prior work on memory-related agent safety focused on adversarial settings (prompt injection, memory poisoning), whereas this paper demonstrates that *routine, benign* operation alone can produce monotonically increasing violation rates — a qualitatively different threat model with arguably greater practical relevance since it requires no attacker.

Methodological Rigor

The experimental design is thoughtful and addresses confounds systematically:

1. NullMemory counterfactual: The paired-run protocol (memory vs. no-memory, same model, same decoding parameters) provides a clean causal attribution mechanism. Cases where both runs violate are excluded, ensuring measured violations are genuinely memory-induced.

2. Stream non-stationarity control: The paper explicitly motivates the protocol by demonstrating (Figure 1) that naive temporal analysis on Enron data confounds memory effects with distribution shift. The read-only snapshot evaluation elegantly sidesteps this.

3. Order-randomization experiments: Block shuffle and full shuffle experiments disentangle content accumulation from encounter order effects, finding content is the primary driver while order is a secondary modulator (architecture-dependent).

4. Judge validation: 855 human-annotated evaluation pairs with ~91% pooled agreement, with detailed failure-mode analysis showing judge errors are predominantly false positives (inflating rates uniformly) and cannot create spurious temporal trends.

However, there are notable methodological concerns:

  • The synthetic datasets (Medical Practice, University Registrar) are generated by GPT-4 Turbo and Claude Sonnet 4, respectively. While this avoids pretraining contamination, it may not capture the full complexity of real-world interactions. The Enron evaluation partially addresses this but at lower absolute violation rates.
  • The probe set construction for T_hard (sampling inputs that produced violations in a pilot run) creates a selection bias. The authors acknowledge this explicitly, but it means absolute rates are not deployment-representative.
  • The judge's precision is notably weak (0.50 for U_a), meaning roughly half of flagged memory-induced violations may be false positives. While the authors argue this shifts levels uniformly rather than creating trends, this is a meaningful limitation for the absolute magnitude of reported effects.
  • Potential Impact

    Immediate practical implications: The finding that memory-induced violation rates increase monotonically with exposure length across 8 architectures and 2 agent classes has direct implications for deployment of persistent LLM agents (customer service bots, developer assistants, medical office systems). Organizations deploying such systems need temporal safety evaluation protocols.

    Memory architecture design: The identification of retrieval scope and summarization level as the primary design factors associated with amplification provides actionable guidance. Broad semantic retrieval + aggressive summarization amplifies risk most; recency-biased retrieval is naturally protective. This creates an explicit utility-safety tradeoff framework.

    Retrieval-time monitoring: The event decomposition insight — that memory-induced risk is structurally visible at retrieval time before generation — enables a practical mitigation strategy. The high-recall diagnostic monitor (0.970/0.984 recall) could be deployed as a safety filter without requiring response generation.

    Broader influence: This work could catalyze a shift in how the community evaluates memory-equipped agents, from snapshot evaluations to trajectory-based assessment. The protocol is general enough to apply beyond the specific scenarios studied.

    Timeliness & Relevance

    This paper is exceptionally timely. Memory-equipped LLM agents (Mem0, MemGPT, OpenClaw) are being deployed at scale in 2025-2026, and the safety evaluation infrastructure has not kept pace. The paper addresses a genuine blind spot: most safety benchmarks evaluate single interactions or short sessions, missing the longitudinal dimension entirely. The inclusion of Claw-like agents (developer-facing autonomous agents with file system access) is particularly relevant given the rapid adoption of such tools.

    Strengths

    1. Novel problem formulation: Temporal memory contamination is a genuinely new safety concern that is well-motivated and practically important.

    2. Rigorous experimental design: The trigger-probe protocol with NullMemory counterfactual is a clean, reusable evaluation methodology.

    3. Breadth of evaluation: 8 memory architectures × 3 deployment scenarios + Claw-like agents with 4 LLM backends × 2 platforms provides substantial evidence for generality.

    4. Actionable findings: The architecture-dependent amplification patterns and the retrieval-time monitor provide concrete guidance for practitioners.

    5. Transparent limitations: The paper is candid about judge precision issues and the trigger-conditional nature of reported rates.

    Limitations

    1. Scale of probe sets: 40 probes per checkpoint and 20 probes for Claw-like agents is relatively small, potentially limiting statistical power for detecting subtle effects.

    2. Limited model diversity: The base models for office-assistant scenarios are not extensively varied; the Claw-like experiments use 4 backends but with limited probes.

    3. Mitigation evaluation is thin: The retrieval-time monitor is demonstrated but its deployment impact (e.g., false positive costs, utility degradation from filtering) is not evaluated.

    4. No formal statistical tests: The paper reports trends and shaded bands but does not provide formal hypothesis tests for the monotonic increase claim.

    5. Synthetic data dominance: Two of three office-assistant scenarios are synthetic, and real-world deployment conditions may differ substantially.

    Additional Observations

    The paper's event decomposition (precondition → trigger → violation) is a clean abstraction that could generalize beyond memory safety to other stateful agent failure modes. The extensive appendix with violation examples (spanning ~20 pages) provides valuable qualitative evidence and reproducibility support.

    The F1 drop from Medical (0.692) to Registrar (0.573) for the event-structure monitor suggests domain sensitivity that warrants further investigation before deployment.

    Rating:7.3/ 10
    Significance 8Rigor 7Novelty 8.5Clarity 7.5

    Generated May 19, 2026

    Comparison History (25)

    vs. Echo: Learning from Experience Data via User-Driven Refinement
    gemini-3.15/22/2026

    Paper 1 identifies a novel, fundamental failure mode in agentic AI (temporal memory contamination) and introduces a rigorous evaluation protocol for longitudinal safety. As long-term memory becomes standard in LLMs, establishing how accumulated context degrades safety will have profound implications across AI alignment, safety evaluations, and architecture design, offering broader scientific impact than Paper 2's application-focused data pipeline.

    vs. Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
    claude-opus-4.65/22/2026

    Paper 1 introduces a novel architectural framework (SR²AM) decomposing agentic reasoning into three systems with self-regulated planning, demonstrating dramatic efficiency gains (25-95% fewer tokens) while maintaining competitive accuracy against much larger models. This addresses a fundamental challenge in LLM agent design with broad applicability. Paper 2 identifies an important but narrower safety concern (temporal memory contamination) in memory-equipped agents. While valuable, it is primarily diagnostic rather than offering new capabilities. Paper 1's contributions to efficient reasoning architecture and the self-regulation principle have broader potential to influence agent design across the field.

    vs. Echo: Learning from Experience Data via User-Driven Refinement
    gemini-3.15/22/2026

    Paper 1 offers higher scientific impact by identifying a novel, fundamental vulnerability in AI agents: temporal memory contamination. While Paper 2 provides a highly practical, production-validated framework for continuous learning from user feedback, it builds on established paradigms of interaction-based alignment. Paper 1 pioneers a new longitudinal evaluation paradigm for AI safety, demonstrating that risks compound over time across unrelated tasks. Its rigorous trigger-probe protocol and early detection mechanism provide foundational tools for future research in secure, long-horizon autonomous agents, making its conceptual contributions more broadly impactful.

    vs. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure
    gemini-3.15/19/2026

    Paper 2 presents a highly counter-intuitive 'capability paradox' where upgrading to smarter models degrades system security, a finding likely to spark significant discourse. Its identification of 'semantic hijacking' and the robust mediation analysis explaining the mechanism offer profound insights into multi-agent system vulnerabilities. Furthermore, it proposes a novel defense mechanism with striking empirical success. While Paper 1 addresses an important temporal issue, Paper 2's unexpected findings and actionable architectural solutions give it a broader and more disruptive potential impact across AI safety and multi-agent systems research.

    vs. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning
    claude-opus-4.65/19/2026

    Paper 1 introduces a novel, generalizable framework for understanding LLM planning by extracting search trees from reasoning traces, revealing a fundamental dissociation between LLM and human planning (myopic vs. deep search). This provides deep mechanistic insight into how reasoning models actually work, with broad implications for improving LLM reasoning capabilities. Paper 2 identifies an important but more incremental safety concern (temporal memory contamination) in memory-equipped agents. While practically relevant, Paper 1's methodological innovation and fundamental insights into LLM cognition have broader scientific impact across AI, cognitive science, and alignment research.

    vs. The Wittgensteinian Representation Hypothesis: Is Language the Attractor of Multimodal Convergence?
    gpt-5.25/19/2026

    Paper 1 is likely to have higher impact due to timeliness and direct real-world relevance: memory-equipped LLM agents are rapidly deploying, and longitudinal safety failures are a practical, under-evaluated risk. It offers a concrete evaluation protocol (trigger-probe, NullMemory baseline), tests across multiple scenarios and memory architectures, and provides an actionable diagnostic insight (risk detectable pre-generation). Paper 2 is conceptually novel and cross-modal, but its central hypothesis may be harder to validate broadly and translate into immediate applications compared to Paper 1’s deployment-facing methodology and safety implications.

    vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics
    gemini-3.15/19/2026

    Paper 1 identifies a novel and critical safety vulnerability in long-horizon LLM agents (temporal memory contamination). Given the rapid adoption of memory-equipped autonomous agents in real-world applications, exposing and mitigating longitudinal safety risks has profound and broad implications for AI safety. While Paper 2 offers a valuable methodology for scientific reasoning in physics, Paper 1 addresses an overarching, urgent systemic risk that affects the broader deployment of LLM agents across multiple domains.

    vs. PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact: it identifies a deployment-realistic, under-evaluated safety failure mode (temporal memory contamination) and proposes a general evaluation methodology (trigger-probe + NullMemory) applicable across memory architectures and agent platforms. Its implications span safety, agent design, evaluation standards, and monitoring, making it broadly relevant and timely given rapid adoption of memory-equipped agents. Paper 1 is innovative and shows strong benchmark gains, but is narrower (RLVR post-training via population self-play) and more incremental within existing LLM training paradigms, with less immediate cross-field impact.

    vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
    gpt-5.25/19/2026

    Paper 1 targets a high-stakes, under-evaluated regime: longitudinal safety in memory-equipped LLM agents, introducing a clear new failure mode (temporal memory contamination) and an evaluation methodology (trigger-probe + NullMemory counterfactual) that can become a standard for deployed agents. Its findings generalize across scenarios, memory architectures, and agent platforms, and it yields actionable monitoring hooks (pre-generation retrieval-state diagnostics). This is timely as memory/personalization is rapidly deployed, and the work impacts safety, evaluation science, and real-world agent deployments. Paper 2 is strong but narrower to factuality metrics and internal steering.

    vs. When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State
    gemini-3.15/19/2026

    Paper 1 addresses a highly timely and critical issue: longitudinal safety risks in memory-equipped LLMs. As LLM agents with long-term memory become ubiquitous, understanding 'temporal memory contamination' will have broad impact across AI safety and deployment. Paper 2, while methodologically sound, focuses on RL in specific economic environments (e.g., hotel pricing), giving it a narrower scope and likely fewer immediate real-world applications across diverse fields compared to the universal relevance of LLM safety.

    vs. LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning
    gpt-5.25/19/2026

    Paper 1 has higher potential impact: it identifies a deployment-realistic, under-evaluated safety failure mode (temporal memory contamination) and introduces a concrete longitudinal evaluation methodology (trigger-probe + NullMemory) validated across multiple scenarios, memory architectures, and agent platforms. Its findings are timely and broadly relevant to LLM agent safety, auditing, and monitoring, with clear real-world implications for any memory-enabled assistant. Paper 2 is promising for MARL performance, but appears narrower in application scope and depends on LLM-guided protocol design that may be less methodologically general than Paper 1’s evaluation framework.

    vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning
    gemini-3.15/19/2026

    Paper 2 addresses a highly timely and critical issue: the longitudinal safety of memory-equipped LLM agents. Given the rapid deployment of autonomous agents, understanding how accumulated memory introduces temporal contamination has immense real-world implications and broad relevance across AI safety, alignment, and systems. While Paper 1 offers strong methodological advancements in classical planning, Paper 2's focus on LLM safety vulnerabilities guarantees a broader, more immediate scientific and societal impact.

    vs. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact due to a clearer conceptual novelty (temporal memory contamination) and a general evaluation methodology (trigger-probe, NullMemory counterfactual) applicable across many memory-equipped agent designs and deployment settings. Its findings address a timely, high-stakes problem in LLM deployment (longitudinal safety), with potential influence on benchmarks, monitoring, and standards across AI safety, HCI, and ML systems. Paper 1 is strong industrially and methodologically practical, but is more application-specific (POI IVR) and less broadly generalizable as a scientific contribution.

    vs. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
    claude-opus-4.65/19/2026

    Paper 1 identifies a novel and fundamental safety concern—temporal memory contamination—in memory-equipped LLM agents, introducing a rigorous evaluation protocol and demonstrating consistent risks across multiple architectures. This addresses a critical gap in AI safety that will grow increasingly important as persistent-memory agents become widespread. Its breadth of impact spans AI safety, alignment, and deployment policy. Paper 2, while practically useful in democratizing OR re-optimization, represents a more incremental application of LLMs to an established domain with narrower impact scope.

    vs. Prediction of Challenging Behaviors Associated with Profound Autism in a Classroom Setting Using Wearable Sensors
    gemini-3.15/19/2026

    Paper 1 addresses a critical and emerging vulnerability in LLM agents by identifying a novel failure mode (temporal memory contamination) and introducing a new evaluation protocol. Given the rapid, widespread deployment of memory-equipped AI agents across numerous domains, this foundational safety research is likely to have a broader and more immediate scientific impact across the AI community compared to the specific, albeit highly valuable, application of existing wearable ML techniques to autism in Paper 2.

    vs. Reinforcing VLAs in Task-Agnostic World Models
    claude-opus-4.65/19/2026

    Paper 2 (RAW-Dream) introduces a novel paradigm for task-agnostic world model learning that decouples world/reward models from downstream tasks, enabling zero-shot VLA adaptation. This has broader impact across robotics and embodied AI, with demonstrated real-world applicability and strong scalability implications. While Paper 1 identifies an important longitudinal safety concern for memory-equipped LLM agents, it is primarily a diagnostic/evaluation contribution. Paper 2's methodological innovation—task-agnostic world models with dual-noise verification—addresses a fundamental scalability bottleneck and offers a more transformative contribution to the rapidly growing VLA/robotics field.

    vs. Data Language Models: A New Foundation Model Class for Tabular Data
    claude-opus-4.65/19/2026

    Paper 2 introduces a fundamentally new foundation model class (Data Language Models) for tabular data, which is arguably the most widely used data modality in industry yet lacks native foundation models. This addresses a significant gap in the AI stack with broad applications across virtually every domain that uses structured data. While Paper 1 makes a solid contribution by identifying longitudinal safety risks in memory-equipped LLM agents—an important and timely concern—its scope is narrower, focused on a specific failure mode. Paper 2's potential to reshape how tabular data is consumed across the entire AI ecosystem gives it broader impact potential.

    vs. TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
    claude-opus-4.65/19/2026

    Paper 1 introduces a novel concept—temporal memory contamination—that identifies a previously understudied longitudinal safety risk in memory-equipped LLM agents. It proposes a rigorous evaluation protocol (trigger-probe with NullMemory baseline), demonstrates the phenomenon across multiple architectures, and offers a practical detection mechanism. This addresses a fundamental and increasingly critical safety concern as persistent-memory agents become widespread. Paper 2, while valuable as a benchmark contribution, is more incremental—adding another multi-modal tool-use benchmark to an already crowded space. Paper 1's conceptual novelty and safety implications give it broader and more lasting impact.

    vs. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
    gpt-5.25/19/2026

    Paper 2 likely has higher impact: it introduces a timely, broadly applicable evaluation paradigm for longitudinal safety in memory-equipped LLM agents (a rapidly emerging deployment setting). The trigger-probe protocol, NullMemory counterfactual, cross-architecture experiments, and diagnostic monitoring suggest strong methodological rigor and immediate real-world relevance for safety governance across many domains. Paper 1 is innovative and includes wet-lab results, but its impact is narrower (optimization workflows in AI-for-science) and may depend on how reliably LLM “preferences” generalize and remain controllable across tasks and objectives.

    vs. From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning
    gpt-5.25/19/2026

    Paper 1 has higher likely impact: it introduces a novel, deployment-relevant failure mode (temporal memory contamination) and a clear evaluation protocol (trigger-probe + NullMemory counterfactual) applicable across memory architectures and agent platforms. The work is timely for real-world LLM agent deployment, has immediate safety/governance applications, and offers a methodological contribution (longitudinal evaluation + retrieval-state diagnostics) that can generalize across domains. Paper 2 is a solid empirical study in a niche game; its novelty and cross-field breadth are more limited.