MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu

#1274 of 2525 · Artificial Intelligence
Share
Tournament Score
1411±42
10501800
62%
Win Rate
13
Wins
8
Losses
21
Matches
Rating
4.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from 70%70\% to 0%0\%, while RAP attack success drops from 83.3%83.3\% to 0%0\%.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MemAudit

1. Core Contribution

MemAudit introduces a post-hoc auditing framework for identifying and removing poisoned memories in memory-augmented LLM agents. The central insight is that once harmful agent behavior has been observed, the critical question shifts from preventing harm to diagnosing *which* stored memories caused it. The framework combines two signals: (1) a Counterfactual Memory Influence Score (CMIS) that measures each memory's causal contribution to harmful outputs via replay-based ablation, and (2) a Memory Consistency Graph (MCG) that detects structurally anomalous memories using semantic similarity and NLI-based contradiction scoring. These are fused into a detoxification score for ranking and removing suspicious memories.

The problem formulation—post-hoc causal auditing of agent memory—is a genuinely useful framing that fills a gap between online defenses (which block at inference time) and the reality that harmful behavior is often discovered only retrospectively. This is a timely contribution given the rapid deployment of memory-augmented agents.

2. Methodological Rigor

The methodology, while intuitive, raises several concerns about rigor:

CMIS design: The counterfactual replay approach (Equation 7) is conceptually straightforward—remove one memory, re-run, measure harm reduction. However, this is essentially a leave-one-out ablation, which does not capture combinatorial effects. The authors acknowledge this limitation implicitly when discussing dense poisoning failure, but the method lacks any mechanism for handling memory interactions (e.g., Shapley-value-based attribution or grouped ablations).

MCG design: The structural anomaly score (Equation 8) multiplies inconsistency weight by semantic similarity, which captures the intuition that a memory contradicting its nearest neighbors is suspicious. However, the threshold-based detection (μ + 2σ) is a simple statistical heuristic with no formal justification for the chosen threshold. The reliance on DeBERTa-v3 for NLI is reasonable but the paper provides no analysis of how NLI model quality affects downstream auditing performance.

Evaluation concerns: The most striking result—ASR dropping to 0.0% in most settings—is almost suspiciously perfect. While impressive, this raises questions about whether the evaluation scenario is sufficiently challenging. The contamination ratios tested (ρ = 0.15–0.20) represent relatively sparse poisoning. The contamination analysis (Tables 4-5) reveals a sharp phase transition where the method completely fails (ASR near 100%) at ρ ≥ 0.25, suggesting the method operates effectively only in a narrow regime. The binary nature of this transition (from 0% to 60-100%) suggests the method may be brittle rather than gracefully degrading.

Baselines: The three baselines (random deletion, retrieval-frequency, nearest-neighbor contradiction) are reasonable but relatively weak. There is no comparison against more sophisticated attribution methods (e.g., influence functions, gradient-based attribution) or against existing memory defense methods like A-MemGuard adapted to post-hoc settings. The NNC baseline is essentially the MCG component without the causal signal, making the baseline comparison partially redundant with the ablation study.

Statistical reporting: Results are averaged over 10 runs but no variance or confidence intervals are reported, making it difficult to assess reliability, especially for the 0.0% ASR claims.

3. Potential Impact

The paper addresses a real and growing concern. As LLM agents are deployed with persistent memory in production settings (e.g., personal assistants, coding agents), the ability to audit and repair compromised memory states has genuine practical value. The post-hoc framing is particularly relevant for enterprise deployments where incident response and forensic analysis are standard practices.

However, the practical impact is tempered by several factors: (1) the method requires replaying harmful events, which may be costly for agents with expensive API calls or long-horizon trajectories; (2) the sharp failure mode at moderate contamination levels limits applicability to early-stage poisoning detection; and (3) the evaluation is limited to a single attack framework (MINJA) with only two task settings.

4. Timeliness & Relevance

The paper is timely. Memory-augmented agents are becoming mainstream (Voyager, Reflexion, generative agents), and memory poisoning has been demonstrated as practical (MINJA, AgentPoison, MemoryGraft). The post-hoc auditing perspective is genuinely underexplored—most prior work focuses on prevention. The paper correctly identifies that real-world failure patterns typically involve discovering problems after deployment, making forensic memory auditing a natural complement to online defenses.

5. Strengths & Limitations

Strengths:

  • Novel and well-motivated problem formulation (post-hoc memory auditing as a security primitive)
  • Clean framework design combining complementary signals
  • Evaluation across three LLM backbones (GPT-4o, GPT-4o-mini, DeepSeek)
  • Honest discussion of operating boundaries and failure modes under dense contamination
  • The batch auditing protocol (analyzing all events against the same memory state) is a sensible design choice that avoids cascading attribution errors
  • Limitations:

  • Extremely narrow evaluation scope (one attack, two tasks)
  • No analysis of false positive rates or benign memory preservation—critical for practical deployment
  • The method's computational cost scales linearly with |R*| × |E| in replay calls, which could be prohibitive for large memory stores
  • Leave-one-out attribution ignores synergistic effects between poisoned memories
  • The fusion weight α = 0.6 is selected based on grid search over a small set of values; no principled method for setting this hyperparameter
  • No comparison with training-data influence methods (influence functions, datamodels) adapted to this setting
  • The 0% ASR results across multiple settings, combined with the absence of error bars, weaken confidence in the generalizability of these findings
  • The paper does not discuss or evaluate the time/cost overhead of the counterfactual replay process
  • Overall Assessment

    MemAudit makes a meaningful conceptual contribution by formalizing post-hoc memory auditing as a security problem for LLM agents, filling a genuine gap in the defense literature. The dual-signal framework is intuitive and the problem framing is timely. However, the empirical evaluation is narrow (single attack, two tasks, no false positive analysis), the baselines are weak, and the sharp failure mode under moderate contamination raises questions about practical robustness. The work would benefit significantly from broader evaluation, stronger baselines, cost analysis, and utility-preservation metrics.

    Rating:4.8/ 10
    Significance 5.5Rigor 4Novelty 5.5Clarity 7

    Generated May 25, 2026

    Comparison History (21)

    vs. Energy Shields for Fairness
    claude-opus-4.65/26/2026

    Paper 1 introduces a fundamentally novel concept—energy shields for runtime fairness—that combines physics-inspired energy functions with formal guarantees (safety and liveness), representing a significant theoretical contribution with broad applicability across any sequential decision-making system requiring fairness. It provides both conceptual innovation and a synthesis procedure. Paper 2, while addressing an important and timely security problem in LLM agents, is more narrowly scoped as an application-specific defense mechanism evaluated against a single attack type. Paper 1's formal framework and generalizability suggest broader and more lasting impact across fairness, formal methods, and control theory.

    vs. How Well Do Models Follow Their Constitutions?
    gemini-3.15/26/2026

    Paper 1 addresses a critical and highly relevant challenge in AI governance and alignment: evaluating whether frontier models actually adhere to their stated behavioral specifications. By proposing a comprehensive audit pipeline and analyzing the trajectory of state-of-the-art models, it provides valuable insights for policy, safety, and model development. While Paper 2 offers a strong technical defense against a specific agent vulnerability, Paper 1 has broader implications for the safe deployment and regulation of general-purpose AI systems.

    vs. Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care
    gemini-3.15/26/2026

    Paper 1 addresses a highly timely and critical vulnerability in modern AI systems—LLM agent memory poisoning. Its proposed post-hoc causal auditing framework provides a novel defense mechanism with striking empirical results, reducing attack success rates to 0%. In contrast, Paper 2 applies established interpretable ML methods to standard acoustic/linguistic features for mental health. While Paper 2 has strong clinical relevance, Paper 1 demonstrates greater methodological innovation and immediate, high-impact relevance to the rapidly accelerating field of LLM agent security.

    vs. Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework
    claude-opus-4.65/26/2026

    MemAudit addresses a critical and emerging security vulnerability in memory-augmented LLM agents with a novel post-hoc causal auditing framework, combining counterfactual influence scoring with structural anomaly detection. It demonstrates dramatic results (reducing attack success to 0%) against realistic attacks. This tackles a timely, practical problem as LLM agents with persistent memory become widespread. Paper 2 proposes a useful evaluation framework but is more incremental—multi-dimensional evaluation frameworks exist in related forms, and its contributions, while methodologically sound, are less novel and have narrower security implications compared to Paper 1's pioneering approach to memory poisoning defense.

    vs. MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional
    gpt-5.25/26/2026

    Paper 1 likely has higher scientific impact due to stronger novelty and broader relevance: it targets a general, emerging security vulnerability in memory-augmented LLM agents and proposes a post-hoc causal auditing framework combining counterfactual attribution with structural anomaly detection. This is methodologically oriented and applicable across many agent domains beyond QA, including enterprise and autonomous systems. Paper 2 is a useful systems/architecture contribution for clinical benchmarking, but its gains are modest, potentially model- and grader-dependent, and more domain-specific, limiting breadth and generalizability.

    vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization
    gpt-5.25/25/2026

    Paper 1 targets a timely, high-stakes security vulnerability in memory-augmented LLM agents and introduces a broadly applicable post-hoc auditing method (causal attribution + structural anomaly detection). The problem is likely to grow as persistent agent memory becomes standard, and the approach generalizes across agent types and domains, with strong empirical reductions in attack success under realistic scenarios. Paper 2 is innovative and valuable for formal methods and theorem-proving workflows, but its impact is narrower (Lean-centric) and more specialized, whereas MemAudit addresses cross-domain AI safety/security with wider near-term adoption potential.

    vs. Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework
    claude-opus-4.65/25/2026

    Paper 1 addresses a fundamental limitation in diffusion model safety—showing concept erasure methods are inherently flawed by analyzing denoising trajectories. Its black-box, training-free attack framework (ConceptAgent) has broader implications for AI safety policy and model deployment. The theoretical insight about early-stage disruption vs. later semantic propagation is novel and generalizable. Paper 2, while practically valuable for LLM agent security, addresses a narrower problem (memory poisoning auditing) with a more incremental contribution combining existing techniques (counterfactual analysis + anomaly detection). Paper 1's findings challenge foundational assumptions in content safety, likely spurring broader research.

    vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems
    gpt-5.25/25/2026

    Paper 2 is more likely to have higher scientific impact: it targets a timely, high-stakes problem (security of agent memory) with clear real-world applicability and provides an evaluative methodology with quantitative results on concrete attacks (MINJA), indicating stronger empirical rigor. Its causal attribution + structural anomaly approach could be adopted across many memory-augmented agent systems, influencing both ML security research and deployment practices. Paper 1 is conceptually novel (event-sourced reactive graphs) but is more architectural and exploratory, with less demonstrated empirical validation, making its near-term impact less certain.

    vs. Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
    claude-opus-4.65/25/2026

    Trace2Skill addresses a broader and more impactful problem: improving LLM agents for complex hardware design tasks through a novel test-time scaling framework that avoids fine-tuning. Its skill evolution approach (mining rollout traces, oracle/mutator/selector loop) is more generalizable across verifiable domains beyond EDA. While MemAudit makes a solid contribution to LLM agent security via post-hoc memory auditing, it addresses a narrower problem (memory poisoning defense) with a more incremental methodological advance. Trace2Skill's potential to transform hardware design automation and its transferable test-time scaling paradigm give it broader cross-field impact.

    vs. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
    gpt-5.25/25/2026

    Paper 2 likely has higher scientific impact because it tackles a timely, broadly relevant security problem for memory-augmented LLM agents (post-hoc auditing after harm), with clear cross-domain implications for agent safety, governance, and deployment. Its causal attribution + structural anomaly approach is conceptually novel and generalizable across agent architectures and applications, and it reports large risk reductions under a realistic attack model. Paper 1 is valuable and application-focused, but its impact is narrower (spreadsheet automation) and more incremental (RL fine-tuning + benchmark/environment engineering) with modest performance gains.

    vs. The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning
    claude-opus-4.65/25/2026

    Paper 1 addresses the broadly significant and timely question of how AI assistance affects human learning and skill development—a topic with immense societal implications as AI tools become ubiquitous in education and workplaces. Its findings on AI usage intensity and informativeness mediating learning outcomes are novel and policy-relevant, with potential impact across education, cognitive science, and AI policy. Paper 2 presents a solid security contribution (MemAudit) but addresses a narrower technical problem in LLM agent memory poisoning. While important, its impact is more confined to the AI security community, whereas Paper 1's insights have broader cross-disciplinary relevance.

    vs. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
    gpt-5.25/25/2026

    Paper 2 is likely higher impact: it tackles a timely, high-stakes security problem for memory-augmented LLM agents (poisoned persistent memory) with a novel post-hoc auditing angle, combining causal attribution with structural anomaly detection. The approach has clear real-world applicability for deploying safer agents and could influence both ML security and agent design. While Paper 1 provides a valuable, scalable benchmark for terminal agents, benchmarks are more incremental and domain-specific; MemAudit’s methods and framing generalize across agent systems and align with urgent concerns about agent robustness and adversarial misuse.

    vs. Agentic Proving for Program Verification
    gemini-3.15/25/2026

    Paper 1 introduces a novel, foundational framework for auditing memory-augmented agents, addressing a critical and emerging vulnerability in AI security (memory poisoning). Paper 2, while demonstrating impressive empirical results in formal verification, primarily evaluates an existing proprietary agent (Claude) to highlight benchmark saturation. Paper 1's methodological innovation in causal attribution and its broad applicability to AI safety give it a higher potential for sustained scientific impact.

    vs. Foundation Protocol: A Coordination Layer for Agentic Society
    claude-opus-4.65/25/2026

    MemAudit addresses a concrete, well-defined security vulnerability in memory-augmented LLM agents with a novel post-hoc causal auditing framework. It demonstrates strong empirical results (reducing attack success to 0%), offers methodological rigor with counterfactual attribution and structural anomaly detection, and tackles a timely problem as memory-augmented agents proliferate. Paper 2 proposes a coordination protocol but is more of an architectural vision/position paper without empirical validation, making its actual scientific impact more speculative and harder to build upon incrementally.

    vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
    gemini-3.15/25/2026

    Paper 1 identifies a novel, counterintuitive inverse scaling phenomenon in critical domains like finance and epidemiology. By exposing a fundamental flaw in how advanced LLMs handle tail-risk and challenging standard evaluation metrics, it offers broader implications for AI safety, benchmarking, and real-world deployment compared to Paper 2's specialized security defense tool.

    vs. Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems
    claude-opus-4.65/25/2026

    Paper 2 (MemAudit) addresses a concrete, timely security vulnerability in memory-augmented LLM agents with a novel technical framework combining causal attribution and structural anomaly detection. It demonstrates strong empirical results (reducing attack success to 0%) against realistic attacks. While Paper 1 offers a valuable theoretical contribution to organizational theory around agentic AI boundaries, Paper 2's methodological rigor, immediate practical applicability to a rapidly growing area (LLM agent security), and reproducible quantitative results give it higher near-term scientific impact and broader relevance across AI safety and security communities.

    vs. Unlocking Proactivity in Task-Oriented Dialogue
    claude-opus-4.65/25/2026

    Paper 1 presents a more comprehensive and novel framework addressing a fundamental challenge in task-oriented dialogue—proactivity in LLM agents. It introduces multiple interconnected innovations (Cognitive User Simulator, Asymmetric-View Policy Optimization, state-transition refinement) with broader applicability to persuasive AI, sales, and general agent behavior. Paper 2 addresses an important but narrower security problem (memory poisoning auditing) with strong empirical results but limited scope. Paper 1's contributions to RL training methodology, user simulation, and proactive dialogue have wider cross-field impact potential.

    vs. Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions
    claude-opus-4.65/25/2026

    MemAudit addresses a timely, practical security vulnerability in LLM agents with persistent memory—a rapidly growing deployment paradigm. It provides a novel post-hoc auditing framework combining causal attribution and structural anomaly detection, with strong empirical results (attack success reduced to 0%). Its immediate applicability to real-world AI safety, methodological rigor with quantitative evaluation, and relevance to the booming LLM agent ecosystem give it broader near-term impact. Paper 1, while theoretically ambitious in extending mediative fuzzy logic across multiple type levels, addresses a more niche theoretical area with less empirical validation and narrower audience.

    vs. Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
    claude-opus-4.65/25/2026

    Paper 1 (IDS) addresses a fundamental challenge in software engineering—automating formal verification of distributed systems—achieving 7/7 success where SOTA agents manage only 2/7. It demonstrates a 200x speedup over expert effort with practical cost savings, representing a potentially transformative advance in verified system development. Paper 2 (MemAudit) addresses an important but narrower security concern (memory poisoning in LLM agents) with strong results, but its scope and transformative potential are more limited. IDS's impact spans formal methods, distributed systems, and AI-assisted programming, with broader implications for trustworthy software at scale.

    vs. The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems
    gpt-5.25/25/2026

    Paper 2 has higher likely impact because it targets an immediate, widely relevant security failure mode of memory-augmented LLM agents with a concrete, testable framework and strong empirical results (attack success reduced to 0% in two settings). Its applications are practical (post-hoc forensics, remediation, monitoring) and timely as agentic systems deploy. Paper 1 is ambitious and broad, but makes sweeping theoretical claims that are hard to validate from the abstract and may face higher skepticism/verification burden, reducing near-term uptake despite potential long-run significance.