Beyond Similarity: Trustworthy Memory Search for Personal AI Agents
Jiawen Zhang, Kejia Chen, Jiachen Ma, Yangfan Hu, Lipeng He, Yechao Zhang, Jian Liu, Xiaohu Yang
Abstract
Personal AI agents increasingly rely on long-term memory to provide persistent personalization across sessions. However, existing memory pipelines are largely driven by semantic similarity: memory data close to the current query is retrieved and injected into the model context. This creates a critical trustworthiness gap, since a semantically related memory may still be contextually inappropriate, leading to threats such as cross-domain leakage, sycophancy, tool-call drift, or memory-induced jailbreaks. In this paper, we study memory search as a trust boundary in personal AI agents. We evaluate representative agentic memory frameworks, including A-Mem, Mem0, and MemOS, together with OpenClaw, a real-world personal-agent environment with persistent state and tool-use capability. Our results show that long-term memory is not merely a utility layer, but a durable control channel that can reshape how agents interpret tasks and execute actions, leaving them highly susceptible to the aforementioned threats. To mitigate these vulnerabilities, we propose MemGate, a lightweight and deployable memory plug-in for trustworthy memory search, with only 9M parameters and a 35.1MB footprint. MemGate is inserted between the vector memory store and the backbone LLM, requiring no LLM modification, memory-database rewriting, or inference-time LLM judge. It applies a query-conditioned neural gate to candidate memory representations, turning raw similarity search into task-conditioned memory admission. Across multiple mainstream memory frameworks, real-world agent settings, and diverse LLM backbones, MemGate reduces memory-induced threats while preserving long-term memory utility.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Beyond Similarity: Trustworthy Memory Search for Personal AI Agents"
1. Core Contribution
This paper identifies and formalizes a previously under-examined vulnerability in personal AI agents: that semantic similarity-based memory retrieval creates a trust boundary problem. The key insight is that memories semantically close to a query may be contextually inappropriate, leading to four failure modes: cross-domain leakage, sycophancy amplification, tool-call drift, and memory-induced jailbreaks. The paper contributes both a systematic empirical characterization of these threats and a mitigation mechanism called MemGate—a 9M-parameter neural gate that performs query-conditioned filtering of memory embeddings before they enter the LLM context.
The conceptual reframing of memory search as a "trust boundary" rather than a utility layer is the paper's strongest intellectual contribution. By formalizing the distinction between semantic similarity and contextual admissibility (Equation 3), the authors provide a clean abstraction that unifies diverse failure modes under a single framework.
2. Methodological Rigor
Strengths in evaluation design: The paper evaluates across four agent frameworks (A-Mem, Mem0, MemOS, OpenClaw), six LLM backbones spanning open-source and proprietary models, and four distinct threat categories. This breadth substantially strengthens the generalizability claims. The inclusion of both benign over-personalization and malicious manipulation threat models provides comprehensive coverage.
MemGate architecture: The design is technically sound—using concatenation, element-wise product, and the original embeddings as input features to an MLP that produces a continuous mask is a well-motivated approach. The DPO-based training objective with positive-memory preservation penalty is elegant, though the theoretical analysis (Theorems 1 and 2) provides relatively standard bounds that, while correct, offer limited practical insight beyond confirming the regularizer's basic properties.
Concerns: The training data consists of only 1,640 synthetically generated preference pairs from GPT-4o-mini. While the authors claim no test-set leakage, the reliance on a single model for preference pair generation raises questions about the diversity and coverage of the training signal. The paper would benefit from analysis of failure cases where MemGate incorrectly gates beneficial memories or fails to gate harmful ones. Additionally, the judge model (GPT-4o-mini) used for evaluation is the same model used to generate training data, creating a potential circularity concern.
The evaluation metrics rely on LLM-as-judge (GPT-4o-mini for most benchmarks, Llama-Guard for jailbreak), with ordinal scoring and thresholding. While common practice, the reliability of these judgments is not validated against human annotations.
3. Potential Impact
Practical deployment value: MemGate's lightweight footprint (35.1MB, ~40ms latency overhead) and plug-in architecture make it genuinely deployable. The fact that it requires no LLM modification, memory rewriting, or inference-time LLM judge calls addresses real engineering constraints. This positions it well for adoption in production memory systems.
Broader implications: The paper's framing of memory-as-attack-surface has significant implications for the AI safety community. The demonstration that memory can serve as a "durable control channel"—where planted benign memories later legitimize harmful requests—represents a practically important attack vector that differs from standard prompt injection. This could catalyze research into memory hygiene, provenance tracking, and contextual access control in agent systems.
Limitations in scope: The work focuses exclusively on retrieval-time gating. It does not address write-time memory sanitization, memory lifecycle management, or the interaction between memory gating and other agent components (e.g., planning, reflection). The paper also doesn't explore adversarial robustness of MemGate itself—a sophisticated attacker might craft memories designed to bypass the gate.
4. Timeliness & Relevance
This paper is highly timely. Personal AI agents with persistent memory (e.g., ChatGPT's memory feature, various open-source agent frameworks) are rapidly proliferating. The security implications of persistent memory have received minimal systematic attention compared to prompt injection or jailbreaking. The paper fills this gap at a moment when the industry is actively building these systems, making the findings immediately actionable.
The cited benchmarks (PersistBench, MemDrift, PS-Bench) are all from 2025-2026, indicating this is an active research frontier. The evaluated systems (Mem0, A-Mem, MemOS, OpenClaw) represent current state-of-the-art memory frameworks.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional Observations
The paper's utility improvement is a particularly compelling finding. It suggests that raw similarity retrieval is suboptimal even for benign use cases, and that task-conditioned gating provides a general improvement to memory quality. This dual benefit (safety + utility) significantly strengthens the practical case for adoption.
The choice of DPO over simpler contrastive objectives is interesting but not ablated—it would be valuable to know whether simpler training objectives (e.g., triplet loss, InfoNCE) achieve comparable results.
The paper opens several important research directions: write-time memory validation, multi-turn adversarial probing of memory gates, and integration with memory provenance systems. These are acknowledged implicitly but could be discussed more explicitly.
Generated Jun 5, 2026
Comparison History (23)
Paper 1 targets an immediate, widely encountered bottleneck in personal AI agents: trustworthy long-term memory retrieval as a security/control boundary. It introduces a lightweight, deployable gating module that plugs into existing memory stacks without LLM modification, and evaluates across multiple mainstream frameworks and real-world agent settings—suggesting strong methodological rigor and high near-term adoption potential. Its impact spans security, alignment, retrieval, and agentic systems, and directly addresses timely failure modes (jailbreaks, sycophancy, tool drift). Paper 2 is promising but likely narrower and harder to standardize across diverse simulators.
Paper 1 targets a fundamental bottleneck in AI development: data curation. By introducing a novel benchmark and demonstrating that scaffolded agents can autonomously discover data-selection policies that outperform human baselines, it promises to accelerate the entire AI training pipeline. While Paper 2 addresses an important security vulnerability in personal AI agents, Paper 1 has broader, field-wide implications for how AI research and model training are conducted. Automating the data-curation loop could fundamentally shift the paradigm of machine learning development, giving it higher potential scientific impact.
Paper 1 addresses the critical, widespread problem of AI-generated content attribution by combining mechanistic interpretability with watermarking. Its novel approach of steering activation signatures without degrading text quality offers a fundamental advancement over traditional external watermarking. While Paper 2 tackles an important security vulnerability in agent memory systems, Paper 1's impact is likely broader and more foundational, affecting general LLM deployment, AI safety, and regulatory compliance regarding synthetic media.
While Paper 1 addresses important security vulnerabilities in personal AI agents, Paper 2 tackles a universal bottleneck in LLM infrastructure: KV cache management for long-context serving. By fundamentally redesigning the KV cache abstraction to be head-aware, RedKnot offers broad, immediate improvements to GPU memory efficiency, scalability, and serving costs across almost all LLM deployments, giving it a higher potential for widespread foundational impact.
Paper 1 (SENSEI) introduces a fundamentally novel paradigm shift in human-AI collaboration—moving from correcting actions to correcting underlying misconceptions via knowledge-gap localization. It demonstrates strong zero-shot compositional generalization and validates with a user study showing 90% misconception correction. This has broad implications across education, assistive systems, and human-AI interaction. Paper 2 (MemGate) addresses an important but more incremental trustworthiness concern in memory-augmented agents with a practical lightweight solution. While valuable, it operates within a narrower scope and represents more of an engineering contribution than a conceptual advance.
Paper 1 addresses a ubiquitous challenge in personal AI agents (memory-induced vulnerabilities) with a highly practical, lightweight, and framework-agnostic solution. Its plug-and-play nature ensures broad and immediate applicability across various LLM backbones. While Paper 2 tackles an important emerging area (Computer Use Agents), its proposed upfront planning architecture is more restrictive and acknowledges unresolved vulnerabilities (branch steering), making Paper 1's contribution more complete and widely deployable.
Paper 1 addresses a timely and critical problem—trustworthiness of memory in personal AI agents—with a practical, deployable solution (MemGate). The rapid adoption of LLM-based agents with persistent memory makes this highly relevant. It identifies novel threat categories (memory-induced jailbreaks, cross-domain leakage) and provides a lightweight mitigation. Paper 2 offers a useful conceptual framework for knowledge infusion in generative models but is more taxonomic in nature. Paper 1's direct applicability to AI safety, its evaluation across multiple real-world frameworks, and the growing importance of agentic AI give it broader and more immediate impact potential.
Paper 1 (CMTF) demonstrates higher potential impact due to its more rigorous experimental evaluation (2448 runs across 4 LLM backends, 102 tasks, 100 tools), its training-free approach requiring no additional model parameters, and its dramatic practical benefits (90% token reduction while maintaining success rates). The causal sufficiency framework offers a more fundamental theoretical contribution compared to Paper 2's neural gating approach. Paper 2 (MemGate) addresses important trustworthiness concerns in memory systems but is more narrowly focused on a specific vulnerability class. CMTF's broader applicability to any tool-using agent and its principled causal reasoning framework give it wider cross-field impact.
Paper 1 has higher potential scientific impact due to its broad, unifying framework for intervention-aware clinical trajectory modeling, explicitly integrating forecasting, counterfactuals, and policy evaluation with identifiability considerations (treatment assignment, time-varying confounding, observation bias). Its real-world stakes are high (clinical decision support), and it connects multiple methodological areas (causal inference, time-series, point processes, deep models, trial emulation), likely influencing both ML and biostatistics practice. Paper 2 is timely and practical for agent safety, but is narrower (memory retrieval gating) and may be overtaken quickly as agent architectures evolve.
Paper 2 has higher likely impact due to stronger methodological rigor (comparative evaluation across multiple memory frameworks, environments, and LLM backbones), clear timeliness (trust/safety issues in agentic long-term memory are immediate), and a deployable, lightweight mitigation (MemGate) with concrete real-world applicability. Its contribution targets a broadly relevant failure mode—memory as a control channel—affecting many personal-agent systems, thus offering wider cross-field impact (ML systems, security, HCI) than Paper 1’s more conceptual motivational-architecture proposal with less demonstrated empirical validation.
Paper 1 addresses a critical and timely trustworthiness gap in personal AI agents' memory systems—an area with significant security implications as AI agents become more prevalent. It identifies novel threat categories (cross-domain leakage, memory-induced jailbreaks) and proposes MemGate, a practical, lightweight solution that works across multiple frameworks without modifying the LLM. The security/trust angle has broader impact across the AI safety community. Paper 2, while valuable for skill distillation, addresses a more incremental improvement in agent capability. Paper 1's focus on trust boundaries in agentic memory is more foundational and likely to influence policy, standards, and future system design.
Paper 2 addresses a critical and highly timely vulnerability in the rapidly expanding field of personal AI agents. By identifying memory search as a trust boundary and proposing a lightweight, universally applicable mitigation (MemGate), it offers broad implications for AI safety and agent design. While Paper 1 presents a strong architectural contribution for constrained optimization, Paper 2's focus on LLM trustworthiness gives it significantly higher potential for immediate, widespread impact across both academic research and industry applications.
Paper 2 proposes a broad paradigm shift—biomedical world models—that could transform drug discovery, clinical trials, surgical planning, and fundamental biology by enabling prospective simulation rather than static pattern recognition. Its breadth of impact spans molecular biology, clinical medicine, and AI, addressing a fundamental limitation of current foundation models. While Paper 1 addresses an important and timely security concern (trustworthy memory for AI agents) with a concrete solution (MemGate), its scope is narrower, targeting a specific vulnerability in agentic memory systems. Paper 2's vision has greater potential to catalyze new research directions across multiple fields.
Paper 2 addresses a critical and highly timely issue in AI safety and personal agent architectures. By identifying vulnerabilities in semantic memory search and proposing a lightweight, universally deployable mitigation (MemGate), it offers broad impact across the rapidly expanding field of LLMs. In contrast, Paper 1 presents a valuable but domain-specific application of existing memory-augmented networks to maritime trajectory prediction, resulting in a significantly narrower scope of scientific impact.
Paper 2 addresses a more fundamental and broadly impactful problem—trustworthiness and security of memory systems in AI agents. It identifies a novel threat model (memory as a control channel) with concrete attack categories (cross-domain leakage, jailbreaks, sycophancy), evaluates multiple existing frameworks, and proposes MemGate, a practical lightweight solution. The security/trust angle has broader implications across the AI safety community. Paper 1 tackles preference learning with a decoupled architecture, which is useful but narrower in scope. Paper 2's focus on trustworthiness is more timely given growing AI safety concerns and has wider cross-field relevance.
Paper 1 targets a timely, high-stakes problem—trust and security failures from long-term memory in personal agents—and frames memory retrieval as a control boundary. Its contribution (MemGate) is lightweight, deployable across multiple existing memory frameworks and LLM backbones, and evaluated in a real-world agent environment with tool use, suggesting broad practical adoption and cross-field impact (LLM security, agent alignment, RAG). Paper 2 offers a strong systems idea for long-horizon state management, but is more task/performance oriented and appears evaluated in a narrower benchmark setting.
Paper 2 addresses a critical and rapidly growing area of AI safety and agentic systems: the security and trustworthiness of long-term memory in personal AI agents. By identifying novel threats like memory-induced jailbreaks and proposing a lightweight, universally applicable mitigation tool (MemGate), it offers high real-world applicability and broad impact across the booming LLM agent ecosystem. Paper 1 offers a valuable optimization for classical planning (PDDL grounding), but its impact is confined to a much narrower, specialized subfield of AI compared to the ubiquitous relevance of secure LLM agents.
Paper 2 likely has higher impact due to timeliness and broad real-world relevance: trustworthy long-term memory is central to deployed personal agents and intersects security, privacy, alignment, and systems. It frames memory retrieval as a trust boundary, evaluates multiple mainstream frameworks in a realistic environment, and proposes a deployable, model-agnostic plugin (MemGate) with small footprint—facilitating adoption across products and research. Paper 1 identifies an important RL failure mode and offers a simple fix with strong gains, but its scope is narrower (specific RL training dynamics) and may translate less directly to widely deployed agent stacks than memory safety controls.
Paper 2 targets a timely, high-impact problem: safety and trustworthiness of long-term memory in personal AI agents, with broad real-world deployment implications. It frames memory retrieval as a security boundary, evaluates multiple existing frameworks in a realistic environment, and proposes a lightweight, modular mitigation (MemGate) applicable across systems and LLM backbones without modifying the LLM or database—supporting wide adoption. Paper 1 is novel within heuristic search for longest-path variants but is more specialized with narrower cross-field impact and application scope.
Paper 1 addresses a critical and emerging security gap in personal AI agents (memory-induced threats), offering high novelty and significant real-world applications for AI safety. Its lightweight mitigation strategy is highly practical. In contrast, Paper 2 presents a multi-agent framework for mathematical reasoning, which, while effective, contributes more incrementally to an already heavily researched area of LLM capabilities.