Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

Abdelghny Orogat, Essam Mansour

May 25, 2026

arXiv:2605.26252v1 PDF

cs.AI(primary)cs.DB

#292of 2525·Artificial Intelligence

#292 of 2525 · Artificial Intelligence

Tournament Score

1504±48

10501800

82%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance7.5

Rigor4.5

Novelty7

Clarity8.5

Tournament Score

1504±48

10501800

82%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper argues that long-term AI agent memory constitutes a fundamentally new data-management workload that cannot be adequately served by existing database paradigms or current agent memory systems. The authors identify four recurring failure modes in existing approaches (unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval) and propose Governed Evolving Memory (GEM), a state-level abstraction where correctness is defined over the memory trajectory {M_t} rather than individual records. GEM introduces four state-level operators (ingestion, revision, forgetting, retrieval) governed by six correctness conditions (C1–C6). The paper also presents MemState, a prototype on the Kuzu property-graph engine, and outlines a research agenda.

The central intellectual contribution is the reframing: shifting from record-level CRUD to trajectory-level correctness. This is a meaningful conceptual move that draws a clear line between what databases do and what agent memory needs.

Methodological Rigor

The paper is primarily a vision paper with formalization, not an empirical study. This needs to be evaluated on those terms.

Strengths in formalization: The memory state definition M_t = (D_t, S_t, P_t) is clean and well-structured. The six correctness conditions are clearly stated and individually motivated. The distinction between extension edges (propagation-bearing) and association edges (context only) is a useful design choice that prevents unnecessary cascading updates.

Weaknesses in rigor: The three "structural observations" (Section 3.4) are explicitly acknowledged as structural claims rather than theorems. While the authors are honest about this, the lack of formal proofs weakens the central argument that no CRUD-based system can satisfy GEM's conditions. For instance, Observation 1 claims that pure-function retrieval cannot satisfy C6, but one could argue that a database trigger on read operations achieves something functionally equivalent. The paper partially addresses this by noting that such approaches decouple the state-modifying step, but this distinction could be more rigorously drawn.

MemState as validation: The prototype is described architecturally but lacks empirical evaluation. There are no experiments measuring latency, scalability, correctness preservation over long interaction sequences, or comparisons against baselines like Mem0 or Zep. The paper acknowledges MemState as a "feasibility sketch," but this limits the ability to judge whether the abstraction imposes acceptable overhead. The gap between conceptual contribution and empirical validation is significant.

Potential Impact

The paper targets a real and growing problem. As LLM-based agents become more prevalent in production settings (ChatGPT memory, Cursor, Claude Code), the limitations of append-only memory are practically experienced by millions of users. The four failure modes identified are concrete and recognizable.

For the database community: The paper's framing as a new workload is strategically positioned to attract attention from the data management community, following the precedent of stream processing becoming a recognized workload. If the community accepts this framing, it could spawn research on native engines, new query languages with write-on-read semantics, and trajectory-level benchmarks.

For the AI agents community: The correctness conditions provide a principled checklist for evaluating memory systems. Table 1's systematic comparison across database paradigms and agent memory families is a useful reference artifact.

For industry: The failure modes directly map to user complaints about existing products. If the GEM abstraction or something like it gets adopted, it could meaningfully improve agent reliability in long-horizon deployments.

However, impact depends heavily on follow-through. Without empirical validation demonstrating that GEM-conformant systems measurably outperform existing approaches, the ideas may remain aspirational.

Timeliness & Relevance

The paper is extremely timely. Long-term memory for LLM agents is an active area with several concurrent efforts (Mem0, Zep, MemGPT, EverMemOS, A-MEM, Memory-R1) published in 2024-2025. The paper synthesizes these into a coherent landscape and identifies their shared limitations. The timing is good—the field is mature enough that the failure modes are documented but young enough that foundational abstractions are not yet settled.

Strengths

1. Clear problem identification: The four failure modes are concrete, well-illustrated (Figure 1 is effective), and map directly to real products.

2. Comprehensive landscape analysis: Table 1 systematically covers five database paradigms and six agent memory families against four capabilities, revealing that no system achieves all four.

3. Clean formalization: The state tuple, operators, correctness conditions, and policy language are well-defined and composable.

4. Actionable research agenda: The three directions (native engine, trajectory benchmarks, privacy/multi-tenancy) are specific, with stated success criteria—unusual and commendable for a vision paper.

5. Retrieval-as-write insight: Elevating retrieval from a pure function to a state-modifying operator (C6) is a genuinely novel observation with implications for database semantics.

Limitations

1. No empirical evaluation: The absence of experiments is the paper's most significant weakness. The research agenda proposes a "500-turn adversarial workload" but doesn't execute it. Even preliminary results on a simple scenario would substantially strengthen the claims.

2. Structural observations lack formal proof: The impossibility arguments are informal and could be challenged. A formal treatment would make the contribution more durable.

3. LLM dependence is underexplored: The operators rely heavily on LLM calls (topic selection, conflict resolution, propagation decisions). The paper does not analyze the cost, latency, or error propagation from LLM involvement in the data path, which is a practical concern.

4. Scalability questions: The paper does not discuss how GEM scales with the number of topics, the depth of dependency chains, or the frequency of writes. Revision propagating along extension edges could be expensive.

5. Granularity of semantic units: The paper acknowledges that unit boundaries are a design decision but provides limited guidance on how to make this decision in practice, which is critical for real deployments.

6. Comparison depth: While Table 1 is useful, the comparison is qualitative. Quantitative evidence of where existing systems fail on specific scenarios would be more convincing.

Overall Assessment

This is a well-crafted vision paper that identifies a genuine abstraction gap in agent memory systems and proposes a principled formalization. The conceptual contribution is solid, the landscape analysis is thorough, and the research agenda is specific. The main limitation is the absence of empirical validation, which leaves the central claims—that GEM-conformant systems would outperform existing approaches and that no CRUD system can satisfy the conditions—as plausible but unproven. The paper is best understood as a research agenda setter rather than a validated contribution.

Rating:6.2/ 10

Significance 7.5Rigor 4.5Novelty 7Clarity 8.5

Generated May 27, 2026

Comparison History (17)

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

gemini-3.15/27/2026

Paper 1 proposes a foundational paradigm shift by redefining AI agent memory as a state trajectory rather than static storage, bridging AI and data management. This conceptual leap offers broader impact across multiple disciplines compared to Paper 2, which addresses the specific, albeit important, problem of distribution shift in multi-turn dialogues. Paper 1's formalization of memory-centric data management has the potential to guide future infrastructure design for all long-running AI agents, making its long-term scientific and practical impact significantly higher.

vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems

gemini-3.15/27/2026

Paper 1 proposes a fundamental paradigm shift by conceptualizing agent memory as a new data-management workload rather than static storage. By introducing the Governed Evolving Memory abstraction, it bridges the AI and database communities, opening a broad, novel research direction. While Paper 2 provides a highly useful benchmarking tool for existing systems, Paper 1 addresses deeper architectural limitations and offers theoretical foundations that could dictate how future long-term AI agent memories are fundamentally designed and engineered, leading to broader long-term scientific impact.

vs. TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

gemini-3.15/27/2026

Paper 2 addresses a fundamental and broad challenge in AI—long-term agent memory—proposing a paradigm shift and formalizing a new data-management workload. This foundational approach has the potential to influence a wide range of AI architectures and database systems. Paper 1, while highly timely and practical, targets a relatively niche problem (LLM-generated peer reviews), which limits its broader methodological impact compared to the architectural rethink proposed in Paper 2.

vs. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

claude-opus-4.65/27/2026

Paper 1 introduces a foundational theoretical framework (GEM) that redefines long-term agent memory as a new data-management workload with formal correctness conditions and impossibility results showing record-level systems are insufficient. This has broader impact by establishing new abstractions that could reshape how the database and AI communities think about agent memory infrastructure. Paper 2 (AgingBench) provides valuable empirical benchmarking of agent degradation, but is more incremental—a diagnostic tool rather than a paradigm shift. Paper 1's formalization opens multiple research directions and is more likely to spawn follow-on work across database systems and AI agent communities.

vs. What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

gemini-3.15/27/2026

Paper 1 addresses a critical, high-stakes issue—ethical pluralism in clinical AI—with direct implications for patient care and medical AI safety. Its introduction of a clinician-verified benchmark and attribution method offers immediate real-world utility in preventing dangerous 'deployment monocultures' in healthcare. While Paper 2 presents a strong foundational framework for AI agent memory systems, Paper 1's interdisciplinary breadth (spanning machine learning, medical ethics, and healthcare policy) and its urgent societal relevance give it a higher potential for broad and profound scientific impact.

vs. MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

gpt-5.25/27/2026

Paper 2 has higher potential impact: it introduces a broadly applicable conceptual and formal framework (GEM) for long-term agent memory, framing it as a new data-management workload with state-trajectory correctness, new operators, and correctness conditions, plus negative results about record-level systems. This targets a timely, cross-cutting bottleneck for AI agents and databases, likely influencing both systems and ML communities and spawning follow-on work (engines, benchmarks, theory). Paper 1 is strong and practical but more domain-specific (clinical guidelines) and incremental within LLM adaptation.

vs. Can LLMs Introspect? A Reality Check

gpt-5.25/27/2026

Paper 1 proposes a novel data-management abstraction (GEM) for long-term agent memory, with formal operators and correctness conditions plus a prototype (MemState), offering clear methodological structure and immediate systems implications for a timely, fast-growing agent ecosystem. Its impact could span databases, AI systems, and safety/auditing by reframing memory as a state-trajectory workload. Paper 2 is a valuable corrective that tightens evaluation rigor for LLM “introspection,” but is primarily a critique of existing paradigms with narrower direct application. Overall, Paper 1 has higher potential for foundational and cross-field impact.

vs. Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

claude-opus-4.65/27/2026

Paper 1 proposes a foundational rethinking of data management for AI agent memory, formalizing a new workload (GEM) with correctness conditions and proving limitations of existing paradigms. This has broader impact across databases, AI agents, and systems research, defining a new research area. Paper 2 makes a solid but more incremental contribution to CoT faithfulness detection using circuit tracing. While technically interesting, it addresses a narrower problem. Paper 1's vision-setting nature, formal foundations, and identification of a new data-management workload class give it higher potential for long-term cross-disciplinary impact as AI agents become prevalent.

vs. Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

claude-opus-4.65/27/2026

Paper 1 introduces a fundamentally new conceptual framework (GEM) for agent memory as a distinct data-management workload, with formal correctness conditions and impossibility results showing existing paradigms are insufficient. This addresses a critical infrastructure need for the rapidly growing AI agent ecosystem, potentially spawning an entire research area. Paper 2, while practically useful, presents an incremental combination of existing techniques (symbolic + neural verification) for a specific application domain with modest performance numbers (72-83% detection). Paper 1's broader theoretical contributions and timeliness give it greater potential impact.

vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

gemini-3.15/27/2026

Paper 2 fundamentally rethinks long-term AI agent memory, shifting the paradigm from basic storage to a novel data-management workload (Governed Evolving Memory). This theoretical framework addresses recurring failure modes in agent architectures and spans multiple fields (AI, databases, data management), promising broad foundational impact. In contrast, Paper 1 offers a valuable but more narrowly focused systems optimization for on-device mobile GUI agents.

vs. GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

claude-opus-4.65/27/2026

Paper 2 introduces a novel conceptual framework (GEM) for a fundamental challenge in AI agent design—long-term memory management—that cuts across virtually all AI agent applications. It formalizes a new data-management workload with theoretical guarantees, opening multiple research directions. Paper 1, while rigorous and practically useful, is a benchmark contribution limited to dental AI evaluation. Paper 2's broader applicability across AI systems, databases, and agent architectures, combined with the timeliness of the AI agent paradigm, gives it higher potential for cross-field impact and foundational influence.

vs. Credit Assignment with Resets in Language Model Reasoning

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and timely problem in LLM training—credit assignment in reinforcement learning for reasoning—with concrete algorithmic contributions (RRPO, SRPO), theoretical grounding in CPI, and empirical validation across benchmarks. It directly improves upon widely-used methods like GRPO in the rapidly growing RLVR paradigm. Paper 2 presents an interesting conceptual framework (GEM) for agent memory but is more of a vision/position paper with only a prototype validation, making its near-term scientific impact less certain. Paper 1's methodological contributions are more immediately actionable and relevant to the active LLM reasoning research community.

vs. Generating Robust Portfolios of Optimization Models using Large Language Models

gemini-3.15/27/2026

Paper 2 proposes a fundamental paradigm shift in AI agent memory, bridging AI and database systems to define a new data-management workload. This foundational rethinking has broad implications for the rapidly growing field of autonomous agents and opens multiple new research directions. In contrast, Paper 1 offers a valuable but more narrowly focused application of LLMs to optimization modeling.

vs. Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

gpt-5.25/27/2026

Paper 1 offers a more foundational, broadly applicable reframing: long-term agent memory as a new data-management workload with state-trajectory correctness, formal operators, and impossibility-style structural claims, plus a prototype (MemState). This combination of conceptual novelty, formalization, and systems implications can influence databases, agent architectures, and evaluation/auditing practices. Paper 2 is timely and practically relevant (personalization via LoRA consolidation) but is narrower in scope, with limited experimental scale (n=10) and a more incremental methodological core relative to existing continual/personalized fine-tuning work.

vs. OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

claude-opus-4.65/27/2026

Paper 1 introduces a fundamentally new data-management paradigm (GEM) for AI agent memory, formalizing it as a distinct workload with correctness conditions and proving that existing record-level systems are insufficient. This addresses a critical infrastructure need for the rapidly growing field of long-running AI agents, with broad implications across databases, AI systems, and software engineering. Paper 2, while valuable, is a benchmark contribution for evaluating ToM in LLMs—important but more incremental and narrower in scope. Paper 1's foundational framing has potential to spawn an entire research area in memory-centric data management.

vs. The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

gpt-5.25/27/2026

Paper 1 identifies a concrete, under-addressed failure mode in retrieval-augmented generation (attribution blind spot) and proposes an empirically tested internal-signal method (CRM) spanning multiple model families, tasks, and interventions—high novelty, rigor, and immediate relevance to trustworthy LLM deployment. Its impact could extend to evaluation, safety, compliance, and interpretability across many RAG systems. Paper 2 offers a compelling conceptual reframing and prototype for agent memory as a new data-management workload, but it is more systems/vision-oriented and likely needs broader empirical validation and adoption to match Paper 1’s near-term cross-field impact.

vs. EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

gpt-5.25/27/2026

Paper 2 is more novel and field-shaping: it reframes long-term agent memory as a new data-management workload, introduces a formal abstraction (GEM) with operators and correctness conditions, and argues fundamental limitations of record-level systems. This creates a broader research agenda spanning databases, AI systems, and agent safety/auditing, with clear real-world relevance as persistent agents proliferate. While Paper 1 provides a timely and useful benchmark with solid methodology, its impact is narrower (evaluation of coding agents) and more incremental relative to existing benchmark work.