Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman

#1148 of 3355 · Artificial Intelligence
Share
Tournament Score
1438±48
10501800
79%
Win Rate
15
Wins
4
Losses
19
Matches
Rating
7.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Agent Memory Characterization and System Implications

1. Core Contribution

This paper provides the first systematic, systems-level characterization of agent memory workloads for LLM agents. While prior work has benchmarked agent memory systems on downstream task accuracy, this paper shifts the lens to computational cost, latency, energy, storage footprint, and scheduling tradeoffs. The contributions are fourfold: (1) a four-paradigm taxonomy classifying ten agent memory systems along construction, storage, retrieval, and mutability axes; (2) a phase-aware profiling harness that attributes cost to construction, retrieval, and generation; (3) a comprehensive empirical characterization revealing how design choices redistribute cost across write and read paths; and (4) ten concrete system recommendations for deployment.

The key insight is that accuracy benchmarks are insufficient for deployment decisions — systems with similar accuracy can differ by 47× in energy per correct answer and two orders of magnitude in per-query latency. This reframes agent memory selection from a purely ML problem to a systems engineering problem.

2. Methodological Rigor

The experimental methodology is generally strong. The authors evaluate ten systems across two benchmark suites (MemoryAgentBench and MemoryArena), using both remote API (GPT-4o-mini, GPT-4.1-mini) and local serving (Qwen3 model ladder on H100 GPUs via vLLM). The phase-aware profiling harness captures both API telemetry (token counts, call structure, latency) and hardware telemetry (GPU utilization, power, DRAM bandwidth), enabling fine-grained cost attribution.

Several methodological decisions merit scrutiny. The primary characterization centers on LongMemEval_S_*, which contains only five samples with 300 total queries — a relatively small evaluation corpus for drawing deployment recommendations. The authors acknowledge this limitation implicitly by also running the full MAB suite for aggregate analysis. The standardization choices (fixed chunk size, capped retrieval at 10 entries, reduced parameters for Letta) are reasonable but necessarily favor some systems' native operating points over others. The "minimal task-adaptation changes" made for dialogue-oriented systems (SimpleMem, A-Mem, Mem0, Letta) introduce a confound: it becomes harder to distinguish whether poor accuracy reflects memory system limitations or prompt adaptation inadequacy.

The energy measurements using integrated GPU power are credible for relative comparison but exclude host CPU, DRAM, and network energy, which matters more for systems with significant non-GPU work (BM25, graph traversal). The freshness-latency analysis using MemoryArena with a controlled 5-second inter-session arrival schedule is a creative experimental design that reveals genuine scheduling challenges invisible to batch evaluation.

3. Potential Impact

This paper has substantial practical impact potential for three audiences:

Infrastructure operators deploying agent memory at scale gain actionable guidance: construction should be treated as background throughput work, separated from latency-sensitive QA serving; construction-LLM downscaling is a viable cost lever with algorithm-specific floors; per-user cost growth slopes (not just initial footprints) should drive system selection for long-lived agents.

Agent memory system designers receive concrete feedback on where their systems' costs concentrate. The finding that construction energy dominates the agent lifecycle (often exceeding total query-phase energy across 300 queries) and that construction is overwhelmingly prefill/embedding-dominated provides clear optimization targets. The bimodal embedding traffic pattern (batch vs. sequential) suggests architectural specialization opportunities.

The research community benefits from the taxonomy and profiling methodology as frameworks for evaluating future systems. The observation that no system occupies the Pareto frontier across construction cost, serving latency, and accuracy simultaneously motivates new designs.

The paper's influence extends to adjacent fields: LLM serving systems (motivating construction-aware schedulers), database systems (per-user mutable stores with heterogeneous access patterns), and edge/on-device AI (where energy per correct answer becomes critical).

4. Timeliness & Relevance

This paper is extremely timely. Agent memory systems have proliferated rapidly (Mem0, MemGPT/Letta, GraphRAG, HippoRAG v2 all appeared in 2024-2025), and production deployments are scaling. Yet the field has been guided almost exclusively by accuracy benchmarks, creating a blind spot precisely where operators most need guidance. The paper fills this gap at a moment when deployment decisions are being made.

The freshness-latency tradeoff analysis is particularly relevant as agents move from single-session to multi-session deployment. The scaling analysis (64K to 1M tokens per user) directly addresses the trajectory of real-world deployments where user histories grow continuously.

5. Strengths & Limitations

Key Strengths:

  • *Novel framing*: Treating agent memory as a systems workload rather than an accuracy optimization problem is an important conceptual contribution.
  • *Comprehensive coverage*: Ten systems spanning four paradigms, evaluated under both remote and local serving, with multiple model scales.
  • *Actionable recommendations*: The ten system recommendations are specific, justified by data, and directly applicable.
  • *Energy-normalized metrics*: Energy per correct answer (inspired by Intelligence per Watt) provides a deployment-relevant efficiency metric.
  • *Taxonomy predictive power*: The taxonomy doesn't merely classify — it predicts cost shapes, which are then validated empirically.
  • Notable Limitations:

  • *Scale of evaluation*: Five samples for primary characterization is thin; results may not generalize to diverse workload mixes.
  • *Single-GPU constraint*: All local experiments use a single H100, leaving multi-GPU and distributed deployment unexplored.
  • *No co-location experiments*: The paper argues construction and QA traffic conflict but doesn't measure interference directly under multiplexed workloads.
  • *Static snapshot*: The systems evaluated are rapidly evolving; characterization results may not persist through version updates.
  • *Limited quality analysis*: The paper's quality evaluation relies on LLM-Judge accuracy, which has known biases, and doesn't deeply analyze failure modes.
  • *BM25 anomaly*: BM25 achieving the highest aggregate accuracy raises questions about benchmark design — are the tasks sufficiently challenging for structured memory to demonstrate its value?
  • Additional Observations

    The paper's finding that BM25 achieves the highest macro-averaged accuracy while having the lowest cost challenges the field's assumption that more sophisticated memory architectures provide commensurate quality improvements. This is both a strength (honest reporting) and a limitation (suggests the benchmarks may not adequately stress the capabilities that motivate complex systems). The profiling harness, promised to be open-sourced, could become a valuable community resource if maintained.

    Rating:7.5/ 10
    Significance 8Rigor 7Novelty 8Clarity 8.5

    Generated Jun 5, 2026

    Comparison History (19)

    vs. Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
    gemini-3.16/8/2026

    Paper 1 presents the first comprehensive systems-level characterization and taxonomy of LLM agent memory. Foundational benchmarking and profiling papers in emerging domains typically generate high citation counts and broad impact across both systems and ML communities by setting the standard for future infrastructure. While Paper 2 offers a strong algorithmic improvement for LVLM training, Paper 1 addresses a critical scaling bottleneck for agent deployment with highly practical, fleet-scale recommendations.

    vs. Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo
    gpt-5.26/6/2026

    Paper 2 likely has higher scientific impact: it introduces a broadly applicable conceptual framework (intervention layers) for knowledge infusion across iterative generative models, demonstrates compositional design principles, and provides empirical evidence (large reduction in knowledge violations) in a timely safety/alignment setting. This framing can unify and guide future methods across multimodal/diffusion research and potentially beyond. Paper 1 is valuable and rigorous as a first systems characterization of agent memory with practical recommendations, but its impact is more specialized to LLM-agent infrastructure and benchmarking rather than offering a generalizable theoretical lens across model classes.

    vs. AIP: A Graph Representation for Learning and Governing Agent Skills
    claude-opus-4.66/6/2026

    Paper 1 provides the first comprehensive systems characterization of agent memory, a foundational infrastructure concern for scaling LLM agents. Its taxonomy, profiling methodology, and 10 actionable system recommendations have broad applicability across the entire agent ecosystem. Paper 2 presents a useful but narrower contribution—a graph-based skill representation with solid empirical gains on a specific benchmark. While Paper 2 is well-executed, Paper 1 addresses a more fundamental and widely relevant problem, offering reusable frameworks and insights that will likely influence system design across many agent architectures.

    vs. Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
    gemini-3.16/5/2026

    Paper 1 offers a foundational systems-level characterization and taxonomy for LLM agent memory, a critical bottleneck in scaling AI agents. Its broad applicability across various domains and its comprehensive benchmarking are likely to influence a wide range of future architectures, yielding a broader scientific impact compared to the domain-specific (albeit highly effective) distillation method for autonomous driving presented in Paper 2.

    vs. Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development
    gemini-3.16/5/2026

    Paper 2 offers a foundational systems characterization, taxonomy, and profiling harness for LLM agent memory, an essential and rapidly evolving area of AI research. Its benchmarking and system recommendations provide broad, foundational contributions that researchers across AI and systems will build upon. Paper 1, while demonstrating strong real-world enterprise utility, focuses more narrowly on applied knowledge management for software engineering, making its core scientific impact less foundational compared to the architectural and systems-level insights of Paper 2.

    vs. Learning Adaptive Parallel Execution for Efficient Code Localization
    gemini-3.16/5/2026

    Paper 1 addresses a foundational challenge in LLM agents (stateful long-horizon memory systems) by providing a novel taxonomy, profiling harness, and system-level recommendations. This foundational systems research has broader applicability across various agentic applications and infrastructure designs. In contrast, Paper 2 focuses on a more specific application (code localization in software development), making Paper 1's potential impact broader and more significant across the wider AI systems community.

    vs. DiG-Plan: Mitigating Early Commitment for Tool-Graph Planning via Diffusion Guidance
    gpt-5.26/5/2026

    Paper 2 has higher likely impact due to a more novel methodological contribution (diffusion-guided propose–refine planning to mitigate AR early commitment) with clear, generalizable implications for tool-use, program synthesis, and combinatorial generation. It provides strong empirical evidence (large Pass@10 coverage jump in controlled study; consistent gains on TaskBench and API-Bank) and an approach that can transfer across domains where search/exploration is a bottleneck. Paper 1 is valuable systems work, but is primarily characterization/recommendations and may have narrower novelty and broader-field influence than a new planning paradigm.

    vs. Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation
    gemini-3.16/5/2026

    Paper 1 addresses the critical, broadly applicable challenge of LLM agent memory systems, offering a taxonomy, profiling harness, and system-level recommendations. This work has wide-reaching implications for scaling AI agents across numerous domains. Paper 2, while methodologically rigorous, focuses on the highly niche area of TLA+ specification generation, inherently limiting its broader scientific impact compared to the foundational systems research presented in Paper 1.

    vs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation
    claude-opus-4.66/5/2026

    Paper 2 addresses a foundational systems-level challenge for the rapidly growing field of LLM agents, providing the first systematic characterization of agent memory systems. Its taxonomy, profiling framework, and actionable system recommendations have broad applicability across the entire agent ecosystem, impacting infrastructure design at scale. Paper 1, while novel in combining world models with VLA for UAV navigation, targets a narrower application domain (urban UAV navigation) and introduces a benchmark specific to that niche. Paper 2's breadth of impact across the booming LLM agent field gives it higher potential scientific impact.

    vs. PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models
    gemini-3.16/5/2026

    Paper 1 offers the first systems-level characterization and taxonomy of an emerging, critical infrastructure (agent memory). By establishing benchmarks, profiling harnesses, and foundational system recommendations, it is likely to broadly influence both future AI research and practical, fleet-scale engineering. Paper 2 presents a strong specific methodology for VLMs, but its scope is narrower compared to the foundational systems guidelines provided by Paper 1, which will guide the broader development of autonomous agents.

    vs. The Self-Correction Illusion: LLMs Correct Others but Not Themselves
    claude-opus-4.66/5/2026

    Paper 1 identifies a fundamental and surprising mechanism—self-correction failure in LLMs is a chat-template artifact, not a capability deficit—which challenges prevailing assumptions about LLM reasoning limitations. It offers a training-free, immediately deployable intervention with large effect sizes across multiple model families. This insight has broad implications for LLM agent design, RLHF training, and prompt engineering. Paper 2 provides a valuable systems characterization of agent memory but is more incremental (taxonomy, profiling, benchmarking) without a similarly transformative finding. Paper 1's causal mechanistic insight is more likely to reshape research directions.

    vs. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains
    gemini-3.16/5/2026

    Paper 2 addresses a critical bottleneck in the rapidly expanding field of LLM agents by providing the first system-level characterization of agent memory. By introducing a new taxonomy, profiling harness, and actionable systems recommendations, it establishes foundational knowledge that will broadly influence AI systems design. In contrast, Paper 1 offers a valuable but more narrowly focused application of DRL to pharmaceutical supply chain management, resulting in lower potential breadth and overall scientific impact compared to the foundational systems research in Paper 2.

    vs. Brick-Composer: Using MLLMs for Assembly with Diverse Bricks
    claude-opus-4.66/5/2026

    Paper 1 addresses a foundational systems-level challenge for the rapidly growing LLM agent ecosystem. Its systematic taxonomy, profiling methodology, and actionable design recommendations for agent memory have broad applicability across all agent-based AI systems, influencing infrastructure decisions at scale. Paper 2, while creative in applying MLLMs to brick assembly, targets a narrower robotics/embodied AI niche with modest results (~15% step success). Paper 1's timeliness, breadth of impact across the entire agent systems community, and practical utility for deployment give it significantly higher potential impact.

    vs. Severity-Aware Curriculum Learning with Multi-Model Response Selection for Medical Text Generation
    claude-opus-4.66/5/2026

    Paper 2 addresses a fundamental infrastructure challenge for LLM agents—memory management in long-horizon tasks—which is a rapidly growing area with broad implications across AI systems. It provides the first systems-level characterization with a taxonomy, profiling harness, and actionable recommendations, making it highly impactful for the entire agent ecosystem. Paper 1, while useful, applies relatively standard techniques (curriculum learning, multi-model selection) to a narrower medical NLP task with incremental improvements on a single dataset and metric, limiting its broader impact.

    vs. A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR
    gpt-5.26/5/2026

    Paper 1 is more scientifically novel and field-shaping: it formalizes a widely used but biased RLVR estimand, provides a provable causal decomposition, and validates it via preregistered factorial experiments plus re-audits of prior results, yielding a reusable audit harness. This combination of theory + rigorous experimental design directly affects how alignment/RLHF-style results are interpreted, potentially correcting conclusions across many papers. Paper 2 is timely and practically useful, but is primarily a systems characterization/taxonomy with recommendations; impactful for engineering practice yet less likely to redefine core scientific understanding.

    vs. Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking
    claude-opus-4.66/5/2026

    Paper 2 addresses a foundational systems-level challenge for LLM agents—memory management at scale—which is broadly relevant as agents become widely deployed. Its comprehensive taxonomy, profiling framework, and actionable system recommendations have broad applicability across the growing agent ecosystem. Paper 1, while introducing a clever reward-free probe for implicit reward hacking, is narrower in scope: it tests on a single model/dataset pair and addresses a specific alignment auditing niche. Paper 2's breadth of impact, timeliness given the agent deployment wave, and practical engineering utility give it higher potential impact.

    vs. Simulate, Reason, Decide: Scientific Reasoning with LLMs for Simulation-Driven Decision Making
    gpt-5.26/5/2026

    Paper 2 likely has higher impact due to timeliness and breadth: scalable long-horizon agent memory is a near-term bottleneck across most deployed agent applications, and a first systems-level characterization with taxonomy, profiling harness, comparative evaluation across many systems, and actionable recommendations can directly influence both research and production stacks. Its methodological rigor (measurement framework + multi-system benchmarking) and broad applicability across domains and infrastructure layers suggest wider adoption. Paper 1 is novel and valuable for simulator-grounded transparency, but its impact may be narrower to simulation-driven decision workflows and depends on adoption of its schema and neuro-symbolic pipeline.

    vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
    gemini-3.16/5/2026

    Paper 2 provides a foundational systems-level characterization of LLM agent memory, a highly timely and rapidly expanding area. By introducing a taxonomy, profiling harness, and evaluating multiple systems, it offers broad applicability across AI and systems research. While Paper 1 makes a strong contribution to autonomous driving, Paper 2's insights into scalable LLM agents will likely influence a wider range of applications, architectures, and future research directions, resulting in a broader overall scientific impact.

    vs. Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization
    gpt-5.26/5/2026

    Paper 2 likely has higher scientific impact due to broader, more immediate applicability and cross-field relevance. It provides the first systems-level characterization of agent memory, introduces a taxonomy, profiling methodology, and evaluates 10 representative systems, yielding actionable recommendations for deploying long-horizon LLM agents at scale—highly timely for industry and research. Paper 1 is novel and strong (knowledge-gap localization, user study), but its impact is narrower to interpretability/education-style assistance and depends on task-specific knowledge representations, whereas Paper 2’s insights generalize across many agent architectures and deployments.