Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads
Yasmine Omri, Ziyu Gan, Zachary Broveak, Robin Geens, Zexue He, Alex Pentland, Marian Verhelst, Tsachy Weissman
Abstract
LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Agent Memory Characterization and System Implications
1. Core Contribution
This paper provides the first systematic, systems-level characterization of agent memory workloads for LLM agents. While prior work has benchmarked agent memory systems on downstream task accuracy, this paper shifts the lens to computational cost, latency, energy, storage footprint, and scheduling tradeoffs. The contributions are fourfold: (1) a four-paradigm taxonomy classifying ten agent memory systems along construction, storage, retrieval, and mutability axes; (2) a phase-aware profiling harness that attributes cost to construction, retrieval, and generation; (3) a comprehensive empirical characterization revealing how design choices redistribute cost across write and read paths; and (4) ten concrete system recommendations for deployment.
The key insight is that accuracy benchmarks are insufficient for deployment decisions — systems with similar accuracy can differ by 47× in energy per correct answer and two orders of magnitude in per-query latency. This reframes agent memory selection from a purely ML problem to a systems engineering problem.
2. Methodological Rigor
The experimental methodology is generally strong. The authors evaluate ten systems across two benchmark suites (MemoryAgentBench and MemoryArena), using both remote API (GPT-4o-mini, GPT-4.1-mini) and local serving (Qwen3 model ladder on H100 GPUs via vLLM). The phase-aware profiling harness captures both API telemetry (token counts, call structure, latency) and hardware telemetry (GPU utilization, power, DRAM bandwidth), enabling fine-grained cost attribution.
Several methodological decisions merit scrutiny. The primary characterization centers on LongMemEval_S_*, which contains only five samples with 300 total queries — a relatively small evaluation corpus for drawing deployment recommendations. The authors acknowledge this limitation implicitly by also running the full MAB suite for aggregate analysis. The standardization choices (fixed chunk size, capped retrieval at 10 entries, reduced parameters for Letta) are reasonable but necessarily favor some systems' native operating points over others. The "minimal task-adaptation changes" made for dialogue-oriented systems (SimpleMem, A-Mem, Mem0, Letta) introduce a confound: it becomes harder to distinguish whether poor accuracy reflects memory system limitations or prompt adaptation inadequacy.
The energy measurements using integrated GPU power are credible for relative comparison but exclude host CPU, DRAM, and network energy, which matters more for systems with significant non-GPU work (BM25, graph traversal). The freshness-latency analysis using MemoryArena with a controlled 5-second inter-session arrival schedule is a creative experimental design that reveals genuine scheduling challenges invisible to batch evaluation.
3. Potential Impact
This paper has substantial practical impact potential for three audiences:
Infrastructure operators deploying agent memory at scale gain actionable guidance: construction should be treated as background throughput work, separated from latency-sensitive QA serving; construction-LLM downscaling is a viable cost lever with algorithm-specific floors; per-user cost growth slopes (not just initial footprints) should drive system selection for long-lived agents.
Agent memory system designers receive concrete feedback on where their systems' costs concentrate. The finding that construction energy dominates the agent lifecycle (often exceeding total query-phase energy across 300 queries) and that construction is overwhelmingly prefill/embedding-dominated provides clear optimization targets. The bimodal embedding traffic pattern (batch vs. sequential) suggests architectural specialization opportunities.
The research community benefits from the taxonomy and profiling methodology as frameworks for evaluating future systems. The observation that no system occupies the Pareto frontier across construction cost, serving latency, and accuracy simultaneously motivates new designs.
The paper's influence extends to adjacent fields: LLM serving systems (motivating construction-aware schedulers), database systems (per-user mutable stores with heterogeneous access patterns), and edge/on-device AI (where energy per correct answer becomes critical).
4. Timeliness & Relevance
This paper is extremely timely. Agent memory systems have proliferated rapidly (Mem0, MemGPT/Letta, GraphRAG, HippoRAG v2 all appeared in 2024-2025), and production deployments are scaling. Yet the field has been guided almost exclusively by accuracy benchmarks, creating a blind spot precisely where operators most need guidance. The paper fills this gap at a moment when deployment decisions are being made.
The freshness-latency tradeoff analysis is particularly relevant as agents move from single-session to multi-session deployment. The scaling analysis (64K to 1M tokens per user) directly addresses the trajectory of real-world deployments where user histories grow continuously.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's finding that BM25 achieves the highest macro-averaged accuracy while having the lowest cost challenges the field's assumption that more sophisticated memory architectures provide commensurate quality improvements. This is both a strength (honest reporting) and a limitation (suggests the benchmarks may not adequately stress the capabilities that motivate complex systems). The profiling harness, promised to be open-sourced, could become a valuable community resource if maintained.
Generated Jun 5, 2026
Comparison History (19)
Paper 1 presents the first comprehensive systems-level characterization and taxonomy of LLM agent memory. Foundational benchmarking and profiling papers in emerging domains typically generate high citation counts and broad impact across both systems and ML communities by setting the standard for future infrastructure. While Paper 2 offers a strong algorithmic improvement for LVLM training, Paper 1 addresses a critical scaling bottleneck for agent deployment with highly practical, fleet-scale recommendations.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable conceptual framework (intervention layers) for knowledge infusion across iterative generative models, demonstrates compositional design principles, and provides empirical evidence (large reduction in knowledge violations) in a timely safety/alignment setting. This framing can unify and guide future methods across multimodal/diffusion research and potentially beyond. Paper 1 is valuable and rigorous as a first systems characterization of agent memory with practical recommendations, but its impact is more specialized to LLM-agent infrastructure and benchmarking rather than offering a generalizable theoretical lens across model classes.
Paper 1 provides the first comprehensive systems characterization of agent memory, a foundational infrastructure concern for scaling LLM agents. Its taxonomy, profiling methodology, and 10 actionable system recommendations have broad applicability across the entire agent ecosystem. Paper 2 presents a useful but narrower contribution—a graph-based skill representation with solid empirical gains on a specific benchmark. While Paper 2 is well-executed, Paper 1 addresses a more fundamental and widely relevant problem, offering reusable frameworks and insights that will likely influence system design across many agent architectures.
Paper 1 offers a foundational systems-level characterization and taxonomy for LLM agent memory, a critical bottleneck in scaling AI agents. Its broad applicability across various domains and its comprehensive benchmarking are likely to influence a wide range of future architectures, yielding a broader scientific impact compared to the domain-specific (albeit highly effective) distillation method for autonomous driving presented in Paper 2.
Paper 2 offers a foundational systems characterization, taxonomy, and profiling harness for LLM agent memory, an essential and rapidly evolving area of AI research. Its benchmarking and system recommendations provide broad, foundational contributions that researchers across AI and systems will build upon. Paper 1, while demonstrating strong real-world enterprise utility, focuses more narrowly on applied knowledge management for software engineering, making its core scientific impact less foundational compared to the architectural and systems-level insights of Paper 2.
Paper 1 addresses a foundational challenge in LLM agents (stateful long-horizon memory systems) by providing a novel taxonomy, profiling harness, and system-level recommendations. This foundational systems research has broader applicability across various agentic applications and infrastructure designs. In contrast, Paper 2 focuses on a more specific application (code localization in software development), making Paper 1's potential impact broader and more significant across the wider AI systems community.
Paper 2 has higher likely impact due to a more novel methodological contribution (diffusion-guided propose–refine planning to mitigate AR early commitment) with clear, generalizable implications for tool-use, program synthesis, and combinatorial generation. It provides strong empirical evidence (large Pass@10 coverage jump in controlled study; consistent gains on TaskBench and API-Bank) and an approach that can transfer across domains where search/exploration is a bottleneck. Paper 1 is valuable systems work, but is primarily characterization/recommendations and may have narrower novelty and broader-field influence than a new planning paradigm.
Paper 1 addresses the critical, broadly applicable challenge of LLM agent memory systems, offering a taxonomy, profiling harness, and system-level recommendations. This work has wide-reaching implications for scaling AI agents across numerous domains. Paper 2, while methodologically rigorous, focuses on the highly niche area of TLA+ specification generation, inherently limiting its broader scientific impact compared to the foundational systems research presented in Paper 1.
Paper 2 addresses a foundational systems-level challenge for the rapidly growing field of LLM agents, providing the first systematic characterization of agent memory systems. Its taxonomy, profiling framework, and actionable system recommendations have broad applicability across the entire agent ecosystem, impacting infrastructure design at scale. Paper 1, while novel in combining world models with VLA for UAV navigation, targets a narrower application domain (urban UAV navigation) and introduces a benchmark specific to that niche. Paper 2's breadth of impact across the booming LLM agent field gives it higher potential scientific impact.
Paper 1 offers the first systems-level characterization and taxonomy of an emerging, critical infrastructure (agent memory). By establishing benchmarks, profiling harnesses, and foundational system recommendations, it is likely to broadly influence both future AI research and practical, fleet-scale engineering. Paper 2 presents a strong specific methodology for VLMs, but its scope is narrower compared to the foundational systems guidelines provided by Paper 1, which will guide the broader development of autonomous agents.
Paper 1 identifies a fundamental and surprising mechanism—self-correction failure in LLMs is a chat-template artifact, not a capability deficit—which challenges prevailing assumptions about LLM reasoning limitations. It offers a training-free, immediately deployable intervention with large effect sizes across multiple model families. This insight has broad implications for LLM agent design, RLHF training, and prompt engineering. Paper 2 provides a valuable systems characterization of agent memory but is more incremental (taxonomy, profiling, benchmarking) without a similarly transformative finding. Paper 1's causal mechanistic insight is more likely to reshape research directions.
Paper 2 addresses a critical bottleneck in the rapidly expanding field of LLM agents by providing the first system-level characterization of agent memory. By introducing a new taxonomy, profiling harness, and actionable systems recommendations, it establishes foundational knowledge that will broadly influence AI systems design. In contrast, Paper 1 offers a valuable but more narrowly focused application of DRL to pharmaceutical supply chain management, resulting in lower potential breadth and overall scientific impact compared to the foundational systems research in Paper 2.
Paper 1 addresses a foundational systems-level challenge for the rapidly growing LLM agent ecosystem. Its systematic taxonomy, profiling methodology, and actionable design recommendations for agent memory have broad applicability across all agent-based AI systems, influencing infrastructure decisions at scale. Paper 2, while creative in applying MLLMs to brick assembly, targets a narrower robotics/embodied AI niche with modest results (~15% step success). Paper 1's timeliness, breadth of impact across the entire agent systems community, and practical utility for deployment give it significantly higher potential impact.
Paper 2 addresses a fundamental infrastructure challenge for LLM agents—memory management in long-horizon tasks—which is a rapidly growing area with broad implications across AI systems. It provides the first systems-level characterization with a taxonomy, profiling harness, and actionable recommendations, making it highly impactful for the entire agent ecosystem. Paper 1, while useful, applies relatively standard techniques (curriculum learning, multi-model selection) to a narrower medical NLP task with incremental improvements on a single dataset and metric, limiting its broader impact.
Paper 1 is more scientifically novel and field-shaping: it formalizes a widely used but biased RLVR estimand, provides a provable causal decomposition, and validates it via preregistered factorial experiments plus re-audits of prior results, yielding a reusable audit harness. This combination of theory + rigorous experimental design directly affects how alignment/RLHF-style results are interpreted, potentially correcting conclusions across many papers. Paper 2 is timely and practically useful, but is primarily a systems characterization/taxonomy with recommendations; impactful for engineering practice yet less likely to redefine core scientific understanding.
Paper 2 addresses a foundational systems-level challenge for LLM agents—memory management at scale—which is broadly relevant as agents become widely deployed. Its comprehensive taxonomy, profiling framework, and actionable system recommendations have broad applicability across the growing agent ecosystem. Paper 1, while introducing a clever reward-free probe for implicit reward hacking, is narrower in scope: it tests on a single model/dataset pair and addresses a specific alignment auditing niche. Paper 2's breadth of impact, timeliness given the agent deployment wave, and practical engineering utility give it higher potential impact.
Paper 2 likely has higher impact due to timeliness and breadth: scalable long-horizon agent memory is a near-term bottleneck across most deployed agent applications, and a first systems-level characterization with taxonomy, profiling harness, comparative evaluation across many systems, and actionable recommendations can directly influence both research and production stacks. Its methodological rigor (measurement framework + multi-system benchmarking) and broad applicability across domains and infrastructure layers suggest wider adoption. Paper 1 is novel and valuable for simulator-grounded transparency, but its impact may be narrower to simulation-driven decision workflows and depends on adoption of its schema and neuro-symbolic pipeline.
Paper 2 provides a foundational systems-level characterization of LLM agent memory, a highly timely and rapidly expanding area. By introducing a taxonomy, profiling harness, and evaluating multiple systems, it offers broad applicability across AI and systems research. While Paper 1 makes a strong contribution to autonomous driving, Paper 2's insights into scalable LLM agents will likely influence a wider range of applications, architectures, and future research directions, resulting in a broader overall scientific impact.
Paper 2 likely has higher scientific impact due to broader, more immediate applicability and cross-field relevance. It provides the first systems-level characterization of agent memory, introduces a taxonomy, profiling methodology, and evaluates 10 representative systems, yielding actionable recommendations for deploying long-horizon LLM agents at scale—highly timely for industry and research. Paper 1 is novel and strong (knowledge-gap localization, user study), but its impact is narrower to interpretability/education-style assistance and depends on task-specific knowledge representations, whereas Paper 2’s insights generalize across many agent architectures and deployments.