Runzhe Wang, Huilin Lu, Shengjie Liu, Li Dong, Jason Zhu
Large Language Models (LLMs) show promise as tool-using agents but remain limited in long-horizon tasks that require remembering, organizing, and reusing knowledge. Prior memory approaches aim to resolve the situation, but mainly focus on storing factual information. Recent work on procedural memory improves task reuse, yet often reduces to replaying past successes without addressing failure cases or online scalability. We introduce a unified and automatic memory framework that integrates semantic, episodic, and procedural memory in a bi-level design combining short-term and long-term stores. A multi-agent architecture with actor, memory, and critic agents enables automatic memory generation, reward annotation, and adaptive retrieval. Long-term memory is managed through reward-based evaluation, merging, and pruning, ensuring scalability and continual improvement. Experiments across various environments show that our approach improves robustness and success on long multi-turn tasks compared to existing baselines. This work highlights the importance of comprehensive, adaptive memory for advancing LLM-based agents.
AdMem proposes a unified memory framework for LLM-based task-solving agents that integrates three cognitive memory types—semantic, episodic, and procedural—within a bi-level (short-term/long-term) architecture. The system operates through a multi-agent design comprising an actor agent (task execution), a critic agent (reward annotation and reflection generation), and a memory agent (storage and retrieval management). The key claimed novelties are: (a) procedural memory that captures both successes and failures with reward annotations, (b) a bandit-style evaluation mechanism for memory entry effectiveness using an EM-based update rule, (c) stack-based short-term context management inspired by program execution stacks, and (d) automatic memory pruning and merging based on retrieval statistics.
The paper addresses a genuine gap: most prior procedural memory systems (e.g., AWM, MempP) only store successful trajectories and lack mechanisms for online memory management, credit assignment, or learning from failures. The formalization as a POMDP with memory as agent state is reasonable, though not deeply exploited theoretically.
Strengths in design: The reward-based memory evaluation model is a thoughtful contribution. The binary stochastic model where reward is a disjunction of individual memory helpfulness indicators, combined with cosine similarity weighting, provides a principled way to estimate memory entry value. The EM update for parameters is a reasonable lightweight approach.
Weaknesses in evaluation: The experimental evaluation has several notable shortcomings:
The cost analysis is minimal—"per-step time cost is 2 times that for vanilla LLM with 3 times LLM calling" is mentioned but not systematically measured or reported across domains.
The paper addresses an important practical problem: building agents that genuinely learn and improve from experience in online, multi-task settings. The integration of failure-aware procedural memory with reward-based management is a step toward more robust agent systems. If the approach generalizes, it could benefit:
However, the practical impact is tempered by the computational overhead (3x LLM calls per step), the limited demonstration of scalability, and the absence of production-scale experiments.
The paper is well-timed. LLM-based agents are rapidly moving from research demonstrations to production deployments, and memory management is indeed a critical bottleneck. The cognitive science framing (semantic/episodic/procedural taxonomy) aligns with growing interest in cognitive architectures for LLM agents (Sumers et al., 2023). The emphasis on online, lifelong learning rather than offline batch processing reflects genuine deployment needs.
The focus on the theory-practice gap in agent memory is appropriate—many conceptual frameworks exist but few have been implemented with all components functioning together.
The paper's theoretical framing is stronger than its experimental validation. The POMDP formulation and the connection to cognitive science are well-articulated, but the experiments don't fully test the theoretical claims. The improvement on PDDL (33.3% → 76.7%) is striking but unexplained. Domain-level analysis of why memory helps dramatically in some settings but not others would significantly strengthen the contribution.
The multi-agent overhead is acknowledged but not deeply studied. In production settings, the 3x LLM cost would be a significant consideration that merits more thorough analysis.
Generated Jun 8, 2026
Paper 1 addresses a fundamental bottleneck in LLM agents—long-term, adaptive memory across semantic, episodic, and procedural domains. Its core architectural advancements have broad applicability across any task-solving agent framework, significantly advancing general agentic capabilities. While Paper 2 presents a highly useful tool for automating embodied AI benchmarks, its impact is more specialized to benchmark generation and embodied AI, making Paper 1's core algorithmic contributions more broadly influential across the wider AI field.
Paper 2 likely has higher impact: it targets a core scaling bottleneck (quadratic attention) with an actionable, efficient recipe to convert existing SA models to linear-complexity SWA and recover performance via RL. This is timely given long-context demand and offers clear real-world deployment benefits (cheaper inference) and broad relevance across LLM architectures beyond math. Methodologically, it provides a concrete hypothesis (data-architecture mismatch) and an empirical demonstration that RL alters conclusions about SWA viability. Paper 1 is useful for agents, but memory frameworks are crowded and impact may be more incremental/less general.
Paper 1 addresses a highly timely and critical bottleneck in LLM agents—handling infinite context demands via delegation intelligence. Its focus on 'deep research' aligns with cutting-edge industry trends, and the commitment to releasing the harness, training data, and model weights ensures significant, immediate utility and high citation potential within the open-source community.
While Paper 2 identifies an important and specific problem with biomedical embeddings and provides a well-engineered fix, Paper 1 addresses a broader and more foundational challenge—equipping LLM-based agents with comprehensive, adaptive memory systems integrating semantic, episodic, and procedural memory. This has wider applicability across diverse AI agent tasks and environments, and the multi-agent architecture with continual learning capabilities represents a more broadly impactful contribution to the rapidly growing field of LLM-based autonomous agents. Paper 2's impact, though rigorous, is more niche.
Paper 2 proposes a foundational paradigm shift by formally bridging the gap between classical control theory (MDPs) and foundation model agents. By framing agent robustness as a sim-to-real problem, it sets a broad, unifying research agenda that can impact multiple fields (robotics, RL, and LLMs) and establish new standardized evaluation benchmarks. In contrast, Paper 1 offers a valuable but more narrowly focused architectural improvement for agent memory systems.
Paper 1 addresses a critical bottleneck in the rapidly growing field of LLM agents: long-horizon memory. Its unified memory framework has broad, cross-disciplinary applicability in AI, robotics, and automation, offering significant real-world utility. While Paper 2 presents a rigorous methodology for audio-visual event localization, its focus is much narrower, limiting its overall scientific and practical impact compared to advancing general-purpose autonomous agents.
Paper 2 likely has higher impact: adaptive, unified memory for LLM agents is a timely, broadly applicable problem with clear real-world utility (tool use, automation, long-horizon workflows) across many domains. The integrated semantic/episodic/procedural design with evaluation, pruning, and multi-agent roles suggests a general framework that can be adopted and extended by others, increasing breadth and follow-on work. Paper 1 is novel but more niche (legal benchmarks, CoT fragment parsing, HUBO/quantum-inspired optimization), with higher methodological and deployment friction and narrower applicability.
Paper 2 addresses a highly timely and critical bottleneck in modern Large Reasoning Models—inference inefficiency or 'overthinking'. Its insight that problem difficulty evolves and is encoded in step-level embeddings offers a novel, training-free solution to dynamically control reasoning depth. Given the massive compute costs associated with inference-time scaling (e.g., o1-like models), an efficient, generalizable approach tested across multiple models and benchmarks provides higher immediate real-world utility and broader impact than the relatively more saturated field of LLM agent memory frameworks proposed in Paper 1.
Paper 2 has higher likely impact: it proposes a broadly applicable agent-memory architecture (semantic/episodic/procedural, automated generation/evaluation/pruning) that can improve long-horizon performance across many tasks and domains, making it more general and reusable than a single benchmark contribution. Its real-world applicability (scalable continual memory for deployed agents) is high and timely. Paper 1 is valuable and rigorous as an evaluation benchmark for monitoring agents, but benchmarks typically have narrower cross-field impact unless they become a dominant standard; Paper 2’s method could influence many agent systems directly.
AdMem addresses a fundamental challenge in LLM-based agents—long-horizon memory management—which is a highly active and impactful research area with broad applications across AI. Its unified memory framework combining semantic, episodic, and procedural memory with a multi-agent architecture represents a novel contribution with significant potential for real-world deployment. Paper 1, while methodologically sound, proposes an incremental improvement to TOPSIS (a well-established MCDM method) with limited scope and demonstrates results only on toy examples, suggesting narrower impact within the decision science community.