AdaMEM: Test-Time Adaptive Memory for Language Agents
Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang
Abstract
A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AdaMEM: Test-Time Adaptive Memory for Language Agents
1. Core Contribution
AdaMEM addresses a genuine limitation in existing language agent memory systems: the rigidity of one-time, episode-level memory retrieval. The paper proposes a hybrid memory architecture that separates long-term trajectory storage (raw experiences indexed offline) from short-term strategy memory (concise guidance synthesized dynamically at inference time). The key insight is that agents should be able to query and update their strategic guidance *during* task execution, not just at the beginning. The framework offers two operating modes—AdaMEM-LOW (persistent strategy with selective refresh) and AdaMEM-HIGH (transient per-step strategy regeneration)—providing a controllable trade-off between token efficiency and adaptability.
Additionally, STEP-MFT introduces a process-level filtering mechanism for fine-tuning: it retains only training examples where the generated strategy actually changed the agent's action on a successful trajectory, using this as a proxy for positive strategy advantage. This avoids the need for expensive rollouts or auxiliary value models.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The paper addresses a practical need: deploying agents that can self-correct during execution without expensive retraining. The non-parametric nature of the adaptation makes it immediately deployable with existing LLMs. The cross-model memory sharing capability is practically valuable—organizations could maintain shared trajectory banks across different model deployments.
The STEP-MFT technique, while simple, provides a scalable approach to training better strategy generators. The insight that outcome-based filtering can actually *hurt* performance (WebShop, Figure 5) while action-change filtering consistently helps is a useful finding for the community.
However, the impact may be bounded by several factors: (1) the approach is fundamentally limited by the quality and coverage of the trajectory bank; (2) the strategy synthesis adds computational overhead that may be prohibitive in latency-sensitive applications; (3) the "strategy inertia" failure mode (Appendix C.3) represents a fundamental limitation of prompt-based self-evaluation that isn't resolved.
4. Timeliness & Relevance
This work is well-timed. The field is actively exploring how to make language agents more adaptive and capable in long-horizon tasks, and memory mechanisms are a key research direction. The paper positions itself at the intersection of test-time compute scaling (a hot topic post-o1) and agentic memory (an emerging research area). The framing of dynamic memory as a "new scaling dimension" for inference-time compute is compelling and timely.
The distinction between inter-episode and intra-episode adaptation is important and underexplored. Most prior work focuses on learning across episodes; AdaMEM explicitly targets within-episode recovery, which is arguably more critical for real-world deployment.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations:
The paper's efficiency analysis is informative—showing that AdaMEM reduces latency vs. Synapse despite adding strategy synthesis overhead is a non-obvious finding. The token consumption analysis (Table 5) transparently shows the costs of higher adaptation modes. The scalability with k (Figure 4) is a strong result, showing monotonic improvement while Synapse degrades—this cleanly demonstrates the value of abstraction over raw retrieval.
The writing is generally clear, though the paper could benefit from a more honest discussion of when static methods might suffice (e.g., short-horizon tasks or highly predictable environments).
Generated Jun 5, 2026
Comparison History (16)
AdaMEM addresses a fundamental challenge in language agent systems—test-time adaptation through dynamic memory—with broad applicability across multiple agent tasks (ALFWorld, WebShop, HotpotQA). It introduces a novel hybrid memory architecture and scaling dimension for agentic memory that could influence the design of future agent systems broadly. Paper 2 (Brick-Composer) tackles an interesting but narrower problem of brick assembly with MLLMs, achieving modest results (15% step success). While creative, its impact is more domain-specific, whereas AdaMEM's contributions to adaptive agent architectures have wider implications for the rapidly growing field of language agents.
Paper 1 introduces a novel, generalizable test-time adaptation framework (hybrid long-term trajectory memory + on-the-fly short-term strategy memory) with demonstrated performance gains across multiple agent benchmarks and an added training method (STEP-MFT). It is timely for deployable LLM agents and broadly applicable to many interactive tasks, suggesting wide downstream impact. Paper 2 provides a valuable dataset and evaluation suite for human-agent collaboration, but its scope is narrower (Map Task-derived dyadic routing) and impact depends on adoption and generalization beyond the dataset. Overall, Paper 1 is more likely to shift methods and practice.
Paper 2 has higher likely impact due to its protocol-aligned, controlled evaluation framework (BenchAgent) that directly addresses a timely, field-wide confusion: whether multi-agent workflows actually help once execution details are normalized. Its methodological rigor (standardized loader/tool access/answer contracts/cost accounting/logging + statistical guidance) and breadth across 10 benchmarks make it broadly usable by the community and influential for future agent research and claims. Paper 1 is a solid algorithmic contribution with clear gains on key agent benchmarks, but its impact is narrower and more incremental relative to the larger reproducibility/evaluation problem Paper 2 targets.
Paper 2 addresses a timely and broadly impactful issue—covert AI persuasion in public discourse—with a unique, naturally occurring dataset from an ethically controversial field experiment. Its findings have direct implications for AI governance, platform regulation, and democratic deliberation, reaching audiences across computer science, political science, communication, and policy. While Paper 1 (AdaMEM) makes solid technical contributions to agentic memory architectures, it represents an incremental advance in a crowded LLM-agent optimization space. Paper 2's novelty, societal relevance, and cross-disciplinary appeal give it higher potential impact.
Paper 1 introduces rigorous formal verification (process calculus) to the rapidly growing field of LLM agent protocols. By bridging an industry standard (MCP) with academic frameworks (SGD) and proving behavioral equivalence properties, it establishes a foundational safety standard. While Paper 2 offers solid empirical improvements to agent memory, Paper 1's theoretical contributions address critical safety and reliability bottlenecks, likely yielding longer-lasting, cross-disciplinary impact in formal methods and AI safety.
Paper 1 introduces a paradigm shift in test-time compute allocation by prioritizing the real-world consequence of errors over mere task difficulty. This addresses a critical, often-overlooked gap in AI deployment, risk management, and safety. While Paper 2's adaptive memory framework is solid and improves agent performance, it represents a more incremental architectural enhancement compared to the broad conceptual innovation and high real-world applicability of consequence-aware reasoning proposed in Paper 1.
AdaMEM introduces a novel and generalizable framework for test-time adaptive memory in language agents, addressing a fundamental limitation in agentic AI systems. It demonstrates strong empirical gains across multiple benchmarks, introduces a new scaling dimension for agentic memory, and has broad applicability across diverse agent tasks. Paper 1, while practically valuable, is primarily an evaluation study comparing AI vs. expert summaries in a narrow clinical domain (headache medicine) without introducing significant methodological innovation. Paper 2's contributions to agent architecture are more likely to influence future research across multiple AI subfields.
Paper 2 addresses dynamic test-time memory adaptation, a critical bottleneck for long-horizon agent autonomy. While Paper 1 offers an elegant solution for tool-selection efficiency, Paper 2 taps into the highly impactful area of test-time compute scaling and continuous self-evolution. Its hybrid memory architecture and novel fine-tuning strategy provide a foundational framework applicable to a wide range of complex reasoning environments, likely driving broader subsequent research.
AdaMEM introduces a broadly applicable framework for test-time adaptive memory in language agents, addressing a fundamental challenge across multiple domains (ALFWorld, WebShop, HotpotQA). Its hybrid memory architecture and STEP-MFT technique establish a new scaling dimension with wide applicability. Paper 2, while impressive in its competition results (beating GPT-5), is more narrowly focused on multi-agent game environments and competition-specific engineering. AdaMEM's contributions to adaptive memory and test-time adaptation have broader potential impact across the rapidly growing field of language agents.
Self-correction is a critical and notoriously difficult challenge in LLM research. Paper 2 addresses this fundamental issue by introducing a structured approach to error localization, offering significant performance lifts. By improving reasoning reliability, this method has broader applicability across virtually all LLM use cases compared to Paper 1's specific focus on long-horizon agentic memory, giving Paper 2 a higher potential for widespread scientific and practical impact.
Paper 1 (AdaMEM) is likely to have higher scientific impact due to a more novel, general-purpose adaptation mechanism for long-horizon LLM agents: continuous test-time behavior adaptation via hybrid long-/short-term memory without online parameter updates, plus a training method (STEP-MFT) to synthesize strategies from retrieved experience. This targets a broad capability bottleneck (agent robustness over time) with applicability across many agent settings (web, embodied, QA/search) and aligns with timely interest in scalable agent memory. Paper 2 is valuable and timely for safety, but is narrower (guardrail feedback loop) and more policy/dataset dependent.
Paper 2 (SENSEI) likely has higher scientific impact due to stronger cross-domain relevance and real-world applicability: it targets human-AI collaboration, interpretable assistance, and misconception correction—problems spanning education, decision support, HCI, and safety-critical systems. Its knowledge-gap localization introduces a more conceptually novel intervention level than action-level feedback, and includes evidence of compositional generalization plus a user study with substantial misconception-correction rates, suggesting methodological rigor and translational value. Paper 1 is valuable for LLM agents, but is more niche and benchmark-centric.
Paper 2 (AdaMEM) is likely to have higher scientific impact due to clearer conceptual novelty (test-time adaptive memory updated throughout an episode), broad relevance to general language-agent research, and easier adoption (algorithmic framework + code release) across tasks and model families. Its contributions (hybrid long/short-term memory, compute–adaptability trade-off, STEP-MFT training) generalize beyond specific modalities or proprietary systems. Paper 1 is strong engineering and benchmarking, but appears more system/product-centric with tightly integrated components, which may limit reproducibility and broader scientific uptake.
Paper 2 likely has higher impact due to broad, timely relevance to federated personalization of foundation models—an area with strong real-world demand (privacy, on-device adaptation, enterprise deployment). Its proposed shift from heuristic LoRA aggregation and repeated client optimization to learned hypernetwork initialization plus product-space aggregation is a notable methodological advance with cross-domain applicability (vision, VLMs, potentially LLMs). Paper 1 is novel for agent test-time memory, but is more niche to agentic benchmarks and lacks the same immediate deployment pull and ecosystem-level relevance as federated adaptation.
Paper 2 has higher likely impact: it targets broadly useful test-time adaptation for language agents via a novel hybrid memory (long-term trajectories + dynamic short-term strategies) and introduces STEP-MFT, with gains across multiple embodied/web/QA benchmarks—suggesting wider applicability beyond a single task type. Its focus on post-deployment adaptation without parameter updates is timely for real-world agents and relates to scaling inference-time compute. Paper 1 is valuable but more narrowly scoped to math reasoning reliability on GSM8K, with a comparatively incremental multi-agent/critic loop.
Paper 2 introduces a novel, general-purpose framework for test-time adaptive memory in language agents, addressing a fundamental challenge in AI with broad applicability across multiple domains. In contrast, Paper 1 primarily focuses on benchmarking existing LLM approaches within the specific, narrower domain of network configuration repair. Paper 2's methodological innovation and potential for widespread adoption across various agentic systems give it a higher estimated scientific impact.