AdaMEM: Test-Time Adaptive Memory for Language Agents

Yunxiang Zhang, Yiheng Li, Ali Payani, Lu Wang

Jun 4, 2026

arXiv:2606.05684v1 PDF

cs.AI(primary)

#1804of 3355·Artificial Intelligence

#1804 of 3355 · Artificial Intelligence

Tournament Score

1395±44

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance6.5

Rigor6

Novelty6

Clarity7.5

Tournament Score

1395±44

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.3/ 10

Significance

Rigor

Novelty

Clarity

Abstract

A central challenge for language agents is utilizing past experience to adapt to dynamic test-time conditions. While recent work demonstrates the promise of agentic memory mechanisms, most systems restrict retrieval to episode initiation. Consequently, agents are forced to rely on static guidance that becomes increasingly misaligned as long-horizon tasks unfold. To address this rigidity, we propose the Adaptive Memory Agent (AdaMEM), a novel framework for agent test-time adaptation. Without updating model parameters online, AdaMEM adapts agent behavior via a hybrid memory architecture: it maintains a long-term trajectory memory of raw experiences collected offline while generating dynamic short-term strategy memory on-the-fly to guide decision-making. This mechanism enables the trade-off between token efficiency and adaptability across varying inference-time compute levels. Empirically, AdaMEM significantly outperforms static memory baselines, achieving relative gains of up to 13% on ALFWorld and 11% on WebShop, with consistent leading performance extending to agentic search on HotpotQA. To further enhance this adaptation, we develop STEP-MFT, a Step-wise Memory Fine-Tuning technique that trains the policy to synthesize high-quality strategies from retrieved experiences, yielding additional performance gains. Our work establishes a new scaling dimension for agentic memory, supporting continuous reasoning and self-evolution post-deployment in real-world environments. Our code is available at https://github.com/yunx-z/AdaMEM.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AdaMEM: Test-Time Adaptive Memory for Language Agents

1. Core Contribution

AdaMEM addresses a genuine limitation in existing language agent memory systems: the rigidity of one-time, episode-level memory retrieval. The paper proposes a hybrid memory architecture that separates long-term trajectory storage (raw experiences indexed offline) from short-term strategy memory (concise guidance synthesized dynamically at inference time). The key insight is that agents should be able to query and update their strategic guidance *during* task execution, not just at the beginning. The framework offers two operating modes—AdaMEM-LOW (persistent strategy with selective refresh) and AdaMEM-HIGH (transient per-step strategy regeneration)—providing a controllable trade-off between token efficiency and adaptability.

Additionally, STEP-MFT introduces a process-level filtering mechanism for fine-tuning: it retains only training examples where the generated strategy actually changed the agent's action on a successful trajectory, using this as a proxy for positive strategy advantage. This avoids the need for expensive rollouts or auxiliary value models.

2. Methodological Rigor

Strengths:

The experimental design is reasonably thorough, with evaluation across three distinct benchmarks (ALFWorld, WebShop, HotpotQA) covering embodied, web, and search tasks. Results include standard deviations over 3 runs.

The ablation study (Table 3) cleanly isolates the contribution of each memory component, demonstrating that both long-term grounding and short-term abstraction are necessary.

The off-policy evaluation using Gemma-27B with Qwen-constructed memory banks tests cross-model generalizability—an important practical consideration.

The formal justification (Proposition 3.1) for the action-change filter is sound, though somewhat straightforward given its assumptions.

Weaknesses:

The proof of Proposition 3.1 assumes deterministic greedy decoding and uses sparse outcome rewards as a proxy for step-level correctness. These are strong assumptions that may not hold in practice (temperature 0.7 is used during inference, contradicting the deterministic assumption).

The benchmarks, while diverse, are relatively standard and somewhat saturated in the agent literature. ALFWorld and WebShop are well-studied environments that may not fully test the limits of dynamic adaptation.

The comparison baseline set is limited. The paper compares against Synapse and ReasoningBank but excludes workflow-based methods (AWM) and RL-based memory approaches (MemRL), limiting the assessment of relative improvement.

HotpotQA improvements are modest (~1 point) and within noise margins given the standard deviations.

3. Potential Impact

The paper addresses a practical need: deploying agents that can self-correct during execution without expensive retraining. The non-parametric nature of the adaptation makes it immediately deployable with existing LLMs. The cross-model memory sharing capability is practically valuable—organizations could maintain shared trajectory banks across different model deployments.

The STEP-MFT technique, while simple, provides a scalable approach to training better strategy generators. The insight that outcome-based filtering can actually *hurt* performance (WebShop, Figure 5) while action-change filtering consistently helps is a useful finding for the community.

However, the impact may be bounded by several factors: (1) the approach is fundamentally limited by the quality and coverage of the trajectory bank; (2) the strategy synthesis adds computational overhead that may be prohibitive in latency-sensitive applications; (3) the "strategy inertia" failure mode (Appendix C.3) represents a fundamental limitation of prompt-based self-evaluation that isn't resolved.

4. Timeliness & Relevance

This work is well-timed. The field is actively exploring how to make language agents more adaptive and capable in long-horizon tasks, and memory mechanisms are a key research direction. The paper positions itself at the intersection of test-time compute scaling (a hot topic post-o1) and agentic memory (an emerging research area). The framing of dynamic memory as a "new scaling dimension" for inference-time compute is compelling and timely.

The distinction between inter-episode and intra-episode adaptation is important and underexplored. Most prior work focuses on learning across episodes; AdaMEM explicitly targets within-episode recovery, which is arguably more critical for real-world deployment.

5. Strengths & Limitations

Key Strengths:

Clean architectural design with clear separation of concerns between storage and abstraction

Practical controllability through LOW/HIGH modes and scalable retrieval budget k

Demonstrates that dynamic strategy synthesis avoids the negative transfer problem observed with static baselines on WebShop

STEP-MFT provides a simple but principled approach to process-level credit assignment without expensive infrastructure

Code availability and reproducibility details

Notable Limitations:

Strategy inertia is acknowledged but unresolved—the agent's ability to decide *when* to refresh is the weakest link

The backbone models used are relatively small (4B, 7B); it's unclear how the approach scales with larger, more capable models that may need less guidance

The WebShop baseline uses an RL-trained model due to the base model's insufficient success rate, introducing a confound

Limited analysis of failure modes beyond the single case study

The approach requires a pre-existing bank of successful trajectories, which may be a bootstrapping challenge in novel domains

The gains on HotpotQA are marginal and unconvincing

Additional Observations:

The paper's efficiency analysis is informative—showing that AdaMEM reduces latency vs. Synapse despite adding strategy synthesis overhead is a non-obvious finding. The token consumption analysis (Table 5) transparently shows the costs of higher adaptation modes. The scalability with k (Figure 4) is a strong result, showing monotonic improvement while Synapse degrades—this cleanly demonstrates the value of abstraction over raw retrieval.

The writing is generally clear, though the paper could benefit from a more honest discussion of when static methods might suffice (e.g., short-horizon tasks or highly predictable environments).

Rating:6.3/ 10

Significance 6.5Rigor 6Novelty 6Clarity 7.5

Generated Jun 5, 2026

Comparison History (16)

vs. Brick-Composer: Using MLLMs for Assembly with Diverse Bricks

claude-opus-4.66/6/2026

AdaMEM addresses a fundamental challenge in language agent systems—test-time adaptation through dynamic memory—with broad applicability across multiple agent tasks (ALFWorld, WebShop, HotpotQA). It introduces a novel hybrid memory architecture and scaling dimension for agentic memory that could influence the design of future agent systems broadly. Paper 2 (Brick-Composer) tackles an interesting but narrower problem of brick assembly with MLLMs, achieving modest results (15% step success). While creative, its impact is more domain-specific, whereas AdaMEM's contributions to adaptive agent architectures have wider implications for the rapidly growing field of language agents.

vs. Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

gpt-5.26/6/2026

Paper 1 introduces a novel, generalizable test-time adaptation framework (hybrid long-term trajectory memory + on-the-fly short-term strategy memory) with demonstrated performance gains across multiple agent benchmarks and an added training method (STEP-MFT). It is timely for deployable LLM agents and broadly applicable to many interactive tasks, suggesting wide downstream impact. Paper 2 provides a valuable dataset and evaluation suite for human-agent collaboration, but its scope is narrower (Map Task-derived dyadic routing) and impact depends on adoption and generalization beyond the dataset. Overall, Paper 1 is more likely to shift methods and practice.

vs. Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

gpt-5.26/6/2026

Paper 2 has higher likely impact due to its protocol-aligned, controlled evaluation framework (BenchAgent) that directly addresses a timely, field-wide confusion: whether multi-agent workflows actually help once execution details are normalized. Its methodological rigor (standardized loader/tool access/answer contracts/cost accounting/logging + statistical guidance) and breadth across 10 benchmarks make it broadly usable by the community and influential for future agent research and claims. Paper 1 is a solid algorithmic contribution with clear gains on key agent benchmarks, but its impact is narrower and more incremental relative to the larger reproducibility/evaluation problem Paper 2 targets.

vs. How Far Did They Go? The Persuasive Tactics of Covert LLM Agents in a Discontinued Field Experiment

claude-opus-4.66/6/2026

Paper 2 addresses a timely and broadly impactful issue—covert AI persuasion in public discourse—with a unique, naturally occurring dataset from an ethically controversial field experiment. Its findings have direct implications for AI governance, platform regulation, and democratic deliberation, reaching audiences across computer science, political science, communication, and policy. While Paper 1 (AdaMEM) makes solid technical contributions to agentic memory architectures, it represents an incremental advance in a crowded LLM-agent optimization space. Paper 2's novelty, societal relevance, and cross-disciplinary appeal give it higher potential impact.

vs. Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach

gemini-3.16/6/2026

Paper 1 introduces rigorous formal verification (process calculus) to the rapidly growing field of LLM agent protocols. By bridging an industry standard (MCP) with academic frameworks (SGD) and proving behavioral equivalence properties, it establishes a foundational safety standard. While Paper 2 offers solid empirical improvements to agent memory, Paper 1's theoretical contributions address critical safety and reliability bottlenecks, likely yielding longer-lasting, cross-disciplinary impact in formal methods and AI safety.

vs. Not All Errors Are Equal: Consequence-Aware Reasoning Compute Allocation

gemini-3.16/6/2026

Paper 1 introduces a paradigm shift in test-time compute allocation by prioritizing the real-world consequence of errors over mere task difficulty. This addresses a critical, often-overlooked gap in AI deployment, risk management, and safety. While Paper 2's adaptive memory framework is solid and improves agent performance, it represents a more incremental architectural enhancement compared to the broad conceptual innovation and high real-world applicability of consequence-aware reasoning proposed in Paper 1.

vs. Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

claude-opus-4.66/6/2026

AdaMEM introduces a novel and generalizable framework for test-time adaptive memory in language agents, addressing a fundamental limitation in agentic AI systems. It demonstrates strong empirical gains across multiple benchmarks, introduces a new scaling dimension for agentic memory, and has broad applicability across diverse agent tasks. Paper 1, while practically valuable, is primarily an evaluation study comparing AI vs. expert summaries in a narrow clinical domain (headache medicine) without introducing significant methodological innovation. Paper 2's contributions to agent architecture are more likely to influence future research across multiple AI subfields.

vs. ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

gemini-3.16/5/2026

Paper 2 addresses dynamic test-time memory adaptation, a critical bottleneck for long-horizon agent autonomy. While Paper 1 offers an elegant solution for tool-selection efficiency, Paper 2 taps into the highly impactful area of test-time compute scaling and continuous self-evolution. Its hybrid memory architecture and novel fine-tuning strategy provide a foundational framework applicable to a wide range of complex reasoning environments, likely driving broader subsequent research.

vs. MindGames Arena Generalization Track: In2AI Solution with Delayed Per-Step Reward Attribution

claude-opus-4.66/5/2026

AdaMEM introduces a broadly applicable framework for test-time adaptive memory in language agents, addressing a fundamental challenge across multiple domains (ALFWorld, WebShop, HotpotQA). Its hybrid memory architecture and STEP-MFT technique establish a new scaling dimension with wide applicability. Paper 2, while impressive in its competition results (beating GPT-5), is more narrowly focused on multi-agent game environments and competition-specific engineering. AdaMEM's contributions to adaptive memory and test-time adaptation have broader potential impact across the rapidly growing field of language agents.

vs. Structure Enables Effective Self-Localization of Errors in LLMs

gemini-3.16/5/2026

Self-correction is a critical and notoriously difficult challenge in LLM research. Paper 2 addresses this fundamental issue by introducing a structured approach to error localization, offering significant performance lifts. By improving reasoning reliability, this method has broader applicability across virtually all LLM use cases compared to Paper 1's specific focus on long-horizon agentic memory, giving Paper 2 a higher potential for widespread scientific and practical impact.

vs. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

gpt-5.26/5/2026

Paper 1 (AdaMEM) is likely to have higher scientific impact due to a more novel, general-purpose adaptation mechanism for long-horizon LLM agents: continuous test-time behavior adaptation via hybrid long-/short-term memory without online parameter updates, plus a training method (STEP-MFT) to synthesize strategies from retrieved experience. This targets a broad capability bottleneck (agent robustness over time) with applicability across many agent settings (web, embodied, QA/search) and aligns with timely interest in scalable agent memory. Paper 2 is valuable and timely for safety, but is narrower (guardrail feedback loop) and more policy/dataset dependent.

vs. Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization

gpt-5.26/5/2026

Paper 2 (SENSEI) likely has higher scientific impact due to stronger cross-domain relevance and real-world applicability: it targets human-AI collaboration, interpretable assistance, and misconception correction—problems spanning education, decision support, HCI, and safety-critical systems. Its knowledge-gap localization introduces a more conceptually novel intervention level than action-level feedback, and includes evidence of compositional generalization plus a user study with substantial misconception-correction rates, suggesting methodological rigor and translational value. Paper 1 is valuable for LLM agents, but is more niche and benchmark-centric.

vs. Interfaze: The Future of AI is built on Task-Specific Small Models

gpt-5.26/5/2026

Paper 2 (AdaMEM) is likely to have higher scientific impact due to clearer conceptual novelty (test-time adaptive memory updated throughout an episode), broad relevance to general language-agent research, and easier adoption (algorithmic framework + code release) across tasks and model families. Its contributions (hybrid long/short-term memory, compute–adaptability trade-off, STEP-MFT training) generalize beyond specific modalities or proprietary systems. Paper 1 is strong engineering and benchmarking, but appears more system/product-centric with tightly integrated components, which may limit reproducibility and broader scientific uptake.

vs. Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models

gpt-5.26/5/2026

Paper 2 likely has higher impact due to broad, timely relevance to federated personalization of foundation models—an area with strong real-world demand (privacy, on-device adaptation, enterprise deployment). Its proposed shift from heuristic LoRA aggregation and repeated client optimization to learned hypernetwork initialization plus product-space aggregation is a notable methodological advance with cross-domain applicability (vision, VLMs, potentially LLMs). Paper 1 is novel for agent test-time memory, but is more niche to agentic benchmarks and lacks the same immediate deployment pull and ecosystem-level relevance as federated adaptation.

vs. Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

gpt-5.26/5/2026

Paper 2 has higher likely impact: it targets broadly useful test-time adaptation for language agents via a novel hybrid memory (long-term trajectories + dynamic short-term strategies) and introduces STEP-MFT, with gains across multiple embodied/web/QA benchmarks—suggesting wider applicability beyond a single task type. Its focus on post-deployment adaptation without parameter updates is timely for real-world agents and relates to scaling inference-time compute. Paper 1 is valuable but more narrowly scoped to math reasoning reliability on GSM8K, with a comparatively incremental multi-agent/critic loop.

vs. Evaluating Agentic Configuration Repair for Computer Networks

gemini-3.16/5/2026

Paper 2 introduces a novel, general-purpose framework for test-time adaptive memory in language agents, addressing a fundamental challenge in AI with broad applicability across multiple domains. In contrast, Paper 1 primarily focuses on benchmarking existing LLM approaches within the specific, narrower domain of network configuration repair. Paper 2's methodological innovation and potential for widespread adoption across various agentic systems give it a higher estimated scientific impact.