SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

Yuyang Hu, Hongjin Qian, Shuting Wang, Jiongnan Liu, Ziliang Zhao, Jiejun Tan, Zheng Liu, Zhicheng Dou

#509 of 2682 · Artificial Intelligence
Share
Tournament Score
1479±42
10501800
73%
Win Rate
16
Wins
6
Losses
22
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Long-horizon agentic reasoning requires large language models to act over long interaction histories containing thoughts, tool calls, observations, and partial conclusions. The challenge is not merely that these histories grow long, but that information needed for the current decision may be scattered across distant steps and only become relevant later. Existing approaches address this difficulty by truncating the interaction history, compressing it into shorter surrogates, or retrieving selected parts of it for reuse, but they do not explicitly model how access to past interaction should adapt to the agent's evolving state. We instead cast long-horizon reasoning as a problem of state-adaptive memory. To this end, we propose State-Adaptive Memory~(SAM), a standalone framework that consolidates ongoing interaction into compact memory cues while preserving raw trajectory pages for intent-driven recall. These cues are not treated as replacements for history; rather, they serve as lightweight handles that allow the agent to reconstruct temporally distant information according to its current needs, without retraining the underlying backbone. We further optimize the memory module through expert-guided supervision and reinforcement learning, aligning it with trajectory-level utility. Across BrowseComp, BrowseComp-ZH, WideSearch, and HLE, SAM consistently outperforms strong baselines over diverse agent backbones. Our results suggest that explicit memory modeling provides a simple and effective foundation for long-horizon agentic reasoning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

1. Core Contribution

SAM reframes long-horizon context management as a state-adaptive memory problem rather than a simple context compression or truncation task. The key architectural insight is the cue-page decoupling: raw interaction history is preserved as external "pages" while compact "memory cues" remain in the active context as lightweight handles. At read time, the agent selects cues based on its current intent, and a dedicated memory model reconstructs decision-relevant information from the underlying pages. This is fundamentally different from rolling summaries (which irreversibly compress) or retrieval-based approaches (which retrieve pre-stored chunks without intent-conditioned reconstruction).

The second major contribution is OAT-GRPO (Oracle-Anchored Tree GRPO), a reinforcement learning objective that addresses the credit assignment problem unique to memory modules sitting behind tool interfaces. It introduces tree-structured rollouts branching at memory-call points and an oracle-anchored recoverability reward using a committee of frontier models, densifying the otherwise sparse binary outcome signal.

2. Methodological Rigor

Strengths in experimental design:

  • The evaluation spans four complementary benchmarks (BrowseComp, BrowseComp-ZH, WideSearch, HLE) covering English/Chinese, browsing/scientific reasoning, and narrow/broad search.
  • Two heterogeneous backbones (proprietary GLM-4.7 and open-source Qwen3.5-35B-A3B) with a single shared memory model demonstrate generalization.
  • The controlled comparison is well-designed: all context-management baselines share the same backbone, tools, and inference protocol (128K window, 64K trigger, avg@3 reporting).
  • Ablations (§4.1-4.3) systematically isolate training stages, recall mechanisms, round-bucket performance, and page-size sensitivity.
  • Concerns:

  • No error bars or significance tests are reported, only avg@3 means. Given the moderate sample sizes (200 questions for BrowseComp and HLE), some margins could be within noise.
  • The OAT-GRPO reward pipeline relies heavily on frontier models (GPT-5.4, GLM-4.7, DeepSeek-V4-Flash as committee; GPT-5.4 as assessor), creating a significant dependency on expensive proprietary systems that limits reproducibility and introduces potential biases.
  • The SFT stage also uses Claude-4.5-Opus and GPT-5.4 as expert memory models, meaning the pipeline fundamentally requires access to multiple frontier APIs.
  • GLM-4.7 appears as both a committee member during training and as an agent backbone during evaluation. While the authors address this (committee is not queried at test time), the shared knowledge distribution could still confer indirect advantages.
  • 3. Potential Impact

    Immediate practical impact: SAM addresses a genuine bottleneck in deploying LLM agents for complex, multi-step tasks. The modular design—memory as a standalone module that doesn't require retraining the backbone—is architecturally appealing for production systems where backbone models are frequently updated.

    Broader implications:

  • The cue-page architecture could generalize beyond web browsing to software engineering agents, embodied agents, and scientific research assistants.
  • The formalization of "decision state" (what's established, resolved, remaining) provides a useful conceptual framework even beyond this specific implementation.
  • OAT-GRPO's tree-structured credit assignment for behind-the-tool models is a methodological contribution applicable to any auxiliary module optimized within an agent loop.
  • Limitations on impact:

  • The benchmarks are all information-seeking tasks; transfer to code generation, embodied reasoning, or planning domains remains unvalidated.
  • The computational overhead of SAM (separate memory model inference at every consolidation and recall step) is not quantified against baselines.
  • 4. Timeliness & Relevance

    The paper is extremely timely. Long-horizon agentic reasoning is among the most active areas in AI (2025-2026), and context management is widely recognized as a critical bottleneck. The distinction between "shortening context" and "making the right information accessible" is conceptually sharp and addresses a gap in existing approaches that primarily focus on compression or truncation. The use of very recent models (GPT-5.4, Claude-4.5-Opus, Qwen3.5) and benchmarks (BrowseComp, HLE) places this at the frontier of current research.

    5. Strengths & Limitations

    Key Strengths:

    1. Clean conceptual separation between write-time consolidation and read-time reconstruction, avoiding the irreversibility problem of summary-based approaches.

    2. Strong ablation evidence: The recall mechanism ablation (Figure 2, right) convincingly shows that intent-driven recall, not consolidation alone, drives the gains. Summary-only, recency-based, and raw-content retrieval all underperform full SAM.

    3. Long-horizon behavior analysis (Figure 3) demonstrates that SAM's advantage grows with trajectory length—exactly the regime where memory matters most.

    4. Modularity: A single 9B memory model trained once works across both backbones and all four benchmarks, suggesting the learned capability is genuinely transferable.

    5. Detailed case studies (Appendix E) provide transparent analysis of both success and failure modes, with the failure case honestly demonstrating that SAM cannot rescue upstream framing errors.

    Notable Weaknesses:

    1. Frontier-model dependency: The training pipeline requires extensive access to multiple frontier APIs, making the approach expensive and partially non-reproducible.

    2. Missing cost analysis: No wall-clock time, token consumption, or API cost comparisons against baselines.

    3. Limited backbone diversity: Only two backbones tested; the community would benefit from seeing results on more diverse architectures (e.g., Llama-family, Gemini).

    4. No statistical significance testing: Margins on some benchmarks (e.g., HLE: 38.2 vs. 37.5 for summary) are small enough that significance is unclear.

    5. Page-size sensitivity: The 128K page size degrades performance on BrowseComp (Figure 3c), suggesting the method is not fully robust to this hyperparameter.

    Overall Assessment

    SAM presents a well-motivated and architecturally clean solution to a real and timely problem. The cue-page decoupling is a genuine conceptual advance over flat summarization, and the empirical evidence across four benchmarks and two backbones is convincing, though statistical rigor could be improved. The OAT-GRPO training recipe is innovative but expensive. The paper is clearly written with thorough ablations and honest failure analysis. Its main limitation is the heavy reliance on frontier models during training and the lack of cost/efficiency analysis.

    Rating:7.2/ 10
    Significance 7.5Rigor 6.8Novelty 7.5Clarity 8

    Generated May 26, 2026

    Comparison History (22)

    vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning
    claude-opus-4.65/28/2026

    SAM addresses the critical and broadly applicable problem of long-horizon agentic reasoning with a novel state-adaptive memory framework. It demonstrates results across four diverse benchmarks and multiple agent backbones, suggesting strong generalizability. The problem of managing long interaction histories is fundamental to the rapidly growing field of LLM agents. TRACER tackles multi-LLM cooperative reasoning with interesting game-theoretic foundations, but its evaluation is limited to math/QA benchmarks and its scope is narrower. SAM's framework-agnostic design and broader applicability give it higher potential impact.

    vs. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents
    claude-opus-4.65/28/2026

    MemCog introduces a more paradigm-shifting conceptual framework (Memory-as-Cognition vs Memory-as-Tool), proposes novel architectural components (associative link graphs, cross-dimensional navigation, proactive reasoning), and creates a new benchmark (ProactiveMemBench) for an underexplored problem (proactive memory triggering). While SAM addresses important long-horizon reasoning with solid results, MemCog's broader reconceptualization of memory in AI agents, its proactive memory paradigm, and its potential to influence conversational AI design give it higher impact potential across multiple research directions.

    vs. The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence
    gpt-5.25/27/2026

    Paper 2 has higher estimated impact due to its broader, end-to-end contribution: a large-scale MoE model family, novel agent-driven data generation with executable/verifiable trajectories, and a scalable agent-native RL/training system (Forge) that likely generalizes across many agentic tasks. Its real-world applicability (coding, search, office workflows), timeliness (agentic RL, efficient inference via sparse activation), and potential to influence both systems and training practice are high. Paper 1 is a strong, focused memory framework but narrower in scope and likely incremental relative to existing retrieval/compression work.

    vs. PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design
    claude-opus-4.65/27/2026

    PolyFusionAgent addresses a critical bottleneck in polymer discovery by combining multimodal foundation models with agentic design workflows, offering direct real-world applications in energy storage, biomedicine, and materials science. Its novelty spans multimodal representation learning across polymer modalities, property-conditioned generation, and literature-grounded reasoning—a comprehensive end-to-end framework. While SAM presents a useful memory architecture for long-horizon reasoning agents, it is more incremental in nature, improving existing agent capabilities rather than enabling a new scientific paradigm. PolyFusionAgent's cross-disciplinary impact and actionable design capabilities give it higher potential impact.

    vs. Credit Assignment with Resets in Language Model Reasoning
    claude-opus-4.65/27/2026

    Paper 2 addresses a fundamental limitation in RL-based LLM training—uniform credit assignment across tokens—with a theoretically grounded solution (CPI framework) and practical methods (RRPO/SRPO) that require no external supervision. This has broader impact because it improves the core training paradigm used across reasoning tasks, is model-agnostic, and provides provable guarantees. Paper 1 presents a useful engineering framework for memory management in long-horizon agents, but is more incremental and narrower in scope. Credit assignment improvements could fundamentally change how reasoning models are trained across the field.

    vs. TaBIIC2: Interactive Building of Ontological Taxonomies using Weighted Self-Organizing Maps
    gemini-3.15/26/2026

    Paper 2 addresses a critical bottleneck in modern AI: long-horizon reasoning and memory management in Large Language Model (LLM) agents. The proposed State-Adaptive Memory (SAM) framework has broad applicability across numerous AI domains and tasks, aligning with highly active, fast-growing research trends. In contrast, Paper 1 focuses on ontology taxonomy construction using Self-Organizing Maps, which represents an incremental tool-building contribution in a narrower, more mature knowledge engineering subfield. Therefore, Paper 2 is highly likely to generate broader interest, citations, and real-world impact.

    vs. PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models
    claude-opus-4.65/26/2026

    SAM addresses the increasingly critical problem of long-horizon agentic reasoning, which is at the forefront of current LLM research. Its framework for state-adaptive memory is broadly applicable across diverse agent backbones and tasks, and it tackles a fundamental limitation (memory management over long interactions) that affects nearly all agentic AI systems. While PALoRA makes a solid contribution to the plasticity-stability dilemma in fine-tuning, it operates in a more specialized niche (knowledge injection while preserving reasoning). SAM's broader applicability to the rapidly growing agent ecosystem and its retraining-free design give it higher potential impact.

    vs. Trust but Verify: Prover-Verifier Deliberation for Selective LLM Prediction
    claude-opus-4.65/26/2026

    SAM addresses a fundamental challenge in long-horizon agentic reasoning with a novel state-adaptive memory framework, demonstrating consistent improvements across multiple benchmarks and diverse agent backbones. Its broader applicability to any LLM agent system, combined with the practical framework requiring no retraining, gives it wider potential impact. Paper 2's prover-verifier deliberation is a useful contribution to selective prediction and confidence calibration, but it operates in a narrower scope (inference-time verification) and shows limitations when verifier competence is insufficient. SAM's contributions to memory-augmented reasoning are more foundational and broadly applicable.

    vs. Evolutionary Enhanced Multi-Agent Reinforcement Learning for Cooperative Air Combat
    gemini-3.15/26/2026

    Paper 2 addresses a fundamental challenge in the rapidly growing field of Large Language Model (LLM) agents: long-horizon reasoning and memory management. Its framework for state-adaptive memory has broad applicability across various AI domains, from web navigation to complex problem-solving. In contrast, Paper 1 focuses on a highly specialized, military-specific application (air combat using MARL), which, while methodologically sound, has a narrower scope and less potential for widespread cross-disciplinary impact.

    vs. RewardHarness: Self-Evolving Agentic Post-Training
    claude-opus-4.65/26/2026

    SAM addresses a fundamental challenge in LLM-based agentic reasoning—managing long interaction histories through state-adaptive memory—with broad applicability across diverse reasoning tasks and agent backbones. Its framework-level contribution (no retraining needed, works with any backbone) and strong results across multiple benchmarks suggest wide adoption potential. RewardHarness, while innovative in its data-efficient reward modeling for image editing, targets a narrower domain. SAM's contribution to the core infrastructure of agentic AI systems gives it broader impact potential across the rapidly growing field of LLM agents.

    vs. Democratizing Large-Scale Re-Optimization with LLM-Guided Model Patches
    gemini-3.15/26/2026

    Paper 1 addresses a fundamental limitation in autonomous LLM agents (long-horizon reasoning and memory management), offering a widely applicable framework with broad implications across the rapidly advancing AI field. Paper 2, while highly practical and valuable for industrial applications, represents a more niche application of LLMs within Operations Research, limiting its foundational scientific impact compared to the architectural innovations in Paper 1.

    vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
    gemini-3.15/26/2026

    Paper 2 addresses long-horizon reasoning and memory in LLM agents, a highly active and broadly applicable area in current AI research. Its state-adaptive memory framework offers scalable improvements for agentic systems across diverse domains. While Paper 1 presents a novel safety approach for diffusion LLMs, its impact is currently limited by the narrower adoption of D-LLMs compared to standard autoregressive models used in widespread agent architectures.

    vs. Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it tackles a broadly relevant, timely bottleneck (long-horizon agent memory) across many LLM-agent settings, with clear real-world applicability (tool-using agents, web tasks) and backbone-agnostic integration without retraining. Its evaluation spans multiple agent benchmarks and uses supervised + RL optimization, suggesting solid methodological rigor and practical robustness. Paper 1 is novel and theoretically interesting (Transformer–k-means correspondence) but is more specialized to text-attributed graphs and may have narrower immediate applicability beyond graph learning.

    vs. SkillOS: Learning Skill Curation for Self-Evolving Agents
    gemini-3.15/26/2026

    SkillOS introduces a novel, self-evolving approach to agentic skill curation using RL, enabling the formation of structured meta-skills over time. While SAM addresses the critical issue of long-context memory management, SkillOS's focus on autonomous self-improvement and generalized skill evolution presents a larger conceptual leap. Its ability to create continuously improving, generalizable agents positions it to have a broader and deeper impact on the development of lifelong learning AI systems.

    vs. Agent Manufacturing: Foundation-Model Agents as First-Class Industrial Entities
    gemini-3.15/26/2026

    Paper 1 presents a concrete, empirically validated technical framework addressing a critical bottleneck in LLM agents (long-horizon memory management), demonstrating measurable performance gains across multiple benchmarks. In contrast, Paper 2 is primarily a conceptual position paper defining a new industrial paradigm. Paper 1 offers higher scientific impact through its methodological rigor, immediate practical applications in AI, and quantitative evidence of its effectiveness.

    vs. Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems
    gemini-3.15/26/2026

    Paper 1 addresses a fundamental and broadly applicable challenge in AI—long-horizon reasoning and memory in LLM agents. Its proposed state-adaptive memory framework has the potential to impact numerous domains utilizing autonomous agents. In contrast, Paper 2 focuses on a highly specific application (portfolio management and crypto trading), making its potential scientific impact narrower despite strong empirical results.

    vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning
    gpt-5.25/26/2026

    Paper 2 (SAM) likely has higher impact: it targets a broadly important, timely problem—long-horizon agentic reasoning—and proposes a general, standalone memory framework applicable across tasks, tools, and model backbones. Its state-adaptive recall concept extends beyond a specific inference procedure and could influence retrieval, memory, and agent design across fields. The methodology appears stronger and more comprehensive (expert supervision + RL, multiple benchmarks including multilingual). Paper 1 (ArborKV) is innovative and useful for Tree-of-Thoughts efficiency, but its scope is narrower and more systems-specific.

    vs. Towards Multi-Turn Dialog Systems for Industrial Asset Operations and Maintenance
    gpt-5.25/26/2026

    Paper 2 is more novel and broadly applicable: it introduces a general, state-adaptive memory framework for long-horizon agent reasoning that can plug into diverse LLM agents without retraining, addressing a widely relevant bottleneck in agentic systems. It is evaluated across multiple established benchmarks and uses expert supervision plus RL, suggesting stronger methodological rigor and wider impact across domains (web agents, tool use, long-context reasoning). Paper 1 is valuable but more domain-specific (industrial O&M) and appears to offer mainly system-engineering improvements with narrower cross-field reach.

    vs. Agentic Proving for Program Verification
    gemini-3.15/26/2026

    Paper 2 proposes a novel, generalizable framework (SAM) for long-horizon reasoning in LLM agents, a highly active and broad area of AI research. By addressing fundamental memory management challenges, its methodology can be applied across various backbones and tasks. In contrast, Paper 1 is primarily an evaluation of a proprietary model on a specific program verification benchmark, which, while valuable for formal methods, offers narrower methodological innovation and broader impact compared to introducing a new architectural paradigm.

    vs. Toward Enactive Artificial Intelligence
    claude-opus-4.65/26/2026

    Paper 2 (SAM) presents a concrete, novel framework with empirical results demonstrating improvements across multiple benchmarks and agent backbones. It addresses a pressing practical problem in LLM-based agents—long-horizon reasoning with adaptive memory—offering a reusable, backbone-agnostic solution. Paper 1, while intellectually valuable, is primarily a philosophical/theoretical advocacy paper suggesting incorporation of enactive principles into AI without presenting new algorithms, implementations, or empirical results. SAM's immediate applicability, methodological rigor with benchmarks, and relevance to the rapidly growing LLM agent ecosystem give it higher near-term scientific impact.