Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

Jun 10, 2026arXiv:2606.11680v1

cs.AIcs.CLcs.LG

#545of 3489·Artificial Intelligence

#545 of 3489 · Artificial Intelligence

Tournament Score

1479±48

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7

Clarity7.5

Abstract

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: HORMA — Hierarchical Organize-and-Retrieve Memory Agent

1. Core Contribution

HORMA addresses a genuine and increasingly important problem: how LLM agents can maintain effective working memory over long-horizon tasks without succumbing to context window limitations. The key insight is the explicit decoupling of memory construction from memory retrieval, operationalized through a shared file-system-like hierarchical workspace. This is a meaningful architectural contribution because it resolves a credit assignment problem that plagues joint optimization approaches — when a task fails, it becomes ambiguous whether the fault lies in how memory was organized or how it was retrieved.

The memory construction module uses iterative skill refinement through contrastive analysis (comparing failures with/without structured memory), categorizing errors into exogenous (information loss from structuring) versus endogenous (benefits from filtering noise). The retrieval module formulates context selection as a navigation problem solved by a lightweight RL-trained agent using Bash-like commands over the file hierarchy. The evidence-grounded Jaccard reward provides direct, decoupled supervision for retrieval quality.

2. Methodological Rigor

The experimental design is thorough, spanning three diverse benchmarks: ALFWorld (embodied interaction), LoCoMo (long conversational QA), and LongMemEval (multi-session chat). The comparison against a reasonable spread of baselines — truncation, sliding window, embedding retrieval, ReSum, Acon, HIAGENT, Fold, A-MEM, and Mem0 — provides solid grounding.

Strengths in rigor:

The ablation study (Table 3) systematically isolates contributions of skill evolution and RL-trained retrieval, using both Claude Sonnet 4.5 and Qwen 3.5 4B as retrievers.

The cross-backbone analysis (Figure 3a, Table 7) empirically validates that memory construction quality bottlenecks overall performance — a strong finding supporting the decoupled architecture.

The error attribution analysis (Figure 3b) distinguishes reasoning errors from retrieval errors, providing mechanistic insight into why HORMA outperforms embedding-based retrieval on temporal tasks.

OOD generalization is tested by training the retrieval agent only on LoCoMo and evaluating zero-shot on ALFWorld and LongMemEval.

Concerns:

The primary agent, memory manager, and retriever all use Claude Sonnet 4.5 in the main experiments, which is a very capable (and expensive) proprietary model. The practical cost of running three LLM agents per step is not thoroughly analyzed.

The evidence-grounded reward requires ground-truth evidence annotations, limiting applicability to settings where such supervision exists. The authors acknowledge this but don't quantify the impact of noisy or approximate evidence.

LoCoMo training uses only 7 conversations (1089 QA instances) — while the OOD results are impressive, the scale of RL training is modest.

Context budgets (1950/2200 tokens for ALFWorld, 10K for LoCoMo, 50K for LongMemEval) are somewhat arbitrarily chosen. Sensitivity analysis across a range of budgets would strengthen claims.

3. Potential Impact

The work has broad applicability across several domains:

Agentic systems: Any multi-step agent operating under context constraints could benefit from hierarchical memory organization — web agents, coding assistants, robotic planners.

Long-context conversation systems: The dramatic token reduction (3-22% of baseline usage) with maintained or improved performance is practically significant for deployment cost.

Memory architecture design: The decoupling principle (construct vs. retrieve) offers a reusable design pattern beyond this specific implementation.

The file-system metaphor for memory organization is intuitive and interpretable, which may accelerate adoption. The skill evolution mechanism provides a form of continual learning without parameter updates, which is appealing for production systems.

However, the reliance on frontier LLMs for memory management (shown necessary by Figure 3a) limits accessibility. The finding that "flawed memory organization cannot be compensated for by high-quality retrieval" is important but also constraining — it means the approach fundamentally depends on expensive models for the construction phase.

4. Timeliness & Relevance

This work is highly timely. The explosion of agentic AI systems (2024-2026) has made working memory management a critical bottleneck. The paper sits at the intersection of several active research threads: RL for LLMs (DeepSeek-R1, Search-R1), memory-augmented agents (MemGPT, A-MEM, Mem0, Memory-R1), and context management (Acon, ReSum). The credit assignment argument for decoupling is theoretically well-motivated and practically relevant as the field moves toward more complex multi-step agent tasks.

5. Strengths & Limitations

Key Strengths:

The decoupling of memory construction and retrieval is well-motivated theoretically (credit assignment gap) and validated empirically.

The file-system metaphor provides natural interpretability — users can inspect memory state, understand organization, and debug failures.

Strong OOD generalization: a 4B parameter retriever trained on conversational data transfers effectively to embodied tasks, suggesting learned navigation strategies are domain-agnostic.

The Pareto efficiency analysis (Figure 2) is a compelling way to demonstrate efficiency-performance trade-offs.

Comprehensive skill examples (Appendix C) demonstrate the system produces interpretable, domain-specific memory management strategies.

Notable Limitations:

Computational overhead: Running three LLM agents (primary, manager, retriever) potentially increases total compute despite reducing per-step token usage. The paper reports retrieval calls (~4-5 per step) but doesn't provide wall-clock time comparisons.

Dependence on frontier models: Memory construction requires Claude Sonnet 4.5-level capability, undermining claims of scalability for resource-constrained settings.

Limited RL training scale: Training on only 1089 instances from 7 conversations, then testing OOD, raises questions about whether the RL training meaningfully shaped the policy versus simply teaching basic tool-use patterns.

Evaluation metrics: LLM-as-judge scores are the primary metric for conversational benchmarks, with F1 relegated to the appendix. The reliability of Claude evaluating Claude-generated answers deserves scrutiny.

No comparison with recent RL-based memory methods like Memory-T1 or MemRL in the main tables, despite citing them.

Overall Assessment

HORMA presents a well-structured and well-motivated approach to a pressing problem. The decoupling insight is the paper's strongest theoretical contribution, and the empirical results across diverse benchmarks are convincing. The token efficiency gains are substantial and practically meaningful. However, the reliance on frontier models for memory construction and the computational overhead of multi-agent orchestration temper the practical impact. The work advances the field's understanding of how to design memory systems for LLM agents and should influence future architectural decisions in this space.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7Clarity 7.5

Generated Jun 11, 2026

Comparison History (20)

Wonvs. The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

Paper 2 addresses a fundamental and widespread bottleneck in LLM agents—handling long-horizon tasks and managing context length. Its hierarchical memory approach significantly reduces token usage (up to 78%) while maintaining or improving reasoning quality across general tasks. While Paper 1 presents an innovative self-supervised RL approach for spatial reasoning, Paper 2's focus on general agentic memory and efficiency offers broader applicability and higher potential impact across various real-world AI agent deployments.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

Paper 1 addresses a fundamental challenge in LLM agents—efficient memory management for long-horizon tasks—with broad applicability across AI systems. Its hierarchical memory architecture with RL-trained navigation is novel and demonstrates strong results across multiple benchmarks with significant efficiency gains. Paper 2, while solving an important domain-specific problem in BIM compliance checking, targets a narrower audience (AEC industry) with more incremental improvements (8.6% over baselines). Paper 1's contributions to agent architectures and memory mechanisms have broader cross-field impact and greater timeliness given the rapid growth of LLM agent research.

claude-opus-4-6·Jun 11, 2026

Lostvs. Can AI Agents Synthesize Scientific Conclusions?

Paper 1 addresses a critical and timely problem—evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. SciConBench introduces a large-scale benchmark with a novel clean-room evaluation methodology that reveals significant limitations of frontier models and consumer-facing AI products. Its findings about data leakage inflating performance and the unreliability of deployed systems (Google AI Overview, OpenEvidence) have broad implications for AI safety, policy, and scientific practice. Paper 2 presents a solid engineering contribution for agent memory management, but its impact is more incremental and narrower in scope compared to Paper 1's foundational evaluation framework for scientific AI reliability.

claude-opus-4-6·Jun 11, 2026

Wonvs. SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

HORMA addresses a fundamental and broadly relevant problem—efficient memory management for LLM agents in long-horizon tasks—with a comprehensive hierarchical approach combining structured memory construction and RL-trained navigation-based retrieval. It demonstrates strong results across multiple benchmarks (ALFWorld, LoCoMo, LongMemEval) with significant efficiency gains (22.17% token usage). Paper 2 (SkillJuror) provides useful but narrower insights about skill organization's effect on agent behavior, with modest outcome improvements (+4.1%). HORMA's broader applicability, stronger methodological contributions, and more impactful efficiency-performance trade-offs give it higher potential impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Paper 2 likely has higher scientific impact due to greater novelty and broader applicability: it introduces a general hierarchical memory organization and RL-based navigation mechanism addressing a widely recognized bottleneck for LLM agents (long-horizon, statelessness, context cost). It is evaluated across multiple standard benchmarks with clear efficiency-performance gains, suggesting methodological rigor and timeliness for agent research. Paper 1 is impactful for a specific civil/structural engineering workflow and demonstrates strong applied value, but its domain specificity and reliance on existing agent orchestration concepts may limit cross-field impact compared to a broadly reusable agent-memory contribution.

gpt-5.2·Jun 11, 2026

Wonvs. IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

HORMA addresses a fundamental and broadly applicable challenge in LLM agents—efficient memory management for long-horizon tasks—with a novel hierarchical organization approach combining structured construction and RL-based retrieval. It demonstrates strong results across multiple benchmarks with significant efficiency gains (22% token usage). Paper 2, while innovative in creativity assessment via dialogue optimization, addresses a narrower niche at the intersection of educational assessment and AI. HORMA's contributions are more likely to influence the rapidly growing LLM agent ecosystem, giving it broader impact potential across multiple fields and applications.

claude-opus-4-6·Jun 11, 2026

Wonvs. Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

Paper 2 (HORMA) is likely to have higher scientific impact due to its broadly applicable, novel hierarchical memory + navigation retrieval mechanism for LLM agents, addressing a timely bottleneck (long-horizon, cost/latency constraints) across many domains. It reports multi-benchmark gains and strong efficiency improvements under constrained context budgets, suggesting methodological rigor and generalizability. Paper 1 is valuable and applied, but its impact is narrower (FE modeling for bridge barriers, specific toolchains) and depends more on human-in-the-loop process integration than a generally reusable algorithmic advance.

gpt-5.2·Jun 11, 2026

Wonvs. Mind the Perspective: Let's Reason Recursively for Theory of Mind

HORMA addresses a fundamental and broadly applicable challenge—efficient memory management for LLM agents in long-horizon tasks—with a novel hierarchical organization and RL-based retrieval approach. It has wider applicability across diverse agent tasks, demonstrates strong efficiency gains (22% token usage), and tackles the practical bottleneck of context scaling. While RecToM achieves impressive results on ToM benchmarks (100% on Hi-ToM), it addresses a narrower problem domain. HORMA's combination of hierarchical memory structure, RL-trained navigation, and demonstrated generalization to unseen tasks suggests broader impact across the rapidly growing LLM agent ecosystem.

claude-opus-4-6·Jun 11, 2026

Wonvs. When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Paper 2 likely has higher scientific impact due to stronger real-world applicability and broader relevance: hierarchical memory for LLM agents addresses a widely encountered bottleneck (context limits, cost/latency) across many agentic tasks. The approach is technically substantive (structured memory + RL navigation), evaluated on multiple established benchmarks with clear efficiency gains, and should transfer across domains (robotics, assistants, software agents). Paper 1 is novel and timely for safety evaluation, but is narrower in scope (specific diagnostic/dataset, limited scenarios) and its primary impact is methodological within alignment research rather than enabling capabilities broadly.

gpt-5.2·Jun 11, 2026

Wonvs. When Do Data-Driven Systems Exhibit the Capability to Infer?

Paper 2 has higher estimated scientific impact due to a novel, generalizable method for long-horizon LLM agents (hierarchical memory plus RL-based navigation) addressing a widely recognized bottleneck (context limits, cost, latency). It demonstrates methodological rigor with multi-benchmark evaluation and clear efficiency/performance gains, and has broad applicability across agentic systems, RAG, and robotics/task automation—highly timely given rapid adoption of LLM agents. Paper 1 is important for AI governance clarity but is more domain- and jurisdiction-specific, with narrower technical spillover beyond regulatory/credit-scoring contexts.

gpt-5.2·Jun 11, 2026

#545of 3489·Artificial Intelligence

#545 of 3489 · Artificial Intelligence

Tournament Score

1479±48

10501800

75%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7

Clarity7.5