Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He
Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.
HORMA addresses a genuine and increasingly important problem: how LLM agents can maintain effective working memory over long-horizon tasks without succumbing to context window limitations. The key insight is the explicit decoupling of memory construction from memory retrieval, operationalized through a shared file-system-like hierarchical workspace. This is a meaningful architectural contribution because it resolves a credit assignment problem that plagues joint optimization approaches — when a task fails, it becomes ambiguous whether the fault lies in how memory was organized or how it was retrieved.
The memory construction module uses iterative skill refinement through contrastive analysis (comparing failures with/without structured memory), categorizing errors into exogenous (information loss from structuring) versus endogenous (benefits from filtering noise). The retrieval module formulates context selection as a navigation problem solved by a lightweight RL-trained agent using Bash-like commands over the file hierarchy. The evidence-grounded Jaccard reward provides direct, decoupled supervision for retrieval quality.
The experimental design is thorough, spanning three diverse benchmarks: ALFWorld (embodied interaction), LoCoMo (long conversational QA), and LongMemEval (multi-session chat). The comparison against a reasonable spread of baselines — truncation, sliding window, embedding retrieval, ReSum, Acon, HIAGENT, Fold, A-MEM, and Mem0 — provides solid grounding.
The work has broad applicability across several domains:
The file-system metaphor for memory organization is intuitive and interpretable, which may accelerate adoption. The skill evolution mechanism provides a form of continual learning without parameter updates, which is appealing for production systems.
However, the reliance on frontier LLMs for memory management (shown necessary by Figure 3a) limits accessibility. The finding that "flawed memory organization cannot be compensated for by high-quality retrieval" is important but also constraining — it means the approach fundamentally depends on expensive models for the construction phase.
This work is highly timely. The explosion of agentic AI systems (2024-2026) has made working memory management a critical bottleneck. The paper sits at the intersection of several active research threads: RL for LLMs (DeepSeek-R1, Search-R1), memory-augmented agents (MemGPT, A-MEM, Mem0, Memory-R1), and context management (Acon, ReSum). The credit assignment argument for decoupling is theoretically well-motivated and practically relevant as the field moves toward more complex multi-step agent tasks.
HORMA presents a well-structured and well-motivated approach to a pressing problem. The decoupling insight is the paper's strongest theoretical contribution, and the empirical results across diverse benchmarks are convincing. The token efficiency gains are substantial and practically meaningful. However, the reliance on frontier models for memory construction and the computational overhead of multi-agent orchestration temper the practical impact. The work advances the field's understanding of how to design memory systems for LLM agents and should influence future architectural decisions in this space.
Generated Jun 11, 2026
Paper 2 addresses a fundamental and widespread bottleneck in LLM agents—handling long-horizon tasks and managing context length. Its hierarchical memory approach significantly reduces token usage (up to 78%) while maintaining or improving reasoning quality across general tasks. While Paper 1 presents an innovative self-supervised RL approach for spatial reasoning, Paper 2's focus on general agentic memory and efficiency offers broader applicability and higher potential impact across various real-world AI agent deployments.
Paper 1 addresses a fundamental challenge in LLM agents—efficient memory management for long-horizon tasks—with broad applicability across AI systems. Its hierarchical memory architecture with RL-trained navigation is novel and demonstrates strong results across multiple benchmarks with significant efficiency gains. Paper 2, while solving an important domain-specific problem in BIM compliance checking, targets a narrower audience (AEC industry) with more incremental improvements (8.6% over baselines). Paper 1's contributions to agent architectures and memory mechanisms have broader cross-field impact and greater timeliness given the rapid growth of LLM agent research.
Paper 1 addresses a critical and timely problem—evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. SciConBench introduces a large-scale benchmark with a novel clean-room evaluation methodology that reveals significant limitations of frontier models and consumer-facing AI products. Its findings about data leakage inflating performance and the unreliability of deployed systems (Google AI Overview, OpenEvidence) have broad implications for AI safety, policy, and scientific practice. Paper 2 presents a solid engineering contribution for agent memory management, but its impact is more incremental and narrower in scope compared to Paper 1's foundational evaluation framework for scientific AI reliability.
HORMA addresses a fundamental and broadly relevant problem—efficient memory management for LLM agents in long-horizon tasks—with a comprehensive hierarchical approach combining structured memory construction and RL-trained navigation-based retrieval. It demonstrates strong results across multiple benchmarks (ALFWorld, LoCoMo, LongMemEval) with significant efficiency gains (22.17% token usage). Paper 2 (SkillJuror) provides useful but narrower insights about skill organization's effect on agent behavior, with modest outcome improvements (+4.1%). HORMA's broader applicability, stronger methodological contributions, and more impactful efficiency-performance trade-offs give it higher potential impact.
Paper 2 likely has higher scientific impact due to greater novelty and broader applicability: it introduces a general hierarchical memory organization and RL-based navigation mechanism addressing a widely recognized bottleneck for LLM agents (long-horizon, statelessness, context cost). It is evaluated across multiple standard benchmarks with clear efficiency-performance gains, suggesting methodological rigor and timeliness for agent research. Paper 1 is impactful for a specific civil/structural engineering workflow and demonstrates strong applied value, but its domain specificity and reliance on existing agent orchestration concepts may limit cross-field impact compared to a broadly reusable agent-memory contribution.
HORMA addresses a fundamental and broadly applicable challenge in LLM agents—efficient memory management for long-horizon tasks—with a novel hierarchical organization approach combining structured construction and RL-based retrieval. It demonstrates strong results across multiple benchmarks with significant efficiency gains (22% token usage). Paper 2, while innovative in creativity assessment via dialogue optimization, addresses a narrower niche at the intersection of educational assessment and AI. HORMA's contributions are more likely to influence the rapidly growing LLM agent ecosystem, giving it broader impact potential across multiple fields and applications.
Paper 2 (HORMA) is likely to have higher scientific impact due to its broadly applicable, novel hierarchical memory + navigation retrieval mechanism for LLM agents, addressing a timely bottleneck (long-horizon, cost/latency constraints) across many domains. It reports multi-benchmark gains and strong efficiency improvements under constrained context budgets, suggesting methodological rigor and generalizability. Paper 1 is valuable and applied, but its impact is narrower (FE modeling for bridge barriers, specific toolchains) and depends more on human-in-the-loop process integration than a generally reusable algorithmic advance.
HORMA addresses a fundamental and broadly applicable challenge—efficient memory management for LLM agents in long-horizon tasks—with a novel hierarchical organization and RL-based retrieval approach. It has wider applicability across diverse agent tasks, demonstrates strong efficiency gains (22% token usage), and tackles the practical bottleneck of context scaling. While RecToM achieves impressive results on ToM benchmarks (100% on Hi-ToM), it addresses a narrower problem domain. HORMA's combination of hierarchical memory structure, RL-trained navigation, and demonstrated generalization to unseen tasks suggests broader impact across the rapidly growing LLM agent ecosystem.
Paper 2 likely has higher scientific impact due to stronger real-world applicability and broader relevance: hierarchical memory for LLM agents addresses a widely encountered bottleneck (context limits, cost/latency) across many agentic tasks. The approach is technically substantive (structured memory + RL navigation), evaluated on multiple established benchmarks with clear efficiency gains, and should transfer across domains (robotics, assistants, software agents). Paper 1 is novel and timely for safety evaluation, but is narrower in scope (specific diagnostic/dataset, limited scenarios) and its primary impact is methodological within alignment research rather than enabling capabilities broadly.
Paper 2 has higher estimated scientific impact due to a novel, generalizable method for long-horizon LLM agents (hierarchical memory plus RL-based navigation) addressing a widely recognized bottleneck (context limits, cost, latency). It demonstrates methodological rigor with multi-benchmark evaluation and clear efficiency/performance gains, and has broad applicability across agentic systems, RAG, and robotics/task automation—highly timely given rapid adoption of LLM agents. Paper 1 is important for AI governance clarity but is more domain- and jurisdiction-specific, with narrower technical spillover beyond regulatory/credit-scoring contexts.