Doeon Kwon, Junho Bang
Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.
This paper investigates what geometric information spatial memory systems for language agents must store beyond text, using occlusion/visibility as the critical test case. The central argument is that "memory palace" systems (which anchor memories to 3D coordinates) gain their value not from blending spatial proximity into retrieval scores, but from enabling geometric predicates—particularly visibility/occlusion checks—that text-only systems fundamentally cannot compute.
Three main results are presented: (1) the default linear blend of spatial proximity with recency/importance in memory-palace systems fails or hurts retrieval compared to a position-blind baseline, while geometry-led weighting succeeds; (2) memory recall (occlusion-blind) and visibility (a geometric predicate) must be separated, with a simple DDA ray-march fixing the gap; (3) live pre-registered confirmation on a running system validates the occlusion predicate.
The paper shows commendable methodological discipline in several respects. The authors pre-register hypotheses and commit them to git before running experiments—a practice rare in AI research. They honestly report a prior negative result (the topic-partition study that turned out not to test spatial memory at all), which builds credibility. The confound checks (E1-E5) are well-designed: E1 isolates that the action-level gain is from object binding rather than geometry per se, E3 demonstrates that an LLM given coordinates as text can match the DDA, and E5 stress-tests the DDA's failure modes on thin geometry.
However, several concerns limit the rigor:
The paper's conceptual contributions could influence how spatial memory systems are designed:
1. The recall/visibility separation is a clean architectural insight. Recall should be occlusion-blind (you remember what's behind a wall), while visibility is a separate perception predicate. This distinction, though intuitive once stated, is absent from most current systems.
2. The ranker-blend critique is practically useful. Demonstrating that the "memory palace" default of linearly blending spatial proximity with other signals can be net-negative is a cautionary finding for practitioners building agent memory systems.
3. The minimum-representation schema (Table 1) provides a useful taxonomy of what must be stored per query type.
However, the practical impact is bounded by the fact that this applies specifically to embodied agents in voxel worlds with authoritative geometry—a niche setting. The bridge to real-world agents operating in noisy, reconstructed environments (where the neat DDA over perfect voxels doesn't apply) is wide and unaddressed.
The paper addresses a timely question as language agents increasingly operate in 3D environments and spatial grounding becomes more important. The agent memory literature (Generative Agents, MemGPT, A-MEM, Mem0, Zep) has grown rapidly, and the observation that none of these systems store spatial geometry or compute visibility predicates identifies a genuine gap. The positioning relative to render-as-recall systems (GSMem, RenderMem) and 3DSPMR is honest and well-articulated, with the claimed delta being measurement and isolation rather than method novelty.
This paper makes a conceptually clear but empirically preliminary contribution. The isolation of what spatial memory must store versus how it is read is a valid framing, and the pre-registration discipline is admirable. However, the central findings are largely confirmations of what is structurally obvious (text can't compute visibility), demonstrated in controlled synthetic environments that are far from realistic deployment. The most interesting results (E3's medium-independence finding, the blend-dilution measurement) are somewhat buried. This is honest, careful preliminary work that sets up a more impactful future study, but the impact of the present paper is limited.
Generated Jun 10, 2026
Paper 1 investigates foundational cognitive architectures for AI agents with high methodological rigor, including pre-registered experiments and robust statistical analyses. In contrast, Paper 2 is a solution tailored to a specific dataset challenge (AV2 2026 Scenario Mining Challenge), which typically has a narrower, more applied impact compared to foundational AI memory research.
Paper 2 demonstrates strong methodological rigor with pre-registered experiments and highly significant statistical results (p<10^-15) addressing a fundamental problem in AI agent architectures (spatial memory). In contrast, Paper 1 presents an exploratory study with a small sample size, limited expert agreement, and statistically non-significant findings. Therefore, Paper 2 offers more conclusive evidence and broader fundamental impact in AI systems development.
Paper 2 demonstrates higher potential scientific impact by addressing a fundamental architectural challenge in embodied language agents: spatial memory and occlusion. Its findings have broad applicability across AI, robotics, and virtual environments. Furthermore, Paper 2 exhibits exceptional methodological rigor through pre-registered experiments and precise statistical validation. While Paper 1 offers a highly valuable and practical applied framework for civil engineering, Paper 2 advances foundational AI capabilities, which typically yields a wider ripple effect across multiple computational and scientific disciplines.
Paper 2 has higher likely scientific impact: it introduces a clear, testable criterion (occlusion/visibility) for evaluating spatial memory in language agents, backed by pre-registered experiments, strong quantitative effects, and a live confirmatory run that found and fixed a real system defect. Its implications generalize across embodied AI, robotics, AR/VR, agent memory architectures, and evaluation methodology. Paper 1 is timely and practically relevant for AI regulation, but its contribution is more domain-specific (legal/credit-scoring workflows) and less likely to propagate broadly across technical fields.
Paper 2 introduces a novel framework (MCPS) combining MCTS-style evaluation with 3D ball tracking data for football pass evaluation, bridging sports analytics, multi-agent trajectory prediction, and counterfactual reasoning. It adapts methods from autonomous driving (SMART) to a new domain, releases code/checkpoints, and uses a novel public dataset. Paper 1 addresses a narrower problem (occlusion handling in language-agent memory palaces) with results that the authors themselves acknowledge as 'near-tautological,' and the confirmatory studies remain future work. Paper 2 has broader cross-domain impact, stronger methodological novelty, and more immediate practical applications.
Paper 2 addresses a broader, more impactful problem—public health policy optimization during pandemics—with wider real-world applicability. It combines hierarchical reinforcement learning with uncertainty-aware policy gradients in a novel framework integrating individual behavior and policy uncertainties, relevant across epidemiology, economics, and AI. Paper 1, while methodologically rigorous with pre-registration, addresses a niche problem in language-agent spatial memory systems (occlusion handling) with limited breadth of impact, and the authors themselves acknowledge the contribution is 'near-tautological' with key confirmatory studies still pending.
Paper 1 likely has higher impact due to broader applicability and timeliness: scalable, automated benchmarking for realistic agent tasks in state-based OS-like environments addresses a central bottleneck in deploying LLM agents. Its framework can be adopted across many tasks, models, and research groups, influencing evaluation standards and accelerating progress. Paper 2 is methodologically rigorous (pre-registration, strong statistics) and offers clear insights for spatial-memory systems, but its scope is narrower (occlusion/visibility in spatial recall) and closer to a specialized diagnostic/ablation than a general infrastructure contribution.
Paper 1 addresses LLM unlearning, a timely and broadly impactful problem in AI safety with significant real-world applications (privacy, harmful knowledge removal). It presents a principled method (NSRU) with strong theoretical grounding and comprehensive experiments on established benchmarks (TOFU, WMDP). Paper 2 tackles a niche problem in spatial memory for language agents, with contributions that are self-admittedly 'near-tautological' and remain at the pilot stage with confirmatory studies left as future work. Paper 1's broader relevance to AI safety, methodological rigor, and completeness give it substantially higher impact potential.
Paper 1 presents a highly practical, scalable solution to a complex, real-world industrial problem (mine scheduling) by effectively bridging Large Language Models with Operations Research. Demonstrating that zero-shot LLMs can achieve near-optimal results compared to computationally expensive MILP baselines while scaling linearly offers massive economic potential and establishes a strong precedent for using LLMs in complex, constrained industrial scheduling tasks.
Paper 2 addresses a critical real-world problem—medical literature summarization for clinical decision-making—with high relevance to healthcare. Its rigorous, blinded evaluation by medical experts provides valuable insights into LLM utility in medicine, offering broader societal and cross-disciplinary impact compared to Paper 1's niche focus on language agent architecture.