What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

Doeon Kwon, Junho Bang

Jun 9, 2026arXiv:2606.10299v1

cs.AIcs.CVcs.MA

#3111of 3489·Artificial Intelligence

#3111 of 3489 · Artificial Intelligence

Tournament Score

1267±43

10501800

29%

Win Rate

Wins

Losses

Matches

Rating

4.2/ 10

Significance4

Rigor6

Novelty3.5

Clarity4.5

Abstract

Language-agent "memory palace" systems anchor each memory to a world coordinate, on the intuition that geometry adds something text cannot. We make that intuition testable and report three results. First, the memory-palace default of folding spatial proximity into a linear blend beside recency and importance does not help and can hurt: in a pre-registered recall experiment the shipped blend fails its own frozen test (mean Delta-Hit@5 -0.0375, Wilcoxon p=0.306), sitting at a position-blind baseline, while a geometry-led weighting wins decisively (+0.3208, p<10^-15): geometry must lead recall when the query regime is spatial. Second, memory recall and visibility must be separated: recall is occlusion-blind by design (you correctly remember the next room behind a wall), while visibility is a perception predicate over stored geometry that the live system never computed. A one-line ray-versus-voxel digital differential analyzer (DDA), re-pointed from the gaze ray the agent already casts, supplies it: text and the live FoV cone both score 0.000 on 849 behind-wall targets while cone-plus-DDA reaches 0.982 (exact McNemar p<10^-6); coordinate recall separately resolves near-duplicate locations a cosine null cannot (1.000 vs 0.533, n=150). Third, the visibility predicate is confirmed live under a git-committed pre-registration (SPMEM-OCC-LIVE-v1: eight scripted worlds, automated oracle scoring, 96 behind-wall targets, false-visible 1.000->0.000, pooled exact McNemar p=2.5x10^-29), a run that surfaced and fixed a real relay anchor defect. We concede that occlusion-needs-geometry is near-tautological; the contribution is the measurement and isolation, separating what spatial memory must store from how it is read. These pilots power a frozen confirmatory study (SPMEM-ZERO-REAL-PREREG-v1); the full human-authored multi-world study with blind raters remains future work.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates what geometric information spatial memory systems for language agents must store beyond text, using occlusion/visibility as the critical test case. The central argument is that "memory palace" systems (which anchor memories to 3D coordinates) gain their value not from blending spatial proximity into retrieval scores, but from enabling geometric predicates—particularly visibility/occlusion checks—that text-only systems fundamentally cannot compute.

Three main results are presented: (1) the default linear blend of spatial proximity with recency/importance in memory-palace systems fails or hurts retrieval compared to a position-blind baseline, while geometry-led weighting succeeds; (2) memory recall (occlusion-blind) and visibility (a geometric predicate) must be separated, with a simple DDA ray-march fixing the gap; (3) live pre-registered confirmation on a running system validates the occlusion predicate.

Methodological Rigor

The paper shows commendable methodological discipline in several respects. The authors pre-register hypotheses and commit them to git before running experiments—a practice rare in AI research. They honestly report a prior negative result (the topic-partition study that turned out not to test spatial memory at all), which builds credibility. The confound checks (E1-E5) are well-designed: E1 isolates that the action-level gain is from object binding rather than geometry per se, E3 demonstrates that an LLM given coordinates as text can match the DDA, and E5 stress-tests the DDA's failure modes on thin geometry.

However, several concerns limit the rigor:

Controlled/synthetic settings: Most experiments run in controlled voxel simulations or scripted worlds with a single occluder family. The gap between these and realistic agent environments is substantial.

Partly tautological results: The authors themselves concede that "occlusion-needs-geometry is near-tautological." The Tier-A battery (Section 6.4) is "partly by construction"—geometry-necessary queries by definition cannot be answered without geometry. The measured effect sizes therefore overstate what would be observed in naturalistic settings where queries mix spatial and non-spatial needs.

Single embedding model: All results use bge-small-en-v1.5; generalization is untested.

No human raters: All evaluation is automated. For the action-level claims especially, this is a meaningful limitation the authors acknowledge.

The recall experiment: While pre-registered, the "fix" (re-tuning weights to spatial-dominant) was tested on the same regime, not as a fresh pre-registration. The disambiguation regime is artificial—location tokens stripped from text, forcing spatial resolution—which inflates the spatial signal's importance.

Potential Impact

The paper's conceptual contributions could influence how spatial memory systems are designed:

1. The recall/visibility separation is a clean architectural insight. Recall should be occlusion-blind (you remember what's behind a wall), while visibility is a separate perception predicate. This distinction, though intuitive once stated, is absent from most current systems.

2. The ranker-blend critique is practically useful. Demonstrating that the "memory palace" default of linearly blending spatial proximity with other signals can be net-negative is a cautionary finding for practitioners building agent memory systems.

3. The minimum-representation schema (Table 1) provides a useful taxonomy of what must be stored per query type.

However, the practical impact is bounded by the fact that this applies specifically to embodied agents in voxel worlds with authoritative geometry—a niche setting. The bridge to real-world agents operating in noisy, reconstructed environments (where the neat DDA over perfect voxels doesn't apply) is wide and unaddressed.

Timeliness & Relevance

The paper addresses a timely question as language agents increasingly operate in 3D environments and spatial grounding becomes more important. The agent memory literature (Generative Agents, MemGPT, A-MEM, Mem0, Zep) has grown rapidly, and the observation that none of these systems store spatial geometry or compute visibility predicates identifies a genuine gap. The positioning relative to render-as-recall systems (GSMem, RenderMem) and 3DSPMR is honest and well-articulated, with the claimed delta being measurement and isolation rather than method novelty.

Strengths

Intellectual honesty: Reporting the negative scoping result, conceding the near-tautological nature of the core claim, narrowing claims when confound checks demand it (E1 showing object binding, not geometry), and faithfully reporting the shipped blend's failure are all exemplary.

Pre-registration discipline: Git-committed pre-registrations with explicit falsification conditions set a good standard.

Clear conceptual framework: The index-versus-ranker distinction, the recall/visibility separation, and the minimum-representation schema are well-formulated.

The E3 result (LLM with coordinates matches DDA) is a genuinely interesting finding that separates storage from medium.

Limitations & Weaknesses

Narrow scope masquerading as breadth: Despite 23 pages, this ultimately demonstrates something close to obvious (geometry is needed for geometric queries) in very controlled settings. The "near-tautological" concession is more damaging than the authors seem to appreciate.

No downstream task evaluation at scale: The action-level results are tiny pilots (24 scenarios, single-turn decisions) that don't approach realistic agent deployment.

The confirmatory study hasn't been run: The paper is essentially a collection of pilots for a study that "remains future work." The main claims rest on preliminary evidence.

Missing comparisons: No head-to-head with GSMem, RenderMem, or 3DSPMR on shared tasks.

Static geometry assumption: The authors note but don't address that real worlds change, making the occlusion predicate time-dependent.

Writing: The paper is extremely long and repetitive, with the same caveats and claims restated many times. The ratio of insight to length is low.

Overall Assessment

This paper makes a conceptually clear but empirically preliminary contribution. The isolation of what spatial memory must store versus how it is read is a valid framing, and the pre-registration discipline is admirable. However, the central findings are largely confirmations of what is structurally obvious (text can't compute visibility), demonstrated in controlled synthetic environments that are far from realistic deployment. The most interesting results (E3's medium-independence finding, the blend-dilution measurement) are somewhat buried. This is honest, careful preliminary work that sets up a more impactful future study, but the impact of the present paper is limited.

Rating:4.2/ 10

Significance 4Rigor 6Novelty 3.5Clarity 4.5

Generated Jun 10, 2026

Comparison History (24)

Wonvs. AutoMine Solution for AV2 2026 Scenario Mining Challenge

Paper 1 investigates foundational cognitive architectures for AI agents with high methodological rigor, including pre-registered experiments and robust statistical analyses. In contrast, Paper 2 is a solution tailored to a specific dataset challenge (AV2 2026 Scenario Mining Challenge), which typically has a narrower, more applied impact compared to foundational AI memory research.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Paper 2 demonstrates strong methodological rigor with pre-registered experiments and highly significant statistical results (p<10^-15) addressing a fundamental problem in AI agent architectures (spatial memory). In contrast, Paper 1 presents an exploratory study with a small sample size, limited expert agreement, and statistically non-significant findings. Therefore, Paper 2 offers more conclusive evidence and broader fundamental impact in AI systems development.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

Paper 2 demonstrates higher potential scientific impact by addressing a fundamental architectural challenge in embodied language agents: spatial memory and occlusion. Its findings have broad applicability across AI, robotics, and virtual environments. Furthermore, Paper 2 exhibits exceptional methodological rigor through pre-registered experiments and precise statistical validation. While Paper 1 offers a highly valuable and practical applied framework for civil engineering, Paper 2 advances foundational AI capabilities, which typically yields a wider ripple effect across multiple computational and scientific disciplines.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. When Do Data-Driven Systems Exhibit the Capability to Infer?

Paper 2 has higher likely scientific impact: it introduces a clear, testable criterion (occlusion/visibility) for evaluating spatial memory in language agents, backed by pre-registered experiments, strong quantitative effects, and a live confirmatory run that found and fixed a real system defect. Its implications generalize across embodied AI, robotics, AR/VR, agent memory architectures, and evaluation methodology. Paper 1 is timely and practically relevant for AI regulation, but its contribution is more domain-specific (legal/credit-scoring workflows) and less likely to propagate broadly across technical fields.

gpt-5.2·Jun 11, 2026

Lostvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Paper 2 introduces a novel framework (MCPS) combining MCTS-style evaluation with 3D ball tracking data for football pass evaluation, bridging sports analytics, multi-agent trajectory prediction, and counterfactual reasoning. It adapts methods from autonomous driving (SMART) to a new domain, releases code/checkpoints, and uses a novel public dataset. Paper 1 addresses a narrower problem (occlusion handling in language-agent memory palaces) with results that the authors themselves acknowledge as 'near-tautological,' and the confirmatory studies remain future work. Paper 2 has broader cross-domain impact, stronger methodological novelty, and more immediate practical applications.

claude-opus-4-6·Jun 10, 2026

Lostvs. Neetyabhas: A Framework for Uncertainty-Aware Public Policy Optimization in Rational Agent-Based Models

Paper 2 addresses a broader, more impactful problem—public health policy optimization during pandemics—with wider real-world applicability. It combines hierarchical reinforcement learning with uncertainty-aware policy gradients in a novel framework integrating individual behavior and policy uncertainties, relevant across epidemiology, economics, and AI. Paper 1, while methodologically rigorous with pre-registration, addresses a niche problem in language-agent spatial memory systems (occlusion handling) with limited breadth of impact, and the authors themselves acknowledge the contribution is 'near-tautological' with key confirmatory studies still pending.

claude-opus-4-6·Jun 10, 2026

Lostvs. STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

Paper 1 likely has higher impact due to broader applicability and timeliness: scalable, automated benchmarking for realistic agent tasks in state-based OS-like environments addresses a central bottleneck in deploying LLM agents. Its framework can be adopted across many tasks, models, and research groups, influencing evaluation standards and accelerating progress. Paper 2 is methodologically rigorous (pre-registration, strong statistics) and offers clear insights for spatial-memory systems, but its scope is narrower (occlusion/visibility in spatial recall) and closer to a specialized diagnostic/ablation than a general infrastructure contribution.

gpt-5.2·Jun 10, 2026

Lostvs. Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

Paper 1 addresses LLM unlearning, a timely and broadly impactful problem in AI safety with significant real-world applications (privacy, harmful knowledge removal). It presents a principled method (NSRU) with strong theoretical grounding and comprehensive experiments on established benchmarks (TOFU, WMDP). Paper 2 tackles a niche problem in spatial memory for language agents, with contributions that are self-admittedly 'near-tautological' and remain at the pilot stage with confirmatory studies left as future work. Paper 1's broader relevance to AI safety, methodological rigor, and completeness give it substantially higher impact potential.

claude-opus-4-6·Jun 10, 2026

Lostvs. Sim2Schedule: A Simulator-Guided LLM Framework for Autonomous Open-Pit Mine Scheduling

Paper 1 presents a highly practical, scalable solution to a complex, real-world industrial problem (mine scheduling) by effectively bridging Large Language Models with Operations Research. Demonstrating that zero-shot LLMs can achieve near-optimal results compared to computationally expensive MILP baselines while scaling linearly offers massive economic potential and establishes a strong precedent for using LLMs in complex, constrained industrial scheduling tasks.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Ten Headache Specialists versus Artificial Intelligence for Clinical Literature Summarization: A Critical Evaluation and Comparison

Paper 2 addresses a critical real-world problem—medical literature summarization for clinical decision-making—with high relevance to healthcare. Its rigorous, blinded evaluation by medical experts provides valuable insights into LLM utility in medicine, offering broader societal and cross-disciplinary impact compared to Paper 1's niche focus on language agent architecture.

gemini-3.1-pro-preview·Jun 10, 2026

#3111of 3489·Artificial Intelligence

#3111 of 3489 · Artificial Intelligence

Tournament Score

1267±43

10501800

29%

Win Rate

Wins

Losses

Matches

Rating

4.2/ 10

Significance4

Rigor6

Novelty3.5

Clarity4.5