Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao

May 27, 2026

arXiv:2605.28277v1 PDF

cs.AI(primary)

#606of 2682·Artificial Intelligence

#606 of 2682 · Artificial Intelligence

Tournament Score

1469±49

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty7.5

Clarity6

Tournament Score

1469±49

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning"

1. Core Contribution

MentalMap introduces a multilingual diagnostic benchmark for pure-text spatial reasoning with a two-axis design: a six-level capability staircase (L0–L5) progressing from atomic spatial fact retrieval to generative world-graph construction, crossed with four diagnostic lenses (frame of reference, reading-direction bias, reasoning effort, hallucination). The benchmark covers eight typologically diverse languages, is grounded in 100 ProcTHOR household scenes, and comprises 39 task families across 1,950 evaluation cells (~47K scored items per model). The central empirical finding is a "universal L3 cliff"—a discrete performance collapse at viewpoint reasoning that persists across 13 LLMs, all languages, and scales from 1.7B to 32B parameters. Crucially, a cross-lingual human pilot under identical conditions reproduces the same cliff, suggesting the bottleneck is inherent to the text modality rather than being LLM-specific.

2. Methodological Rigor

Strengths in design: The two-axis architecture (capability × diagnostic) is well-motivated by cognitive science (recall-construction distinction from mental rotation and egocentric-allocentric studies). The separation of static comprehension (L0–L2) from active spatial reasoning (L3–L5) is principled and empirically validated by the cliff. The inclusion of both strict and partial-credit composites for L5 is methodologically sound, and the paper demonstrates they yield materially different rankings—an important meta-methodological contribution.

Evaluation breadth: Thirteen models spanning frontier closed-source, open-weight 7-10B, mid-scale 27-32B controls, a sub-cliff lower bound (1.7B), and a VLM text-only ablation provide comprehensive coverage. The ~523K total evaluation items across the panel give statistical power to the claims.

Concerns: The human pilot, while valuable, has limited sample size (N≥3 per language). The L3=0% result across all eight languages is striking but could benefit from larger participant pools and more careful cognitive controls. The paper acknowledges this limitation and pre-registers quantitative predictions for scratchpad and multimodal extensions—a commendable practice, though these remain future work. The claim that the cliff reflects "working-memory bottleneck of the text modality" is plausible but not definitively established; it's an interpretation consistent with the data rather than a proven mechanism.

Some evaluation choices warrant scrutiny: the L5 composite weights (0.05/0.05/0.20/0.55/0.15) appear somewhat arbitrary, and the dominance of edge F1 at 0.55 could skew rankings. The paper partially addresses this by reporting both strict and partial-credit metrics.

3. Potential Impact

Benchmark contribution: MentalMap fills a genuine gap at the intersection of spatial reasoning, multilingual evaluation, and structured output generation. No prior benchmark combines all three (Table 1 makes this clear). The released evaluation harness, schema validators, and multilingual prompts create a reusable infrastructure.

Theoretical contribution: The static-to-active reasoning cliff, if robust, has implications for understanding LLM capabilities broadly. The finding that chain-of-thought is not universally beneficial (helping DeepSeek by +32pp but hurting Qwen2.5-7B by -16pp at L3) and actively degrades JSON output validity challenges the common assumption that CoT is a reliable reasoning booster. The script-typology recovery from performance correlations (F6) offers an interesting lens on multilingual model analysis.

Practical implications: For embodied AI and robotics, the finding that no model reliably performs viewpoint transformation from text—and that humans also fail—suggests fundamental architectural constraints on text-only spatial interfaces. This motivates multimodal and scratchpad-augmented approaches with clear design rationale.

4. Timeliness & Relevance

The paper addresses a timely question as LLMs are increasingly deployed as interfaces for embodied agents and robotics planners. The multilingual axis is especially relevant given the global deployment of LLMs beyond English. The concurrent emergence of several spatial reasoning benchmarks (SpatialText, SiT-Bench, grid-world probes) confirms this is an active area, but MentalMap's multi-dimensional design distinguishes it from competitors.

The pre-registration of falsifiable predictions for scratchpad and multimodal extensions is methodologically forward-looking and sets a good precedent for benchmark papers.

5. Strengths & Limitations

Key strengths:

Comprehensive, principled benchmark design with clear cognitive-science motivation

The L3 cliff finding is striking and consistent across an unusually large evaluation matrix

Human evaluation under identical conditions provides rare modality-level evidence

Seven distinct findings avoid the single-number leaderboard trap

Script-typology recovery from performance data is an elegant emergent result

Pre-registered quantitative predictions for future extensions

Notable weaknesses:

Closed-source models (GPT-4o, Gemini) lack L0-L2 data, creating gaps in the staircase analysis for frontier models

The human pilot is underpowered (N≥3 per language); 0% on L3 across all languages is dramatic but small-N

ProcTHOR scenes are synthetic; generalization to naturalistic scene descriptions is assumed but untested

The "working-memory bottleneck" explanation, while parsimonious, lacks mechanistic evidence (e.g., probing internal representations)

The paper is extremely dense; the seven findings, while individually interesting, fragment the narrative

Native-speaker validation of translations is reported but inter-annotator agreement is not quantified

Additional observations: The paper's length and complexity (18+ pages with extensive appendices) may limit accessibility. The finding that fine-tuning regime rather than parameter scale drives structured-output competence (F5) is practically useful but based on a limited number of scale control points. The Gemma-3-27B structured-text control missing data (N/A in Table 3) is unexplained.

Overall, this is a substantial benchmark contribution that addresses real gaps in spatial reasoning evaluation. The universal L3 cliff finding, if replicated at larger human sample sizes and across additional model families, could become a reference result in the field. The main risks are that the cliff may be partially artifactual (driven by task design choices at L3) and that the modality-bottleneck interpretation remains speculative without mechanistic evidence.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 7.5Clarity 6

Generated May 28, 2026

Comparison History (17)

vs. From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with a rigorous multilingual benchmark (MentalMap) spanning 13 models, 8 languages, and 39 task families. The discovery of a universal 'L3 reasoning cliff' and the finding that humans show similar limitations under text-only conditions provides deep theoretical insight into reasoning constraints. Its breadth of impact spans NLP, cognitive science, and AI safety, and it motivates future multimodal research directions. Paper 1, while useful, addresses a narrower educational technology application with evaluation limited to one university's CS department.

vs. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

gpt-5.25/28/2026

Paper 2 offers a broadly reusable evaluation methodology (matched-pair protocol + ADR) that directly addresses a known pitfall in LLM reasoning assessment (metric gaming via satisfiable bias) and connects to foundational complexity phenomena (phase transition, scaling). Its representation-invariant tests via reductions (to Vertex Cover and 3D packing) increase rigor and generality, making it applicable across many reasoning benchmarks beyond SAT. Paper 1 is a strong, timely multilingual diagnostic for spatial world modeling, but its impact is more domain-specific and less methodologically generalizable than Paper 2’s evaluation framework.

vs. A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis

gemini-3.15/28/2026

Paper 1 tackles a fundamental and highly debated question in AI: whether LLMs construct true internal world models. By introducing a comprehensive benchmark and demonstrating that LLM spatial reasoning failures mirror human working memory limits, it provides profound insights bridging NLP, multimodal AI, and cognitive science. In contrast, Paper 2 offers a valuable but highly specialized technical optimization fix for Multimodal Sentiment Analysis. Paper 1's broader scope, theoretical implications for future LLM architectures, and relevance to the ongoing discourse on AI reasoning give it a substantially higher potential for widespread scientific impact.

vs. Cross-Entropy Games and Frost Training

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a substantial multilingual benchmark with a clear capability taxonomy and multiple diagnostic axes, yielding a robust, generalizable finding (a universal “L3 reasoning cliff”) corroborated by human evaluations, which strengthens methodological rigor and broad relevance to cognition, evaluation, and multilingual NLP. Its insights inform future model design (multimodal/scratchpad) and provide a reusable resource. Paper 2 is a promising training tweak for a narrower class of judge-based optimization tasks; impact depends on adoption and on how broadly “Cross-Entropy Games” applies beyond this setup.

vs. Continual Model Routing in Evolving Model Hubs

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with a rigorous multilingual benchmark and a striking empirical finding (the L3 reasoning cliff) that generalizes across languages, scales, and even to humans. This reframes spatial reasoning limitations as working-memory constraints rather than architectural deficits, which has broad implications for LLM design, multimodal AI, and cognitive science. Paper 1 tackles a practical but narrower infrastructure problem (model routing in hubs). While useful, its impact is more incremental and domain-specific compared to Paper 2's foundational insights.

vs. Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

claude-opus-4.65/28/2026

Paper 2 introduces a novel benchmark (MentalMap) addressing a fundamental question about LLM world models with a well-designed multilingual hierarchy. The discovery of a universal 'L3 reasoning cliff' is a striking empirical finding with broad implications for understanding LLM capabilities and limitations. The human comparison strengthens claims about fundamental text-based reasoning constraints. Paper 1, while methodologically rigorous, is more narrowly focused on measurement protocol sensitivity in confidence calibration—important but incremental. Paper 2's breadth (13 models, 8 languages, human baselines) and its implications for multimodal AI give it wider impact potential.

vs. Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact due to a broadly useful, multilingual diagnostic benchmark (MentalMap) that can become a standard evaluation tool across the field, informing model development, cognition-inspired analysis, and cross-lingual NLP. Its methodological contribution (capability hierarchy, multiple diagnostic axes, structured-text control, many models, plus human comparison) supports strong, generalizable claims about a persistent spatial reasoning bottleneck. Paper 1 is practically valuable for reducing multimodal hallucinations, but its impact is more specialized to a training recipe and may be superseded faster than a widely adopted benchmark and reframing result.

vs. When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

gpt-5.25/28/2026

Paper 1 likely has higher impact due to direct, deployment-critical relevance: it reframes tool-use failures as a sim-to-real POMDP gap, introduces a perturbation benchmark grounded in real GitHub issues, and proposes an RL domain-randomization recipe that measurably improves robustness and transfers to unseen runtime failures. This combines actionable methodology, public benchmark/leaderboard, and immediate applicability to production LLM agents, with potential influence across agent evaluation, RL-for-LLMs, and reliability engineering. Paper 2 is novel and rigorous diagnostically, but is more descriptive and its main bottleneck attribution (text-only working memory) may limit near-term actionable gains.

vs. ProvMind: Provenance-grounded reasoning for materials synthesis

claude-opus-4.65/28/2026

Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with broader implications across AI, cognitive science, and linguistics. Its multilingual, multi-axis diagnostic framework (MentalMap) with the discovery of a universal 'L3 reasoning cliff' that also appears in humans provides a deeply informative finding about text-based reasoning limitations. This has wider cross-field impact and relevance to the broader LLM research community compared to Paper 1's domain-specific (materials science) benchmark, despite Paper 1's solid methodological contribution to process reasoning.

vs. CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

gemini-3.15/28/2026

Paper 1 addresses a fundamental, highly debated question in AI regarding whether LLMs build internal world models, offering deep insights into their cognitive capabilities and limitations. Its findings on the 'reasoning cliff' have broad theoretical implications across natural language processing, cognitive science, and AGI research. In contrast, Paper 2 focuses on a niche, applied problem (e-commerce disputes), which, while practically useful, has a narrower scientific scope and theoretical impact.

vs. Adaptive auditing of AI systems with anytime-valid guarantees

claude-opus-4.65/28/2026

Paper 2 introduces a novel statistical framework for adaptive AI auditing with rigorous anytime-valid guarantees, addressing a fundamental and growing need across all AI systems. Its methodological contribution—formalizing dueling hypotheses via e-processes and proving asymptotic certification guarantees—is broadly applicable beyond any single domain. Paper 1, while thorough in benchmarking spatial reasoning across languages, is more incremental (another LLM benchmark) with narrower scope. Paper 2's framework has broader cross-field impact (statistics, ML safety, regulation) and higher timeliness given increasing AI deployment and auditing demands.

vs. Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

gemini-3.15/28/2026

While Paper 1 offers a practical engineering solution for LLM efficiency, Paper 2 tackles a fundamental theoretical debate in AI: whether LLMs build internal world models. By introducing a comprehensive benchmark and uncovering a universal 'reasoning cliff' akin to human working memory limits, Paper 2 provides profound cognitive and architectural insights that will likely drive foundational research in multimodal and augmented reasoning across the broader AI community.

vs. Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

gemini-3.15/28/2026

Paper 2 offers a profound theoretical contribution to a fundamental debate in AI: whether LLMs build internal world models. By establishing a universal 'L3 reasoning cliff' in spatial reasoning across languages and scales, and validating it against human baselines, it reveals inherent limitations in text-only working memory. While Paper 1 identifies a critical security vulnerability in stateful agents, Paper 2's rigorous methodological hierarchy and implications for future architectural designs give it broader foundational impact across AI, cognitive science, and NLP.

vs. VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

gemini-3.15/28/2026

Paper 2 addresses a fundamental and heavily debated question in AI and cognitive science: whether LLMs build internal world models. Its highly rigorous methodology—featuring a six-level hierarchy, multilingual evaluation, and human baselines—provides profound theoretical insights into LLM working memory and spatial reasoning. While Paper 1 offers a valuable benchmark for practical agent development, Paper 2's focus on foundational model capabilities and its broader implications across linguistics and cognitive science give it a higher potential for deep scientific impact.

vs. Entropy-aware Masking for Masked Language Modeling

gpt-5.25/28/2026

Paper 1 introduces a large, carefully structured multilingual diagnostic benchmark (MentalMap) with a capability hierarchy and multiple diagnostic axes, plus broad evaluation across 13 LLMs and human baselines. Its finding of a consistent cross-language “reasoning cliff” reframes text-only spatial reasoning limitations and has wide implications for evaluation, cognitive modeling, multilingual NLP, and multimodal/scratchpad research. Paper 2 is a useful optimization to MLM pretraining (entropy-based masking) with solid applied gains, but it is more incremental and narrower in downstream conceptual impact.

vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

gemini-3.15/28/2026

Paper 2 offers a broadly applicable tool that democratizes AI model development for researchers across various scientific disciplines. By automating the creation of AI pipelines with an evolving knowledge system, it has the potential to accelerate scientific discovery widely. While Paper 1 provides valuable theoretical insights into LLM spatial reasoning and cognitive bottlenecks, Paper 2's direct utility and state-of-the-art performance on MLE-Bench suggest a more immediate, transformative, and widespread practical impact across multiple domains.

vs. DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution

gemini-3.15/28/2026

Paper 1 addresses a fundamental, highly debated scientific question: whether LLMs construct internal world models. By introducing a comprehensive benchmark and identifying a universal 'L3 reasoning cliff' that mirrors human cognitive limits, it provides profound theoretical insights into LLM working memory and reasoning constraints. While Paper 2 offers valuable technical improvements for inference efficiency via speculative decoding, Paper 1's discoveries will likely have a broader, paradigm-shifting impact on how researchers understand LLM capabilities and design future multimodal or augmented architectures.