Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao, Yitong Qiao, Chunlei Meng, Zhangquan Chen, Xin Cao
Abstract
Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning"
1. Core Contribution
MentalMap introduces a multilingual diagnostic benchmark for pure-text spatial reasoning with a two-axis design: a six-level capability staircase (L0–L5) progressing from atomic spatial fact retrieval to generative world-graph construction, crossed with four diagnostic lenses (frame of reference, reading-direction bias, reasoning effort, hallucination). The benchmark covers eight typologically diverse languages, is grounded in 100 ProcTHOR household scenes, and comprises 39 task families across 1,950 evaluation cells (~47K scored items per model). The central empirical finding is a "universal L3 cliff"—a discrete performance collapse at viewpoint reasoning that persists across 13 LLMs, all languages, and scales from 1.7B to 32B parameters. Crucially, a cross-lingual human pilot under identical conditions reproduces the same cliff, suggesting the bottleneck is inherent to the text modality rather than being LLM-specific.
2. Methodological Rigor
Strengths in design: The two-axis architecture (capability × diagnostic) is well-motivated by cognitive science (recall-construction distinction from mental rotation and egocentric-allocentric studies). The separation of static comprehension (L0–L2) from active spatial reasoning (L3–L5) is principled and empirically validated by the cliff. The inclusion of both strict and partial-credit composites for L5 is methodologically sound, and the paper demonstrates they yield materially different rankings—an important meta-methodological contribution.
Evaluation breadth: Thirteen models spanning frontier closed-source, open-weight 7-10B, mid-scale 27-32B controls, a sub-cliff lower bound (1.7B), and a VLM text-only ablation provide comprehensive coverage. The ~523K total evaluation items across the panel give statistical power to the claims.
Concerns: The human pilot, while valuable, has limited sample size (N≥3 per language). The L3=0% result across all eight languages is striking but could benefit from larger participant pools and more careful cognitive controls. The paper acknowledges this limitation and pre-registers quantitative predictions for scratchpad and multimodal extensions—a commendable practice, though these remain future work. The claim that the cliff reflects "working-memory bottleneck of the text modality" is plausible but not definitively established; it's an interpretation consistent with the data rather than a proven mechanism.
Some evaluation choices warrant scrutiny: the L5 composite weights (0.05/0.05/0.20/0.55/0.15) appear somewhat arbitrary, and the dominance of edge F1 at 0.55 could skew rankings. The paper partially addresses this by reporting both strict and partial-credit metrics.
3. Potential Impact
Benchmark contribution: MentalMap fills a genuine gap at the intersection of spatial reasoning, multilingual evaluation, and structured output generation. No prior benchmark combines all three (Table 1 makes this clear). The released evaluation harness, schema validators, and multilingual prompts create a reusable infrastructure.
Theoretical contribution: The static-to-active reasoning cliff, if robust, has implications for understanding LLM capabilities broadly. The finding that chain-of-thought is not universally beneficial (helping DeepSeek by +32pp but hurting Qwen2.5-7B by -16pp at L3) and actively degrades JSON output validity challenges the common assumption that CoT is a reliable reasoning booster. The script-typology recovery from performance correlations (F6) offers an interesting lens on multilingual model analysis.
Practical implications: For embodied AI and robotics, the finding that no model reliably performs viewpoint transformation from text—and that humans also fail—suggests fundamental architectural constraints on text-only spatial interfaces. This motivates multimodal and scratchpad-augmented approaches with clear design rationale.
4. Timeliness & Relevance
The paper addresses a timely question as LLMs are increasingly deployed as interfaces for embodied agents and robotics planners. The multilingual axis is especially relevant given the global deployment of LLMs beyond English. The concurrent emergence of several spatial reasoning benchmarks (SpatialText, SiT-Bench, grid-world probes) confirms this is an active area, but MentalMap's multi-dimensional design distinguishes it from competitors.
The pre-registration of falsifiable predictions for scratchpad and multimodal extensions is methodologically forward-looking and sets a good precedent for benchmark papers.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional observations: The paper's length and complexity (18+ pages with extensive appendices) may limit accessibility. The finding that fine-tuning regime rather than parameter scale drives structured-output competence (F5) is practically useful but based on a limited number of scale control points. The Gemma-3-27B structured-text control missing data (N/A in Table 3) is unexplained.
Overall, this is a substantial benchmark contribution that addresses real gaps in spatial reasoning evaluation. The universal L3 cliff finding, if replicated at larger human sample sizes and across additional model families, could become a reference result in the field. The main risks are that the cliff may be partially artifactual (driven by task design choices at L3) and that the modality-bottleneck interpretation remains speculative without mechanistic evidence.
Generated May 28, 2026
Comparison History (17)
Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with a rigorous multilingual benchmark (MentalMap) spanning 13 models, 8 languages, and 39 task families. The discovery of a universal 'L3 reasoning cliff' and the finding that humans show similar limitations under text-only conditions provides deep theoretical insight into reasoning constraints. Its breadth of impact spans NLP, cognitive science, and AI safety, and it motivates future multimodal research directions. Paper 1, while useful, addresses a narrower educational technology application with evaluation limited to one university's CS department.
Paper 2 offers a broadly reusable evaluation methodology (matched-pair protocol + ADR) that directly addresses a known pitfall in LLM reasoning assessment (metric gaming via satisfiable bias) and connects to foundational complexity phenomena (phase transition, scaling). Its representation-invariant tests via reductions (to Vertex Cover and 3D packing) increase rigor and generality, making it applicable across many reasoning benchmarks beyond SAT. Paper 1 is a strong, timely multilingual diagnostic for spatial world modeling, but its impact is more domain-specific and less methodologically generalizable than Paper 2’s evaluation framework.
Paper 1 tackles a fundamental and highly debated question in AI: whether LLMs construct true internal world models. By introducing a comprehensive benchmark and demonstrating that LLM spatial reasoning failures mirror human working memory limits, it provides profound insights bridging NLP, multimodal AI, and cognitive science. In contrast, Paper 2 offers a valuable but highly specialized technical optimization fix for Multimodal Sentiment Analysis. Paper 1's broader scope, theoretical implications for future LLM architectures, and relevance to the ongoing discourse on AI reasoning give it a substantially higher potential for widespread scientific impact.
Paper 1 likely has higher impact: it introduces a substantial multilingual benchmark with a clear capability taxonomy and multiple diagnostic axes, yielding a robust, generalizable finding (a universal “L3 reasoning cliff”) corroborated by human evaluations, which strengthens methodological rigor and broad relevance to cognition, evaluation, and multilingual NLP. Its insights inform future model design (multimodal/scratchpad) and provide a reusable resource. Paper 2 is a promising training tweak for a narrower class of judge-based optimization tasks; impact depends on adoption and on how broadly “Cross-Entropy Games” applies beyond this setup.
Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with a rigorous multilingual benchmark and a striking empirical finding (the L3 reasoning cliff) that generalizes across languages, scales, and even to humans. This reframes spatial reasoning limitations as working-memory constraints rather than architectural deficits, which has broad implications for LLM design, multimodal AI, and cognitive science. Paper 1 tackles a practical but narrower infrastructure problem (model routing in hubs). While useful, its impact is more incremental and domain-specific compared to Paper 2's foundational insights.
Paper 2 introduces a novel benchmark (MentalMap) addressing a fundamental question about LLM world models with a well-designed multilingual hierarchy. The discovery of a universal 'L3 reasoning cliff' is a striking empirical finding with broad implications for understanding LLM capabilities and limitations. The human comparison strengthens claims about fundamental text-based reasoning constraints. Paper 1, while methodologically rigorous, is more narrowly focused on measurement protocol sensitivity in confidence calibration—important but incremental. Paper 2's breadth (13 models, 8 languages, human baselines) and its implications for multimodal AI give it wider impact potential.
Paper 2 likely has higher scientific impact due to a broadly useful, multilingual diagnostic benchmark (MentalMap) that can become a standard evaluation tool across the field, informing model development, cognition-inspired analysis, and cross-lingual NLP. Its methodological contribution (capability hierarchy, multiple diagnostic axes, structured-text control, many models, plus human comparison) supports strong, generalizable claims about a persistent spatial reasoning bottleneck. Paper 1 is practically valuable for reducing multimodal hallucinations, but its impact is more specialized to a training recipe and may be superseded faster than a widely adopted benchmark and reframing result.
Paper 1 likely has higher impact due to direct, deployment-critical relevance: it reframes tool-use failures as a sim-to-real POMDP gap, introduces a perturbation benchmark grounded in real GitHub issues, and proposes an RL domain-randomization recipe that measurably improves robustness and transfers to unseen runtime failures. This combines actionable methodology, public benchmark/leaderboard, and immediate applicability to production LLM agents, with potential influence across agent evaluation, RL-for-LLMs, and reliability engineering. Paper 2 is novel and rigorous diagnostically, but is more descriptive and its main bottleneck attribution (text-only working memory) may limit near-term actionable gains.
Paper 2 addresses a fundamental question about LLM cognition—whether they build internal world models—with broader implications across AI, cognitive science, and linguistics. Its multilingual, multi-axis diagnostic framework (MentalMap) with the discovery of a universal 'L3 reasoning cliff' that also appears in humans provides a deeply informative finding about text-based reasoning limitations. This has wider cross-field impact and relevance to the broader LLM research community compared to Paper 1's domain-specific (materials science) benchmark, despite Paper 1's solid methodological contribution to process reasoning.
Paper 1 addresses a fundamental, highly debated question in AI regarding whether LLMs build internal world models, offering deep insights into their cognitive capabilities and limitations. Its findings on the 'reasoning cliff' have broad theoretical implications across natural language processing, cognitive science, and AGI research. In contrast, Paper 2 focuses on a niche, applied problem (e-commerce disputes), which, while practically useful, has a narrower scientific scope and theoretical impact.
Paper 2 introduces a novel statistical framework for adaptive AI auditing with rigorous anytime-valid guarantees, addressing a fundamental and growing need across all AI systems. Its methodological contribution—formalizing dueling hypotheses via e-processes and proving asymptotic certification guarantees—is broadly applicable beyond any single domain. Paper 1, while thorough in benchmarking spatial reasoning across languages, is more incremental (another LLM benchmark) with narrower scope. Paper 2's framework has broader cross-field impact (statistics, ML safety, regulation) and higher timeliness given increasing AI deployment and auditing demands.
While Paper 1 offers a practical engineering solution for LLM efficiency, Paper 2 tackles a fundamental theoretical debate in AI: whether LLMs build internal world models. By introducing a comprehensive benchmark and uncovering a universal 'reasoning cliff' akin to human working memory limits, Paper 2 provides profound cognitive and architectural insights that will likely drive foundational research in multimodal and augmented reasoning across the broader AI community.
Paper 2 offers a profound theoretical contribution to a fundamental debate in AI: whether LLMs build internal world models. By establishing a universal 'L3 reasoning cliff' in spatial reasoning across languages and scales, and validating it against human baselines, it reveals inherent limitations in text-only working memory. While Paper 1 identifies a critical security vulnerability in stateful agents, Paper 2's rigorous methodological hierarchy and implications for future architectural designs give it broader foundational impact across AI, cognitive science, and NLP.
Paper 2 addresses a fundamental and heavily debated question in AI and cognitive science: whether LLMs build internal world models. Its highly rigorous methodology—featuring a six-level hierarchy, multilingual evaluation, and human baselines—provides profound theoretical insights into LLM working memory and spatial reasoning. While Paper 1 offers a valuable benchmark for practical agent development, Paper 2's focus on foundational model capabilities and its broader implications across linguistics and cognitive science give it a higher potential for deep scientific impact.
Paper 1 introduces a large, carefully structured multilingual diagnostic benchmark (MentalMap) with a capability hierarchy and multiple diagnostic axes, plus broad evaluation across 13 LLMs and human baselines. Its finding of a consistent cross-language “reasoning cliff” reframes text-only spatial reasoning limitations and has wide implications for evaluation, cognitive modeling, multilingual NLP, and multimodal/scratchpad research. Paper 2 is a useful optimization to MLM pretraining (entropy-based masking) with solid applied gains, but it is more incremental and narrower in downstream conceptual impact.
Paper 2 offers a broadly applicable tool that democratizes AI model development for researchers across various scientific disciplines. By automating the creation of AI pipelines with an evolving knowledge system, it has the potential to accelerate scientific discovery widely. While Paper 1 provides valuable theoretical insights into LLM spatial reasoning and cognitive bottlenecks, Paper 2's direct utility and state-of-the-art performance on MLE-Bench suggest a more immediate, transformative, and widespread practical impact across multiple domains.
Paper 1 addresses a fundamental, highly debated scientific question: whether LLMs construct internal world models. By introducing a comprehensive benchmark and identifying a universal 'L3 reasoning cliff' that mirrors human cognitive limits, it provides profound theoretical insights into LLM working memory and reasoning constraints. While Paper 2 offers valuable technical improvements for inference efficiency via speculative decoding, Paper 1's discoveries will likely have a broader, paradigm-shifting impact on how researchers understand LLM capabilities and design future multimodal or augmented architectures.