Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

Yajing Zhou, Xiangyu Kong

May 18, 2026

arXiv:2605.18194v1 PDF

cs.AI(primary)cs.CV

#1438of 2292·Artificial Intelligence

#1438 of 2292 · Artificial Intelligence

Tournament Score

1381±43

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

3.8/ 10

Significance4.5

Rigor3

Novelty5

Clarity5.5

Tournament Score

1381±43

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

3.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper introduces the "Observe-to-Believe" pipeline, a two-stage, training-free framework for second-order Theory of Mind (ToM) spatial reasoning in multi-modal LLMs. The key task is: given Agent A's egocentric audio-visual stream, predict what Agent B *believes* about Agent A's location, accounting for B's sensory limitations (field of view, occlusion). The framework decouples perception (Stage I: Gemini-2.5-Pro extracts geometric evidence from video) from cognitive inference (Stage II: DeepSeek-V4-Flash performs modality-aware perspective shifting). A binary "sensory mask" determines whether the reasoning follows a visual-dominant or audio-dominant pathway based on whether A falls within B's estimated visual frustum.

The conceptual framing around the "Cartesian Illusion" — that MLLMs implicitly assume shared omnidirectional perception — is intuitive and well-articulated. The idea of explicitly gating reasoning modalities based on another agent's inferred sensory constraints is a meaningful conceptual contribution to embodied AI.

2. Methodological Rigor

Significant concerns emerge upon close examination:

Absolute accuracy levels are low. The best-performing variant achieves 50.66% accuracy on what is fundamentally a 4-way classification task (front-left, front-right, back-left, back-right), where random chance would yield 25%. While this exceeds baselines (34.36% for egocentric end-to-end), the margin is modest and the absolute performance is far from reliable. The paper frames a 42% zero-shot baseline in the abstract but reports 34.36% for Baseline 1 and 24.42% for Baseline 2 — the 42% figure is unclear in origin.

Baseline comparisons are weak. The baselines are simply direct prompting of a VLM (Gemini) without the structured two-stage decomposition. There is no comparison against other structured prompting approaches, chain-of-thought variants, or existing ToM frameworks adapted to this setting. The "baselines" are essentially ablations of the proposed approach rather than competitive alternatives.

Dataset and evaluation scope are narrow. The evaluation uses only the ego_direction subset of the SAVVY dataset. No new benchmark is created despite the abstract's claim of establishing "a foundational paradigm." The dataset was designed for first-order spatial tracking, and its repurposing for second-order ToM is creative but limited — we lack ground-truth annotations specifically designed for the ToM task being evaluated.

Stage I accuracy is concerning. The perception module achieves only 59.66% accuracy on orientation extraction, meaning roughly 40% of inputs to Stage II are erroneous. Case 4 in the qualitative analysis explicitly shows how persistent Stage I errors cascade. Yet there is no systematic analysis of how Stage I error rates affect Stage II performance.

Statistical rigor is absent. No confidence intervals, significance tests, or variance across runs are reported. Given the stochastic nature of LLM outputs, this is a notable gap.

3. Potential Impact

The conceptual contribution — modeling epistemic sensory constraints for recursive ToM — addresses a genuine gap. Applications in human-robot collaboration, autonomous driving, and assistive robotics (where predicting others' perceptual states matters) are plausible long-term beneficiaries. However, the current implementation is far from deployment-ready:

The 8-direction discretization is coarse for real-world applications.

The pipeline requires two large foundation models (Gemini-2.5-Pro + DeepSeek-V4-Flash) with ~60 seconds per sample, limiting practical scalability.

The framework is entirely training-free (prompt engineering only), which is both a strength (generalizability) and weakness (performance ceiling).

The idea of modality-gated reasoning could influence prompt engineering strategies for embodied AI systems, but the execution needs substantial strengthening.

4. Timeliness & Relevance

The paper addresses a timely intersection: MLLMs are increasingly deployed in embodied settings, and understanding their spatial reasoning limitations is important. The ToM perspective is underexplored in the MLLM literature. However, concurrent work on spatial reasoning benchmarks (SpatialBench, 3D-PC) and embodied agents (PaLM-E, RT-2) is advancing rapidly, and this work's relatively thin empirical contribution may limit its lasting influence.

5. Strengths & Limitations

Strengths:

Clear conceptual framing of the "Cartesian Illusion" and why sensory-bounded ToM matters

The two-stage architecture is modular and model-agnostic, validated across different foundation models

Qualitative analysis (Table 4, Appendix cases) provides genuine insight into failure modes

The hexagonal radar analysis (Figure 3) across visibility conditions is informative

Training-free approach ensures reproducibility and generalizability

Limitations:

Absolute performance is modest (~50% on 4-way classification)

No new dataset or benchmark contribution despite framing as "foundational"

Baselines are weak; no comparison with structured CoT alternatives or existing ToM systems

Stage I bottleneck (59.66% accuracy) is underanalyzed

No statistical significance testing

Audio contribution is ambiguous — adding audio sometimes hurts performance (Table 1), and the improvements in Table 2 are marginal (+0.76% to +1.38%)

The mathematical formalization (Equations 1-5) gives an impression of formal rigor but is largely descriptive of the prompting strategy rather than a computational model

The paper's rhetoric significantly oversells the contributions relative to the empirical evidence

Additional observations:

The writing style is heavily promotional ("profoundly fatal," "completely encompasses," "empirically proves"), which somewhat undermines scientific credibility. Several claims in the abstract (e.g., "42% zero-shot baseline," "foundational paradigm") are either inconsistently supported or overstated. The paper would benefit from a more measured presentation aligned with the actual experimental evidence.

Rating:3.8/ 10

Significance 4.5Rigor 3Novelty 5Clarity 5.5

Generated May 19, 2026

Comparison History (20)

vs. Evaluating the Utility of Personal Health Records in Personalized Health AI

gemini-3.15/20/2026

Paper 1 addresses a fundamental cognitive bottleneck in AI—embodied spatial intelligence and second-order Theory of Mind. By proposing a novel mechanism to overcome the 'Cartesian Illusion' in MLLMs, it offers significant methodological innovation for Embodied AI. While Paper 2 provides highly valuable real-world evaluation of LLMs in healthcare, Paper 1 introduces a foundational algorithmic paradigm that solves a complex, theoretical reasoning gap. This fundamental contribution gives Paper 1 a broader potential scientific impact across robotics, multi-agent systems, and multimodal reasoning.

vs. Generative Auto-Bidding with Unified Modeling and Exploration

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to strong real-world applicability and demonstrated large-scale online deployment with measurable business/ROI gains, indicating immediate practical value and adoption potential. Methodologically, it integrates decision transformers, Q-guided exploration, and a safety fallback into a coherent pipeline addressing an important industrial safety–exploration tradeoff. Paper 2 is timely and novel as a benchmark/analysis of MLLM embodied ToM under perceptual bottlenecks, but its impact depends on broader community uptake and the generality of the proposed reasoning chain, with less evidence of downstream deployment or wide applicability beyond embodied AI evaluation.

vs. Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

claude-opus-4.65/20/2026

Paper 1 addresses a timely and high-impact problem at the intersection of MLLMs, embodied AI, and Theory of Mind—areas attracting enormous research attention. It introduces novel concepts (Epistemic Sensory Bottleneck, Anchor-Based Embodied Spatial Decomposition CoT), provides empirical evaluations with quantitative baselines, and exposes fundamental limitations of current models. Paper 2 presents a formal framework for KG agent affordances, which is intellectually rigorous but addresses a narrower community (Semantic Web/KG), is more of a position/framework paper without empirical validation, and builds on decades-old ideas with less immediate broad impact.

vs. Not all uncertainty is alike: volatility, stochasticity, and exploration

gemini-3.15/20/2026

Paper 2 addresses a fundamental theoretical problem in both AI and cognitive science regarding exploration under distinct types of uncertainty. Its formal mathematical framework and broad cross-disciplinary implications spanning reinforcement learning, neuroscience, and computational psychiatry provide higher potential scientific impact than Paper 1, which focuses on a more specific architectural limitation in current multi-modal language models for embodied AI.

vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

gpt-5.25/20/2026

Paper 1 has higher potential scientific impact due to its novelty and cross-cutting relevance to embodied AI: it targets a fundamental limitation in MLLMs (epistemic, multi-modal Theory of Mind under perceptual constraints) and proposes a generalizable reasoning paradigm plus a benchmark exposing failure modes. This is timely for robotics, multi-agent systems, and AI safety/interpretability. Paper 2 is rigorous and practically useful for survey methodology, but its contribution is more domain-bounded (imputation/workflow on one disaster survey) and likely to have narrower impact outside social science/data quality communities.

vs. CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

claude-opus-4.65/19/2026

Paper 2 (CBT-Audio) addresses a critical gap in mental health AI by introducing a novel, ethically-grounded dataset for evaluating audio language models in CBT. It has broader real-world applicability in clinical mental health settings, fills an important data scarcity problem, and provides a reusable benchmark. Paper 1, while intellectually interesting in probing MLLM spatial reasoning and Theory of Mind, addresses a more niche problem with a complex framework whose practical applicability is less immediate. Paper 2's contribution of a shareable spoken CBT dataset and systematic evaluation across 10 models is more likely to catalyze follow-up research across NLP, clinical AI, and mental health communities.

vs. Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

gpt-5.25/19/2026

Paper 2 has higher likely impact: it addresses a high-stakes, timely real-world problem (reliable medical reasoning across heterogeneous systems) with clear application pathways (knowledge graph alignment, RAG improvements) and measurable benefits on standard metrics plus downstream validation. Its query-conditioned, direction-aware, many-to-many alignment framing is broadly applicable to other domains with asymmetric ontologies. Paper 1 is novel and relevant for embodied AI benchmarking, but appears more diagnostic/benchmarking and CoT-guided heuristics, with less immediate deployment value and potentially weaker methodological grounding given LLM-CoT dependence and limited task scope.

vs. Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments

gemini-3.15/19/2026

Paper 1 addresses a fundamental limitation in Embodied AI (spatial reasoning and Theory of Mind in multi-agent environments) by proposing a novel, modality-aware Chain-of-Thought approach. Its advancement of multi-modal large language models' capabilities to model other agents' epistemic states offers high potential for real-world robotics and complex multi-agent systems. While Paper 2 provides a valuable psychometric validation framework, Paper 1's conceptual innovation and direct contribution to the rapidly evolving field of autonomous AI agents give it a broader and more transformative potential scientific impact.

vs. ADR: An Agentic Detection System for Enterprise Agentic AI Security

claude-opus-4.65/19/2026

Paper 2 (ADR) demonstrates higher scientific impact due to its immediate real-world deployment at scale (Uber, 10 months, 7,200+ hosts), addressing a timely and critical problem of AI agent security. It introduces a practical, production-proven framework with a new benchmark (ADR-Bench), strong empirical results outperforming baselines by 2-4x in F1, and addresses the rapidly growing enterprise AI agent ecosystem. Paper 1 explores an interesting ToM problem for MLLMs but is more niche, with modest baselines (42% accuracy) and less clear practical applicability. ADR's breadth of impact across AI security, enterprise systems, and the emerging MCP ecosystem gives it broader relevance.

vs. Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

claude-opus-4.65/19/2026

Paper 1 addresses a more fundamental and novel problem—Theory of Mind in multi-modal LLMs with embodied spatial reasoning—which has broader implications across Embodied AI, cognitive science, and multi-agent systems. It introduces a novel benchmark and theoretical framework (Epistemic Sensory Bottleneck, anchor-based CoT) that exposes deep limitations in current MLLMs. Paper 2, while solid, addresses a more incremental improvement in RL for open-ended generation with narrower scope (role-playing tasks). Paper 1's interdisciplinary relevance and foundational contribution to spatial cognition in AI gives it higher impact potential.

vs. Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

claude-opus-4.65/19/2026

Paper 1 provides actionable, empirically grounded design principles for compound LLM agents in adversarial sequential decision-making—a rapidly growing area. Its large-scale controlled study (3,475 episodes, 5 model families, 12 configurations) with cost accounting offers immediately useful guidance for practitioners. The identification of 'deliberation cascades' as a failure mode is novel and broadly applicable. Paper 2 addresses an interesting niche (spatial ToM in MLLMs) but targets a narrower problem with less immediate practical breadth. Paper 1's findings on context engineering vs. deliberation tradeoffs will likely influence a wider range of LLM agent deployments.

vs. New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

gemini-3.15/19/2026

Paper 2 demonstrates higher potential scientific impact due to its timeliness, breadth, and focus on a fundamental bottleneck in modern AI. While Paper 1 provides a solid theoretical contribution to zeroth-order optimization, Paper 2 tackles the highly relevant problem of spatial reasoning and Theory of Mind in Multi-Modal Large Language Models (MLLMs). By addressing the 'Cartesian Illusion' and introducing a sensory-bounded reasoning chain for Embodied AI, Paper 2 spans NLP, computer vision, robotics, and cognitive science, offering broader real-world applications in multi-agent environments compared to the specialized algorithmic improvements in Paper 1.

vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

gpt-5.25/19/2026

Paper 2 has higher potential impact due to its broader relevance to embodied AI, multi-agent systems, and multimodal reasoning, addressing a timely limitation (spatial/epistemic ToM under perceptual constraints) with a novel benchmark/task framing that can become a shared evaluation standard. Its applications span robotics, AR/VR, human–agent interaction, and safety-critical multi-agent coordination. Paper 1 is innovative and useful for scientific discovery workflows, but its scope is narrower (symbolic regression/equation discovery) and more incremental as an agentic reliability improvement over existing generation–fit–score loops.

vs. Actionable World Representation

claude-opus-4.65/19/2026

WorldString addresses a fundamental gap in physical world modeling by proposing a unified, differentiable architecture for modeling actionable object states from point clouds/RGB-D data. Its potential as a foundational building block for world models, digital twins, and integration with policy learning gives it broad applicability across robotics, simulation, and embodied AI. Paper 2, while addressing an interesting ToM problem in MLLMs, tackles a narrower problem with a more specialized evaluation setup. WorldString's broader scope, foundational nature, and versatility across multiple downstream applications suggest higher long-term scientific impact.

vs. ALSO: Adversarial Online Strategy Optimization for Social Agents

gpt-5.25/19/2026

Paper 2 has higher potential impact due to its broader relevance to embodied AI and multi-modal reasoning: it targets a fundamental limitation (“Cartesian Illusion”) and second-order Theory of Mind under perceptual constraints, proposes a task/benchmarking paradigm, and yields diagnostic findings about current MLLM failure modes. This is timely and likely to influence evaluation methodology, model design, and embodied multi-agent research. Paper 1 is novel and practical for social-agent adaptation, but is more niche (Sotopia-style dialogue simulation) and its bandit+surrogate approach is a comparatively incremental extension of online optimization ideas to LLM agents.

vs. A Global-Local Graph Attention Network for Traffic Forecasting

claude-opus-4.65/19/2026

Paper 2 addresses a more novel and fundamental problem at the intersection of embodied AI, Theory of Mind, and multi-modal reasoning in LLMs—a rapidly growing and highly relevant research area. It introduces novel concepts (Epistemic Sensory Bottleneck, Anchor-Based Embodied Spatial Decomposition CoT) that could have broad impact across embodied AI, cognitive science, and multi-agent systems. Paper 1, while solid, addresses traffic forecasting with incremental improvements (global-local attention) in an already crowded space with many similar graph-based approaches, limiting its novelty and broader impact.

vs. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

gpt-5.25/19/2026

Paper 2 likely has higher impact because it delivers a broadly usable, scalable, and reproducible evaluation framework (simulation + task generation) for web agents, addressing a major methodological bottleneck with clear real-world relevance to e-commerce automation and agent benchmarking. Its artifacts (ShopArena/ShopGuru, tasks, validation analyses) can become community infrastructure, enabling standardized comparisons across models and labs. Paper 1 is novel and timely for embodied ToM in MLLMs, but appears more niche and potentially more sensitive to prompt/CoT-driven gains, with narrower immediate applicability than an evaluation platform that can be widely adopted.

vs. Responsible Agentic AI Requires Explicit Provenance

gemini-3.15/19/2026

Paper 2 addresses a critical, field-wide challenge (accountability and safety in agentic AI) by proposing a structural framework applicable across domains. Its focus on AI governance and real-world deployment safety gives it a much broader interdisciplinary and societal impact compared to Paper 1, which focuses on a specific, albeit rigorous, technical sub-problem in spatial reasoning and Embodied AI.

vs. ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to broader cross-field relevance (embodied AI, multimodal reasoning, theory of mind, human-agent interaction, robotics), strong timeliness given current focus on MLLM grounding, and a generally applicable evaluation paradigm for perceptual/epistemic bottlenecks. Its task and framework target a foundational limitation (second-order ToM under sensory constraints) with implications beyond a single domain. Paper 1 is methodologically solid and impactful for cheminformatics, but its applications and audience are narrower compared to the wide-reaching embodied spatial/epistemic reasoning agenda in Paper 2.

vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

gemini-3.15/19/2026

Paper 1 addresses a highly timely and interdisciplinary challenge at the intersection of Multi-Modal LLMs, Embodied AI, and cognitive science (Theory of Mind). Its focus on solving the 'Cartesian Illusion' for multi-agent spatial reasoning promises broader real-world applications in robotics and AI compared to Paper 2, which offers specialized, albeit rigorous, algorithmic improvements for classical planning.