OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah

May 25, 2026

arXiv:2605.26322v1 PDF

cs.AI(primary)

#1414of 2682·Artificial Intelligence

#1414 of 2682 · Artificial Intelligence

Tournament Score

1403±42

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6

Novelty7.5

Clarity7.5

Tournament Score

1403±42

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: OmniToM

1. Core Contribution

OmniToM introduces a paradigm shift in how Theory of Mind (ToM) is evaluated in LLMs: instead of scoring models on whether they produce correct final answers to social-reasoning questions (endpoint QA), it requires models to explicitly extract and label the underlying multi-actor belief structures that support those answers. The benchmark is structured in two stages: (1) Belief Extraction, where models must identify what each actor in a narrative knows, believes, or infers, and (2) Belief Labeling, where each belief proposition is annotated along seven dimensions (Order, Truth Status, Knowledge Access, Representation, Content Type, Mental Source, Context) grounded in the ATOMS taxonomy from developmental psychology. Built on 895 stories from ToMBench with 22,343 labeled belief propositions, OmniToM makes the usually-hidden mental-state reasoning process directly observable and auditable.

The central finding — an "actor-specific belief-tracking bottleneck" where models struggle specifically with Knowledge Access and Representation dimensions — is the kind of diagnostic insight that endpoint QA cannot provide. This is a genuinely useful contribution: it pinpoints that models fail not at parsing stories or identifying content, but at determining who knows what and whether beliefs are stated or inferred.

2. Methodological Rigor

Strengths: The two-stage evaluation design is well-motivated and cleanly formalized. The MatchCount-based semantic alignment for Stage 1 is a thoughtful solution to the open-ended extraction problem, handling compound and paraphrased beliefs. The human-calibrated, LLM-assisted annotation pipeline is transparent: 1K+ person-hours for calibration, three-annotator verification with majority voting, and explicit model selection based on calibration performance. The TELeR prompt taxonomy provides systematic prompt engineering across construction and evaluation.

Concerns: The semantic judge (GPT-5) achieves only 72.03% agreement with human alignment decisions — the authors rightly flag this as moderate and recommend treating Stage 1 F₁ as approximate. This is a meaningful limitation, especially since the benchmark's novelty lies precisely in evaluating extraction quality. Stage 2 relies on Claude-Sonnet-4.5 as the annotation model, meaning the gold labels are themselves LLM-generated (albeit human-calibrated). While the 93.62% calibration accuracy is strong, this creates a potential circularity: models evaluated on LLM-generated labels may share systematic biases with the annotation model. The paper acknowledges but does not deeply probe this risk. Additionally, the zero-shot evaluation uses Level 3 prompts while construction uses Level 4 — the performance gap between these (shown in Table 9) is substantial, meaning some of the "bottleneck" may reflect prompt sensitivity rather than fundamental capability limitations.

3. Potential Impact

The most significant impact is conceptual: OmniToM demonstrates that endpoint correctness is insufficient for evaluating social reasoning and provides a concrete alternative. This reframing could influence how future ToM benchmarks are designed across NLP and cognitive AI. The seven-dimensional schema offers a reusable annotation framework that could be applied to other narrative understanding tasks. The identification of Knowledge Access and Representation as specific failure modes provides actionable targets for model improvement — researchers working on belief tracking, information flow modeling, or multi-agent reasoning now have specific dimensions to optimize against.

However, the practical impact may be bounded. The benchmark is text-only, story-based, and limited to short narratives from seven ToMBench categories. Real-world social reasoning involves multimodal cues, interactive dynamics, and open-ended contexts that OmniToM cannot capture. The finding that models struggle with information distribution across actors is not entirely surprising given prior work on perspective-taking failures, though the dimensional decomposition adds granularity.

4. Timeliness & Relevance

This work is highly timely. As LLMs are increasingly deployed in social contexts (customer service, therapy bots, educational agents, negotiation), understanding whether they actually model mental states versus exploit surface patterns is critical. The paper directly responds to growing concerns that ToM benchmarks may reward shortcut strategies rather than genuine reasoning. The benchmark fills a clear gap in the evaluation landscape — Table 1 effectively shows that no prior benchmark explicitly evaluates all seven dimensions OmniToM covers.

5. Strengths & Limitations

Key Strengths:

Principled decomposition: The ATOMS-grounded schema translates cognitive science theory into operational evaluation dimensions, providing interpretability beyond aggregate scores.

Diagnostic specificity: The per-dimension, per-order analysis (Fig. 5) enables precise failure localization — Order 1 beliefs with Knowledge Access labels are the hardest, which is genuinely informative.

Comprehensive evaluation: Nine models spanning open and closed-source, 8B to frontier scale, providing broad coverage.

Transparency: The extensive appendix (annotation examples, prompt templates, reliability statistics) supports reproducibility.

Notable Limitations:

Gold standard quality: The 72.03% judge agreement and LLM-generated gold labels introduce measurement uncertainty that may obscure true model differences, particularly for close comparisons.

Limited ecological validity: Short, self-contained stories with clear belief dynamics don't capture the ambiguity and complexity of real social reasoning.

No training signal: The benchmark is evaluation-only; it doesn't provide mechanisms for improving the identified bottleneck.

Inherited biases: Building on ToMBench inherits its topical and cultural scope limitations.

Missing baselines: No comparison with chain-of-thought prompting, fine-tuned models, or specialized belief-tracking architectures (like SymbolicToM's tracker applied to OmniToM's schema).

Stage coupling: Stage 1 and Stage 2 are evaluated independently, but the interaction between extraction errors and labeling errors is not analyzed (e.g., do models that extract fewer beliefs label them more accurately?).

Overall Assessment

OmniToM makes a meaningful conceptual and methodological contribution by shifting ToM evaluation from endpoint answers to explicit belief-structure modeling. The seven-dimensional schema is well-designed and the diagnostic findings about Knowledge Access and Representation bottlenecks are actionable. However, the moderate reliability of the semantic judge, LLM-generated gold standards, and limited narrative complexity temper the benchmark's authority. This is a solid step toward process-level evaluation of social reasoning, though its ultimate impact depends on community adoption and extension to richer settings.

Rating:6.8/ 10

Significance 7.5Rigor 6Novelty 7.5Clarity 7.5

Generated May 27, 2026

Comparison History (26)

vs. SkillGrad: Optimizing Agent Skills Like Gradient Descent

gemini-3.15/28/2026

Paper 1 addresses a fundamental cognitive capability in LLMs (Theory of Mind) by shifting evaluation from outcome-based QA to explicit belief modeling. This provides deeper insights into model reasoning and representation bottlenecks. Its rigorous benchmark approach has broad implications for AI safety, alignment, and cognitive science, likely driving more fundamental research than the practical, albeit innovative, agent optimization framework presented in Paper 2.

vs. GONDOR to the Rescue: Satisficing Planning with Low Memory

claude-opus-4.65/28/2026

OmniToM addresses a fundamental gap in evaluating Theory of Mind in LLMs—a highly active and rapidly growing research area. By introducing explicit belief modeling with a structured annotation framework (22,343 labeled propositions), it provides a novel diagnostic benchmark that goes beyond end-point QA, revealing specific bottlenecks in current LLMs. This has broad impact across AI safety, cognitive science, and NLP. GONDOR, while a solid contribution to memory-efficient search, represents an incremental improvement to GBFS in a more mature, narrower subfield with comparatively limited cross-disciplinary reach.

vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

gpt-5.25/28/2026

Paper 2 (CIVIC) likely has higher scientific impact due to strong real-world applicability and timeliness: inference-time memory/latency is a major deployment bottleneck for VLMs. Its end-to-end, hardware-aware sequence compactness (including KV-cache and contiguous memory access) addresses a practical gap where prior token-pruning methods underdeliver wall-clock gains. The method appears rigorously evaluated on a modern VLM (Qwen3-VL) with efficiency and accuracy benchmarks, and could generalize across architectures, benefiting multimodal systems broadly. Paper 1 is novel and valuable as an evaluation benchmark, but impact is narrower (ToM probing) and more indirect on deployment.

vs. MIRA: A Bilingual Benchmark for Medical Information Response Audit

claude-opus-4.65/28/2026

MIRA addresses a critical real-world safety concern—health information equity across language, literacy, and register—with direct implications for LLM deployment in healthcare. The discovery of 'Differential Information Dilution' is novel and actionable, with a demonstrated mitigation strategy. Its bilingual design and practical focus on vulnerable populations (low health literacy users) give it broader societal impact. While OmniToM contributes meaningfully to Theory of Mind evaluation methodology, it is more narrowly focused on cognitive benchmarking without immediate real-world applications. MIRA's timeliness regarding AI safety in healthcare amplifies its potential impact.

vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

gemini-3.15/28/2026

Paper 1 addresses a critical, highly practical gap in the rapidly expanding field of LLM agents by isolating the impact of the system 'harness' from the base model. This insight fundamentally shifts how agent capabilities should be evaluated and reported. While Paper 2 offers a valuable cognitive benchmark for Theory of Mind, Paper 1 has broader real-world applicability for building, diagnosing, and deploying reliable autonomous systems, giving it a higher potential for immediate and widespread scientific and engineering impact.

vs. Do Agents Think Deeper? A Mechanistic Investigation of Layer-Wise Dynamics in Sequential Planning

gpt-5.25/28/2026

Paper 2 has higher potential impact due to greater novelty and breadth: it offers mechanistic, layer-wise causal evidence (probes, layer-skipping interventions, effective-depth metrics) about how depth is allocated in agentic multi-turn settings—highly timely given the shift toward autonomous agents. The findings can influence model architecture, training, interpretability, and agent design across domains. Paper 1 is a solid benchmark with clear applications in evaluating ToM, but benchmarks are more incremental and narrower in cross-field impact than mechanistic insights that may generalize to many tasks and model families.

vs. The Shape of Overthinking: Backtracking Bursts in Long Reasoning Traces

claude-opus-4.65/28/2026

OmniToM introduces a novel benchmark with a structured evaluation framework for Theory of Mind in LLMs, addressing a fundamental gap in how mental-state reasoning is assessed. It provides reusable infrastructure (22,343 labeled propositions, seven-dimensional schema) that can drive substantial follow-up research across cognitive science, NLP, and AI safety. Paper 1 offers useful empirical insights on backtracking dynamics in reasoning traces, but its scope is narrower—focused on a specific model behavior pattern with incremental practical utility (early-exit filtering). Paper 2's broader applicability and foundational contribution to ToM evaluation give it higher impact potential.

vs. An LLM-Based Assistance System for Intuitive and Flexible Capability-Based Planning

gemini-3.15/28/2026

Paper 1 offers higher scientific impact because it addresses a fundamental challenge in evaluating Theory of Mind (ToM) in LLMs. By shifting from simplistic end-point QA to explicit, multi-dimensional belief modeling, OmniToM provides a robust framework that impacts general AI development, cognitive modeling, and human-AI interaction. In contrast, Paper 2 presents a valuable but narrower application of LLMs to industrial SMT planning. Furthermore, Paper 1 demonstrates greater methodological rigor with a large-scale dataset (22k+ propositions), whereas Paper 2 relies on a very small set of 23 test cases, limiting its broader scientific significance.

vs. The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

claude-opus-4.65/28/2026

Paper 1 provides a rigorous statistical re-evaluation of a high-profile benchmark (GSM-Symbolic) that made sweeping claims about LLM reasoning. By identifying concrete statistical flaws—shifted integer distributions, lack of proper random effects modeling—and showing that blanket conclusions about LLM reasoning are premature, it directly challenges influential findings and establishes better methodological standards for LLM evaluation. This has broad impact across the entire LLM benchmarking community. Paper 2 introduces a valuable but more niche benchmark for Theory of Mind evaluation, with narrower scope of influence.

vs. VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

gpt-5.25/28/2026

Paper 2 (OmniToM) likely has higher scientific impact due to broader cross-field relevance and timeliness: explicit belief-structure evaluation targets a central limitation of LLM social reasoning and agent interaction, with applications in HCI, education, safety, and multi-agent systems. Its two-stage framework and rich seven-dimensional labeling provide a methodological advance over endpoint QA, enabling more diagnostic analyses. Paper 1 (VeriTrip) is valuable and rigorous for web-based planning agents, but is more domain-specific (travel) and overlaps with an already crowded space of retrieval/grounding benchmarks, potentially narrowing breadth of impact.

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it introduces a unified framework (Calibrated Interactive RL) addressing a central, broadly relevant obstacle for deployed multi-turn dialogue—compounding distribution shift—with theoretical characterization and empirically validated mitigation via simulator alignment. This is timely for RLHF/agentic LLMs and has direct real-world applications in robust conversational systems, plus breadth across RL, simulation-to-real, and dialogue. Paper 1 is novel and valuable as an interpretability/diagnostic benchmark for ToM, but benchmarks typically yield narrower immediate impact than methods that improve interactive agent performance.

vs. Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

gemini-3.15/27/2026

Paper 1 addresses a fundamental cognitive capability (Theory of Mind) in LLMs, introducing a novel, granular benchmark that moves beyond simple question-answering to evaluate internal belief representations. This methodological innovation will likely have a broad impact on LLM evaluation and cognitive AI. Paper 2, while providing a valuable empirical critique of a specific Agent-to-Agent platform's incentive and validation flaws, is more narrowly focused on the operational shortcomings of existing systems rather than advancing fundamental AI capabilities or evaluation methods.

vs. Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

gpt-5.25/27/2026

Paper 2 (Claw-Anything) likely has higher scientific impact due to its strong real-world applicability and timeliness: always-on assistants with broad digital access are a central near-term deployment target, and the benchmark captures long-horizon, multi-service, multi-device, GUI+CLI interaction plus proactive behavior under noisy/conflicting context. It also contributes a scalable automated environment-generation pipeline and demonstrates measurable training gains, increasing methodological utility beyond evaluation. Paper 1 is novel and valuable for interpretability of ToM in LLMs, but its applications are more indirect and narrower than the systems-level agent setting in Paper 2.

vs. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

gemini-3.15/27/2026

Paper 2 tackles a fundamental cognitive capability (Theory of Mind) in LLMs, offering a novel methodological shift from end-point QA to explicit belief modeling. This has broad implications across NLP, cognitive science, and human-AI interaction. While Paper 1 provides a valuable systemic fix for production benchmarking artifacts, its scientific scope is narrower, focusing primarily on Python-specific engineering bottlenecks rather than foundational model reasoning capabilities.

vs. Advancing Creative Physical Intelligence in Large Multimodal Models

gpt-5.25/27/2026

Paper 2 has higher estimated impact due to stronger novelty and broader applicability: it introduces a visually grounded benchmark for creative, physically constrained tool use (a key gap for multimodal agents) and pairs it with a concrete training recipe (affordance-grounded alignment via DPO + KB supervision) showing measurable improvements and reduced hallucination. This combination of evaluation + intervention is timely for robotics/embodied AI and multimodal agents, with potential real-world translation. Paper 1 is rigorous and valuable for social reasoning diagnostics, but is more niche (ToM evaluation) and primarily benchmark-focused without an associated capability-improving method.

vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact due to its timely, broadly relevant identification of a structural vulnerability in RLHF—currently the dominant alignment paradigm. It introduces a clear threat model (alignment tampering), demonstrates amplification across multiple real-world failure modes (propaganda, bias, brand promotion, goal-seeking), and highlights unresolved mitigation trade-offs, making it immediately actionable for both academia and industry. Paper 1 is a solid, methodical benchmark contribution, but its impact is narrower (ToM evaluation) and primarily advances measurement rather than exposing a system-level risk affecting most deployed alignment pipelines.

vs. NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

claude-opus-4.65/27/2026

NeurIPS (Paper 2) has higher potential scientific impact due to its concrete methodological innovations (SRST and SG-MoE) that solve a well-defined performance-fidelity trade-off in fMRI decoding, achieving state-of-the-art results with dramatically improved efficiency (10 vs. 600 epochs). Its contributions are immediately applicable to neuroscience and brain-computer interfaces, with demonstrated scalability and generalizability. Paper 1, while valuable as a benchmark for ToM evaluation in LLMs, primarily offers a diagnostic tool rather than a novel solution, and its impact depends on community adoption. Paper 2's paradigm shift—treating anatomy as signal rather than noise—has broader transformative potential.

vs. Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

gemini-3.15/27/2026

Paper 1 addresses a fundamental limitation in evaluating Theory of Mind (ToM) in LLMs by shifting the paradigm from simple end-point QA to explicit belief representation tracking. This offers profound theoretical insights into LLM reasoning capabilities and cognitive modeling. While Paper 2 presents a highly practical and relevant RAG framework for semi-structured data, Paper 1's focus on deep cognitive evaluation has broader implications for understanding and developing AGI, granting it higher potential for long-term scientific impact across AI and cognitive science.

vs. Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

claude-opus-4.65/27/2026

OmniToM addresses a fundamental limitation in evaluating Theory of Mind in LLMs—a core capability for AI systems—by introducing a rigorous benchmark with explicit belief modeling. This has broad impact across cognitive science, NLP, and AI safety. Paper 1, while providing a valuable empirical study of A2A networks revealing important design flaws, is more narrowly focused on characterizing a specific platform (EvoMap). Paper 2's benchmark methodology, multi-dimensional evaluation schema, and identification of systematic LLM limitations are more likely to drive widespread follow-up research and methodological advances.

vs. Credit Assignment with Resets in Language Model Reasoning

gemini-3.15/27/2026

Paper 1 addresses a fundamental algorithmic bottleneck in LLM post-training—credit assignment in multi-step reasoning. By proposing novel RL methods (RRPO and SRPO) that improve over standard techniques like GRPO, it has broad implications for developing next-generation reasoning models. Paper 2 introduces a valuable but narrower evaluation benchmark for Theory of Mind. Foundational training methodologies generally yield higher and broader scientific impact than domain-specific benchmarks.