Chao Lei, Guang Hu, Meng Yang, Yanbei Jiang, Nir Lipovetzky
Theory of Mind (ToM) reasoning requires inferring agents' beliefs from partial and asymmetric observations, which remains an open challenge for LLMs. Existing prompting-based approaches improve ToM reasoning through observable-event filtering or temporal belief chains, without explicitly modeling nested beliefs. We introduce RecToM, an inference-time framework for ToM reasoning that models nested beliefs via recursive perspective construction. RecToM constructs each character perspective from the preceding character perspective along the character chain specified by the question, reducing higher-order belief questions to actual-world questions within the final constructed perspective. We further provide a KD45 analysis showing that RecToM's perspective construction induces a well-formed belief modality beyond simple event filtering. Experiments on ToM benchmarks, including Hi-ToM, Big-ToM, and FanToM, across multiple LLM backbones show that RecToM consistently outperforms recent advanced approaches, achieving state-of-the-art performance. Notably, RecToM reaches 100\% accuracy on Hi-ToM with GPT-5.4 and Qwen3.5, a benchmark requiring higher-order ToM reasoning.
RecToM introduces an inference-time framework for Theory of Mind (ToM) reasoning that explicitly models nested beliefs through recursive perspective construction. The key insight is straightforward but well-executed: rather than filtering events or constructing temporal chains (as in SIMTOM and TIMETOM), RecToM builds a symbolic state-event sequence and then recursively constructs character-specific perspectives by (1) identifying observable states/events, (2) completing unobserved states via belief persistence, and (3) chaining perspectives along the character chain specified by higher-order questions. This reduces K-th order belief questions to zero-order (factual) questions within the innermost character's constructed perspective.
The approach is grounded in a clean formalism: events are classified as persistent (modifying ontic facts) or transient (communications that update beliefs), states are accumulated deterministically, and perspectives are constructed through partial observation and completion. The KD45 modal logic analysis provides formal justification that the constructed belief modality is well-formed.
Direct impact on ToM reasoning: RecToM offers a principled and effective framework for nested belief modeling that clearly advances the state of the art on established benchmarks. The 100% accuracy on Hi-ToM (up to 4th-order beliefs) is a notable result, effectively "solving" this benchmark for capable LLMs.
The paper addresses a timely topic. ToM reasoning is increasingly recognized as a bottleneck for LLM deployment in social and multi-agent settings. Recent work (SIMTOM, TIMETOM, DEL-ToM, PercepToM) demonstrates active community interest. RecToM's contribution of explicit nested belief modeling fills a genuine gap: prior methods either handle only first-order beliefs well or use ad-hoc extensions for higher orders. The use of very recent LLMs (GPT-5.4, Qwen3.5) ensures relevance to the current model landscape.
The paper is well-written with an effective running example (Figure 2) that makes the recursive construction intuitive. The prompt templates in the appendix support reproducibility. The distinction between persistent and transient events, while simple, is practically effective and could be adopted by other ToM frameworks.
The achievement of 100% on Hi-ToM, while impressive, also raises the question of benchmark saturation—whether Hi-ToM remains a useful discriminative benchmark going forward.
Generated Jun 11, 2026
Paper 1 addresses a critical bottleneck in LLM deployment (the quadratic scaling of self-attention) by offering a novel RL-based training paradigm to make efficient sliding-window attention viable for rigorous tasks like math reasoning. This structural improvement has broad implications for foundational model training and efficient inference. In contrast, Paper 2 proposes an inference-time prompting strategy tailored to a specific cognitive domain (Theory of Mind), which, while valuable, has a narrower potential scientific and practical impact.
Paper 2 (IntElicit) has higher likely scientific impact due to broader real-world applicability and cross-field relevance: it contributes a general framework for interactive creativity elicitation/assessment spanning education, psychometrics, HCI, and AI alignment, backed by a human-subject study and explicit mechanisms against reward hacking. Paper 1 (RecToM) is novel and rigorous within LLM Theory-of-Mind prompting, but its impact is narrower (benchmark-centric ToM reasoning) and primarily advances inference-time prompting rather than introducing a widely deployable evaluation/intervention paradigm.
Paper 2 addresses Theory of Mind, a fundamental cognitive capability crucial for multi-agent systems and human-AI alignment. By introducing a recursive framework with formal logical grounding (KD45 analysis) and achieving state-of-the-art results across multiple benchmarks, it offers broader theoretical and practical implications. In contrast, Paper 1 focuses on a more specific architectural detail regarding agent skill organization, which, while useful, is less likely to drive foundational shifts in AI reasoning.
Paper 2 is likely higher impact: it introduces a broadly applicable, scalable alternative to explanation-based interpretability—training behavior forecasters from self-generated data—to predict model stability and counterfactual sensitivity. This is timely for trustworthy LRMs, reduces reliance on human labels, and offers clear real-world utility (deployment monitoring, uncertainty, robustness) across many domains and model types. While Paper 1 is novel and rigorous for Theory of Mind prompting with strong benchmark gains, its applicability is narrower and more task-specific than Paper 2’s general framework for forecasting model behavior.
SciConBench addresses a critical problem—evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. It introduces a large-scale benchmark (9.11K questions), a clean-room evaluation harness to prevent data leakage, and audits consumer-facing systems, revealing significant shortcomings. This has broad impact across AI safety, scientific integrity, and public health policy. Paper 2, while technically sound with its recursive ToM framework, addresses a narrower problem in LLM reasoning with incremental improvements on existing benchmarks. Paper 1's methodological contributions and real-world implications give it greater potential impact.
HORMA addresses a fundamental and broadly applicable challenge—efficient memory management for LLM agents in long-horizon tasks—with a novel hierarchical organization and RL-based retrieval approach. It has wider applicability across diverse agent tasks, demonstrates strong efficiency gains (22% token usage), and tackles the practical bottleneck of context scaling. While RecToM achieves impressive results on ToM benchmarks (100% on Hi-ToM), it addresses a narrower problem domain. HORMA's combination of hierarchical memory structure, RL-trained navigation, and demonstrated generalization to unseen tasks suggests broader impact across the rapidly growing LLM agent ecosystem.
Paper 2 (INFRAMIND) likely has higher scientific impact because it tackles an under-addressed, timely bottleneck—real-world deployment of multi-agent LLM systems under shared, congested infrastructure—yielding large latency/SLO gains with competitive or improved accuracy. Its infrastructure-aware, end-to-end RL formulation spans planning, routing, and scheduling, making it broadly applicable across serving stacks and agentic pipelines, with immediate industry relevance and cross-field impact (systems, RL, LLM orchestration). Paper 1 is novel and rigorous for ToM prompting, but its applications are narrower and more benchmark-centric.
Paper 1 addresses a critical bottleneck in AI reasoning—verifying research-level mathematical proofs. By introducing a strict step-level verification framework, it mitigates 'context poisoning' and lays the foundation for automated proof-review systems. This has transformative potential for advancing AI capabilities in frontier mathematics and formal logic. While Paper 2 presents a strong theoretical framework for Theory of Mind, the saturation of ToM benchmarks limits its broader methodological impact compared to solving complex, open-ended mathematical verification.
Paper 1 introduces RecToM, a novel framework addressing a fundamental AI challenge (Theory of Mind reasoning) with strong theoretical grounding (KD45 modal logic analysis) and demonstrates state-of-the-art results including 100% accuracy on a challenging benchmark. It offers broader impact across cognitive science, AI alignment, and multi-agent systems. Paper 2 provides interesting empirical observations about coding agents' metaprogramming strategies on esoteric languages, but its scope is narrower, findings are more descriptive than prescriptive, and the practical implications are more limited. Paper 1's methodological contribution and theoretical depth give it greater scientific impact potential.
Paper 2 addresses a fundamental challenge in AI—Theory of Mind reasoning in LLMs—with broader implications across cognitive science, NLP, and AI safety. Its formal grounding in modal logic (KD45), state-of-the-art results across multiple benchmarks (including 100% on Hi-ToM), and applicability across multiple LLM backbones give it wider scientific reach. Paper 1, while practically useful for mining operations, addresses a narrower domain-specific scheduling problem and primarily demonstrates that LLMs can approximate existing MILP solutions, representing more incremental engineering than fundamental scientific advancement.