Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dangjoo Kim
Abstract
Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
1. Core Contribution
This paper introduces Imaginative Perception Tokens (IPT), intermediate visual representations that externalize what a VLM would perceive under an unobserved spatial configuration—a novel viewpoint, a ground-level view from a top-down map, or a unified bird's-eye view from multiple partial observations. The key conceptual distinction from prior work (Visual Sketchpad, MVoT, Mirage, Mull-Tokens) is that IPTs represent *missing* spatial structure rather than refining *visible* structure. The paper formalizes three tasks—Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC)—each requiring reasoning about spatial information not directly observable in the input. Training datasets of ~20K examples each are constructed with ground-truth intermediate imaginations and final answers.
The approach builds on BAGEL, a unified decoder-only transformer supporting interleaved text-image generation, repurposing its generative capacity for spatial reasoning intermediates rather than open-ended image generation. A noteworthy finding is that IPT-trained models improve even in answer-only inference mode (no image generated at test time), suggesting the imagination supervision shapes better internal spatial representations.
2. Methodological Rigor
Strengths in experimental design:
Concerns:
3. Potential Impact
Direct impact: The conceptual framing of "imaginative perception" as a distinct category of intermediate reasoning—separate from refining visible structure—is valuable. It provides a principled criterion for when visual intermediates should be employed: when required spatial structure is absent from input.
Dataset contribution: The three benchmarks with ground-truth imaginations fill a genuine gap. No prior dataset pairs spatial reasoning questions with ground-truth intermediate views representing unobserved configurations.
Broader implications: The finding that spatial reasoning degrades under textual chain-of-thought has implications for the broader reasoning community, suggesting that modality-appropriate reasoning representations matter. This could influence how future models handle geometric/spatial vs. logical/verbal reasoning differently.
Limitations on impact: The approach is tightly coupled to BAGEL's unified architecture. Replication requires access to models that natively support interleaved generation, limiting immediate adoption. The sim-to-real transfer shows promise but real-world performance gaps remain significant.
4. Timeliness & Relevance
The paper is highly timely. Spatial reasoning is an acknowledged weakness of current VLMs, and there is growing interest in multimodal chain-of-thought (OpenAI's o3/o4-mini image reasoning, ThinkMorph). The emergence of unified models (BAGEL, Chameleon, Janus) that support interleaved generation makes this approach technically feasible now in a way it wasn't previously. The paper's exploration of discrete tokens (Appendix Section 11) provides useful negative evidence that motivated the continuous latent approach, adding historical context.
5. Strengths & Limitations
Key Strengths:
1. Clean conceptual contribution: The distinction between imaginative perception (missing structure) and perceptual refinement (visible structure) is well-articulated and scientifically useful.
2. Strong negative result: Text CoT degrading spatial performance is counterintuitive and important for the field.
3. Answer-only inference benefit: That imagination supervision helps even without generating images at test time is the paper's most compelling finding, suggesting genuine representational improvement.
4. Comprehensive ablations: Resolution, modality, inference mode, and cross-domain transfer are all carefully studied.
Notable Weaknesses:
1. Imagination quality bottleneck: The 36-point gap between generated and GT imaginations on PT (Table 4) suggests the approach is fundamentally limited by generation quality for harder tasks.
2. Inconsistent gains across tasks: IPT underperforms label-only on in-domain PET (96.8% vs. 97.5%) and in-domain PT (49.0% vs. 65.7%), with clear benefits only on MVC (+3.4%). The narrative of consistent improvement is somewhat overstated.
3. Task-specific imagination formats: Each task requires a different imagination target (novel viewpoint, sideview, BEV map), limiting scalability to a general-purpose spatial reasoner.
4. Limited real-world evaluation: Real-world benchmarks are small (332 for PT, 200 for MessyTable), and some configurations show no clear IPT advantage over label-only.
5. Reproducibility concerns: BAGEL is a recent model with limited community adoption; the paper promises code release but hasn't done so yet.
Overall Assessment
This is a well-motivated paper that introduces a conceptually clean idea—supervising VLMs with intermediate visual representations of unobserved spatial structure—and provides supporting datasets and experiments. The strongest contribution is the conceptual framing and the finding that imagination supervision can improve internal representations without requiring image generation at inference. However, the empirical story is mixed: gains are inconsistent across tasks, the approach is bottlenecked by generation quality, and the comparison framework (fine-tuned vs. zero-shot baselines) complicates interpretation. The paper advances understanding of spatial reasoning in VLMs but falls short of demonstrating a robustly practical method.
Generated Jun 3, 2026
Comparison History (22)
Paper 1 introduces a novel concept (Imaginative Perception Tokens) that addresses a fundamental limitation of VLMs in spatial reasoning—a critical bottleneck for embodied AI and robotics. The finding that visual intermediate representations outperform textual chain-of-thought for spatial tasks reveals an important modality mismatch insight with broad implications. Paper 2's interpretable architecture is valuable but faces steep adoption barriers given Transformer dominance, and interpretability via prototypes has precedent in vision models. Paper 1's approach is more immediately actionable and addresses a timelier problem in the rapidly evolving multimodal AI landscape.
Paper 1 is more likely to have higher scientific impact because it introduces a novel learning signal (Imaginative Perception Tokens) for unobserved spatial reasoning in VLMs, along with new tasks and ~20K-example datasets—assets that can drive broad follow-on research and benchmarking. The approach is methodologically anchored in measurable gains and yields interpretable intermediate representations, with relevance to robotics, embodied AI, and multimodal reasoning. Paper 2 is valuable engineering infrastructure, but frameworks are easier to supersede and may have narrower long-term scientific novelty compared to a new supervision paradigm plus benchmarks.
Paper 1 introduces a novel, empirically validated method for spatial reasoning in Vision Language Models, a rapidly growing and highly influential field. Its introduction of Imaginative Perception Tokens and new datasets provides immediate utility and advances the state-of-the-art beyond textual chain-of-thought. Paper 2 is a position paper proposing an agenda for MILP robustness; while practically important, it lacks the concrete empirical breakthroughs and broad, immediate applicability of Paper 1 in the fast-paced AI community.
Paper 2 demonstrates higher potential scientific impact due to its broad applicability in the rapidly expanding field of Multimodal Large Language Models. By addressing a critical bottleneck in AI—spatial reasoning with unobservable information—the introduction of Imaginative Perception Tokens (IPT) has extensive implications for embodied AI, robotics, and general multimodal reasoning. While Paper 1 offers a valuable contribution to scientific data compression, its impact is largely confined to High-Performance Computing (HPC) domains. Paper 2's dataset creation, novel token formulation, and demonstrated improvements over textual Chain-of-Thought position it as a foundational advancement with wider cross-disciplinary relevance.
Paper 1 introduces a more novel conceptual contribution—Imaginative Perception Tokens that externalize spatial reasoning as intermediate perceptual representations rather than text, revealing a fundamental modality mismatch in spatial reasoning via language. This has broader theoretical implications across VLM research, cognitive science connections, and multiple spatial reasoning tasks. Paper 2's latent reasoning distillation for mobile agents is impactful but more application-specific and incremental (efficiency gains via reasoning compression). Paper 1's finding that textual CoT degrades spatial performance challenges prevailing assumptions and could redirect research on multimodal reasoning more broadly.
Paper 2 introduces a novel evaluation framework and benchmark (AgentCL) for continual learning in language agents. Benchmarks and rigorous evaluation frameworks typically have a broader and longer-lasting scientific impact by setting new standards for a rapidly growing field, whereas Paper 1 proposes a more specific methodological improvement for spatial reasoning in vision-language models.
Paper 2 addresses a highly timely and critical issue in the rapidly growing field of Large Reasoning Models (test-time compute). By identifying and quantifying 'harmful overthinking'—where models arrive at the correct answer but subsequently degrade their own response—it exposes a fundamental flaw in current scaling paradigms. This has broader implications across all language and multimodal reasoning tasks compared to Paper 1, which, while offering a novel approach to spatial reasoning in VLMs, represents a more domain-specific architectural improvement.
Paper 2 proposes a fundamental methodological advancement by introducing Imaginative Perception Tokens (IPT) to address a core limitation in Vision Language Models: spatial reasoning. By demonstrating that perceptual intermediate representations outperform textual Chain-of-Thought for spatial tasks, it offers deep theoretical and architectural insights for the field of multimodal AI and embodied robotics. While Paper 1 provides a highly useful, resource-efficient evaluation tool, its contribution is primarily applied engineering, whereas Paper 2 shifts the paradigm on how VLMs process and reason about unobserved visual information.
Paper 1 introduces a novel concept (Imaginative Perception Tokens) that addresses a fundamental limitation of VLMs in spatial reasoning—a timely and broadly impactful problem given the rapid adoption of multimodal AI. The finding that spatial reasoning suffers from modality mismatch when forced through language is a significant conceptual insight with broad implications for VLM architecture design. Paper 2 makes solid contributions to safe RL with formal guarantees, but operates in a more established niche. Paper 1's relevance to the rapidly growing multimodal AI field and its potential to influence how future VLMs handle spatial reasoning gives it higher impact potential.
Paper 2 likely has higher impact: it introduces a large-scale, real-world benchmark (millions of trade instances) enabling evaluation of personalized decision modeling grounded in behavioral traces, directly addressing a timely gap where simulated users can mislead. Its applications span decision support, economics/finance, HCI, personalization, and trustworthy AI, giving broad cross-field relevance. The methodology leverages objective public records and provides multiple evaluation interfaces exposing failure modes, supporting rigorous, reproducible comparisons. Paper 1 is novel and valuable for spatial reasoning in VLMs, but its impact may be narrower and dataset scale/results appear more incremental.
Paper 2 likely has higher impact due to broader applicability across multimodal AI tasks and domains (robotics, navigation, AR/VR, embodied agents), plus timely relevance to improving spatial reasoning in VLMs. IPT is a generally reusable training signal and comes with new tasks/datasets that can catalyze follow-on work. Paper 1 is innovative and rigorous within structure-based drug design, but its impact is narrower (specialized benchmarks, domain constraints) and may depend more on downstream wet-lab validation and deployment hurdles.
Paper 1 introduces a novel and principled approach (Imaginative Perception Tokens) that addresses a fundamental limitation of VLMs—spatial reasoning about unobserved viewpoints. It provides new tasks, datasets, and demonstrates consistent improvements, with the key insight that spatial reasoning suffers from modality mismatch when forced through language. This has broad implications for multimodal AI architecture design. Paper 2 addresses an interesting security concern about reasoning trace extraction but is more narrowly focused on an adversarial prompting technique with less fundamental scientific contribution and more limited generalizability.
Paper 2 introduces a novel methodological advancement (Imaginative Perception Tokens) that addresses a fundamental limitation in Multimodal Large Language Models regarding spatial reasoning. Demonstrating that visual intermediate representations outperform textual Chain-of-Thought provides critical architectural insights for a highly active field. While Paper 1 offers a timely dataset for forensics and human-agent interactions, its relatively small scale (37 hours) likely limits its foundational, long-term scientific impact compared to the algorithmic innovations presented in Paper 2.
Paper 1 introduces a novel conceptual framework (Imaginative Perception Tokens) that addresses a fundamental limitation of VLMs—spatial reasoning about unobserved configurations—with broad implications across multimodal AI. The finding that forcing spatial computation through language (chain-of-thought) degrades performance reveals an important modality mismatch insight. Paper 2 is a solid engineering contribution to autonomous driving testing but is more incremental, combining known concepts (LLM agents, Pareto optimization, evolutionary methods) in a domain-specific application. Paper 1's broader applicability across vision-language tasks and its principled insight about representational modality give it higher potential impact.
Paper 1 introduces a more novel and broadly applicable concept—Imaginative Perception Tokens that externalize spatial reasoning in VLMs through intermediate perceptual representations rather than language. This addresses a fundamental limitation (spatial reasoning via language creates modality mismatch) with a principled, generalizable approach. Paper 2 offers a well-executed but more domain-specific contribution (chest X-ray report generation) with set-distance rewards. While Paper 2 shows strong empirical gains, Paper 1's insight about modality-appropriate intermediate representations for spatial reasoning has broader implications across computer vision, robotics, and embodied AI.
Paper 2 addresses a fundamental cognitive limitation in Vision Language Models—spatial reasoning and imaginative perception—which has profound implications for embodied AI, robotics, and advanced multimodal reasoning. Its introduction of Imaginative Perception Tokens offers a novel methodological paradigm. In contrast, Paper 1 presents a highly practical but primarily engineering-focused optimization for token cost reduction. While timely, Paper 1's approach is less likely to drive foundational theoretical advancements across diverse AI domains compared to the architectural and representational innovations proposed in Paper 2.
Paper 2 likely has higher impact: it targets a broad, timely problem (reliability/safety of LLM multi-agent systems) with clear real-world relevance under emerging regulation, and provides an actionable protocol plus open-source tooling and a benchmark (POIROT + BLAME), enabling rapid adoption and follow-on work across domains. Its empirical claims include statistical significance and scaling analyses, suggesting stronger methodological rigor. Paper 1 is novel for spatial reasoning supervision in VLMs and contributes datasets, but its impact is narrower to multimodal spatial tasks and dependent on a specific training paradigm/backbone.
Paper 1 has significantly higher potential impact due to its timeliness and broad applicability in AI. Enhancing spatial reasoning in Vision-Language Models addresses a critical bottleneck in modern multimodal AI, with immediate real-world applications in robotics, navigation, and AR. Introducing 'Imaginative Perception Tokens' is a highly novel, empirically validated approach. In contrast, Paper 2 offers a niche theoretical advancement in formal defeasible logic, which, while methodologically rigorous, has a much narrower scope and fewer immediate practical applications across different fields.
Paper 2 likely has higher impact: it introduces a novel training signal (Imaginative Perception Tokens) that targets a broadly recognized limitation in VLMs (spatial reasoning under partial observability), with clear methodological contributions (new tasks + ~20K datasets with ground-truth intermediate representations) and practical relevance to robotics, navigation, AR/VR, and embodied AI. The idea may generalize across multimodal models and suggests a new paradigm for intermediate supervision beyond language CoT. Paper 1 is valuable for agent reliability/auditing but is more evaluation-focused and narrower in downstream scope.
Paper 2 likely has higher scientific impact because it delivers an open, comprehensive benchmark for single-cell multi-omics modality translation—a rapidly growing, high-stakes area with direct biomedical applications. Benchmarks often become community infrastructure, shaping evaluation standards, enabling fair comparison, and accelerating method development across labs. It also studies underexplored but practically critical factors (feature quality/selection, few-shot), increasing usefulness and rigor. Paper 1 is novel and timely for multimodal AI, but its impact may be narrower (specific VLM spatial reasoning setting and datasets) and more dependent on adoption within a fast-moving model landscape.