Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou

Jun 4, 2026arXiv:2606.06076v1

cs.AIcs.CV

#1853of 3622·Artificial Intelligence

#1853 of 3622 · Artificial Intelligence

Tournament Score

1396±44

10501800

50%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance5.5

Rigor6

Novelty6

Clarity7

Abstract

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MGSD – Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

1. Core Contribution

The paper identifies and formalizes a "perception–reasoning modality gap" in visual spatial planning: VLMs must both recover latent state structures from pixels (perception) and reason over those structures to produce valid action plans (reasoning). These are entangled failure modes that standard SFT and RLVR cannot disentangle effectively.

MGSD proposes a two-stage training framework: (1) cold-start perception-oriented SFT that trains the model to extract planning-relevant state variables (coordinates, topology, affordances) from images, and (2) symbolic-guided on-policy self-distillation (OPSD) where a frozen text-only teacher, conditioned on privileged symbolic states and reference plans, provides dense token-level supervision on the visual student's own generated rollouts. Critically, symbolic information is used only during training — inference remains purely visual.

The key conceptual insight is treating symbolic state representations as "privileged information" for a teacher in a cross-modal distillation setup. This is a clean formulation that connects privileged learning (à la Learning Using Privileged Information) with on-policy distillation in the VLM context.

2. Methodological Rigor

Strengths in methodology:

The two-stage design is well-motivated: cold-start SFT ensures student rollouts are grounded enough for OPSD to be effective, addressing a known bootstrapping problem in on-policy methods.

The reverse-KL objective (Eq. 2) is a principled choice for mode-seeking behavior, encouraging the student to concentrate probability mass where the teacher assigns high likelihood.

The diagnostic framework (State F1, Plan on GT, E2E Acc.) is a valuable contribution for decomposing failures, providing causal evidence that improvements come from both perception and reasoning.

Concerns:

The environments are relatively simple gridworld tasks (FrozenLake, Maze, MiniBehaviour) with small state spaces (up to 8×8 grids). While these are standard benchmarks, the visual complexity is limited — 256×256 rendered images with clear geometric structures. It remains unclear how MGSD would perform on more visually complex or ambiguous environments.

The training data (18K examples) is procedurally generated with deterministic symbolic annotations, which is a favorable setting. The paper acknowledges this limitation but doesn't explore robustness to noisy or approximate symbolic states.

The ablation is conducted only on the 4B model. Cross-validation of design choices on the 8B backbone would strengthen claims.

The paper uses only one rollout per prompt during OPSD. The sensitivity to this choice and the effect of multiple rollouts is unexplored.

The uniform token weighting (w_t = 1) is mentioned but not explored — the paper suggests planning-critical token emphasis but doesn't investigate it.

3. Potential Impact

Direct applications: The framework is applicable to any setting where paired visual-symbolic training data exists — robotics simulators, game environments, warehouse logistics, and structured planning domains. The principle of using symbolic state as privileged supervision is broadly transferable.

Broader influence: The perception-reasoning decomposition and the diagnostic framework could influence how the community evaluates and improves VLM planning capabilities. The idea of cross-modal privileged distillation (symbolic teacher → visual student) could extend to other modality gaps (e.g., language→audio, structured data→unstructured).

Limitations on impact: The reliance on paired visual-symbolic data limits applicability to environments where such correspondence is naturally available. Open-world visual planning — the arguably more impactful setting — remains out of scope. The discrete action space and short horizons further limit generalization claims.

4. Timeliness & Relevance

The paper addresses a timely problem. VLMs are increasingly deployed for agentic tasks, yet their spatial reasoning and planning capabilities lag behind their language understanding. The gap between visual and symbolic planning is well-documented but under-addressed. The paper also positions itself well relative to concurrent work: RLVR methods (which struggle with sparse rewards in planning), visual chain-of-thought approaches (which add inference-time complexity), and standard distillation (which doesn't handle modality gaps).

The use of on-policy distillation from symbolic teachers is a natural but underexplored idea in this space, making the contribution relevant and timely.

5. Strengths & Limitations

Key Strengths:

Clean problem formulation with the perception-reasoning modality gap concept

Strong empirical results: +19.3% and +18.4% macro average improvements on 4B and 8B backbones, competitive with much larger proprietary models

Thorough diagnostic analysis that causally decomposes improvements into perception and reasoning components

The framework is architecturally simple — no new modules at inference time

Comprehensive baselines including state-of-the-art proprietary models (GPT-5, Gemini-3-Flash)

Notable Weaknesses:

Limited visual complexity — gridworld environments with synthetic renderings don't stress real visual perception

The "self-distillation" framing is slightly misleading: the teacher is initialized from the same base model but receives fundamentally different (privileged) inputs, making it more of a cross-modal distillation than true self-distillation

No investigation of scalability to longer horizons, continuous action spaces, or partial observability

The gap to symbolic upper bounds on MiniBehaviour (13.4 points for 4B) suggests the method's effectiveness varies significantly by environment complexity

Reproducibility depends on specific Qwen3-VL models and procedural environment generation; the contribution is more in the training recipe than in reusable artifacts

Additional observations:

The comparison with Gemini-3-Flash (67.0% avg) and GPT-5 (41.3%) puts MGSD-8B (35.6%) in perspective — while impressive relative to the base model, there remains substantial room for improvement

The paper would benefit from analyzing failure cases more systematically — when does MGSD fail, and are these primarily perception or reasoning failures?

The cold-start SFT stage implicitly teaches the model to output symbolic-like state descriptions from images, which may be a form of chain-of-thought that contributes independently to performance

Rating:5.8/ 10

Significance 5.5Rigor 6Novelty 6Clarity 7

Generated Jun 5, 2026

Comparison History (20)

Wonvs. STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

Paper 2 likely has higher scientific impact: it proposes a novel, generalizable training framework (modality-gap-aware self-distillation) that directly advances core capabilities—visual state abstraction and multi-step planning—showing large gains across model scales with clear diagnostics. This is timely for embodied/robotic agents and multimodal reasoning, with broader applicability across vision-language planning tasks. Paper 1 is valuable infrastructure for realistic agent evaluation, but its contribution is narrower (benchmarking methodology and a 40-task suite) and may have more limited cross-field impact compared to a technique that can improve model performance broadly.

gpt-5.2·Jun 10, 2026

Lostvs. SearchSwarm: Towards Delegation Intelligence in Agentic LLMs for Long-Horizon Deep Research

Paper 1 offers higher potential scientific impact by addressing a critical bottleneck in modern AI: handling long-horizon tasks despite finite context windows. By introducing a novel method to synthesize training data for 'delegation intelligence' and releasing the harness, data, and model weights, it provides highly foundational tools for the booming open-source AI agent community. While Paper 2 presents a rigorous approach to visual spatial planning, Paper 1's contributions to autonomous multi-agent workflows and deep research have broader cross-disciplinary applicability and align more closely with current transformative AI trends.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Collaborative Human-Agent Protocol (CHAP)

Paper 2 presents a novel technical framework (MGSD) with rigorous experimental validation showing significant quantitative improvements (18-19% gains) on visual spatial planning benchmarks. It addresses a fundamental challenge in vision-language models with a well-motivated approach (modality-gap-aware self-distillation) that has broad applicability across embodied AI and robotics. Paper 1 (CHAP) proposes a protocol specification for human-agent collaboration—while practically relevant, it is more of an engineering/standards contribution than a scientific advancement, lacking empirical validation of its claims and representing incremental infrastructure work rather than novel scientific insight.

claude-opus-4-6·Jun 9, 2026

Lostvs. Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

Paper 1 has higher potential impact due to its novelty and timeliness in AI safety: it proposes and operationalizes an upstream, mechanistic precursor (PRIME) to reward hacking, with predictive probes, activation-level interventions, evaluator-shift generalization, and links to out-of-domain misalignment—making it broadly relevant to alignment, interpretability, and RL evaluation. While Paper 2 offers a strong, application-ready training framework for visual planning with solid empirical gains, its impact is more incremental and domain-specific, with narrower cross-field implications than early-warning signals for reward hacking and misalignment.

gpt-5.2·Jun 9, 2026

Lostvs. Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Trace2Skill demonstrates broader impact across multiple domains (office workflows, math reasoning, vision QA), shows impressive cross-model transferability (e.g., 57.65pp improvement), and addresses the fundamental scalability problem of skill creation for LLM agents—a rapidly growing field. Its portable, parameter-free skill reuse mechanism has wide practical applicability. Paper 2, while methodologically sound with its modality-gap-aware distillation for visual spatial planning, addresses a narrower problem domain with more incremental improvements (18-19%), limiting its breadth of impact.

claude-opus-4-6·Jun 6, 2026

Wonvs. Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

Paper 2 (MGSD) likely has higher scientific impact due to broader applicability and timeliness: improving visual spatial planning in large vision-language models affects robotics, embodied AI, and general multimodal reasoning. The modality-gap framing plus a concrete two-stage self-distillation recipe (cold-start grounding + privileged symbolic teacher with on-policy distillation) is a reusable methodological contribution across tasks and model scales, with sizable empirical gains and diagnostics. Paper 1 is novel and valuable for drug discovery, but its impact is more domain-specific and benchmark-dependent, with narrower cross-field transfer.

gpt-5.2·Jun 6, 2026

Lostvs. ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

Paper 2 addresses inference-time reasoning and cross-problem learning without fine-tuning, directly advancing the highly impactful field of test-time compute for LLMs. Its framework is broadly applicable across any reasoning task, offering a highly scalable and timely solution. While Paper 1 presents a solid contribution to visual spatial planning, Paper 2's potential to improve general reasoning in LLMs without gradient updates grants it wider relevance and greater potential real-world impact across various domains.

gemini-3.1-pro-preview·Jun 5, 2026

Wonvs. Synapse: Federated Tool Routing via Typed Compendium Artifacts

Paper 2 addresses a well-defined, broadly relevant problem (visual spatial planning in VLMs) with a clear, reproducible methodology and strong empirical results (18-19% improvements). It targets a fundamental bottleneck in multimodal AI—bridging perception and reasoning—with wide applicability across robotics, embodied AI, and planning domains. Paper 1, while intellectually interesting in proposing typed federated artifacts for tool routing, addresses a narrower niche (federated tool routing across heterogeneous LLMs), relies heavily on novel but untested abstractions, and its practical impact is less immediately clear. Paper 2's open-source code and concrete benchmarks also enhance reproducibility and adoption.

claude-opus-4-6·Jun 5, 2026

Wonvs. Evaluation of LLMs for Mathematical Formalization in Lean

Paper 2 has higher potential impact due to a novel, generalizable training framework (modality-gap-aware self-distillation) that targets a key limitation in vision-language models: visual spatial planning. It introduces a principled two-stage method, reports substantial gains across model scales with ablations/diagnostics, and offers broad applicability to robotics, embodied AI, and visual decision-making. Paper 1 is useful and timely as a benchmarking/cost-efficiency evaluation of LLMs for Lean, but is primarily comparative rather than methodological innovation, with narrower cross-field impact.

gpt-5.2·Jun 5, 2026

Wonvs. When AI Says It Feels

Paper 1 addresses a fundamental bottleneck in embodied AI and robotics (visual spatial planning) with a rigorous, novel self-distillation framework. Its quantifiable improvements on established benchmarks demonstrate clear real-world utility. In contrast, while Paper 2 explores an intriguing aspect of LLM behavior (expressing feelings), its practical application is less clear, and the noted degradation in truthfulness limits its immediate beneficial impact. Paper 1 offers more reliable and broadly applicable technical advancements.