Haocheng Luo, Jiahui Liu, Ruicheng Zhang, Zhizhou Zhong, Jiaqi Huang, Zunnan Xu, Quan Shi, Jun Zhou
While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.
The paper identifies and formalizes a "perception–reasoning modality gap" in visual spatial planning: VLMs must both recover latent state structures from pixels (perception) and reason over those structures to produce valid action plans (reasoning). These are entangled failure modes that standard SFT and RLVR cannot disentangle effectively.
MGSD proposes a two-stage training framework: (1) cold-start perception-oriented SFT that trains the model to extract planning-relevant state variables (coordinates, topology, affordances) from images, and (2) symbolic-guided on-policy self-distillation (OPSD) where a frozen text-only teacher, conditioned on privileged symbolic states and reference plans, provides dense token-level supervision on the visual student's own generated rollouts. Critically, symbolic information is used only during training — inference remains purely visual.
The key conceptual insight is treating symbolic state representations as "privileged information" for a teacher in a cross-modal distillation setup. This is a clean formulation that connects privileged learning (à la Learning Using Privileged Information) with on-policy distillation in the VLM context.
Direct applications: The framework is applicable to any setting where paired visual-symbolic training data exists — robotics simulators, game environments, warehouse logistics, and structured planning domains. The principle of using symbolic state as privileged supervision is broadly transferable.
Broader influence: The perception-reasoning decomposition and the diagnostic framework could influence how the community evaluates and improves VLM planning capabilities. The idea of cross-modal privileged distillation (symbolic teacher → visual student) could extend to other modality gaps (e.g., language→audio, structured data→unstructured).
Limitations on impact: The reliance on paired visual-symbolic data limits applicability to environments where such correspondence is naturally available. Open-world visual planning — the arguably more impactful setting — remains out of scope. The discrete action space and short horizons further limit generalization claims.
The paper addresses a timely problem. VLMs are increasingly deployed for agentic tasks, yet their spatial reasoning and planning capabilities lag behind their language understanding. The gap between visual and symbolic planning is well-documented but under-addressed. The paper also positions itself well relative to concurrent work: RLVR methods (which struggle with sparse rewards in planning), visual chain-of-thought approaches (which add inference-time complexity), and standard distillation (which doesn't handle modality gaps).
The use of on-policy distillation from symbolic teachers is a natural but underexplored idea in this space, making the contribution relevant and timely.
Generated Jun 5, 2026
Paper 2 likely has higher scientific impact: it proposes a novel, generalizable training framework (modality-gap-aware self-distillation) that directly advances core capabilities—visual state abstraction and multi-step planning—showing large gains across model scales with clear diagnostics. This is timely for embodied/robotic agents and multimodal reasoning, with broader applicability across vision-language planning tasks. Paper 1 is valuable infrastructure for realistic agent evaluation, but its contribution is narrower (benchmarking methodology and a 40-task suite) and may have more limited cross-field impact compared to a technique that can improve model performance broadly.
Paper 1 offers higher potential scientific impact by addressing a critical bottleneck in modern AI: handling long-horizon tasks despite finite context windows. By introducing a novel method to synthesize training data for 'delegation intelligence' and releasing the harness, data, and model weights, it provides highly foundational tools for the booming open-source AI agent community. While Paper 2 presents a rigorous approach to visual spatial planning, Paper 1's contributions to autonomous multi-agent workflows and deep research have broader cross-disciplinary applicability and align more closely with current transformative AI trends.
Paper 2 presents a novel technical framework (MGSD) with rigorous experimental validation showing significant quantitative improvements (18-19% gains) on visual spatial planning benchmarks. It addresses a fundamental challenge in vision-language models with a well-motivated approach (modality-gap-aware self-distillation) that has broad applicability across embodied AI and robotics. Paper 1 (CHAP) proposes a protocol specification for human-agent collaboration—while practically relevant, it is more of an engineering/standards contribution than a scientific advancement, lacking empirical validation of its claims and representing incremental infrastructure work rather than novel scientific insight.
Paper 1 has higher potential impact due to its novelty and timeliness in AI safety: it proposes and operationalizes an upstream, mechanistic precursor (PRIME) to reward hacking, with predictive probes, activation-level interventions, evaluator-shift generalization, and links to out-of-domain misalignment—making it broadly relevant to alignment, interpretability, and RL evaluation. While Paper 2 offers a strong, application-ready training framework for visual planning with solid empirical gains, its impact is more incremental and domain-specific, with narrower cross-field implications than early-warning signals for reward hacking and misalignment.
Trace2Skill demonstrates broader impact across multiple domains (office workflows, math reasoning, vision QA), shows impressive cross-model transferability (e.g., 57.65pp improvement), and addresses the fundamental scalability problem of skill creation for LLM agents—a rapidly growing field. Its portable, parameter-free skill reuse mechanism has wide practical applicability. Paper 2, while methodologically sound with its modality-gap-aware distillation for visual spatial planning, addresses a narrower problem domain with more incremental improvements (18-19%), limiting its breadth of impact.
Paper 2 (MGSD) likely has higher scientific impact due to broader applicability and timeliness: improving visual spatial planning in large vision-language models affects robotics, embodied AI, and general multimodal reasoning. The modality-gap framing plus a concrete two-stage self-distillation recipe (cold-start grounding + privileged symbolic teacher with on-policy distillation) is a reusable methodological contribution across tasks and model scales, with sizable empirical gains and diagnostics. Paper 1 is novel and valuable for drug discovery, but its impact is more domain-specific and benchmark-dependent, with narrower cross-field transfer.
Paper 2 addresses inference-time reasoning and cross-problem learning without fine-tuning, directly advancing the highly impactful field of test-time compute for LLMs. Its framework is broadly applicable across any reasoning task, offering a highly scalable and timely solution. While Paper 1 presents a solid contribution to visual spatial planning, Paper 2's potential to improve general reasoning in LLMs without gradient updates grants it wider relevance and greater potential real-world impact across various domains.
Paper 2 addresses a well-defined, broadly relevant problem (visual spatial planning in VLMs) with a clear, reproducible methodology and strong empirical results (18-19% improvements). It targets a fundamental bottleneck in multimodal AI—bridging perception and reasoning—with wide applicability across robotics, embodied AI, and planning domains. Paper 1, while intellectually interesting in proposing typed federated artifacts for tool routing, addresses a narrower niche (federated tool routing across heterogeneous LLMs), relies heavily on novel but untested abstractions, and its practical impact is less immediately clear. Paper 2's open-source code and concrete benchmarks also enhance reproducibility and adoption.
Paper 2 has higher potential impact due to a novel, generalizable training framework (modality-gap-aware self-distillation) that targets a key limitation in vision-language models: visual spatial planning. It introduces a principled two-stage method, reports substantial gains across model scales with ablations/diagnostics, and offers broad applicability to robotics, embodied AI, and visual decision-making. Paper 1 is useful and timely as a benchmarking/cost-efficiency evaluation of LLMs for Lean, but is primarily comparative rather than methodological innovation, with narrower cross-field impact.
Paper 1 addresses a fundamental bottleneck in embodied AI and robotics (visual spatial planning) with a rigorous, novel self-distillation framework. Its quantifiable improvements on established benchmarks demonstrate clear real-world utility. In contrast, while Paper 2 explores an intriguing aspect of LLM behavior (expressing feelings), its practical application is less clear, and the noted degradation in truthfulness limits its immediate beneficial impact. Paper 1 offers more reliable and broadly applicable technical advancements.