Planning with the Views via Scene Self-Exploration
Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas
Abstract
Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Planning with the Views via Scene Self-Exploration"
1. Core Contribution
This paper introduces view planning as a formal capability for VLMs — the ability to predict how camera movements change the rendered view and to compose such predictions across multiple turns to localize a target viewpoint in real 3D scenes. The work makes three interlinked contributions:
1. ViewSuite benchmark: A 165K-instance benchmark built on ~300 real ScanNet scenes with three diagnostic tasks (Path-to-View, View-to-Path, Interactive View Planning) that decompose view planning into understanding and composition.
2. Diagnostic finding — the planning gap: 13 frontier VLMs achieve 50–70% on single-step view-action understanding but collapse to <21% on multi-turn Interactive View Planning (IVP), revealing that local spatial knowledge does not compose into planning capability.
3. Iterative self-exploration with view graph distillation: A training framework that alternates RL exploration with supervised fine-tuning derived from a view graph — a structure that converts *all* exploration trajectories (including failures) into valid view-planning demonstrations. This lifts Qwen2.5-VL-7B from 2.5% to 47.8% on IVP, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%).
The central insight — that failed trajectories still encode valid view transitions and can be aggregated into a reusable graph structure for task reformulation — is both elegant and practically effective. This goes beyond standard Hindsight Experience Replay by distilling a persistent, growing graph into diverse supervised tasks rather than relabeling individual episodes.
2. Methodological Rigor
Benchmark design is thorough: the three tasks systematically probe different aspects of view planning; the success threshold is calibrated against human judgments (Appendix A.3) with a full precision-recall analysis; the unified view distance metric provides principled difficulty stratification. The distractor generation for MCQ tasks uses pixel-uniqueness filters to avoid trivially distinguishable options.
Evaluation is comprehensive: 13 frontier VLMs (7 proprietary, 6 open-weight) are evaluated. Important confounds are addressed: rendering quality (point cloud vs. 3DGS), turn budget effects (10/20/30 turns), evaluation protocol ablations (No-Snap, No-Submit), and sample-level factor analysis (12 geometric/visual factors with Spearman correlations). The finding that higher-fidelity rendering yields only marginal IVP gains but inconsistent P2V/V2P changes is particularly informative.
Training comparisons are well-controlled: Direct PPO, Direct GRPO with filtering, and Success-Only Bootstrapping all use identical environments and rewards. The ablation showing Random-graph (13.0%) vs. on-policy graph (47.8%) isolates the contribution of model-collected exploration. Iteration ablations (12.0% → 27.9% → 47.8%) demonstrate progressive improvement.
Potential weaknesses: The framework is validated on only two base models (Qwen2.5-VL-7B and Qwen3-VL-8B), with notably different absolute gains (47.8% vs. 32.5%), raising questions about backbone sensitivity. The discrete 12-action interface, while making the problem tractable, is a simplification that limits ecological validity. Scene-level filtering removes scenes with poor top-down views, which may bias the difficulty distribution.
3. Potential Impact
Immediate applications: The view planning capability is directly relevant to robotic inspection, autonomous surveillance, next-best-view planning in 3D reconstruction, and AR/VR navigation assistance. The framework's ability to bootstrap spatial reasoning from self-exploration without demonstrations is valuable for domains where expert trajectories are unavailable.
Broader influence: The view graph distillation concept — converting any exploration data into structured supervision — could generalize beyond view planning to other sparse-reward agentic settings. The transferability results (Table 5) showing 8–12 point gains on P2V/V2P and ~10 points on external MindCube suggest that interactive view planning builds generalizable spatial priors, which is a significant finding for the spatial reasoning community.
For the VLM community: The diagnostic framework cleanly separates "understanding" from "composition," providing a template for probing other planning capabilities. The planning gap finding itself is impactful — it challenges the assumption that local spatial knowledge automatically supports multi-step spatial reasoning.
4. Timeliness & Relevance
This work arrives at the intersection of two active research fronts: (1) spatial/3D reasoning in VLMs, where benchmarks like MindCube, VSI-Bench, and SpatialRGPT have recently appeared, and (2) agentic RL for LLMs/VLMs, following DeepSeek-R1, VAGEN, and RAGEN. The paper extends both: it moves view reasoning from passive QA to active multi-turn planning, and it addresses the sparse-reward bootstrapping problem that limits RL in agentic settings. The 6-DoF viewpoint control in real scenes fills a genuine gap in the benchmark landscape (Table 1).
5. Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
This is a well-executed paper that identifies an important capability gap, provides rigorous diagnostics, and offers a creative solution with strong empirical results. The view graph distillation framework represents a meaningful methodological contribution to agentic VLM training. The benchmark fills a clear gap and is likely to see adoption. The main limitations — restricted environments and limited backbone diversity — are acknowledged and represent natural extensions rather than fundamental flaws.
Generated May 29, 2026
Comparison History (14)
Paper 2 introduces a novel capability ('view planning') for VLMs with broader implications for embodied AI, robotics, and 3D reasoning. Its self-exploration framework with view graph distillation is methodologically innovative, addressing a fundamental limitation of frontier VLMs. The dramatic improvement (2.5% to 47.8%) surpassing GPT-5.4 Pro and Gemini 3.1 Pro is striking. Paper 1, while technically solid, addresses a narrower safety-steering problem for diffusion models with incremental improvements over existing safety methods. Paper 2's potential to impact embodied AI, navigation, and spatial reasoning gives it broader cross-field impact.
Paper 2 likely has higher impact due to a more novel technical contribution (ViewSuite benchmark + self-exploration with view-graph distillation), stronger methodological depth (multi-model evaluation, clear capability diagnosis, large performance gains), and broader applicability to robotics, embodied AI, AR/VR, and 3D scene understanding. Its timeliness aligns with rapid progress in VLMs and agentic planning. Paper 1 is important for research integrity and conference processes, but is primarily an empirical/diagnostic study with narrower cross-field impact and less generalizable algorithmic innovation.
Paper 2 likely has higher scientific impact due to stronger novelty (explicitly targeting multi-turn view planning and exposing a clear capability gap), broader cross-field relevance (VLMs, embodied AI, robotics, 3D navigation, RL/distillation), and a compelling methodological contribution (ViewSuite benchmark + self-exploration with view-graph distillation to address sparse-reward planning). The reported gains are large and tied to an important frontier problem—agentic, spatially grounded planning—making it timely and widely applicable. Paper 1 is valuable, but more incremental (interactive refinement + LLM metric) and narrower to ASR.
Paper 2 likely has higher scientific impact due to broader real-world applicability and cross-field relevance: it introduces a large, stateful, failure-injected tool benchmark aligned with commercial automation, a central near-term need. Its MCP-based, seed-driven deterministic-yet-diverse design supports rigorous, reproducible evaluation and diagnostic analysis across many agent paradigms, making it a shared testbed for the community. Paper 1 is novel and strong methodologically, but its impact is narrower (3D view planning in ScanNet-like settings) and more specialized to embodied/vision-language planning.
Paper 1 is more novel scientifically: it introduces a new capability target (multi-turn 3D view planning for VLMs), a benchmark (ViewSuite on real ScanNet scenes), diagnoses a systematic compositional planning gap, and proposes a general training framework (self-exploration + view-graph distillation) with large performance gains that likely transfer to embodied/robotic perception, navigation, and active vision. Paper 2 is impactful for systems engineering and near-term deployment costs, but its core contribution is an architectural serving layer plus a caching policy; breadth and long-term scientific generality are narrower than Paper 1’s advances in 3D reasoning and learning.
Paper 2 introduces a significant paradigm shift from 'Memory-as-Tool' to 'Memory-as-Cognition' for conversational agents. By deeply integrating memory access into the reasoning process and proposing a new benchmark for proactive memory, it addresses a fundamental bottleneck in LLM agents. This has broader and more immediate applicability across the rapidly expanding field of autonomous agents compared to the domain-specific 3D view planning for VLMs in Paper 1.
Paper 2 introduces a novel capability ('view planning') for VLMs in 3D spatial reasoning, proposing both a benchmark (ViewSuite) and an innovative self-exploration framework with view graph distillation. It addresses a fundamental limitation of frontier VLMs in compositional spatial planning, with broad implications for embodied AI, robotics, and navigation. The dramatic improvement (2.5% to 47.8%) surpassing GPT and Gemini Pro models is striking. Paper 1, while solid, addresses a more incremental application of VLMs to time-series anomaly detection, a narrower domain with less transformative potential.
Paper 2 addresses a fundamental challenge in Vision-Language Models (embodied planning and spatial reasoning) with an innovative self-exploration framework. Its advancements in foundational AI capabilities offer broad, cross-disciplinary applications in robotics and autonomous systems. In contrast, while Paper 1 presents a highly valuable clinical application, it is a narrower adaptation of existing transformer techniques for a specific medical time-series problem, giving Paper 2 a wider potential scientific impact.
Paper 1 introduces a novel capability (view planning for VLMs), proposes a comprehensive benchmark (ViewSuite), and develops an effective framework (self-exploration with view graph distillation) that dramatically improves performance from 2.5% to 47.8%, surpassing top frontier models. It addresses a concrete, actionable problem in embodied AI with clear real-world applications (robotics, navigation). Paper 2 provides interesting mechanistic analysis of depth utilization in agentic LLMs but is primarily observational/analytical without proposing methods that improve performance, limiting its direct practical impact.
Paper 2 likely has higher impact due to a clearer methodological contribution (self-exploration + view-graph distillation) that substantially improves performance and can generalize to embodied/3D agent settings. It introduces a concrete capability definition (view planning), a 3D benchmark (ViewSuite), and a training framework addressing sparse-reward limitations—useful beyond the specific task (robotics, AR/VR, navigation, interactive perception). Paper 1’s benchmark is valuable for evaluation in web-based planning, but benchmarks alone often have narrower impact unless they become a dominant standard.
Paper 1 addresses the fundamental and increasingly urgent problem of tracing the provenance of AI-generated content through a novel steganographic heredity framework. This has broad societal impact on trust, misinformation, and intellectual property across many domains. Its interdisciplinary approach combining evolutionary biology concepts with information security is highly innovative. Paper 2, while technically strong in advancing VLM spatial reasoning, addresses a narrower problem in embodied AI. Paper 1's relevance to AI governance, content authenticity, and the growing synthetic media ecosystem gives it substantially broader and more timely impact potential.
Paper 1 addresses a critical and timely problem—privacy risks in multi-agent LLM deployments—that has broad implications for AI safety policy and real-world deployment. The finding that social context amplifies privacy violations (from ~20% to ~45%) and that leakage is socially contagious reveals a systematic blind spot in current safety evaluation paradigms. This has immediate relevance to industry practices, regulation, and the rapidly growing agentic AI ecosystem. Paper 2 makes a solid contribution to embodied AI and VLM spatial reasoning, but its scope is narrower and more domain-specific. Paper 1's findings are more likely to influence safety standards across the entire field.
Paper 1 likely has higher impact: it introduces a concrete new capability target (multi-turn 3D view planning), a benchmark (ViewSuite) grounded in real ScanNet scenes, and a general training framework (self-exploration + view-graph distillation) that addresses sparse-reward RL and compositional failures. Its results show a large capability jump and cross-model diagnostics, with implications for embodied AI, robotics, AR/VR, and 3D understanding—broader than multi-hop retrieval agents. Paper 2 is timely and useful, but more domain-specific and closer to incremental agent-training improvements.
Paper 1 addresses a fundamental bottleneck in Embodied AI (3D spatial planning and reasoning) with an innovative self-exploration and view-graph distillation method. Its potential real-world applications in robotics and autonomous navigation give it broader scientific impact compared to Paper 2, which relies heavily on human-engineered, domain-specific expert rules to improve LLM poker performance.