SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects
Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yangguang Li, Yu Cheng
Abstract
Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SceneCode
1. Core Contribution
SceneCode reframes indoor scene synthesis as programmatic world generation, where a natural language prompt is compiled into executable Blender Python programs that produce part-decomposed, articulated, simulation-ready 3D assets. The central insight is that code serves as a natural representation for interactable scenes because it can unify geometry, part hierarchy, material assignment, articulation metadata, and simulation packaging in a single editable artifact. This contrasts with prior approaches that either retrieve static meshes from curated libraries (Holodeck, LayoutVLM, HSM) or generate opaque meshes via image-to-3D pipelines (SceneSmith's non-articulated branch).
The key novelty is the five-strategy routing system paired with an execution-guided repair-and-refine loop: each object request is dispatched to one of five specialized VLM-based code-generation strategies (wall art, static furniture, simple/structured manipulands, articulated objects), each with tailored geometric construction priors. This routing addresses the practical reality that a single universal prompt to a VLM produces unreliable geometry across diverse indoor object categories. The persistent scene-state registry that links object requests, programs, rendered geometry, and simulation assets is a useful architectural choice for traceability and local editability.
2. Methodological Rigor
The evaluation is multi-faceted but has notable gaps. The paper evaluates on 30 prompts from SceneEval-100 across six room categories, measuring both scene-level metrics (10 SceneEval metrics) and object-level metrics (6 mesh/material metrics), supplemented by a user study and MuJoCo demonstrations.
Strengths in evaluation design:
Weaknesses:
3. Potential Impact
The paper addresses a genuine bottleneck in embodied AI: the dependence on curated articulated asset libraries limits the diversity and customizability of simulation environments. SceneCode's ability to generate novel articulated objects on demand (e.g., a cabinet with a specific number of drawers and glass doors) is valuable for:
However, the practical impact is tempered by the high computational cost and the acknowledged visual quality gap relative to retrieval/image-to-3D approaches. The primitive-based construction produces clean geometry but may lack the visual fidelity needed for visual policy training.
4. Timeliness & Relevance
This work is timely on multiple fronts:
The formulation of scenes as executable programs aligns with the broader trend of "code as representation" in AI (Code as Policies, ProgPrompt), extending it to 3D world generation.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Additional Observations
The extensive appendix (prompt designs, code listings, cost statistics) supports reproducibility, though the reliance on proprietary VLMs (likely GPT-4/Gemini) introduces a dependency on commercial APIs. The paper would benefit from testing with open-source VLMs. The claim of "executable world programs" is compelling conceptually but the execution is more of a well-engineered pipeline than a fundamentally new algorithmic contribution.
Generated May 20, 2026
Comparison History (24)
Paper 2 addresses a critical gap in AI safety evaluation—LLM alignment in conflict contexts—that has immediate real-world consequences for journalism, humanitarian work, and public discourse in fragile societies. It proposes the first evaluation framework for this domain, which could influence alignment benchmarking standards across the industry. While Paper 1 (SceneCode) is technically impressive for embodied AI and scene synthesis, Paper 2's breadth of societal impact, timeliness given rapid global LLM deployment, and cross-disciplinary relevance (AI safety, conflict studies, policy) give it higher potential scientific and societal impact.
Paper 1 addresses a critical and highly active area of research: LLM safety, alignment, and jailbreaking. By providing a principled understanding of refusal suppression in latent space and demonstrating state-of-the-art attack success across many models, it exposes fundamental vulnerabilities in current AI systems. While Paper 2 offers a valuable framework for embodied AI simulation, the broader implications, urgency, and cross-disciplinary relevance of securing foundational models give Paper 1 a higher potential for immediate and widespread scientific impact.
GRAM introduces a fundamental new framework for neural reasoning by making recursive reasoning models probabilistic, enabling multi-trajectory computation with theoretical grounding in variational inference. This addresses core questions about how neural systems should implement extended computation, with broad applicability across reasoning, generation, and constraint satisfaction. While SceneCode is a strong engineering contribution for embodied AI scene synthesis, GRAM's conceptual innovation in combining recursive latent reasoning with generative modeling has broader potential impact across multiple fields of AI research, offering a new paradigm for inference-time scaling and probabilistic reasoning.
GRAM introduces a fundamental architectural innovation for neural reasoning—turning deterministic recursive reasoning into probabilistic multi-trajectory computation. This addresses a core challenge in AI (how to implement extended computation in neural systems) with broad theoretical and practical implications across reasoning, generation, and inference-time scaling. SceneCode is a strong engineering contribution for indoor scene synthesis but is more application-specific. GRAM's framework-level contribution to reasoning architectures has wider potential impact across multiple fields and research directions.
While Paper 1 offers an innovative approach to scene synthesis critical for embodied AI and robotics, Paper 2 targets the automation of scientific discovery itself. By developing an iterative, multi-agent autonomous research system with self-healing execution and human-AI collaboration, Paper 2 has the potential for a massive multiplier effect across all computational disciplines, offering significantly broader scientific impact.
SceneCode introduces a novel framework bridging indoor scene synthesis with executable, editable programmatic representations—addressing a concrete gap in embodied AI, robotics, and simulation. Its contributions span multiple fields (scene generation, articulated object synthesis, robot interaction) and offer immediately usable outputs (simulation-ready assets). Paper 1 provides useful empirical insights for multi-model LLM scheduling but is primarily a benchmarking/profiling study offering guidelines rather than a new method or system, limiting its transformative impact.
Paper 2 likely has higher impact due to stronger real-world applicability and broader cross-field relevance: it enables editable, executable scene generation with articulated assets directly usable in simulators (SDF), benefiting robotics, embodied AI, graphics, and simulation. The programmatic representation and execution-guided repair loop are novel and practically enabling, with clear downstream evaluation in robot interaction. Paper 1 is timely and methodologically interesting for trustworthy claim verification, but its impact is more domain-specific (NLP/argumentation) and may face adoption friction due to dataset/task specificity and dependence on argument generation quality.
Paper 1 presents a concrete, executable framework with rigorous evaluation in simulation and downstream robotics. It addresses a critical bottleneck in embodied AI (dynamic, articulated scene generation). Paper 2 is a vision paper offering a conceptual framework for agent trust; while highly relevant, Paper 1's tangible implementation, novel programmatic approach, and immediate applicability to physical simulation and robotics yield a higher potential for measurable scientific impact.
Paper 2 introduces a fundamental theoretical framework (Generalized Turing Test) for comparing intelligence across arbitrary agents, which has broader cross-disciplinary impact spanning AI theory, cognitive science, and evaluation methodology. Its dataset- and task-agnostic nature addresses a foundational problem in AI—how to compare intelligent systems without relying on specific benchmarks. This could reshape how the field thinks about intelligence evaluation and training objectives. Paper 1, while technically solid and practically useful for embodied AI and robotics, represents an incremental engineering advance in scene synthesis with narrower applicability.
Paper 1 addresses a critical bottleneck in embodied AI and robotics by enabling the generation of articulated, interactive scenes through executable code. Its novel approach yields strong, actionable results. In contrast, Paper 2 explores an important topic in autonomous vehicles but finds no statistically significant quantitative improvements, limiting its immediate practical impact despite its value as an empirical benchmark.
Paper 2 (GRAM) introduces a broadly applicable probabilistic extension to recursive reasoning models, enabling multi-trajectory latent computation, hypothesis diversity, and inference-time scaling—ideas likely to transfer across many ML domains (reasoning, planning, constraint solving, and generative modeling). Its methodological framing (latent-variable model with variational inference) is general and aligns with current interest in scalable test-time compute and structured reasoning. Paper 1 is innovative and impactful for embodied AI/robotics simulation, but its impact is more domain-specific (indoor scene/asset generation toolchain) and depends on ecosystem adoption.
Paper 1 is more methodologically and technically innovative: it reframes indoor scene synthesis as executable program generation with execution-guided repair, yielding editable, articulated, simulation-ready assets—directly valuable for robotics, embodied AI, simulation, and graphics. This offers clear real-world applicability and broader cross-field impact than Paper 2, which primarily contributes a benchmark (important but typically narrower impact) for evaluating LLM-to-knowledge-graph mapping. Paper 1’s integration of planning, code synthesis, validation, and downstream robot interaction evaluation suggests higher potential for enabling new capabilities and follow-on research.
Paper 1 introduces a more novel paradigm—compiling natural-language prompts into executable world programs with validated, editable, articulated assets—addressing a key bottleneck in simulation and robotics (controllability, interactability, provenance). Its applications span embodied AI, robot learning, simulation, and 3D content creation, giving broad cross-field impact and strong timeliness amid agentic/programmatic generation trends. The methodology appears more substantial (planning/designer/critic, multiple code-gen strategies, execution-guided repair, export to SDF, downstream robot evaluation). Paper 2 is useful and timely, but is a more incremental fusion improving AUC on a specific dataset.
Paper 2 likely has higher impact due to a more novel representation shift (from static meshes to executable, editable world programs) with clear downstream utility for robotics, embodied AI, simulation, and content creation. Its approach enables on-demand articulated asset generation and traceable scene editing, broadening applicability across multiple fields and practical pipelines (Blender→SDF/physics). Paper 1 is timely and methodologically solid for RL post-training, but its contribution is a relatively incremental weighting scheme within rubric-based RLVR, with narrower cross-domain impact.
Paper 1 reveals a counterintuitive and broadly important finding: embodied LLM agents perform better with noisier observations, challenging fundamental assumptions about perception quality in AI systems. This insight—that measured performance reflects interactions between perceptual errors and reasoning failures—has deep implications for how the entire community evaluates and designs LLM-based agents. Paper 2, while technically solid, is more incremental in its contribution to scene synthesis pipelines. Paper 1's finding is more surprising, generalizable across domains, and likely to influence evaluation methodology and system design philosophy broadly.
SceneCode introduces a novel framework for programmatic indoor scene synthesis with articulated objects, addressing a clear gap in embodied AI and robotics simulation. It has broader impact across multiple fields (embodied AI, robotics, simulation, computer graphics), offers a constructive contribution with practical applications, and presents a comprehensive evaluation including human judgment and downstream robot interaction. Paper 1, while interesting, presents a negative result with limited statistical significance (p=0.71) based on reanalysis of existing data from a narrow domain (offensive cybersecurity CTF), offering a hypothesis rather than a validated mechanism.
Paper 1 addresses the fundamental privacy-cost-capability tension in LLM agents with a novel constant-context skill learning framework that moves procedural knowledge from prompts into model weights. It demonstrates strong results across three diverse benchmarks with multiple model backbones, showing 2-7x token reduction while matching or exceeding state-of-the-art. The approach has broad applicability to any recurring agent workflow and addresses timely concerns about privacy and efficiency. Paper 2, while solid, is more narrowly focused on indoor scene synthesis with articulated objects, serving a more specialized community in embodied AI and robotics simulation.
SceneCode presents a novel, complete framework for generating executable, physically interactable indoor scenes from natural language—bridging scene synthesis, code generation, articulated object creation, and robot simulation. Its contributions span embodied AI, robotics, and computer graphics with concrete technical innovations (programmatic world generation, execution-guided repair loops, SDF export). Paper 2 is a systematic survey/audit of LLM trading agents that identifies reproducibility gaps but offers no new methods or systems. While valuable, surveys typically have less transformative impact than novel frameworks enabling new capabilities across multiple fields.
Paper 1 addresses a critical and timely security vulnerability in multimodal AI agents—hallucination-driven unauthorized actions—which has broad implications for AI safety, security, and deployment trust. It introduces a novel formal framework (hallucination-to-action conversion) and a principled architectural solution (evidence-carrying agents) with rigorous adversarial evaluation. As autonomous AI agents become widely deployed, this work addresses a fundamental authorization problem with cross-domain relevance. Paper 2, while solid, is more incremental in the scene synthesis space and has narrower impact primarily within embodied AI and simulation communities.
SceneCode addresses a fundamental gap in embodied AI and robotics by enabling programmatic generation of physically interactable indoor scenes with articulated objects from natural language, bridging scene synthesis, simulation, and robot interaction. Its breadth of impact spans embodied AI, robotics, computer graphics, and simulation. Paper 2 identifies an important but narrower problem (library drift in LLM skill libraries) with a focused fix validated on a single coding benchmark. While rigorous, its scope is more limited to the self-evolving agent community, whereas SceneCode's multi-domain applicability and novel formulation suggest broader and more lasting impact.