SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yangguang Li, Yu Cheng

May 19, 2026

arXiv:2605.19587v1 PDF

cs.AI(primary)

#952of 2292·Artificial Intelligence

#952 of 2292 · Artificial Intelligence

Tournament Score

1432±44

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7

Tournament Score

1432±44

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

6.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SceneCode

1. Core Contribution

SceneCode reframes indoor scene synthesis as programmatic world generation, where a natural language prompt is compiled into executable Blender Python programs that produce part-decomposed, articulated, simulation-ready 3D assets. The central insight is that code serves as a natural representation for interactable scenes because it can unify geometry, part hierarchy, material assignment, articulation metadata, and simulation packaging in a single editable artifact. This contrasts with prior approaches that either retrieve static meshes from curated libraries (Holodeck, LayoutVLM, HSM) or generate opaque meshes via image-to-3D pipelines (SceneSmith's non-articulated branch).

The key novelty is the five-strategy routing system paired with an execution-guided repair-and-refine loop: each object request is dispatched to one of five specialized VLM-based code-generation strategies (wall art, static furniture, simple/structured manipulands, articulated objects), each with tailored geometric construction priors. This routing addresses the practical reality that a single universal prompt to a VLM produces unreliable geometry across diverse indoor object categories. The persistent scene-state registry that links object requests, programs, rendered geometry, and simulation assets is a useful architectural choice for traceability and local editability.

2. Methodological Rigor

The evaluation is multi-faceted but has notable gaps. The paper evaluates on 30 prompts from SceneEval-100 across six room categories, measuring both scene-level metrics (10 SceneEval metrics) and object-level metrics (6 mesh/material metrics), supplemented by a user study and MuJoCo demonstrations.

Strengths in evaluation design:

The paired-difference user study design (Δ = SceneCode − Baseline within each trial) is statistically appropriate, controlling for rater identity and prompt sampling.

Object-level comparison against SAM 3D Objects on the same requests with the same reference images is fair.

The 95% confidence intervals are reported throughout.

Weaknesses:

The user study involves only 9 participants split into groups of 3, which is statistically underpowered for drawing robust conclusions. The confidence intervals on ΔPF reflect this (e.g., [-0.05, 0.34] for SceneSmith).

The robot interaction evaluation (Section 4.4) is purely qualitative—three MuJoCo demonstrations with no quantitative metrics on joint accuracy, motion range correctness, or task success rates. This is a significant gap given that simulation-ready articulation is a headline claim.

No ablation study on the routing mechanism, ObjectPlan verification, or the execution-guided validation loop. The appendix shows one qualitative comparison (Figure 6) of generic vs. route-specific prompts, but no systematic quantitative ablation.

SceneCode does not lead on several important metrics (OOR, OAR, SUP, ACC, OPC), and the paper's explanations for these gaps—while reasonable—are not experimentally validated.

The computational cost is substantial: average ~$21.73 per scene and ~7.5 hours wall-clock time, with a maximum of ~17 hours. This raises scalability concerns that are acknowledged but not addressed.

3. Potential Impact

The paper addresses a genuine bottleneck in embodied AI: the dependence on curated articulated asset libraries limits the diversity and customizability of simulation environments. SceneCode's ability to generate novel articulated objects on demand (e.g., a cabinet with a specific number of drawers and glass doors) is valuable for:

Robotic manipulation research: Generating diverse training environments with interactable objects without manual asset creation.

Embodied AI benchmarking: Producing controllable scene variations for systematic evaluation.

Sim-to-real transfer: Creating simulation-ready scenes with physically meaningful articulation.

However, the practical impact is tempered by the high computational cost and the acknowledged visual quality gap relative to retrieval/image-to-3D approaches. The primitive-based construction produces clean geometry but may lack the visual fidelity needed for visual policy training.

4. Timeliness & Relevance

This work is timely on multiple fronts:

The rise of VLM-based code generation (GPT-4, Gemini) makes programmatic 3D asset synthesis increasingly feasible.

The embodied AI community's growing need for scalable, diverse simulation environments creates clear demand.

The shift toward simulation-ready scene generation (SceneSmith, HSM, ProcTHOR) positions this work within an active research trajectory.

The formulation of scenes as executable programs aligns with the broader trend of "code as representation" in AI (Code as Policies, ProgPrompt), extending it to 3D world generation.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated formulation: scenes as executable programs with intrinsic articulation.

Comprehensive pipeline design with thoughtful engineering (five routing strategies, plan verification, execution-guided repair).

Clean separation of concerns between room-level planning and object-level code generation.

Demonstrates clear advantages in mesh quality metrics (zero non-manifold edges, ~4.4× fewer UV islands) and semantic fidelity (CNT, ATR).

The persistent scene-state registry enables local editability, which is practically valuable.

Notable Weaknesses:

Underpowered user study (n=3 per group).

No quantitative evaluation of articulation quality (joint accuracy, motion range correctness).

Missing ablation studies for key design decisions.

High computational cost (~$22/scene, ~7.5 hours average) limits practical scalability.

Visual realism trails SceneSmith (Δ_Realism = -0.34), acknowledged but not addressed.

The 30-prompt evaluation set, while spanning six categories, is relatively small for establishing generalization.

Additional Observations

The extensive appendix (prompt designs, code listings, cost statistics) supports reproducibility, though the reliance on proprietary VLMs (likely GPT-4/Gemini) introduces a dependency on commercial APIs. The paper would benefit from testing with open-source VLMs. The claim of "executable world programs" is compelling conceptually but the execution is more of a well-engineered pipeline than a fundamentally new algorithmic contribution.

Rating:6.2/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7

Generated May 20, 2026

Comparison History (24)

vs. Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts

claude-opus-4.65/22/2026

Paper 2 addresses a critical gap in AI safety evaluation—LLM alignment in conflict contexts—that has immediate real-world consequences for journalism, humanitarian work, and public discourse in fragile societies. It proposes the first evaluation framework for this domain, which could influence alignment benchmarking standards across the industry. While Paper 1 (SceneCode) is technically impressive for embodied AI and scene synthesis, Paper 2's breadth of societal impact, timeliness given rapid global LLM deployment, and cross-disciplinary relevance (AI safety, conflict studies, policy) give it higher potential scientific and societal impact.

vs. Latent-space Attacks for Refusal Evasion in Language Models

gemini-3.15/22/2026

Paper 1 addresses a critical and highly active area of research: LLM safety, alignment, and jailbreaking. By providing a principled understanding of refusal suppression in latent space and demonstrating state-of-the-art attack success across many models, it exposes fundamental vulnerabilities in current AI systems. While Paper 2 offers a valuable framework for embodied AI simulation, the broader implications, urgency, and cross-disciplinary relevance of securing foundational models give Paper 1 a higher potential for immediate and widespread scientific impact.

vs. Generative Recursive Reasoning

claude-opus-4.65/21/2026

GRAM introduces a fundamental new framework for neural reasoning by making recursive reasoning models probabilistic, enabling multi-trajectory computation with theoretical grounding in variational inference. This addresses core questions about how neural systems should implement extended computation, with broad applicability across reasoning, generation, and constraint satisfaction. While SceneCode is a strong engineering contribution for embodied AI scene synthesis, GRAM's conceptual innovation in combining recursive latent reasoning with generative modeling has broader potential impact across multiple fields of AI research, offering a new paradigm for inference-time scaling and probabilistic reasoning.

vs. Generative Recursive Reasoning

claude-opus-4.65/21/2026

GRAM introduces a fundamental architectural innovation for neural reasoning—turning deterministic recursive reasoning into probabilistic multi-trajectory computation. This addresses a core challenge in AI (how to implement extended computation in neural systems) with broad theoretical and practical implications across reasoning, generation, and inference-time scaling. SceneCode is a strong engineering contribution for indoor scene synthesis but is more application-specific. GRAM's framework-level contribution to reasoning architectures has wider potential impact across multiple fields and research directions.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

gemini-3.15/20/2026

While Paper 1 offers an innovative approach to scene synthesis critical for embodied AI and robotics, Paper 2 targets the automation of scientific discovery itself. By developing an iterative, multi-agent autonomous research system with self-healing execution and human-AI collaboration, Paper 2 has the potential for a massive multiplier effect across all computational disciplines, offering significantly broader scientific impact.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

claude-opus-4.65/20/2026

SceneCode introduces a novel framework bridging indoor scene synthesis with executable, editable programmatic representations—addressing a concrete gap in embodied AI, robotics, and simulation. Its contributions span multiple fields (scene generation, articulated object synthesis, robot interaction) and offer immediately usable outputs (simulation-ready assets). Paper 1 provides useful empirical insights for multi-model LLM scheduling but is primarily a benchmarking/profiling study offering guidelines rather than a new method or system, limiting its transformative impact.

vs. Neurosymbolic Learning for Inference-Time Argumentation

gpt-5.25/20/2026

Paper 2 likely has higher impact due to stronger real-world applicability and broader cross-field relevance: it enables editable, executable scene generation with articulated assets directly usable in simulators (SDF), benefiting robotics, embodied AI, graphics, and simulation. The programmatic representation and execution-guided repair loop are novel and practically enabling, with clear downstream evaluation in robot interaction. Paper 1 is timely and methodologically interesting for trustworthy claim verification, but its impact is more domain-specific (NLP/argumentation) and may face adoption friction due to dataset/task specificity and dependence on argument generation quality.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

gemini-3.15/20/2026

Paper 1 presents a concrete, executable framework with rigorous evaluation in simulation and downstream robotics. It addresses a critical bottleneck in embodied AI (dynamic, articulated scene generation). Paper 2 is a vision paper offering a conceptual framework for agent trust; while highly relevant, Paper 1's tangible implementation, novel programmatic approach, and immediate applicability to physical simulation and robotics yield a higher potential for measurable scientific impact.

vs. The Generalized Turing Test: A Foundation for Comparing Intelligence

claude-opus-4.65/20/2026

Paper 2 introduces a fundamental theoretical framework (Generalized Turing Test) for comparing intelligence across arbitrary agents, which has broader cross-disciplinary impact spanning AI theory, cognitive science, and evaluation methodology. Its dataset- and task-agnostic nature addresses a foundational problem in AI—how to compare intelligent systems without relying on specific benchmarks. This could reshape how the field thinks about intelligence evaluation and training objectives. Paper 1, while technically solid and practically useful for embodied AI and robotics, represents an incremental engineering advance in scene synthesis with narrower applicability.

vs. From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

gemini-3.15/20/2026

Paper 1 addresses a critical bottleneck in embodied AI and robotics by enabling the generation of articulated, interactive scenes through executable code. Its novel approach yields strong, actionable results. In contrast, Paper 2 explores an important topic in autonomous vehicles but finds no statistically significant quantitative improvements, limiting its immediate practical impact despite its value as an empirical benchmark.

vs. Generative Recursive Reasoning

gpt-5.25/20/2026

Paper 2 (GRAM) introduces a broadly applicable probabilistic extension to recursive reasoning models, enabling multi-trajectory latent computation, hypothesis diversity, and inference-time scaling—ideas likely to transfer across many ML domains (reasoning, planning, constraint solving, and generative modeling). Its methodological framing (latent-variable model with variational inference) is general and aligns with current interest in scalable test-time compute and structured reasoning. Paper 1 is innovative and impactful for embodied AI/robotics simulation, but its impact is more domain-specific (indoor scene/asset generation toolchain) and depends on ecosystem adoption.

vs. BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

gpt-5.25/20/2026

Paper 1 is more methodologically and technically innovative: it reframes indoor scene synthesis as executable program generation with execution-guided repair, yielding editable, articulated, simulation-ready assets—directly valuable for robotics, embodied AI, simulation, and graphics. This offers clear real-world applicability and broader cross-field impact than Paper 2, which primarily contributes a benchmark (important but typically narrower impact) for evaluating LLM-to-knowledge-graph mapping. Paper 1’s integration of planning, code synthesis, validation, and downstream robot interaction evaluation suggests higher potential for enabling new capabilities and follow-on research.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

gpt-5.25/20/2026

Paper 1 introduces a more novel paradigm—compiling natural-language prompts into executable world programs with validated, editable, articulated assets—addressing a key bottleneck in simulation and robotics (controllability, interactability, provenance). Its applications span embodied AI, robot learning, simulation, and 3D content creation, giving broad cross-field impact and strong timeliness amid agentic/programmatic generation trends. The methodology appears more substantial (planning/designer/critic, multiple code-gen strategies, execution-guided repair, export to SDF, downstream robot evaluation). Paper 2 is useful and timely, but is a more incremental fusion improving AUC on a specific dataset.

vs. Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

gpt-5.25/20/2026

Paper 2 likely has higher impact due to a more novel representation shift (from static meshes to executable, editable world programs) with clear downstream utility for robotics, embodied AI, simulation, and content creation. Its approach enables on-demand articulated asset generation and traceable scene editing, broadening applicability across multiple fields and practical pipelines (Blender→SDF/physics). Paper 1 is timely and methodologically solid for RL post-training, but its contribution is a relatively incremental weighting scheme within rubric-based RLVR, with narrower cross-domain impact.

vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

claude-opus-4.65/20/2026

Paper 1 reveals a counterintuitive and broadly important finding: embodied LLM agents perform better with noisier observations, challenging fundamental assumptions about perception quality in AI systems. This insight—that measured performance reflects interactions between perceptual errors and reasoning failures—has deep implications for how the entire community evaluates and designs LLM-based agents. Paper 2, while technically solid, is more incremental in its contribution to scene synthesis pipelines. Paper 1's finding is more surprising, generalizable across domains, and likely to influence evaluation methodology and system design philosophy broadly.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

claude-opus-4.65/20/2026

SceneCode introduces a novel framework for programmatic indoor scene synthesis with articulated objects, addressing a clear gap in embodied AI and robotics simulation. It has broader impact across multiple fields (embodied AI, robotics, simulation, computer graphics), offers a constructive contribution with practical applications, and presents a comprehensive evaluation including human judgment and downstream robot interaction. Paper 1, while interesting, presents a negative result with limited statistical significance (p=0.71) based on reanalysis of existing data from a narrow domain (offensive cybersecurity CTF), offering a hypothesis rather than a validated mechanism.

vs. From History to State: Constant-Context Skill Learning for LLM Agents

claude-opus-4.65/20/2026

Paper 1 addresses the fundamental privacy-cost-capability tension in LLM agents with a novel constant-context skill learning framework that moves procedural knowledge from prompts into model weights. It demonstrates strong results across three diverse benchmarks with multiple model backbones, showing 2-7x token reduction while matching or exceeding state-of-the-art. The approach has broad applicability to any recurring agent workflow and addresses timely concerns about privacy and efficiency. Paper 2, while solid, is more narrowly focused on indoor scene synthesis with articulated objects, serving a more specialized community in embodied AI and robotics simulation.

vs. Agentic Trading: When LLM Agents Meet Financial Markets

claude-opus-4.65/20/2026

SceneCode presents a novel, complete framework for generating executable, physically interactable indoor scenes from natural language—bridging scene synthesis, code generation, articulated object creation, and robot simulation. Its contributions span embodied AI, robotics, and computer graphics with concrete technical innovations (programmatic world generation, execution-guided repair loops, SDF export). Paper 2 is a systematic survey/audit of LLM trading agents that identifies reproducibility gaps but offers no new methods or systems. While valuable, surveys typically have less transformative impact than novel frameworks enabling new capabilities across multiple fields.

vs. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

claude-opus-4.65/20/2026

Paper 1 addresses a critical and timely security vulnerability in multimodal AI agents—hallucination-driven unauthorized actions—which has broad implications for AI safety, security, and deployment trust. It introduces a novel formal framework (hallucination-to-action conversion) and a principled architectural solution (evidence-carrying agents) with rigorous adversarial evaluation. As autonomous AI agents become widely deployed, this work addresses a fundamental authorization problem with cross-domain relevance. Paper 2, while solid, is more incremental in the scene synthesis space and has narrower impact primarily within embodied AI and simulation communities.

vs. Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

claude-opus-4.65/20/2026

SceneCode addresses a fundamental gap in embodied AI and robotics by enabling programmatic generation of physically interactable indoor scenes with articulated objects from natural language, bridging scene synthesis, simulation, and robot interaction. Its breadth of impact spans embodied AI, robotics, computer graphics, and simulation. Paper 2 identifies an important but narrower problem (library drift in LLM skill libraries) with a focused fix validated on a single coding benchmark. While rigorous, its scope is more limited to the self-evolving agent community, whereas SceneCode's multi-domain applicability and novel formulation suggest broader and more lasting impact.