PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

Qiran Zhang, Yuheng Wang, Runde Yang, Lin Wu, Jingru Fan, Shu Yao, Jie Zhang, Tianle Zhou

May 19, 2026

arXiv:2605.19382v1 PDF

cs.AI(primary)

#1337of 2292·Artificial Intelligence

#1337 of 2292 · Artificial Intelligence

Tournament Score

1393±42

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7

Novelty5.5

Clarity7

Tournament Score

1393±42

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Programmatic video generation through code offers geometric precision and temporal coherence beyond pixel-level diffusion models, yet rigorously evaluating whether language models can produce spatially correct animated outputs remains an open problem. We introduce PRISM, a large-scale benchmark of 10,372 human-calibrated instruction-code pairs (20 times larger than prior programmatic video generation benchmarks), grounded in real-world knowledge visualization scenarios across English and Chinese and spanning 437 subject categories. We further propose a funnel-style evaluation framework with four complementary metrics: Code-Level Reliability for executability, Spatial Reasoning for layout correctness over full animation sequences, and Prompt-Aware Dynamic Visual Complexity (PADVC) and Temporal Density (TD) for diagnosing dynamic expression and temporal activity. Systematic evaluation of seven mainstream LLMs reveals a striking Execution-Spatial Gap: the average drop from execution success rate to spatial pass rate is approximately 41%, showing that runnable code does not necessarily yield spatially coherent visual output. These findings show that programmatic video generation evaluation should go beyond executability. PRISM provides a principled benchmark for advancing spatially coherent code generation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

1. Core Contribution

PRISM introduces a large-scale benchmark of 10,372 human-calibrated instruction-code pairs for evaluating LLMs on programmatic video generation via Manim, a Python-based mathematical animation framework. The benchmark is bilingual (English/Chinese), spans 437 subject categories, and is approximately 20× larger than prior programmatic video generation benchmarks. The paper's central conceptual contribution is the identification and quantification of the Execution-Spatial Gap: the finding that code executability does not imply spatial correctness, with an average ~41% drop from execution success rate to spatial pass rate across seven evaluated LLMs.

The paper also proposes a four-metric "funnel-style" evaluation framework: Code-Level Reliability (executability), Spatial Reasoning (layout correctness across animation frames), Prompt-Aware Dynamic Visual Complexity (PADVC), and Temporal Density (TD). This multi-layered evaluation moves beyond binary pass/fail code execution metrics toward nuanced assessment of visual output quality.

2. Methodological Rigor

Data curation: The hierarchical pipeline—deterministic hard filtering via low-level rendering-engine parsing, boundary calibration with 3,000 expert-annotated examples, and human spot-check review—is methodologically sound. The decision to avoid VLM-based quality assessment is well-justified given known VLM weaknesses in fine-grained spatial reasoning. Starting from 30,000+ raw candidates and filtering to 10,372 demonstrates substantial quality control effort.

Evaluation metrics: The spatial reasoning metric is carefully designed with three mutually exclusive failure modes (overlap, out-of-bounds, leakage) and includes a hierarchy-aware parser with false-positive suppression. PADVC and TD are mathematically formulated with centered scores fitted to reference distributions, which is reasonable for detecting both under- and over-production of dynamics.

Experimental concerns: The evaluation uses only a 20% random subset rather than the full benchmark, which is somewhat unusual for a benchmark paper and raises questions about statistical power and representativeness. The bootstrap confidence interval for the Execution-Spatial Gap is reported, which is good practice. Temperature is set to 0.7 with single-attempt evaluation, which is standard but limits assessment of model variability.

One methodological weakness is that the PADVC metric involves several design choices (the concave exponent p=0.7, the specific formulation of PVD, the Gaussian centering) that appear somewhat ad hoc. While reasonable, the sensitivity of results to these choices is not thoroughly explored. The paper also acknowledges that temporal ordering is not explicitly evaluated since sequential script execution generally preserves it—this is a reasonable simplification but limits the benchmark's temporal reasoning assessment.

3. Potential Impact

Direct impact on programmatic generation: PRISM fills a genuine gap. Prior benchmarks for Manim/programmatic video generation were small (hundreds of samples) and primarily evaluated executability. The 20× scale increase and spatial evaluation framework provide a much-needed standardized testbed.

Broader implications for LLM evaluation: The Execution-Spatial Gap finding is the paper's most impactful insight—it demonstrates that code correctness metrics are insufficient proxies for output quality in generative settings. This lesson extends beyond Manim to any domain where LLMs generate code that produces visual or spatial outputs (CAD, web design, SVG generation, robotics).

Training data resource: The 10,372 curated instruction-code pairs could serve as high-quality training data for fine-tuning models on programmatic video generation, though the paper does not explore this.

Limitations on broader applicability: The benchmark is tightly coupled to Manim despite claims of generalizability. The educational knowledge visualization domain, while practical, is somewhat niche. The bilingual (English/Chinese) scope is useful but limited.

4. Timeliness & Relevance

The paper addresses a timely need. With the explosion of LLM-based code generation and growing interest in automated content creation, the gap between code that runs and code that produces correct visual output is increasingly important. The educational video generation use case is particularly relevant given the proliferation of AI-assisted educational tools.

The benchmark evaluates very recent models (GPT-5.4, Gemini 3.1 Pro Preview, Claude Sonnet 4.6, Kimi K2), which makes the findings immediately relevant to the current state of the art. However, this also means the benchmark's longevity depends on how quickly models improve on these tasks.

5. Strengths & Limitations

Key Strengths:

Scale and quality: 10,372 human-calibrated samples represent a significant investment and genuine improvement over prior work.

Execution-Spatial Gap: A compelling and well-quantified finding with practical implications for how we evaluate code generation quality.

Deterministic evaluation: Using low-level rendering-engine parsing rather than VLM-based judging provides reproducible, objective spatial assessments.

Comprehensive model comparison: Seven models across open/closed-source, with thoughtful analysis of failure modes and ablations (thinking mode, text expansion).

Practical insights: The finding that thinking modes don't reliably improve spatial quality, and that excessive dynamic intensity degrades spatial correctness, are actionable for model developers.

Notable Weaknesses:

Manim specificity: Despite generalizability claims, the benchmark, metrics, and curation pipeline are deeply tied to Manim. Transfer to other programmatic generation frameworks is unproven.

Evaluation on 20% subset: Using only ~2,000 samples per language for the main evaluation somewhat undermines the "large-scale" claim.

PADVC/TD complexity: These metrics involve numerous design choices and hyperparameters whose sensitivity is not fully analyzed. Their diagnostic value versus simpler alternatives is unclear.

Limited temporal evaluation: The paper essentially concedes that temporal ordering is not meaningfully tested, weakening the "spatial-temporal" framing.

No fine-tuning experiments: The paper doesn't demonstrate that training on PRISM data improves model performance, which would strengthen the contribution.

Single-attempt evaluation: No exploration of self-repair, iterative refinement, or multi-attempt strategies that are common in practice.

Summary

PRISM makes a solid empirical contribution by providing a substantially larger benchmark for programmatic video generation and introducing the Execution-Spatial Gap concept, which is both well-supported and practically important. The evaluation framework is more rigorous than prior work, though the diagnostic metrics (PADVC, TD) add complexity without proportional insight. The paper's primary limitation is its tight coupling to Manim and the educational domain, which constrains generalizability. Nevertheless, the core finding—that runnable code ≠ spatially correct output—is valuable for the broader code generation community.

Rating:6.5/ 10

Significance 6.5Rigor 7Novelty 5.5Clarity 7

Generated May 20, 2026

Comparison History (19)

vs. OSCToM: RL-Guided Adversarial Generation for High-Order Theory of Mind

claude-opus-4.65/21/2026

PRISM introduces a large-scale, well-structured benchmark (10,372 pairs, 437 categories) addressing a timely gap in programmatic video generation evaluation. Its novel funnel-style evaluation framework and the discovery of the 'Execution-Spatial Gap' provide actionable insights with broad applicability across code generation, video synthesis, and spatial reasoning. Paper 1 (OSCToM) makes solid contributions to Theory of Mind benchmarking but addresses a narrower problem domain. PRISM's scale, bilingual coverage, and relevance to the rapidly growing field of code-generating LLMs give it broader potential impact.

vs. AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

gpt-5.25/21/2026

Paper 2 (PRISM) is likely to have higher scientific impact because it delivers a large, reusable benchmark (10,372 pairs, bilingual) with concrete, automatable metrics and a clear empirical finding (Execution–Spatial Gap) that can drive model and method development. Its applications span programmatic video generation, spatial-temporal reasoning, code generation, and multimodal evaluation, making it broadly useful and timely. Paper 1 offers valuable evaluation taxonomies and insights, but is positioned as a protocol demonstration with a small-scale run and no benchmark release, limiting immediate adoption and downstream impact.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

claude-opus-4.65/21/2026

Paper 2 addresses a more fundamental and broadly applicable problem—systematic diagnosis of LLM agent failures at scale—which impacts the entire LLM agent ecosystem. It introduces a novel formalization (corpus-level trace diagnostics) and demonstrates concrete downstream improvements (30.4pp gains). Paper 1, while valuable as a benchmark for programmatic video generation, serves a narrower community. Paper 2's methodology (multi-agent diagnostic system) has wider applicability across any LLM agent deployment, making it more likely to influence research and practice broadly.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gemini-3.15/21/2026

Paper 1 addresses a critical, timely issue in frontier AI evaluation by advocating for open-world, real-world tasks to supplement automated benchmarks. Its focus on long-horizon capabilities and safety has profound implications for AI policy, safety, and broad real-world applications. While Paper 2 offers a rigorous benchmark for a specific domain (programmatic video generation), Paper 1's conceptual shift in how we evaluate state-of-the-art AI systems ensures a much broader and more significant impact across the entire AI ecosystem.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gemini-3.15/21/2026

Paper 2 addresses a fundamental and critical challenge in AI: the limitations of static benchmarks for evaluating frontier models. By proposing a framework for open-world, long-horizon evaluations, it has a broader potential impact across the entire field of AI development and safety compared to Paper 1, which focuses on a more specific niche of programmatic video generation.

vs. Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

gpt-5.25/20/2026

Paper 2 (PRISM) likely has higher impact because it introduces a large, multilingual benchmark and a multi-metric evaluation framework that can become shared infrastructure for the community, enabling standardized comparison and driving progress across program synthesis, vision-language, and temporal/spatial reasoning. Its dataset scale and the identified “Execution–Spatial Gap” provide actionable, broadly relevant insights. Paper 1 offers useful controlled experiments and negative findings about LLM agent behavior in code optimization, but its scope is narrower and more diagnostic than enabling, which may limit cross-field uptake.

vs. Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

gemini-3.15/20/2026

Paper 1 addresses a fundamental and highly timely problem in AI safety and agentic systems (trust calibration and human oversight) by introducing a rigorous mathematical framework based on Preferential Bayesian Optimization. Its formalized approach to progressive autonomy has broader applicability and higher potential impact across various domains of autonomous AI compared to Paper 2, which, while valuable, is restricted to benchmarking programmatic video generation.

vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving

gemini-3.15/20/2026

Paper 2 presents a highly counterintuitive finding—that higher observation fidelity degrades embodied LLM problem-solving, and moderate noise actually improves success by breaking repetitive action loops. This fundamentally challenges existing assumptions in the rapidly growing field of Embodied AI, likely prompting significant conceptual shifts and follow-up research. While Paper 1 provides a valuable benchmark for a specific multimodal niche (programmatic video generation), Paper 2's insights into the opaque decision processes and failure modes of LLM agents have broader, paradigm-shifting implications for robotics and autonomous agent design.

vs. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

gpt-5.25/20/2026

Paper 1 likely has higher broad scientific impact: it introduces a large, multilingual benchmark (10,372 pairs) and a multi-metric evaluation framework for programmatic spatiotemporal reasoning, uncovering a general “execution–spatial” failure mode relevant to code generation, vision-language, and evaluation research. Its outputs are readily reusable by the community and can standardize comparisons across models. Paper 2 is novel and timely for AI safety, but jailbreak methods are narrower in positive scientific utility and can face dissemination/ethical constraints, limiting adoption and cross-field impact despite methodological sophistication.

vs. Divergence-Suppressing Couplings for Rectified Flow

gpt-5.25/20/2026

Paper 2 likely has higher impact: it introduces a principled, broadly applicable modification to Rectified Flow grounded in field properties (divergence), with clear computational advantages (offline correction, no deployment overhead) and demonstrated gains on synthetic and image-generation tasks. This targets a timely core generative-modeling paradigm and could influence diffusion/flow training practices across domains. Paper 1 is valuable infrastructure (a large benchmark + metrics) but is narrower (programmatic video/code evaluation) and its impact depends on community adoption and alignment with emerging generation paradigms.

vs. Dynamics of collective creativity in AI art competitions

claude-opus-4.65/20/2026

Paper 2 addresses a fundamentally interdisciplinary question about collective creativity dynamics in human-AI systems, with broad implications for cultural evolution, computational creativity, and social science. Its empirical analysis of 130,882 images reveals paradoxical behavioral patterns (users prefer remixing less novel works despite novel parents producing more liked outputs), offering genuinely novel theoretical insights. Paper 1, while methodologically solid with a useful benchmark, addresses a narrower technical problem (programmatic video generation evaluation) with more limited cross-disciplinary appeal. Paper 2's findings about human-AI co-creation dynamics are timely and relevant to a much wider audience.

vs. LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

gemini-3.15/20/2026

While Paper 1 introduces a valuable benchmark for programmatic video generation, Paper 2 provides fundamental theoretical insights into LLM reasoning limitations. By uncovering structural failure modes, working memory constraints, and the phenomenon of computational abandonment at specific matrix dimensions, Paper 2 offers profound implications for understanding and improving the cognitive architecture of large language models across multiple domains.

vs. Learning to Learn from Multimodal Experience

claude-opus-4.65/20/2026

Paper 1 proposes a novel paradigm shift—making memory design itself learnable for multimodal experience-driven agents—addressing a fundamental limitation in AI agent design. This has broader impact across multiple fields (robotics, embodied AI, multimodal reasoning) and introduces a conceptually deeper contribution. Paper 2, while solid as a benchmark contribution (PRISM), is more incremental and narrowly focused on programmatic video generation evaluation. Benchmarks have impact but are typically surpassed quickly, whereas new learning paradigms can shape research directions for years.

vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

gpt-5.25/20/2026

Paper 2 (PRISM) likely has higher scientific impact due to broader applicability and timeliness: it introduces a large, bilingual benchmark plus a multi-metric evaluation framework that exposes a general failure mode (Execution–Spatial Gap) relevant to LLM code generation, program synthesis, vision-language, graphics, and evaluation research. Its dataset scale and diagnostics can become a community standard for spatial-temporal reasoning in programmatic video/animation. Paper 1 is novel in causal memory selection and provides a benchmark, but its impact is more niche to long-horizon agent memory systems and may be harder to generalize across tasks/models.

vs. EXG: Self-Evolving Agents with Experience Graphs

gpt-5.25/20/2026

Paper 2 (EXG) likely has higher impact: it proposes a general, reusable framework for self-evolving LLM agents via structured experience graphs, addressing a timely and widely relevant problem (continual improvement during deployment). Its plug-and-play design and demonstrated performance–efficiency gains across multiple agent tasks suggest broad applicability across reasoning, coding, and autonomous systems. Paper 1 (PRISM) is a strong benchmark contribution, but its impact is narrower (programmatic video/code spatial-temporal evaluation) and mainly advances evaluation methodology rather than a broadly deployable capability.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

claude-opus-4.65/20/2026

PRISM introduces a large-scale benchmark (10,372 samples) addressing an emerging and foundational challenge at the intersection of LLMs and programmatic video generation. Its novel evaluation framework revealing the 'Execution-Spatial Gap' provides actionable insights for a rapidly growing field. The benchmark's scale, multilingual design, and broad category coverage give it wider utility across communities (NLP, computer vision, code generation). Paper 1, while solid, offers incremental improvement (2.1% AUC gain) in a narrower deepfake detection subfield using relatively straightforward fusion of existing components.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

gpt-5.25/20/2026

Paper 2 likely has higher impact because it introduces a large, multilingual benchmark and evaluation framework (PRISM) that can become shared infrastructure for a broad community (LLMs, code generation, video/vision, spatial-temporal reasoning). The proposed metrics and the identified “Execution-Spatial Gap” provide a reusable diagnostic that can shape future model and training research. Paper 1 offers valuable empirical insights for multi-model LLM serving, but is more systems-specific and primarily informs scheduler design rather than creating a broadly adopted benchmark standard.

vs. Interference-Aware Multi-Task Unlearning

gpt-5.25/20/2026

Paper 2 likely has higher scientific impact because it introduces a large, multilingual benchmark plus a multi-metric evaluation framework that can become a community standard for assessing programmatic spatial-temporal reasoning—broadly useful across LLM code generation, vision-language, graphics, and evaluation research. The “Execution-Spatial Gap” is a clear, general diagnostic finding that can redirect research priorities. Paper 1 is novel and relevant for multi-task unlearning, but its scope is narrower (specific to unlearning in shared-backbone multi-task models) and may impact a smaller set of downstream applications.

vs. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

gemini-3.15/20/2026

Paper 2 offers a highly novel, cross-disciplinary approach by applying robotics control theory to LLM safety, shifting the paradigm from static output filtering to dynamic interaction trajectory control. Its application in highly sensitive, real-world domains like autism therapy and school de-escalation demonstrates exceptional potential for broad societal and scientific impact. While Paper 1 provides a valuable large-scale benchmark for programmatic video generation, Paper 2 tackles a more urgent, generalized bottleneck in AI safety with rigorous, real-world validation and broader interdisciplinary relevance.