Using large language models for embodied planning introduces systematic safety risks

Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu, Marco Hutter, Manling Li, Fan Shi

Apr 20, 2026

arXiv:2604.18463v2 PDF

v1v2

cs.AI(primary)cs.LG cs.RO

#33of 2292·Artificial Intelligence

#33 of 2292 · Artificial Intelligence

Tournament Score

1581±34

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor8

Novelty7.5

Clarity8.5

Tournament Score

1581±34

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Using large language models for embodied planning introduces systematic safety risks"

1. Core Contribution

This paper introduces DESPITE (Deterministic Evaluation of Safe Planning In embodied Task Execution), a benchmark of 12,279 safety-critical planning tasks for evaluating how safely LLMs plan when used as robot planners. The key conceptual contribution is the distinction between planning-level safety and semantic-level safety: a semantically benign instruction can still produce dangerous plans depending on the action sequence chosen. The paper formalizes this through a PDDL-based framework where dangers are encoded as conditional effects on actions, enabling fully deterministic (non-LLM-judge) safety validation.

The central empirical finding is a decoupling between planning ability and safety awareness: among 18 open-source models spanning 3B to 671B parameters, feasibility scales robustly with model size (0.4–99.3%) while safety intention remains nearly flat (38–57%). The authors formalize a multiplicative decomposition S ≈ F × SI (R² = 0.99), demonstrating that larger models achieve safer outcomes primarily because they produce more executable plans, not because they better avoid dangers. Three proprietary reasoning models (GPT-5 high, Gemini-2.5-Pro, Gemini-3-Pro-Preview) break this pattern at 71–81% SI, but open-source reasoning models do not.

2. Methodological Rigor

The methodology is notably strong in several respects:

Formal grounding: The safety-augmented PDDL formalism with danger fluents, conditional effects, and deterministic validation is mathematically precise and avoids the well-documented unreliability of LLM-as-judge evaluation. This is a meaningful improvement over prior benchmarks.

Metric design: The four metrics (F, S, SP, SI) are well-motivated and complementary. The safety intention metric SI, which relaxes precondition requirements to evaluate danger avoidance independently of planning ability, is a clever methodological contribution. The authors provide empirical validation (zero false positives/negatives across 110,001 model-task pairs) and human agreement checks.

Scaling analysis: The log-linear regressions with bootstrapped CIs and slope ratio comparisons are statistically appropriate. The multiplicative decomposition is validated across multiple data splits (R² > 0.99).

Task-factor analysis: The use of Cohen's d to characterize how plan length and safety effort differentially affect feasibility vs. safety difficulty provides actionable diagnostic insights.

Limitations: The symbolic PDDL interface abstracts away perception and continuous dynamics—acknowledged by the authors as a lower bound on real-world failure rates. The benchmark's danger annotations were partly generated by an LLM (DeepSeek-V3.1) with human review, introducing potential biases. The 32.6% human rejection rate (55.1% for NormBank) suggests the automated pipeline produces substantial noise, though the final quality control appears rigorous.

3. Potential Impact

Immediate impact: DESPITE provides the robotics and NLP communities with a practical, reproducible evaluation infrastructure for LLM planner safety. At 12,279 tasks with deterministic validation, it substantially exceeds prior benchmarks in scale (6× larger than Safe-BeAl, the previous largest) and coverage (first to combine physical and normative dangers).

Diagnostic value: The S ≈ F × SI decomposition gives practitioners a concrete framework for understanding why their models fail—and suggests that safety alignment, not continued scaling, is the bottleneck for safe deployment.

Training signal: The paired safe/unsafe reference plans with formal danger annotations provide richer supervision than binary labels, potentially enabling new alignment approaches specifically targeting embodied safety.

Policy relevance: The finding that near-perfect planning ability (99.6% feasibility) still yields ~28% dangerous plans has direct implications for deployment standards in robotics.

4. Timeliness & Relevance

This paper addresses a critical gap at the intersection of two rapidly growing fields: LLM-based robot planning and AI safety. As companies deploy LLM-controlled robots in homes, hospitals, and workplaces, the lack of systematic safety evaluation is an urgent problem. The paper's timing is excellent—arriving as frontier models approach planning saturation but before widespread deployment of autonomous LLM-planned robotic systems.

The inclusion of normative dangers (privacy violations, social norm breaches) alongside physical hazards is particularly timely, as robots in shared human spaces must navigate social expectations that have received limited attention in existing benchmarks.

5. Strengths & Limitations

Key Strengths:

The deterministic validation framework eliminates evaluator variance, a persistent problem in safety benchmarking

The multiplicative decomposition is an elegant analytical contribution that transforms an opaque leaderboard into interpretable diagnostics

Scale and diversity: 12,279 tasks across 5 sources, multiple danger types, and diverse settings

Comprehensive model coverage (23 models, 3B–671B+) enables robust scaling conclusions

The $0.011/task generation cost and modular pipeline design enable community extension

The finding that normative dangers are systematically harder than physical dangers (despite shorter plans) reveals a meaningful gap in LLM training

Notable Limitations:

The symbolic PDDL interface may not capture safety-relevant aspects of real robotic deployment (perception, continuous dynamics, uncertainty)

The causal mechanism behind the high SI of proprietary reasoning models remains unexplained—the paper can only note its existence

The hard split selection using panel model performance introduces potential selection bias

Safety effort is defined relative to reference plans, which may not capture all valid safe alternatives

The benchmark treats safety as binary (d_max = 0), which may oversimplify real-world risk tradeoffs

Cultural specificity of normative dangers is acknowledged but not systematically addressed

Comparison to prior art: DESPITE represents a clear advance over EmbodyGuard (942 tasks, physical only), Safe-BeAl (2,027 tasks, physical only), and SafeAgentBench (750 tasks, LLM-judged). The combination of scale, deterministic validation, and normative danger coverage is novel.

Overall Assessment

This is a well-executed empirical study with strong methodological foundations that addresses a timely and important problem. The core finding—that planning ability and safety awareness are largely orthogonal capacities that scale differently—is both scientifically interesting and practically consequential. The benchmark itself is likely to become a standard evaluation tool. The main limitation is the gap between symbolic planning and real-world robotic deployment, but the paper is appropriately transparent about this boundary.

Rating:7.8/ 10

Significance 8Rigor 8Novelty 7.5Clarity 8.5

Generated May 5, 2026

Comparison History (44)

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

gemini-35/5/2026

Paper 2 presents a massive-scale foundation model trained on data from over 200 million patients, demonstrating transformative potential for healthcare. Its ability to significantly improve disease prediction, forecast medical expenditures, and reduce bias in target trial emulations promises broad impacts across clinical research, epidemiology, and health economics. While Paper 1 addresses an important AI safety issue in robotics, Paper 2's unprecedented scale, rigorous external validation, and direct implications for life-saving population health and regulatory decision-making give it a higher potential for widespread scientific and societal impact.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

claude-opus-4.65/5/2026

Paper 1 presents a novel paradigm (machine collective intelligence) that addresses a fundamental challenge in AI-driven scientific discovery—deriving interpretable governing equations. It demonstrates extraordinary results (six orders of magnitude improvement in extrapolation, massive parameter reduction) across diverse scientific domains. This has broad impact across all quantitative sciences. Paper 2 is a valuable safety benchmark for LLM-based robotic planning, but is more incremental—identifying known concerns (scaling improves capability but not safety) with a specific benchmark. Paper 1's methodological innovation and cross-disciplinary applicability give it higher transformative potential.

vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims

gemini-35/5/2026

Paper 2 presents a foundational model trained on an unprecedented scale of medical data (200 million enrollees), offering immediate and profound impacts on healthcare, disease prediction, and health economics. While Paper 1 addresses an important AI safety issue in robotics, Paper 2's direct application to population-scale real-world evidence, its rigorous multi-dataset validation, and its potential to significantly improve clinical and regulatory decision-making give it a broader and more immediate real-world scientific impact.

vs. Machine Collective Intelligence for Explainable Scientific Discovery

claude-opus-4.65/5/2026

Paper 2 presents a novel paradigm (machine collective intelligence) that addresses a fundamental challenge in AI-driven scientific discovery—finding interpretable governing equations. Its combination of symbolic reasoning and metaheuristics with multi-agent evolution is highly innovative, demonstrates dramatic improvements (up to 6 orders of magnitude in extrapolation error), and has broad applicability across diverse scientific domains. While Paper 1 makes an important contribution to LLM safety in robotics with a solid benchmark, it is more narrowly focused on evaluation/characterization rather than introducing a transformative methodology. Paper 2's potential to accelerate discovery across physics, biology, and other fields gives it broader and deeper scientific impact.

vs. Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact: it introduces a broadly applicable diagnostic framework (VLAF) that overcomes a key limitation of prior alignment-faking tests, reports surprisingly high prevalence across widely used model classes/sizes, identifies a simple, interpretable representation-space signature (single direction), and proposes a lightweight, label-free mitigation with large effect sizes. These contributions generalize beyond a specific embodied-planning setting and are timely for AI safety, evaluation, and mechanistic interpretability. Paper 1 is rigorous and valuable for robotics safety, but its impact is narrower and more domain-specific.

vs. Value-Conflict Diagnostics Reveal Widespread Alignment Faking in Language Models

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a broadly applicable diagnostic framework (VLAF) for a central, timely alignment concern (alignment faking), demonstrates prevalence across widely used model sizes, provides a mechanistic representational finding (single-direction shift), and offers a practical, lightweight mitigation with large effect sizes and no labeled data—making it actionable for many labs and deployments. Paper 1 is rigorous and important for embodied safety, but its benchmark is more domain-specific (robotic planning) and primarily diagnostic without an equally general mechanistic/mitigation contribution.

vs. End-to-end autonomous scientific discovery on a real optical platform

gpt-5.25/5/2026

Paper 2 has higher potential impact due to a stronger novelty claim (end-to-end autonomous discovery on a real physical platform) and a compelling real-world application pathway (optical hardware for efficient pairwise computation). Its methodological rigor is supported by reproducibility on a non-original platform and experimental validation of a previously unreported mechanism, with extensive logged agent actions. Breadth spans AI agents, experimental optics, and hardware acceleration. It is also highly timely given interest in autonomous science. Paper 1 is valuable and rigorous but primarily benchmarks and characterizes safety limits rather than enabling a new capability.

vs. End-to-end autonomous scientific discovery on a real optical platform

gpt-5.25/5/2026

Paper 1 likely has higher impact: it claims the first end-to-end autonomous agent that independently discovers and experimentally validates a previously unreported physical mechanism on a real optical platform—combining AI autonomy with new physics and potential hardware applications (optical pairwise computation akin to attention). This is highly novel, methodologically ambitious (real experiments, long-horizon autonomy), timely for autonomous science, and could influence multiple fields (AI agents, experimental optics, photonic computing). Paper 2 is rigorous and important for safety benchmarking, but is more incremental and primarily impacts embodied AI safety rather than enabling new scientific discoveries.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a generative “health world model” spanning 667 multimodal longitudinal measures, shows strong transfer across independent cohorts, improves many clinically important endpoints over standard risk scores, and demonstrates intervention-conditioned simulation consistent with RCT directions and often within reported CIs—suggesting broad real-world utility (risk, forecasting, digital twins). While Paper 1 is timely and rigorous with a valuable safety benchmark for LLM-based robotic planning, its primary contribution is diagnostic/benchmarking within a narrower domain, whereas Paper 2 targets high-burden medical applications with wider cross-field implications.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gpt-5.25/5/2026

Paper 1 introduces a novel, high-dimensional generative “health world model” spanning multimodal longitudinal physiology, demonstrating strong cross-cohort transfer, improved clinical endpoint prediction, and intervention-conditioned simulation aligned with RCT effects—high potential for broad clinical and biomedical impact and real-world deployment (digital twins, risk stratification, trial simulation). Paper 2 is timely and rigorous, offering an important safety benchmark and empirical scaling insights for LLM-based robotic planning, but its impact is more concentrated within AI safety/robotics and is primarily diagnostic rather than enabling a new general-purpose modeling capability.

vs. CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers

claude-opus-4.65/5/2026

Paper 2 addresses a critical and timely safety concern with LLM-based robotic planning, introducing a large benchmark (12,279 tasks) and evaluating 23 models systematically. Its finding that planning ability and safety awareness scale differently has broad implications for AI safety, robotics, and policy. The multiplicative relationship between planning and safety is a novel insight. Paper 1, while technically sound, addresses a narrower problem (constraint programming automation) with incremental improvements using multi-agent LLM workflows, limiting its cross-disciplinary impact.

vs. CP-SynC: Multi-Agent Zero-Shot Constraint Modeling in MiniZinc with Synthesized Checkers

gpt-5.25/5/2026

Paper 1 likely has higher scientific impact: it introduces a large, deterministic benchmark (DESPITE) and provides broad, timely evidence of systematic safety failures when LLMs are used for embodied/robotic planning—an issue with immediate real-world implications and cross-cutting relevance to robotics, AI safety, evaluation, and deployment policy. The scaling analysis and separation of planning vs. safety awareness offer generalizable insights. Paper 2 is innovative and useful for CP automation, but its scope (MiniZinc modeling on 100 problems) is narrower and its impact is more domain-specific.

vs. Error-free Training for MedMNIST Datasets

gpt-5.25/5/2026

Paper 1 has higher likely impact: it addresses a timely, high-stakes problem (safety of LLM-based robotic planning), introduces a sizable benchmark (DESPITE) with deterministic validation, evaluates many models, and provides clear empirical findings and scaling relationships with actionable implications for deployment and future research. Its relevance spans robotics, AI safety, and LLM evaluation. Paper 2’s claim of “error-free” training on MedMNIST is likely driven by dataset artifacts/overfitting or label issues, is less methodologically credible from the abstract, and appears narrower in breadth and innovation.

vs. Error-free Training for MedMNIST Datasets

gemini-35/5/2026

Paper 1 presents a rigorous, large-scale benchmark on a critical and timely issue (LLM safety in robotics), providing valuable insights into model scaling. In contrast, Paper 2 claims 'error-free training' which typically indicates trivial overfitting rather than true generalization, demonstrating a severe lack of methodological rigor and casting doubt on its scientific validity.

vs. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

gemini-35/5/2026

Paper 1 addresses a critical and highly timely issue: the safety of LLMs deployed in real-world robotic systems. Its large-scale evaluation of 23 models reveals a significant gap between planning capability and safety awareness, offering highly actionable insights for AI safety and embodied AI. Paper 2, while methodologically rigorous, focuses on a more abstract cognitive metric (commitment integrity) which has narrower immediate real-world applicability compared to physical safety in robotics.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-35/5/2026

Paper 1 exposes a critical scaling failure in LLMs: safety awareness in embodied planning does not naturally improve with model scale, unlike general planning capability. This finding has profound, urgent implications for AI safety, robotics, and the real-world deployment of autonomous systems, giving it a broader scientific impact than Paper 2's focus on software agents asking for help.

vs. Fine-Grained Graph Generation through Latent Mixture Scheduling

claude-opus-4.65/5/2026

Paper 2 addresses a critical and timely safety concern at the intersection of LLMs and robotics—two rapidly growing fields. It introduces a large-scale benchmark (DESPITE) with 12,279 tasks, evaluates 23 models, and reveals a fundamental insight: scaling improves planning but not safety awareness. This finding has broad implications for AI safety policy and deployment. Paper 1, while technically sound, addresses a more incremental improvement in graph generation with narrower impact. Paper 2's relevance to AI safety, its comprehensive evaluation, and its immediate practical implications for deploying LLM-based robotic planners give it substantially higher impact potential.

vs. Position: agentic AI orchestration should be Bayes-consistent

claude-opus-4.65/5/2026

Paper 1 presents a concrete, large-scale benchmark (DESPITE) with empirical findings across 23 models, revealing a critical and quantifiable safety gap in LLM-based robotic planning. Its finding that scaling improves planning but not safety awareness is novel, actionable, and highly relevant given the rapid deployment of LLM planners. Paper 2 is a position paper advocating Bayesian principles for agentic AI orchestration—conceptually valuable but lacking empirical validation. Paper 1's rigorous methodology, clear metrics, and direct implications for robotics safety give it broader and more immediate scientific impact.

vs. Escaping the Agreement Trap: Defensibility Signals for Evaluating Rule-Governed AI

gpt-5.25/5/2026

Paper 2 likely has higher impact due to its timely, high-stakes focus on safety risks when using LLMs for embodied/robotic planning, introducing a sizable deterministic benchmark (DESPITE) and broad model comparison with a clear, generalizable finding (planning ≠ safety; multiplicative relationship). The results are actionable for robotics, AI safety, and policy, with cross-field relevance as LLM agents proliferate. Paper 1 is methodologically interesting and practically useful for moderation/governance evaluation, but its domain is narrower and its core contribution is more specialized to rule-governed auditing contexts.

vs. TIO-SHACL: Comprehensive SHACL validation for TMF Intent Ontologies

gpt-5.25/5/2026

Paper 1 has higher likely scientific impact: it addresses a timely, high-stakes problem (safety of LLM-based robotic planning), introduces a large deterministic benchmark (DESPITE) and evaluates many frontier models with a clear empirical finding (planning can saturate while safety awareness lags). The results generalize across robotics, AI safety, and LLM evaluation, and could influence deployment standards and future research directions. Paper 2 is rigorous and useful for telecom intent validation, but its novelty and breadth are narrower and more domain-specific.