Using large language models for embodied planning introduces systematic safety risks
Tao Zhang, Kaixian Qu, Zhibin Li, Jiajun Wu, Marco Hutter, Manling Li, Fan Shi
Abstract
Large language models are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.
AI Impact Assessments
(3 models)Scientific Impact Assessment: "Using large language models for embodied planning introduces systematic safety risks"
1. Core Contribution
This paper introduces DESPITE, a benchmark of 12,279 safety-critical planning tasks for evaluating whether LLMs used as robotic planners generate safe action sequences. The key intellectual contribution is the distinction between semantic-level safety (is the instruction harmful?) and planning-level safety (does the generated action sequence cause harm?). Safety is formalized through conditional danger effects in PDDL: specific actions trigger harm only under particular state conditions, enabling fully deterministic validation without relying on LLM-based judges or simulators.
The benchmark spans both physical dangers (mechanical, thermal, chemical) and normative dangers (privacy violations, social norm breaches), drawn from five heterogeneous sources. The paper evaluates 23 models and identifies a critical finding: planning ability scales substantially with model size (0.4–99.3% feasibility), while safety awareness remains relatively flat (38–57% safety intention among open-source models). The multiplicative decomposition S ≈ F × SI (R² = 0.99) provides a clean diagnostic framework showing that safety gains at scale are attributable to improved planning, not improved danger avoidance.
2. Methodological Rigor
The methodology is notably well-constructed. Several aspects stand out:
Formal grounding: By encoding dangers as PDDL conditional effects, safety evaluation becomes binary and reproducible — a significant improvement over LLM-as-judge approaches that plague many safety benchmarks. The separation of basic problems (given to the LLM) from safety-augmented problems (used only for validation) is elegant and prevents information leakage.
Metric design: The four metrics (feasibility F, safety S, safety precision SP, safety intention SI) are carefully defined to disentangle planning ability from safety awareness. SI, which relaxes precondition checking to evaluate danger avoidance independently of plan executability, is a particularly clever design choice. The authors provide empirical validation (110,001 model-task pairs with zero false positives/negatives) and manual inspection of 50 stratified samples.
Statistical analysis: Log-linear regressions with bootstrapped confidence intervals, Cohen's d for effect sizes, and the multiplicative decomposition are all appropriate and well-executed. The scaling analysis across 18 open-source models with published parameter counts, while excluding proprietary models from regressions but including them in the decomposition, is methodologically sound.
Potential concerns: The safety intention metric's "relaxation" procedure is acknowledged as theoretically incomplete, though empirically validated. The benchmark generation pipeline uses LLMs (DeepSeek-V3.1) for danger formalization and code generation, introducing potential biases in what dangers are represented. The 32.6% human rejection rate during quality control suggests the automated pipeline's output requires substantial curation. Additionally, the hard split selection based on panel model difficulty introduces a potential circularity — tasks are "hard" because specific models find them hard, which could bias the scaling analysis.
3. Potential Impact
Immediate practical relevance: As LLMs are deployed in real robotic systems (homes, hospitals, warehouses), this work provides the first large-scale, deterministic evaluation of planning-level safety. The finding that near-perfect planning ability coexists with ~28% dangerous plans is alarming and actionable for the robotics community.
Diagnostic utility: The S ≈ F × SI decomposition transforms an opaque leaderboard into a diagnostic tool. Practitioners can now identify whether their model's safety failures stem from planning incompetence or safety unawareness, directing improvement efforts accordingly.
Benchmark infrastructure: At 12,279 tasks with deterministic validation, DESPITE is substantially larger and more rigorous than prior benchmarks (Table 1). The inclusion of normative dangers fills an important gap, and the finding that normative dangers are disproportionately harder for models to recognize has implications beyond robotics.
Alignment research: The observation that three proprietary reasoning models achieve 71–81% SI while open-source models plateau at 38–57% — and that this gap cannot be attributed to scale, reasoning capability, or disclosed alignment methods alone — raises important questions about what training procedures produce safety awareness in embodied contexts.
4. Timeliness & Relevance
This paper addresses a genuine and urgent gap. The deployment of LLM planners in physical systems is accelerating, yet most safety evaluation remains at the semantic level (checking if instructions are harmful). The paper correctly identifies that a semantically benign instruction can produce a physically dangerous plan, and that this planning-level safety assessment is currently underserved. The finding that safety awareness does not scale with model size is particularly timely given industry trends toward ever-larger models.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
This is a well-executed systems paper that makes a timely and important contribution at the intersection of LLM safety and robotics. Its primary strength lies in converting an ill-defined problem (safe embodied planning) into a formally grounded, reproducible evaluation framework. The scaling analysis and multiplicative decomposition provide genuine scientific insight. The main limitation is the symbolic abstraction gap between PDDL planning and real-world robotic deployment, though the authors appropriately frame their results as a lower bound on failure rates.
Generated Apr 21, 2026
Comparison History (44)
Paper 1 introduces a fundamental methodological breakthrough combining generative models with physical forces to accelerate molecular and materials discovery. Its ability to drastically reduce sampling costs while finding out-of-distribution metastable structures has massive, transformative implications for drug discovery and materials science. While Paper 2 presents an important benchmark for AI safety in robotics, Paper 1 provides a foundational, domain-agnostic tool with broader potential for discovering novel, impactful technologies across multiple hard science disciplines.
Paper 2 likely has higher scientific impact: it proposes a novel, unifying framework (GSS) that bridges diffusion generation and random structure search with physical forces, yielding large practical gains (10× lower sampling cost) and demonstrated out-of-distribution effectiveness. This directly enables faster discovery of molecules and materials, with broad applications across chemistry, materials science, catalysis, and condensed matter. The methodological contribution is general and timely for AI-for-science. Paper 1 is important for AI safety in robotics, but is primarily diagnostic/benchmarking rather than providing a new solution paradigm.
Paper 2 introduces a concrete, large-scale benchmark (DESPITE) with 12,279 tasks and evaluates 23 models, providing rigorous empirical evidence for a critical finding: planning ability scales with model size but safety awareness does not. This actionable, reproducible contribution directly impacts robotics and embodied AI deployment. Paper 1, while raising important theoretical points about interaction topology, is a position paper with less empirical grounding. Paper 2's benchmark will likely drive follow-up research and standardize safety evaluation for LLM-based planners, giving it broader and more immediate scientific impact.
Paper 2 challenges fundamental assumptions in AI alignment by shifting focus from individual model safety to interaction topology in multi-agent systems. This paradigm-shifting perspective has broader implications across all agentic AI applications, whereas Paper 1 focuses more specifically on embodied robotics. By identifying systemic pathologies that render model-centric alignment insufficient, Paper 2 is likely to spark widespread theoretical and architectural changes across the broader AI safety community.
Paper 1 has higher impact potential due to its timely, high-stakes relevance to deploying LLMs in robotics, a large benchmark (12,279 tasks) with deterministic validation, and a broad evaluation across 23 models revealing a strong, actionable finding: planning improvements do not translate to safety. This offers immediate real-world implications for safety standards, model development, and policy. Paper 2 is novel architecturally for causal/hypothesis-space restructuring, but is narrower in scope and validated on fewer trials in a more synthetic paradigm, making near-term cross-domain impact less certain.
Paper 1 addresses a critical and timely problem—safety of LLM-based robotic planners—with a large-scale benchmark (12,279 tasks, 23 models) and actionable findings about the disconnect between planning ability and safety awareness. Its breadth of impact spans robotics, AI safety, and LLM deployment policy. The multiplicative relationship finding and the scaling analysis provide concrete guidance for the field. Paper 2, while intellectually interesting in extending developmental psychology paradigms to AI, addresses a narrower problem (causal hypothesis-space restructuring) with a more specialized architectural contribution and smaller scope of applicability.
Paper 2 addresses the critical and timely problem of safety in LLM-based robotic planning, introducing a large-scale benchmark (DESPITE) with clear, actionable findings. Its discovery that scaling improves planning but not safety awareness has broad implications for AI safety policy and robotics deployment. The finding that even frontier models produce dangerous plans ~28% of the time is highly impactful. While Paper 1 offers a rigorous statistical framework for model independence, its scope is narrower (ensemble verification) and its practical gains more incremental. Paper 2's relevance to real-world safety and robotics gives it broader cross-disciplinary impact.
Paper 1 introduces a novel statistical framework to quantify a fundamental, pervasive issue in the LLM ecosystem—behavioral entanglement due to shared data and distillation. Its metrics and practical mitigation strategy for ensemble verification have broader methodological implications across all fields utilizing LLMs, whereas Paper 2 is an empirical benchmark primarily focused on the narrower (though important) domain of embodied robotic planning.
Paper 1 addresses a more fundamental question about whether LLM-based scientific agents genuinely reason scientifically, with broad implications across all domains using AI for research. Its scale (25,000+ runs, 8 domains) and dual analytical framework (performance + epistemological analysis) provide deep methodological rigor. The finding that base models dominate over scaffolding (41.4% vs 1.5% variance) and that scientific reasoning failures persist regardless of context challenges the entire paradigm of autonomous AI science. Paper 2, while valuable for robotics safety, addresses a narrower application domain. Paper 1's implications for AI-generated scientific knowledge and training methodology are more far-reaching.
Paper 2 likely has higher scientific impact: it introduces a large, deterministic benchmark (DESPITE) and provides a clear, scalable empirical finding that planning competence and safety awareness diverge, with strong implications for embodied AI and robotics deployment. The work is timely given rapid adoption of LLM planners and directly informs safety evaluation, model development, and policy. Paper 1 is innovative and potentially impactful for LLM training paradigms, but its coevolution/merging approach may face harder-to-validate generality and adoption barriers compared to a benchmarked safety result with immediate real-world relevance.
Paper 1 introduces a highly novel paradigm for LLM development through task-capability coevolution, moving beyond static pre-training/post-training to an open-ended, continual discovery process. This fundamental methodological innovation has the potential to reshape how frontier models are trained and optimized. While Paper 2 provides a valuable benchmark for embodied AI safety, Paper 1's broader implications for AI self-improvement and capability discovery offer a higher ceiling for transformative scientific impact across the entire field of machine learning.
Paper 2 likely has higher impact due to its direct real-world safety implications for deploying LLM planners in robotics, a timely and high-stakes domain. It introduces a large, deterministic benchmark (DESPITE) enabling reproducible evaluation across models, and reports a clear, actionable finding: planning capability scales while safety awareness stagnates, with quantified dangerous-plan rates even for strong planners. This can influence robotics practice, benchmarking standards, and safety research across embodied AI and alignment. Paper 1 is important diagnostically for AI-for-science, but its implications are more indirect and may be harder to operationalize immediately.
Paper 2 likely has higher impact: it targets embodied/robotic planning safety, a high-stakes, timely deployment domain with direct real-world consequences. DESPITE provides a large, deterministic benchmark enabling reproducible measurement across many models, and the demonstrated decoupling of planning skill from safety awareness offers a clear, generalizable empirical finding with actionable implications for scaling, evaluation, and alignment in robotics. Paper 1 is novel in threat modeling tool deception and provides a useful harness, but its attack setup may be seen as more situational to tool-integrated agents and less immediately tied to physical-world harms than embodied safety failures.
Paper 2 likely has higher scientific impact due to its timeliness and broad relevance: it targets safety risks in LLM-based embodied planning/robotics, a high-stakes deployment area. It contributes a large, deterministic benchmark (DESPITE), evaluates many models, and uncovers a clear empirical pattern (planning scales, safety awareness doesn’t; multiplicative relationship), offering actionable guidance for the field. Paper 1 is methodologically interesting but more incremental within domain adaptation for LLM reasoning and likely narrower in cross-field and real-world safety implications.
Paper 2 addresses the critical and timely issue of safety in LLM-based robotic planning, introducing a comprehensive benchmark (DESPITE) with 12,279 tasks evaluated across 23 models. Its findings—that scaling improves planning but not safety awareness—have broad implications for AI safety, robotics deployment, and policy. The multiplicative relationship between planning and safety is a novel insight with immediate practical relevance. Paper 1 presents an interesting architectural contribution for math reasoning but is narrower in scope, evaluated on a single domain, and the optimal latent selection results (70%) represent an upper bound rather than practical performance.
Paper 1 addresses a highly timely and critical issue—the safety of LLMs in embodied AI. By introducing a massive benchmark and comprehensively evaluating 23 models, it exposes a crucial scaling discrepancy between planning ability and safety awareness. This broad, rigorous evaluation across frontier models offers immediate, high-impact insights for the rapidly growing fields of LLMs and robotics, arguably outstripping the narrower, though innovative, neuro-symbolic motion modeling approach of Paper 2.
Paper 1 addresses a highly critical and timely issue—safety in LLM-driven robotics—with a comprehensive benchmark across numerous frontier models. Its findings on the scaling behavior of safety versus planning ability have broad, immediate implications for AI safety and embodied AI. Paper 2 offers an interesting theoretical connection for BNNs and explainability, but its scope is much more niche and lacks the broad real-world applicability and urgency of Paper 1.
Paper 1 introduces a methodological breakthrough in non-autoregressive language modeling, claiming a 40x speedup over autoregressive baselines without sacrificing generation quality. Given the immense computational costs and inference bottlenecks of current large language models, a paradigm shift toward efficient flow-matching generation could fundamentally transform both AI research and commercial deployment. While Paper 2 provides a valuable safety benchmark for embodied AI, Paper 1's proposed algorithmic innovation addresses a more universal and pressing bottleneck across the entire generative AI landscape, giving it higher potential for widespread scientific and practical impact.
Paper 2 has higher likely scientific impact due to its timeliness and cross-disciplinary relevance (LLMs, robotics, AI safety), plus a clear, scalable benchmark (DESPITE) enabling reproducible evaluation across many models. It identifies a systematic and quantitatively characterized failure mode (planning vs safety awareness decoupling) with direct real-world deployment implications and policy relevance. Paper 1 is methodologically solid and useful for EEG decoding, but its novelty is more incremental within a narrower field, and its broader impact is likely more limited than a widely applicable safety benchmark for embodied LLM planning.
Paper 1 likely has higher impact: it introduces a sizable, deterministic benchmark (DESPITE) for safety in LLM-based embodied planning and reveals a striking disconnect between planning competence and safety awareness across many models. The findings are timely for real-world robotics deployment, directly inform safety evaluation/mitigation, and can influence both ML safety and robotics communities. Paper 2 is useful and methodical for efficient RLVR in low-data regimes, but its contribution is primarily an empirical scaling/recipe study on procedural tasks with narrower immediate societal stakes and less cross-domain urgency than safety risks in embodied systems.