SADP: Subgoal-Aware Diffusion Policy for Explainable Robots Learned from Foundation Model Generated Demonstrations
Site Hu, Takato Horii
Abstract
Explainable robots require not only successful task execution but also the ability to expose internal decision-making process in a user-friendly manner. However, most imitation learning methods are trained solely on task-level demonstrations, without explicitly modeling subgoal structure or execution progress. This limitation is further exacerbated by the scarcity of subgoal-level supervision in standard robot learning datasets, which restricts the development of robots that can convey the subtasks they are executing during long-horizon manipulation. To address this issue, this paper proposes Subgoal-Aware Diffusion Policy (SADP), a framework that leverages foundation models to autonomously generate subgoal-annotated demonstrations and trains diffusion policies on these datasets. SADP structures policy execution around human-interpretable subgoals by conditioning action generation on both task-level and subgoal-level descriptions. A lightweight auxiliary head further predicts subgoal completion states, allowing the robot to expose its current execution stage and monitor subgoal progression. Experiments in RLBench simulations and real-world evaluations on a UR5e robot demonstrate that SADP achieves higher task success rates than strong task-conditioned diffusion baselines, while providing subgoal-level execution signals for monitoring progress and diagnosing failures. These results highlight that built-in, rather than post-hoc, interpretability can coexist with high task performance.
AI Impact Assessments
(1 models)Scientific Impact Assessment: SADP: Subgoal-Aware Diffusion Policy for Explainable Robots
1. Core Contribution
SADP introduces a framework that bridges explainability and task performance in robot manipulation by embedding human-interpretable subgoal structure directly into a diffusion policy. The key insight is that natural-language subgoals—generated automatically by foundation models—should serve as both the structural backbone for policy execution and the explanation mechanism for users. The framework has two main components: (1) an automated data generation pipeline that produces demonstrations annotated with subgoal descriptions and completion labels using LLMs/VLMs, and (2) a diffusion policy conditioned on both task-level and subgoal-level descriptions, augmented with a binary completion prediction head.
The contribution is primarily integrative rather than fundamentally novel in any single dimension. The data generation pipeline extends prior work (TARAD/Hu et al.), the diffusion policy builds on DP3, and the use of foundation models for task decomposition is well-established. The novelty lies in how these components are combined to create an interpretable-by-design policy that doesn't sacrifice performance.
2. Methodological Rigor
Strengths:
Weaknesses:
3. Potential Impact
The paper addresses a genuine gap: most imitation learning policies are opaque during execution, and adding interpretability post-hoc may not faithfully reflect the decision process. The idea of building interpretability into the policy architecture is conceptually appealing and relevant to HRI applications.
However, the practical impact may be limited by several factors:
4. Timeliness & Relevance
The paper is timely in several respects: (1) diffusion policies are rapidly gaining traction in robot learning; (2) foundation models are increasingly used for robot data generation; (3) explainability in robotics is receiving growing attention as robots are deployed in human-facing applications. The intersection of these three trends is relatively underexplored, giving SADP some novelty in positioning.
However, the paper appears somewhat disconnected from the rapidly advancing VLA model landscape (OpenVLA, π0, CoT-VLA), which is beginning to address similar issues through chain-of-thought reasoning and hierarchical execution within much larger architectures. SADP's lightweight approach may be practical but could quickly be superseded.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
SADP presents a reasonable engineering contribution that combines several existing components (diffusion policies, foundation model data generation, subgoal decomposition) in a coherent framework. The interpretable-by-design philosophy is well-motivated, but the execution falls short of the paper's ambitions. The quantitative evidence for performance improvement is weak, and the explainability claims lack empirical validation through user studies. The paper would benefit substantially from a user study, more rigorous statistical analysis, and evaluation on more complex tasks.
Generated May 19, 2026
Comparison History (22)
Paper 2 (SADP) has higher estimated impact due to a more novel and timely combination of diffusion policies, foundation-model-generated subgoal annotations, and built-in interpretability—addressing a broadly relevant bottleneck in long-horizon robot learning (data scarcity + explainability). Its approach is likely transferable across many manipulation platforms and tasks, impacting robotics, imitation learning, HRI, and trustworthy AI. Paper 1 is strong and rigorous for humanoid whole-body control and modular planning, but its contributions are more domain-specific (humanoids/EE-root interface) and may have narrower cross-field adoption.
Paper 1 (SADP) addresses a more fundamental challenge in robot learning—combining explainability with high task performance through subgoal-aware diffusion policies. It introduces a novel framework leveraging foundation models for automatic subgoal annotation, addresses the scarcity of subgoal-level supervision, and demonstrates that built-in interpretability can coexist with strong performance. This has broader impact across imitation learning, explainable AI, and long-horizon manipulation. Paper 2 (NORM-Nav) is solid but more incremental, integrating LLM-parsed constraints into existing costmap planners for socially-aware navigation, a narrower contribution with less methodological novelty.
Paper 2 (SADP) addresses a broader and more timely research gap at the intersection of explainable AI and robot learning, leveraging foundation models for subgoal-aware policy learning. This combines multiple high-impact trends (diffusion policies, foundation models, explainability) in a novel way with broader applicability across manipulation tasks. Paper 1 (Mono-Hydra++) is a strong systems paper with solid engineering contributions for monocular scene graph construction, but is more incremental—combining existing components (DINOv3, VIO, volumetric fusion) into a pipeline. Paper 2's core insight that built-in interpretability can coexist with high performance has wider implications for trustworthy robotics.
Paper 2 (XDiffuser) offers a more fundamental and broadly applicable contribution by introducing extrinsic graph-based search to guide diffusion planning, addressing a core limitation of compositional diffusion models for long-horizon tasks. Its ability to handle unseen combinatorial tasks (multi-agent coordination, TSP-style reasoning) at test time via classical algorithms demonstrates greater generality and novelty. Paper 1 (SADP) makes a solid contribution to explainable robotics but is more incremental, combining existing components (foundation models, diffusion policies) for subgoal annotation. Paper 2's broader applicability across planning domains suggests higher cross-field impact.
Paper 1 integrates highly influential current trends—foundation models and diffusion policies—to address the critical challenge of explainability in robotics. Its approach to autonomously generating subgoal-level supervision offers a scalable solution to dataset limitations, potentially impacting multiple areas of imitation learning and human-robot interaction. Paper 2 presents a practical and effective method for 3D scene graph generation, but its scope is comparatively narrower and relies on more established paradigms, making Paper 1's methodological innovations more likely to spark widespread follow-up research.
Paper 2 (SADP) has higher estimated impact due to combining three timely directions—diffusion policies, foundation-model-generated data, and built-in interpretability via subgoals—into a broadly applicable framework for long-horizon manipulation. It addresses a major bottleneck (lack of subgoal supervision) with a scalable data-generation pipeline and yields user-facing benefits (progress monitoring/diagnosis) alongside performance gains, increasing real-world deployment relevance. GuidedVLA is novel in attention-head specialization, but relies on manual auxiliary signals and may generalize less broadly across tasks/domains than a subgoal-centric, explainability-oriented policy design.
Paper 1 addresses a critical bottleneck in developing generalist robot foundation models: multi-task scaling and avoiding task-specific overfitting during RL fine-tuning. By proposing a framework for cross-task feature representation, it directly contributes to the highly impactful pursuit of general-purpose Vision-Language-Action (VLA) models. While Paper 2 offers valuable contributions to explainability and long-horizon manipulation, Paper 1's focus on cross-task scaling principles has broader methodological implications for the foundational architecture and training paradigms of large-scale robotic models.
SADP addresses the broadly important intersection of explainability and robot learning, proposing a novel framework that integrates foundation models for subgoal-annotated demonstration generation with diffusion policies. Its contribution—showing that built-in interpretability can coexist with high performance—has broader implications across robotics, HRI, and trustworthy AI. While SEDualVLN achieves SOTA on VLN-CE benchmarks, it is more incremental, combining known paradigms (dual-system, spatial mapping) in a specific navigation domain. SADP's methodological novelty (subgoal-aware diffusion, foundation model-generated supervision) and real-world validation suggest wider cross-field impact.
Paper 2 (DeMiAn) has higher potential impact because it addresses a fundamental scaling bottleneck in robot learning—extracting more signal from existing data without collecting new demonstrations. Its approach is validated at significantly larger scale (1M+ clips, 50K videos), introduces a generalizable multi-aspect annotation framework applicable across different policy architectures (VLA and world-action models), and demonstrates improvements on composite and OOD tasks. Paper 1 (SADP) makes a solid contribution on explainability via subgoal conditioning, but its scope is narrower and experiments are smaller-scale. DeMiAn's positioning as a practical scaling lever gives it broader applicability across the field.
Paper 2 demonstrates higher potential scientific impact due to its cross-disciplinary bridging between robotics, cognitive science, and clinical neuroscience. It provides a novel theoretical insight—that a reactive robotics model without planning captures human cognitive failures better than planning models for impaired populations—suggesting deep structural parallels between robotic and biological systems. This has broad implications for understanding cognition, clinical assessment tools, and embodied AI theory. Paper 1, while solid engineering work combining foundation models with diffusion policies for explainability, represents a more incremental contribution within robot learning.
Paper 1 introduces a novel approach to combining foundation models with diffusion policies for robotic manipulation, addressing the critical challenges of explainability and long-horizon task execution. Its methodological innovation in autonomously generating subgoal-annotated data and improving task success rates has broader implications for scalable, interpretable robot learning compared to Paper 2's specific application of VLMs for emotion recognition in human-robot collaboration.
Paper 2 addresses a critical bottleneck in robotic imitation learning—data scarcity for subgoals—by leveraging foundation models. Its combination of diffusion policies and explainability for long-horizon tasks aligns with major trends in embodied AI and has broader, more impactful applications compared to Paper 1's narrow focus on specific mechanical actions (toppling) in tabletop planning.
Paper 1 addresses a broader and more timely challenge at the intersection of foundation models, diffusion policies, and explainable AI for robotics—all rapidly growing areas. Its contribution of integrating subgoal-level interpretability directly into policy learning is novel and broadly applicable across robot learning. Paper 2 makes a solid contribution to AV safety validation but addresses a narrower domain. Paper 1's framework connecting foundation models to structured robot behavior with built-in explainability has higher potential to influence multiple research communities and inspire follow-up work.
SADP introduces a novel framework combining foundation models, subgoal-aware conditioning, and diffusion policies to achieve both high performance and built-in explainability—a relatively underexplored intersection. It addresses the important challenge of interpretable robot decision-making with a principled approach (subgoal structure from foundation models), validated in both simulation and real-world settings. While DexJoCo provides a valuable benchmark for dexterous manipulation, benchmarks typically have narrower impact unless they become widely adopted. SADP's contributions span explainable AI, imitation learning, and foundation model integration, giving it broader cross-field relevance and higher novelty.
Paper 2 has higher potential impact due to its novelty in combining foundation-model–generated subgoal annotations with diffusion policies to achieve built-in interpretability for long-horizon manipulation, a timely and fast-growing area in robotics/AI. It offers broad applicability across imitation learning, human-robot interaction, and reliable deployment via progress monitoring/diagnostics, and aligns with current interest in foundation models and explainability. Paper 1 is solid and rigorous but more specialized (planar rigid-motion reconstruction/NBV), with narrower cross-field reach and less general real-world adoption potential.
Paper 2 likely has higher impact due to its strong novelty (constraint-based representation enabling one-shot generalization in contact-rich, multi-stage tasks), clear real-world relevance, and extensive real-robot validation across seven tasks with high success. Its methodological framing around environmental constraints is broadly applicable (robotics, manipulation, LfD, control, compliant interaction) and addresses a central bottleneck—generalization under contact and unmodeled dynamics. Paper 1 is timely and valuable for explainability and diffusion policies, but depends on foundation-model-generated subgoal labels and is more incremental within imitation/diffusion-policy trends.
Paper 2 has higher potential scientific impact due to its integration of cutting-edge foundation models with diffusion policies to solve the critical challenges of long-horizon robotic manipulation and explainability. By automating the generation of subgoal annotations and creating built-in interpretable policies, it addresses major data scarcity bottlenecks in imitation learning. Its approach has broader applicability across embodied AI and human-robot interaction, whereas Paper 1 focuses on a more specialized, albeit important, problem of pose estimation and symmetry.
Paper 2 likely has higher scientific impact due to its novelty in combining foundation-model-generated subgoal annotations with diffusion policies to achieve built-in interpretability, addressing a timely and widely relevant problem in robot learning (long-horizon manipulation, monitoring, explainability). Its applicability spans many robots and tasks beyond a specific platform, and it evaluates in both simulation and real hardware. Paper 1 is valuable engineering for microgravity manipulation but is more domain-specific (Astrobee/ISS) and thus likely narrower in breadth and downstream adoption.
Paper 2 addresses fundamental challenges in robot learning—explainability, long-horizon manipulation, and data scarcity—by leveraging foundation models to enhance diffusion policies. Its approach to built-in interpretability has broader applicability across imitation learning and human-robot interaction, whereas Paper 1 focuses on a more specialized, albeit novel, domain of deformable object manipulation and clay sculpting.
Paper 2 likely has higher impact due to stronger timeliness (diffusion policies, foundation models, interpretability), broader cross-field relevance (robot learning, HCI/explainability, LLM-generated data, long-horizon manipulation), and clearer real-world applicability via both simulation and UR5e results. Its core idea—using foundation models to generate subgoal-annotated demos and training subgoal-aware diffusion policies—could generalize across many tasks and platforms. Paper 1 is novel and rigorous with polynomial-time optimal planning, but it targets a narrower tabletop block setting and may have more limited transfer beyond tightly packed uniform grids.