Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
Dillon Z. Chen, Till Hofmann, Toryn Q. Klassen, Sheila A. McIlraith
Abstract
We tackle the challenge of building embodied AI agents that can reliably solve long-horizon planning problems. Imitation learning from demonstrations has shown itself to be effective in training robots to solve a diversity of complex tasks requiring fine motor control and manipulation over low-level (LL), continuous environments. Yet, it remains a difficult endeavour to generate long-horizon plans from imitation learning alone. In contrast, high-level (HL), symbolic abstractions facilitate efficient and interpretable long-horizon planning. We propose to combine the strengths of LL imitation learning for manipulation and control, and HL symbolic abstractions for long-horizon planning. We realise this idea via \emph{bilevel policies} of the form , consisting of a neural policy learned from LL demonstrations, and an HL symbolic policy that is constructed from symbolic abstractions of the LL demonstrations combined with inductive generalisation. We implement these ideas in the BISON system. Experiments on extended MetaWorld benchmarks demonstrate that BISON generalises to long horizons and problems with greater numbers of objects than those solved by VLA and end-to-end methods, and is more time and memory efficient in training and inference. Notably, when ignoring LL execution, BISON's HL policies can solve HL problems with 10,000 relevant objects in under a minute. Project page: https://dillonzchen.github.io/bison
AI Impact Assessments
(1 models)Scientific Impact Assessment: Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
1. Core Contribution
BISON introduces a framework for learning bilevel policies (πhl, πll) that decompose long-horizon planning into symbolic high-level (HL) reasoning and neural low-level (LL) execution. The key novelty lies in how the HL policy is constructed: rather than using a symbolic planner at test time (as in traditional TAMP), BISON extracts first-order condition-action rules from abstracted demonstrations via goal regression and inductive generalization. This yields compact, interpretable policies that generalize to arbitrary numbers of objects without search at inference time. The LL policy is a lightweight GNN (<33K parameters) conditioned on HL actions.
The two-fold novelty is well-articulated: (1) replacing search-based HL planning with inductively generalized symbolic policies learned from demonstrations, and (2) using first-order condition-action rules that support open-world reasoning and scale to arbitrary object counts—a meaningful departure from existing bilevel planning approaches that rely on problem-specific plan computation.
2. Methodological Rigor
Theoretical foundations are solid. The paper formalizes the Nondeterministic Downward Refinement Property (NDRP), provides a formal definition of goal regression in the nondeterministic setting, and proves Theorem 1 establishing conditions (C-bounded goal independence) under which the learned HL policies generalize to arbitrary problem sizes. The proof construction via equivalence classes under object renaming is elegant and the exponential bound in Equation (8) is honestly characterized.
Experimental design covers 21,600 episodes across 8 environments, 9 methods (including 2 VLAs), and 3 seeds, which is reasonably comprehensive. The environments extend MetaWorld with uncertainty attributes (exogenous, endogenous, state uncertainty), which is appropriate for testing robustness. However, several concerns arise:
3. Potential Impact
Scalability of HL reasoning is the standout result: solving problems with 10,000 objects in seconds (vs. LAMA timing out at hundreds) demonstrates that learned symbolic policies can dramatically outperform classical planners for certain problem structures. This has implications for logistics, warehouse robotics, and large-scale resource allocation where symbolic planning is the bottleneck.
Interpretability is a genuine advantage. The learned condition-action rules (Appendix C.4) are human-readable and can be verified, which matters for safety-critical deployment. The automatic LLM-generated interpretations add a practical dimension.
Bridging communities: The paper meaningfully connects symbolic AI planning, graph neural networks, and imitation learning. The bilevel policy framework could serve as a template for future systems that need both scalable reasoning and fine motor control.
However, the practical impact is tempered by the strong assumptions: object-centric state representations, known domain theories, and deterministic labelling functions are not readily available in unstructured real-world settings.
4. Timeliness & Relevance
The paper is highly timely. As VLAs and foundation models struggle with compositional, long-horizon reasoning (as the paper's own experiments confirm), there is renewed interest in structured, neuro-symbolic approaches. The paper positions itself well against the current trend of scaling up end-to-end models, offering a complementary path that leverages domain structure.
The comparison against SmolVLA (a recent 2025 model) and the use of MetaWorld+ (2025) benchmarks demonstrate engagement with current state-of-the-art. The connection to ongoing work on learning symbolic abstractions from vision-language models further enhances relevance.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Summary
BISON presents a clean, theoretically grounded approach to long-horizon planning that achieves impressive scalability in the HL reasoning component and demonstrates meaningful advantages over both end-to-end and traditional planning baselines. The main contribution—replacing search-based symbolic planning with learned, generalizable first-order policies—is novel and impactful for the bilevel planning community. However, the strong assumptions on inputs, the simplistic LL policy, and the limited benchmark complexity constrain the near-term practical impact.
Generated May 18, 2026
Comparison History (22)
Paper 2 likely has higher scientific impact due to a more novel systems-level contribution (bilevel policies combining learned low-level control with symbolic high-level planning), stronger real-world applicability to embodied robotics, and broader cross-field relevance (robot learning, planning, symbolic-neural integration). Its claims suggest scalability to very large object counts and improved efficiency, which is timely for long-horizon agent research. Paper 1 is rigorous and useful diagnostically, but its impact is narrower (evaluation/analysis of LLM math failures) and less directly enabling for downstream capabilities.
Paper 1 addresses a fundamental challenge in embodied AI—long-horizon planning—with a novel bilevel approach combining symbolic reasoning and imitation learning. Its demonstrated scalability (10,000 objects) and generalization capabilities represent significant advances with broad applications in robotics and AI planning. Paper 2, while addressing a meaningful gap in mental health AI by introducing audio modality for CBT distress estimation, is more niche in scope, presents an evaluation benchmark rather than a methodological breakthrough, and has a smaller dataset (1,802 turns). Paper 1's broader applicability across robotics, planning, and AI gives it higher potential impact.
Paper 1 addresses a critical, timely bottleneck in the widespread deployment of agentic AI: accountability and safety. By formalizing explicit provenance and responsibility, it offers foundational contributions that span machine learning, systems engineering, and AI policy. While Paper 2 presents strong empirical advances in embodied AI, Paper 1 has a broader potential impact across multiple disciplines and the entire AI ecosystem.
Paper 1 addresses a timely and practically important question about compound LLM agent design in adversarial settings, providing actionable design principles backed by rigorous empirical evaluation across multiple models and configurations. The identification of 'deliberation cascades' as a destructive pattern and the finding that programmatic infrastructure outperforms deeper reasoning offers immediately useful guidance for the rapidly growing LLM agent community. Paper 2 presents solid work on bilevel planning but combines relatively established ideas (symbolic planning + imitation learning) in a more incremental fashion. Paper 1's broader relevance to the booming LLM agent ecosystem gives it higher potential impact.
Paper 1 addresses a critical and ubiquitous problem in LLM agents (agent drift/loops) using a highly novel, inference-time mechanistic interpretability approach. By applying activation steering dynamically, it offers a scalable, retraining-free solution to improve long-horizon reasoning. While Paper 2 presents a strong neurosymbolic framework for embodied AI, Paper 1's method is likely to see broader and more rapid adoption across the rapidly growing field of autonomous software engineering and general LLM agents.
SaaS-Bench addresses a critical gap in evaluating computer-use agents on realistic professional workflows using real SaaS systems. With 23 deployable systems across 6 domains and 106 tasks, it provides a much-needed benchmark revealing that even state-of-the-art agents complete fewer than 4% of tasks. This has broad impact across the AI agent community, informing future research directions. Paper 2 (BISON) presents a solid bilevel planning approach but is more incremental, combining known ideas (symbolic planning + imitation learning) in a relatively narrow robotics setting. SaaS-Bench's timeliness amid the CUA boom gives it higher impact potential.
Paper 1 offers higher potential scientific impact by addressing a critical bottleneck in Embodied AI: long-horizon planning. By elegantly combining low-level continuous imitation learning with high-level discrete symbolic abstractions, it achieves massive scalability (handling up to 10,000 objects). This neuro-symbolic approach bridges a fundamental gap between deep learning and classical planning. While Paper 2 presents a timely method for LLM social agents, Paper 1's methodology has broader real-world applications in robotics and automation, representing a more significant architectural leap with rigorous empirical improvements over state-of-the-art end-to-end models.
Paper 1 addresses a broader and more timely challenge in embodied AI—combining symbolic reasoning with neural imitation learning for long-horizon planning. Its bilevel policy framework (BISON) demonstrates strong scalability (10,000 objects) and generalization, with practical robotics applications. It bridges symbolic AI and deep learning, impacting multiple communities (robotics, planning, ML). Paper 2 makes a solid but more incremental contribution to a well-studied scheduling problem (RCPSP), reformulating it via Petri nets with A* search. While rigorous, its impact is narrower, primarily within operations research and combinatorial optimization.
Paper 2 has higher estimated impact due to stronger cross-field breadth and clearer real-world applicability: it unifies symbolic planning with learned low-level control for long-horizon embodied tasks, addressing a core robotics bottleneck (scalable planning and generalization). The bilevel/symbolic world-model approach is timely and can transfer to robotics, simulation, and hybrid neuro-symbolic AI, with compelling scalability claims (e.g., 10k objects). Paper 1 advances LLM test-time compute via multi-agent memory/RL, but is more incremental within a fast-moving niche and less directly tied to physical deployment.
Paper 1 likely has higher impact: it advances a principled hybrid paradigm (neural low-level imitation + symbolic high-level planning) with clear scalability claims (long horizons, many objects) and broad relevance to robotics, planning, and neuro-symbolic AI. Its methodological contribution (bilevel policy construction from demonstrations plus inductive generalization) is more general than a task-specific protocol. Paper 2 is timely and practically useful for LLM agents, but evidence is confined to a single benchmark/attacker setting and relies on prompt-memory evolution, which may generalize less and be more sensitive to evaluation artifacts.
Paper 2 proposes a highly scalable neuro-symbolic framework (BISON) that effectively bridges the gap between low-level robotic control and high-level long-horizon planning. While Paper 1 introduces a valuable benchmark for evaluating LLM scientific reasoning, Paper 2 offers a concrete methodological breakthrough in Embodied AI that fundamentally solves scaling issues in complex environments (handling up to 10,000 objects), offering immense real-world application potential in robotics and automation.
Paper 2 (BISON) presents a novel architectural contribution—bilevel policies combining symbolic reasoning with neural imitation learning—that addresses a fundamental challenge in embodied AI (long-horizon planning). It demonstrates strong scalability (10,000 objects) and efficiency gains over VLA methods, with broader methodological impact across robotics and AI planning. Paper 1 (π-Bench) introduces a useful benchmark for proactive agents but is more incremental, primarily evaluating existing LLM capabilities rather than proposing new methods. Benchmarks typically have narrower impact unless they become widely adopted standards.
Paper 1 proposes a broad, paradigm-shifting framework (metacognitive AI) that addresses fundamental issues like accuracy, security, and efficiency across various AI domains. While Paper 2 offers a strong, specific technical solution for embodied AI, Paper 1's conceptual novelty, cross-disciplinary inspiration, and provided software framework give it a higher potential for widespread, foundational impact across the entire field of artificial intelligence.
Paper 2 (BISON) presents a concrete, implemented system that combines symbolic planning with neural imitation learning for long-horizon embodied AI tasks, demonstrating impressive scalability (10,000 objects). It addresses a fundamental challenge in robotics/embodied AI with a novel bilevel architecture and provides reproducible results with a project page. Paper 1 provides an interesting empirical evaluation of LLMs for goal recognition but is primarily an evaluation study without a new method, offering diagnostic insights rather than a new capability. Paper 2's methodological contribution and practical applicability give it broader and deeper potential impact.
Paper 1 offers a highly scalable and practical neurosymbolic approach to a major bottleneck in embodied AI (long-horizon planning). Its empirical demonstration of scaling to 10,000 objects and improvements in time/memory efficiency over state-of-the-art end-to-end methods suggest broad and immediate utility in robotics. While Paper 2 provides an innovative formalization for AI ethics, Paper 1's concrete methodological advancements and direct impact on foundational capabilities in autonomous agents give it higher potential for widespread scientific and real-world impact.
SimPersona introduces a novel framework addressing a significant gap in e-commerce AI—learning discrete buyer personas from raw clickstreams rather than hand-crafted prompts. Its evaluation on 8.37M buyers across 42 live storefronts demonstrates strong real-world applicability and scale. The combination of VQ-VAE with LLM persona tokens is innovative, and the open-source data pipeline increases reproducibility and adoption potential. While Paper 2 (BISON) makes solid contributions to long-horizon planning via bilevel policies, it addresses a more incremental advance in robotics planning benchmarks. SimPersona's broader commercial applicability and novel personalization methodology give it higher potential impact.
Paper 2 addresses a fundamental and highly active challenge in embodied AI: long-horizon planning. By effectively combining low-level neural policies with high-level symbolic reasoning, it offers a scalable neurosymbolic approach that significantly outperforms existing end-to-end methods. This methodological advancement has broad applicability across robotics and AI. While Paper 1 is highly valuable for industrial chemistry, Paper 2's fundamental contributions to general-purpose AI planning suggest a broader and potentially more transformative scientific impact across multiple disciplines.
Paper 2 addresses a fundamental bottleneck in embodied AI—long-horizon planning—by proposing a neuro-symbolic architecture that elegantly bridges low-level continuous control with high-level symbolic reasoning. This foundational approach offers robust generalization and massive scalability (up to 10,000 objects), providing a more substantial methodological innovation than Paper 1's prompt-based reflection framework, which relies on existing LLM capabilities.
Paper 2 likely has higher scientific impact due to a concrete, technically novel system (bilevel neural+symbolic policies) addressing a central, timely challenge in embodied AI: long-horizon planning with generalization and efficiency. It presents an implementable method with empirical validation on established benchmarks and clear performance/scaling claims, enabling direct downstream use in robotics and planning. Paper 1 is conceptually important for evaluation science and could influence benchmarking practice, but its impact may be slower and less immediately adoptable, with harder-to-measure methodological uptake compared to an end-to-end demonstrated algorithmic advance.
Paper 1 has higher likely scientific impact: it proposes a broadly applicable hybrid framework (bilevel policies combining neural low-level control with symbolic high-level planning) with strong scalability claims and evaluation on established robotics benchmarks, suggesting methodological rigor and relevance to long-horizon embodied AI. Its approach can transfer across many manipulation/planning domains, impacting robotics, planning, and representation learning. Paper 2 targets an important application, but evidence is more preliminary (small proof-of-concept corpus, no direct comparison to RAG baselines yet), making near-term scientific impact less certain despite good timeliness.