Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning
Michael Aichmüller, Simon Ståhlberg, Martin Funkquist, Hector Geffner
Abstract
Generalized planning aims to learn policies that generalize across collections of instances within a classical planning domain. Recent Graph Neural Network (GNN) approaches have learned nearly perfect policies for several domains. This work improves on the recently published idea of Iterated Width (IW) policies. Therein, the policy broadens its successor scope through an IW-lookahead search that can "jump" over multiple transitions, simplifying the problem structure. Yet, each transition is evaluated individually, leading to unscalable compute costs and expressivity limitations. Furthermore, although IW(1) is attractive because it scales linearly with the number of atoms, it becomes inefficient once thousands of objects are considered, as in the International Planning Competition (IPC) 2023 benchmark. We address both limitations. First, we introduce a vastly more efficient holistic encoding of the entire search tree. It jointly represents IW(1)-reachable states only by their relational differences to the current state, enabling Relational GNNs (R-GNNs) to score all transitions in a single forward pass. Second, we define Abstracted IW(1) to improve scaling through relational abstraction during novelty checks. Rather than testing fully instantiated atoms, it abstracts each atom by replacing all but one argument with its type. The original atom is novel if any of its abstracted forms is novel. This structural compression shifts novelty search scaling from atoms to objects, while preserving meaningful subgoal structure. We evaluate our contributions on the hyperscaling IPC 2023 benchmark and across diverse domains, including domains requiring features beyond the logic fragment. Our policies achieve new state-of-the-art performance, significantly surpassing prior work, including the classical planner LAMA.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper addresses two fundamental scaling bottlenecks in IW-based generalized planning policies. First, it introduces Aggregated-Delta (AD) encoding, a holistic representation of the entire IW(1) lookahead search tree within a single relational graph. Rather than independently encoding each successor state with its full description, AD represents successors only through their relational differences (added/deleted atoms) relative to the current state, anchored to explicit state-object nodes. This enables all Q-value computations in a single R-GNN forward pass, reducing VRAM from >24GB to ≤4GB. Second, it proposes Abstracted IW(1) (AIW), which modifies the novelty calculation by replacing all but one argument in each ground atom with its type, shifting the worst-case scaling from polynomial in atom count to linear in object count. Together, these enable learning-based generalized planning to scale to instances with hundreds or thousands of objects—far beyond prior work.
Methodological Rigor
The experimental design is thorough and well-structured. The authors evaluate on two complementary benchmarks: the IPC 2023 Learning Track (which includes extreme scaling) and a set of domains known to require beyond-C² expressivity. The ablation study is commendably comprehensive—decomposing contributions of AD encoding, AIW, tree-structure atoms, depth nodes, and goal-atom abstraction exceptions. The paper also tests meaningful interpolations between encoding schemes (Internal, Internal-Delta) to isolate where gains originate.
Training on small instances (up to 20 objects in Blocksworld) and testing on vastly larger ones (488 blocks) with strictly greedy, deterministic policies is a demanding evaluation protocol that genuinely tests structural generalization rather than search compensation. The use of 5 seeds per method with explicit model selection criteria adds statistical credibility, though reporting variance or confidence intervals would strengthen claims further.
One methodological concern is the model selection procedure: testing the "best 5 checkpoints among all runs and report the best outcome" introduces an optimistic bias. While common in planning literature, this makes it difficult to assess expected performance versus best-case performance.
Potential Impact
The results are striking: AIW-AD achieves 668/900 (74%) coverage on the IPC 2023 benchmark, substantially surpassing LAMA (586/900, 65%), Distincter (589/900, 65%), and all prior learning-based methods. Critically, this is achieved without exponential-time search at test time—AIW runs in time linear in objects and is invoked only a bounded number of times. This represents a meaningful advance in the competitiveness of learned generalized policies versus classical planners.
The approach has broad implications for:
1. Generalized planning: Demonstrating that learned policies can outperform strong classical planners on standardized benchmarks at scale.
2. GNN expressivity: Showing that architectural limitations (C² fragment) can be partially circumvented through structured lookahead rather than more expressive (and harder to train) architectures.
3. Relational RL: The AD encoding principle—representing transitions through relational deltas rather than full state descriptions—could transfer to other relational reinforcement learning settings.
The insight that width-based lookahead reduces the *expressive burden* on the learned model, not just search cost, is theoretically interesting and practically valuable. This reframes lookahead not merely as computational aid but as an architectural complement.
Timeliness & Relevance
The paper is highly timely. The IPC 2023 Learning Track established hyperscaling benchmarks that exposed the limitations of prior approaches. The planning community has been actively seeking methods that bridge the gap between learned policies and classical planners at scale. This work directly addresses the bottleneck of scaling GNN-based policies to realistic problem sizes—a pressing need as generalized planning matures from proof-of-concept to practical applicability.
The tension between GNN expressivity limitations and planning requirements (beyond-C² domains) is a well-documented open problem. The paper's demonstration that AIW lookaheads provide a practical workaround is directly relevant to ongoing debates about whether more expressive architectures or smarter input representations are the path forward.
Strengths
1. Dramatic efficiency gains: Memory reduction from >24GB to ≤4GB for IW-policy training, enabling practical scaling.
2. Principled abstraction: AIW's type-based abstraction is elegantly motivated and preserves meaningful subgoal structure despite aggressive pruning, with the exception for goal atoms being a smart design choice.
3. Comprehensive evaluation: 21 domains, multiple ablations, multiple baselines including classical planners and competing learning approaches.
4. Exceeding LAMA without exponential search: This is a significant milestone for the field.
5. Strong theoretical grounding: The connection between IW theory, novelty-based pruning, and expressivity requirements is well-articulated.
Limitations
1. Complete failures on Sokoban and Floortile: These domains remain unsolved, and the paper's honest discussion reveals fundamental limitations—PSPACE-completeness and weak HER training signals, respectively.
2. Childsnack scaling wall: The paper acknowledges that discriminative selection from enumerated options cannot handle tens of millions of applicable actions, suggesting a fundamental architectural limitation.
3. Loss of completeness guarantees: AIW sacrifices IW(1)'s completeness for width-1 problems. While empirically this seems acceptable, the theoretical implications are underexplored.
4. Model selection bias: Reporting best-of-5-seeds-best-of-5-checkpoints conflates method quality with selection variance.
5. Limited comparison with transformer-based approaches: The paper mentions supervised transformer methods [Rossetti et al., 2024] but doesn't include them in experiments.
6. Auxiliary ranking loss: The depth-ordering loss feels somewhat ad hoc, and its necessity/contribution isn't fully ablated (only shown indirectly through depth-node ablation).
Additional Observations
The paper's framing of lookahead as reducing "expressive burden" rather than just computational cost is its most intellectually distinctive contribution. The observation that intermediate states handled by lookahead remove the need for the network to make distinctions requiring beyond-C² features is a valuable conceptual insight that could guide future architecture-search trade-off decisions.
The scaling analysis in the appendix (Figure 1, Table 8) effectively contextualizes the challenge: Rovers instances reach 294,553 atoms, and Childsnack has branching factors in the tens of millions. That the method handles most of these cases is impressive.
Generated May 19, 2026
Comparison History (26)
Paper 2 likely has higher scientific impact due to a more substantive algorithmic advance with broad relevance to planning, search, and relational learning: a holistic lookahead encoding that changes the computational profile of IW policies plus an abstraction scheme that improves scaling on large-object benchmarks. It reports state-of-the-art results on IPC 2023 and surpasses a strong classical baseline (LAMA), suggesting strong rigor and real-world relevance for automated planning. Paper 1 is practical and timely for LMM GUI agents but is a narrower, inference-time engineering improvement.
Paper 1 presents a highly impactful application of LLM agents and RL to computer-aided design (CAD) generation. By addressing critical bottlenecks in reasoning chains and geometric constraints for advanced manufacturing, it bridges generative AI with complex industrial workflows. Its novel self-correcting, dual-track memory framework operates without requiring large-scale annotated data, offering immense real-world value. While Paper 2 provides excellent algorithmic advancements for classical planning, Paper 1 has broader potential economic and cross-disciplinary impact by accelerating automated industrial design and manufacturing.
Paper 2 likely has higher impact: it introduces concrete algorithmic innovations (holistic tree encoding + Abstracted IW(1)) that significantly improve scalability and achieves state-of-the-art results on a major, timely benchmark (IPC 2023), surpassing a strong classical baseline (LAMA). The methods are rigorous and broadly relevant across planning, search, and relational learning, with clear real-world applications in robotics and automation. Paper 1 is novel and valuable for faithful, uncertainty-aware claim verification, but its empirical gains appear more incremental and its impact may be narrower and more dependent on evolving LLM evaluation norms.
Paper 1 addresses a fundamental problem in AI planning—learning generalizable policies—with substantial methodological innovations (holistic encoding, abstracted IW(1)) that achieve state-of-the-art results surpassing established classical planners like LAMA on competitive benchmarks (IPC 2023). It advances core AI/planning theory with demonstrated scalability improvements. Paper 2 applies existing conformal prediction techniques to AI agent evaluation, which is useful but more incremental and narrower in scope. Paper 1's contributions to generalized planning have broader theoretical significance and potential to influence multiple research areas in AI.
Paper 2 presents concrete methodological innovations (holistic search tree encoding and Abstracted IW) that yield new state-of-the-art performance on a major benchmark (IPC 2023), overcoming significant scalability bottlenecks in generalized planning. In contrast, while Paper 1 explores a highly relevant topic (temporal grounding for AVs), its quantitative results show no statistically significant improvements, making its immediate impact more limited to qualitative insights and benchmarking.
Paper 1 offers a more foundational, methodologically grounded advance: a scalable, holistic encoding for IW lookahead plus a principled abstraction of novelty checks, validated on IPC 2023 hyperscaling benchmarks and outperforming strong baselines (including LAMA). This targets a core bottleneck in generalized planning and has broad relevance to planning, search, and neuro-symbolic ML. Paper 2 shows impressive speedups, but its impact may be narrower and potentially limited by reliance on solution enumeration (feasibility/scalability) and an LLM-to-constraint translation step that may be harder to reproduce and rigorously analyze across diverse CP domains.
Paper 1 presents concrete algorithmic innovations (holistic encoding, Abstracted IW(1)) with empirical validation showing state-of-the-art results surpassing established planners like LAMA on competitive benchmarks. It advances generalized planning with measurable improvements. Paper 2 is a position paper arguing for a three-layer safety architecture for LLM agents—while timely and relevant, it lacks empirical validation, presents no implemented system, and primarily sketches a conceptual framework with open problems. Paper 1's methodological rigor, novel technical contributions, and demonstrated results give it higher concrete scientific impact.
Paper 1 makes substantial, rigorously evaluated contributions to generalized planning—a core AI problem—demonstrating state-of-the-art results on competitive benchmarks (IPC 2023) and surpassing established classical planners like LAMA. It introduces novel theoretical ideas (Abstracted IW(1), holistic encoding) with strong empirical validation. Paper 2 presents a practical engineering system (NeuSymMS) for LLM memory management but lacks empirical evaluation, offers limited novelty beyond combining known components (CLIPS, triple stores, LLM extraction), and reads more as a system description than a scientific contribution with measurable advances.
Paper 1 likely has higher impact due to timeliness and broader cross-field relevance: it targets LLM-driven household/robotic agents under real deployment constraints (privacy, local compute, long-context limits) and contributes a new evaluation suite (FullHome) plus a model-agnostic framework that substantially improves compact open-weight models. Its practical implications span robotics, embodied AI, HRI, and efficient LLM prompting. Paper 2 is methodologically strong and advances generalized planning with scalable IW policies, but its impact is more specialized to classical planning/IPC-style benchmarks with narrower immediate real-world adoption.
Paper 1 proposes concrete algorithmic advances (holistic lookahead encoding + Abstracted IW(1)) that improve scalability and achieve new SOTA on IPC 2023, with clear real-world relevance to planning/robotics and strong cross-field links to GNNs and search. Its methodological contribution is substantive and likely to influence both learning-for-planning and classical planning communities. Paper 2 is valuable and timely infrastructure for LLM/formalization evaluation, but benchmarks tend to have narrower direct scientific impact unless they become a dominant standard; its innovation is more in dataset/pipeline than in new mathematical or algorithmic methods.
Paper 1 addresses the widely relevant topic of GenAI's heterogeneous productivity effects through a rigorous RCT, introducing the novel concept of AI Interaction Competence (AIC) as a key moderator. Its findings have broad implications across education, management, and policy, given the massive adoption of LLMs across industries. The actionable insight that scaffolding interventions can reduce inequality in AI-mediated performance has immediate real-world applications. Paper 2, while technically strong and achieving SOTA in generalized planning, addresses a narrower AI planning community. Paper 1's timeliness and cross-disciplinary relevance give it greater potential impact.
Paper 1 presents concrete algorithmic innovations (holistic encoding, Abstracted IW(1)) with demonstrated state-of-the-art results surpassing established planners like LAMA on competitive benchmarks. It advances generalized planning with novel, rigorous methodological contributions. Paper 2 is a survey of LLM-based multi-agent systems that synthesizes existing work under a new framework (LIFE) but introduces no new methods or experiments. While timely, surveys generally have less direct scientific impact than papers introducing novel techniques with empirical validation showing clear advances over prior art.
Paper 1 addresses generalized planning with novel contributions (holistic encoding, Abstracted IW(1)) that achieve state-of-the-art results surpassing classical planners on competitive benchmarks (IPC 2023). It advances fundamental AI planning methodology with broad applicability across diverse domains. Paper 2 applies standard RL techniques (shallow MLPs, experience replay) to a specific card game, offering incremental insights with limited generalizability. Paper 1 demonstrates greater novelty, methodological rigor, and broader impact across the planning and learning communities.
Paper 2 addresses a highly timely and critical issue: the longitudinal safety of memory-equipped LLM agents. Given the rapid deployment of autonomous agents, understanding how accumulated memory introduces temporal contamination has immense real-world implications and broad relevance across AI safety, alignment, and systems. While Paper 1 offers strong methodological advancements in classical planning, Paper 2's focus on LLM safety vulnerabilities guarantees a broader, more immediate scientific and societal impact.
Paper 1 addresses a fundamental challenge in generalized planning with novel contributions (holistic encoding, abstracted IW(1)) that achieve state-of-the-art results surpassing established planners like LAMA on competitive benchmarks (IPC 2023). It combines theoretical depth (C2 logic fragment, relational abstraction) with strong empirical validation. Paper 2 presents an incremental improvement to firefly algorithm-based clustering with limited evaluation scope (comparison only to K-Means) and narrower applicability. Paper 1's methodological rigor, broader impact across AI planning, and demonstrated scalability give it significantly higher scientific impact potential.
Paper 2 addresses a critical and highly timely safety issue (memory laundering) in memory-augmented LLM agents. Given the widespread adoption and real-world deployment of LLMs, identifying and mitigating hidden state contamination has broad, high-stakes implications across the AI community. While Paper 1 presents strong methodological advancements and achieves state-of-the-art results, its impact is largely confined to the narrower, more specialized subfield of classical planning.
Paper 2 demonstrates higher scientific impact through rigorous methodology and proven state-of-the-art results on a recognized benchmark (IPC 2023), significantly surpassing established baselines like LAMA. It introduces novel, scalable encoding and abstraction techniques for classical planning. In contrast, while Paper 1 targets a highly relevant real-world application (legal AI) and introduces an interesting graph-constrained approach, it lacks methodological rigor, testing on only a 51-case proof-of-concept and explicitly omitting baseline comparisons. Paper 2's proven empirical success gives it a stronger foundation for broad scientific influence.
Paper 1 offers higher potential scientific impact by providing a theoretically grounded framework using the Graph Information Bottleneck to replace heuristic methods in MARL. By offering formal proofs for topology learning and capacity allocation, it addresses a fundamental bottleneck in multi-agent communication. While Paper 2 presents impressive empirical scaling and SOTA performance in classical planning, Paper 1's rigorous theoretical contributions to reinforcement learning are likely to spur broader methodological adoption across highly impactful fields like autonomous robotics, swarm intelligence, and distributed systems.
Paper 2 addresses the highly influential and rapidly growing field of physical world models. By proposing a foundational, differentiable architecture (WorldString) for actionable object representation, it bridges computer vision, robotics, and policy learning. While Paper 1 offers strong, rigorous improvements in classical planning, Paper 2's alignment with the broader trend of generalized world models gives it a much higher ceiling for interdisciplinary scientific impact and real-world application.
Paper 1 has higher estimated scientific impact due to broader cross-field relevance and real-world applicability: it targets clinical decision-making, where intervention-aware trajectory modeling can directly affect patient outcomes and health-system policy. Its unified framework integrating forecasting, counterfactual estimation, and policy evaluation while explicitly addressing treatment/confounding/observation bias is timely and methodologically consequential, potentially shaping evaluation standards and deployment practices. Paper 2 is technically strong with clear novelty and strong benchmarks in planning, but its impact is more field-specific (AI planning) and less immediately societally transformative.