Learning to Search and Searching to Learn for Generalization in Planning
Michael Aichmüller, Yannik Hesse, Hector Geffner
Abstract
Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as to solve problems from scratch. We propose a self-improving learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via -learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
The paper introduces GSP (Generalized Search for Planning), a self-improving learning framework that replaces the real-time (agent-centered) search typical of RL with best-first search (WA*) for exploration in classical planning domains. The key insight is a virtuous cycle: a relational GNN-based Q-function guides WA* search to solve training instances, and the resulting search trajectories (including goal paths, dead-ends, and intermediate states) provide supervision to improve the Q-function via Q-learning. This improved heuristic then guides more efficient searches on progressively harder instances.
The core novelty lies in bridging two paradigms—best-first search from classical planning and Q-learning from RL—in a way that enables combinatorial generalization across problem sizes. While prior work has combined search with learning (e.g., DeepCubeA, Arfaee et al., Orseau & Lelis), these approaches operated on fixed-size state spaces. GSP generalizes across instances with varying numbers of objects, initial states, and goals, enabled by the relational GNN representation and the domain-instance separation of PDDL.
2. Methodological Rigor
The methodology is well-grounded. The paper clearly distinguishes its approach from related paradigms (LRTA*, RTDP, AlphaZero, supervised heuristic learning) and provides algorithmic pseudocode. Several design choices are well-motivated:
The experimental evaluation is comprehensive and multi-faceted, covering the 2023 IPC learning track (10 domains), combinatorial puzzles (Sokoban, The Witness, 24-Puzzle), and PushWorld. The paper evaluates both greedy policy execution and search-guided execution, includes ablations of all three training components, provides training dynamics visualizations, and compares against a broad set of baselines (LiftedHER, LAMA, WL-f, Horčík et al., Distincter, AlphaZero, DeepCubeA, Orseau & Lelis methods). The use of 5 random seeds with mean/std reporting in ablations adds reliability.
One methodological concern: model selection relies on validation performance with ties broken by multiple criteria. The appendix reveals that for the LiftedHER comparison benchmarks, test instances were used for checkpoint selection, which weakens the generalization claims for that specific comparison, though the authors are transparent about this.
3. Potential Impact
Immediate impact: The framework provides a practical method for learning domain-specific heuristics that can solve planning instances far beyond training distribution. The Blocksworld result (training on ≤30 blocks, solving 488 blocks without search) is a striking demonstration of size generalization. This has implications for:
Broader implications: The self-improving loop paradigm could influence how the community thinks about exploration in model-based RL more generally. The clear demonstration that real-time search is fundamentally inferior to best-first search for planning-like problems (supported by the AlphaZero baseline's failure) is an important empirical finding.
4. Timeliness & Relevance
The work is highly timely. Combinatorial generalization is an active frontier in deep learning, and the 2023 IPC learning track specifically targeted this challenge. The paper directly addresses current bottlenecks: (1) sparse-reward exploration in structured domains, (2) the gap between supervised heuristic learning and self-improving RL, and (3) the limited scalability of RL-based general policy learning. The comparison with concurrent/recent work (Ståhlberg & Geffner 2026, Chen 2025, Horčík et al. 2025, Bai et al. 2025) positions the paper well within an active research landscape.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing of the three levels of RL generalization (states → states+goals → states+goals+size) provides a useful conceptual taxonomy. The insight that joint relational encoding of state and actions enables strong local action ranking but poor global calibration has implications for future architecture design. The comparison against AlphaZero, while preliminary, provides useful negative evidence about MCTS for non-adversarial planning.
Generated May 26, 2026
Comparison History (29)
Paper 1 offers a methodologically grounded integration of classic best-first search (WA*/A*) with learned relational heuristics via Q-learning, demonstrating striking zero-shot combinatorial generalization (e.g., 30→488 blocks) and broad relevance to planning, RL, and neuro-symbolic methods. Its evaluation on standard puzzles and IPC benchmarks increases rigor and reproducibility. Paper 2 targets a timely self-improvement theme with attractive applications, but the claims (large gains, cross-domain harness+weights updates) are harder to assess without strong controls and may be more system/benchmark-dependent, reducing expected durable scientific impact.
Paper 1 addresses a critical safety and reliability flaw in Retrieval-Augmented LLMs, a widely deployed technology. Its discovery of the 'monitoring-control gap' challenges existing evaluation protocols and has immense real-world implications for AI safety. While Paper 2 offers impressive methodological advancements in RL and classical planning, Paper 1's focus on LLM safety guarantees a broader and more urgent scientific and societal impact.
Paper 2 likely has higher scientific impact: it proposes a concrete, broadly applicable self-improving framework combining weighted A* search with learned relational heuristics, demonstrates strong empirical results and striking zero-shot scaling (e.g., Blocksworld 488 blocks), and targets a longstanding, timely challenge (combinatorial generalization in planning/RL). The methodology is action-oriented and transferable across planning, RL, and GNN-based reasoning, with clear real-world relevance to automated planning and decision-making. Paper 1 is insightful diagnostically for LLM evaluation, but is narrower and primarily reframes measurement rather than enabling new capabilities.
Paper 1 addresses a critical bottleneck in the emerging field of autonomous AI research: verifiability and hallucination. By introducing a framework that ensures traceable evidence and eliminates fabricated citations, it has transformative potential across all scientific disciplines, accelerating reliable AI-driven discovery. While Paper 2 presents significant algorithmic advancements in DRL and classical planning with strong zero-shot generalization, its impact is largely confined to the AI planning community, whereas Paper 1 has far-reaching, cross-disciplinary implications.
Paper 1 presents a novel self-improving framework combining classical planning search (WA*) with learned GNN heuristics, demonstrating remarkable zero-shot combinatorial generalization (e.g., training on 30 blocks, solving 488). This addresses a fundamental challenge in AI—combinatorial generalization—with a principled approach bridging classical planning and deep RL. Paper 2 offers useful engineering improvements to test-time scaling for LLM reasoning via stochastic backtracking, but is more incremental, building on existing PRM-guided search. Paper 1's broader theoretical contribution, cross-domain applicability, and striking generalization results suggest higher long-term scientific impact.
Paper 2 presents a novel self-improving framework combining weighted A* search with relational GNNs that demonstrates remarkable zero-shot generalization (e.g., training on <30 blocks, solving 488-block instances). It addresses a fundamental challenge in AI (combinatorial generalization) with broad implications across planning, reinforcement learning, and graph neural networks. Paper 1 is an engineering contribution applying multi-agent LLM architecture to a clinical benchmark with incremental improvements (+3.72pp), and its findings are somewhat narrow—focused on a specific benchmark with grader-dependent variability concerns. Paper 2's methodological novelty and cross-domain applicability give it substantially higher impact potential.
Paper 1 presents a novel self-improving search-and-learning framework combining WA* search with relational GNNs, demonstrating remarkable zero-shot generalization (e.g., training on <30 blocks, solving 488-block instances). It addresses a fundamental challenge in AI—combinatorial generalization—with rigorous methodology and strong empirical results across multiple domains including competition benchmarks. Paper 2 proposes a harness for time series tasks but is more of an engineering/framework contribution with limited evaluation (single Kaggle case study) and less methodological novelty. Paper 1's contributions to planning, RL, and generalization are more broadly impactful and scientifically rigorous.
Paper 1 presents a novel self-improving loop coupling WA* search with a learned relational GNN value heuristic updated via Q-learning, achieving striking zero-shot combinatorial generalization (e.g., 30→488 blocks) across diverse planning benchmarks—an advance likely to influence both planning and RL research and applications. Paper 2 is timely and useful infrastructure, but primarily contributes an evaluation benchmark; its impact is narrower (agent assessment) and less algorithmically transformative. Overall, Paper 1 combines methodological innovation, broad cross-field relevance, and strong empirical generalization claims.
Paper 2 presents a novel and elegant self-improving framework combining classical search (WA*) with learned heuristics via GNNs, demonstrating remarkable zero-shot generalization (e.g., training on 30 blocks, solving 488). This addresses a fundamental challenge in AI—combinatorial generalization—with clear methodological rigor and broad implications for planning, RL, and reasoning. Paper 1, while practically relevant, is more incremental (extending JT-Safe-V1) and primarily an engineering contribution focused on safety benchmarks and cost reduction, with less fundamental scientific novelty.
While Paper 1 presents impressive generalization results in reinforcement learning and planning, Paper 2 addresses a critical, highly timely bottleneck in modern AI: KV-cache memory limits during long-context LLM inference. By introducing a mathematically rigorous, universal vector quantization method that significantly outperforms existing scalar approaches, Paper 2 offers immediate and immense practical value for deploying efficient, large-scale language models, giving it a higher potential for broad scientific and industry impact.
Paper 1 has higher potential impact due to a broadly relevant methodological contribution: a self-improving loop combining weighted A* search with a relational GNN heuristic trained via Q-learning, addressing combinatorial generalization in planning/RL. The reported zero-shot scaling (e.g., Blocksworld to 488 blocks) suggests a significant advance with applicability across AI planning, RL, and search. Paper 2 is timely and practically important for sustainable data-center dispatch, but is more domain-specific and appears to offer incremental (3–5%) gains on standard test systems, limiting breadth compared to Paper 1.
While Paper 1 addresses a highly timely topic in LLM agents, Paper 2 tackles a fundamental, long-standing challenge in Deep RL: combinatorial generalization in sparse-reward domains. Its proposed self-improving search framework demonstrates profound zero-shot generalization capabilities (e.g., scaling from 30 to 488 blocks without search), which offers broader, paradigm-shifting implications for classical planning, reasoning, and autonomous systems.
Paper 2 is more likely to have higher impact: it introduces a self-improving loop combining weighted A* search with a learned relational GNN heuristic trained via Q-learning, showing striking zero-shot combinatorial generalization (e.g., Blocksworld 30→488 blocks) across multiple planning benchmarks. This targets a central, timely RL/planning challenge with clear downstream applications (automated planning, robotics, optimization) and broad relevance across DRL, classical planning, and graph learning. Paper 1 is methodologically careful and valuable for interpretability, but its main contribution is a negative result with narrower immediate application scope.
Paper 2 addresses a fundamental challenge in AI—combinatorial generalization in planning—with a clean, well-motivated approach combining weighted A* search with relational GNNs in a self-improving loop. It demonstrates impressive zero-shot generalization (e.g., training on 30-block instances, solving 488-block ones), validated across multiple benchmarks including international competition problems. The methodology is rigorous with clear theoretical grounding and broad applicability. Paper 1 presents engineering patterns for multi-agent LLM systems, but its contributions are more operational/architectural than scientifically foundational, with limited empirical validation beyond a single case study.
Paper 1 presents a novel self-improving framework combining classical planning search (WA*) with relational GNNs, demonstrating remarkable zero-shot combinatorial generalization (e.g., training on <30 blocks, solving 488-block instances). This addresses a fundamental challenge in AI—combinatorial generalization—with a principled, broadly applicable method. Paper 2 offers valuable diagnostic insights into LLM tool-use via the 'knowing-doing gap,' but is more incremental and narrower in scope. Paper 1's methodological contribution (bridging classical planning and deep learning) has broader potential impact across planning, RL, and combinatorial optimization.
Paper 1 addresses a fundamental challenge in AI—combinatorial generalization in planning—with a novel self-improving framework combining weighted A* search with relational GNNs. It demonstrates remarkable zero-shot generalization (e.g., training on <30 blocks, solving 488-block instances) across multiple benchmarks including IPC 2023. The theoretical contribution of unifying search and learning is broadly impactful across AI planning, RL, and combinatorial optimization. Paper 2 addresses a practical engineering concern (LLM personalization via LoRA) with a small-scale study (n=10) and limited novelty beyond applying existing techniques (LoRA fine-tuning) in a specific deployment scenario.
Paper 1 tackles a fundamental and persistent challenge in Deep Reinforcement Learning (combinatorial generalization) with a novel self-improving search and learning framework. Its demonstration of extreme zero-shot generalization in complex domains suggests highly impactful methodological advancements for RL and planning. While Paper 2 addresses a timely practical issue in LLM deployment, Paper 1 offers deeper algorithmic contributions with broader potential to advance the foundational capabilities of autonomous AI systems.
Paper 2 has higher likely impact: it introduces a broadly applicable self-improving loop combining weighted A* search with learned relational GNN heuristics, demonstrated across diverse planning benchmarks and with striking zero-shot combinatorial generalization (e.g., 30→488 blocks). This bridges classical planning and DRL in a way that can influence multiple communities (planning, RL, graph learning) and has clear downstream utility for scalable problem solving. Paper 1 is timely for RLHF robustness but is more incremental (reweighting via context-dependent rationality using an LLM-judge) and narrower in scope.
Paper 2 addresses the fundamental challenge of combinatorial generalization in planning/RL with a self-improving framework combining search and learning. Its strong zero-shot generalization results (e.g., training on 30 blocks, solving 488) demonstrate significant practical impact. The approach bridges classical AI planning and deep RL, has broad applicability across domains, and addresses a core challenge in AI. Paper 1, while technically rigorous in formalizing combinatorial counting via WFOMC, targets a narrower audience and problem class with more incremental contributions to an established field.
Paper 2 likely has higher scientific impact: it proposes a self-improving loop tightly integrating classical best-first search (WA*) with learned relational GNN heuristics updated via Q-learning, enabling strong zero-shot combinatorial generalization (e.g., Blocksworld 30→488 blocks). This bridges planning and RL, offering broad relevance across DRL, automated planning, and programmatic reasoning, and targets a long-standing core challenge (generalization under sparse rewards). Paper 1 is timely and useful for multi-agent LLM coordination and efficiency, but its impact may be narrower and more incremental relative to the larger cross-field advance and generalization result in Paper 2.