Learning to Search and Searching to Learn for Generalization in Planning

Michael Aichmüller, Yannik Hesse, Hector Geffner

May 25, 2026

arXiv:2605.25720v1 PDF

cs.AI(primary)

#161of 2682·Artificial Intelligence

#161 of 2682 · Artificial Intelligence

Tournament Score

1532±43

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8.5

Tournament Score

1532±43

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Combinatorial generalization remains a central challenge in Deep Reinforcement Learning (DRL). Classical planning provides a simple yet challenging setting to study this problem through explicit relational descriptions, without requiring learning from perception. In sparse-reward domains, standard RL exploration via real-time search is ineffective, and learning-based planning methods often rely on expert demonstrations, hindsight relabeling, or random walks from the goal state. In contrast, planners rely on best-first search methods such as $\mathrm{A}^\star$ to solve problems from scratch. We propose a self-improving $\mathrm{WA}^\star$ learning framework in combination with a value heuristic represented by a Relational Graph Neural Network: the heuristic guides search, and the resulting search data updates the heuristic via $Q$ -learning. This loop yields heuristics that can function as general policies and solve new instances even without search, where DRL otherwise fails, as we show on puzzles such as Sokoban, PushWorld, The Witness, and the 2023 International Planning Competition benchmarks. Notably, we demonstrate strong zero-shot generalization: For example, heuristics trained on Blocksworld instances with fewer than 30 blocks successfully solve instances with 488 blocks without search.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper introduces GSP (Generalized Search for Planning), a self-improving learning framework that replaces the real-time (agent-centered) search typical of RL with best-first search (WA*) for exploration in classical planning domains. The key insight is a virtuous cycle: a relational GNN-based Q-function guides WA* search to solve training instances, and the resulting search trajectories (including goal paths, dead-ends, and intermediate states) provide supervision to improve the Q-function via Q-learning. This improved heuristic then guides more efficient searches on progressively harder instances.

The core novelty lies in bridging two paradigms—best-first search from classical planning and Q-learning from RL—in a way that enables combinatorial generalization across problem sizes. While prior work has combined search with learning (e.g., DeepCubeA, Arfaee et al., Orseau & Lelis), these approaches operated on fixed-size state spaces. GSP generalizes across instances with varying numbers of objects, initial states, and goals, enabled by the relational GNN representation and the domain-instance separation of PDDL.

2. Methodological Rigor

The methodology is well-grounded. The paper clearly distinguishes its approach from related paradigms (LRTA*, RTDP, AlphaZero, supervised heuristic learning) and provides algorithmic pseudocode. Several design choices are well-motivated:

Search-derived lower bounds: Goal-path returns provide lower bounds on optimal return, used via a max operation with bootstrap targets to prevent regression—a simple but effective stabilization mechanism.

Dead-end supervision: Explicit penalty signals for dead-end states, important for domains like The Witness and Sokoban.

Instance selection strategy: Priority-based sampling across unsolved/solved/satisficed pools to focus compute on maximally informative instances.

The experimental evaluation is comprehensive and multi-faceted, covering the 2023 IPC learning track (10 domains), combinatorial puzzles (Sokoban, The Witness, 24-Puzzle), and PushWorld. The paper evaluates both greedy policy execution and search-guided execution, includes ablations of all three training components, provides training dynamics visualizations, and compares against a broad set of baselines (LiftedHER, LAMA, WL-f, Horčík et al., Distincter, AlphaZero, DeepCubeA, Orseau & Lelis methods). The use of 5 random seeds with mean/std reporting in ablations adds reliability.

One methodological concern: model selection relies on validation performance with ties broken by multiple criteria. The appendix reveals that for the LiftedHER comparison benchmarks, test instances were used for checkpoint selection, which weakens the generalization claims for that specific comparison, though the authors are transparent about this.

3. Potential Impact

Immediate impact: The framework provides a practical method for learning domain-specific heuristics that can solve planning instances far beyond training distribution. The Blocksworld result (training on ≤30 blocks, solving 488 blocks without search) is a striking demonstration of size generalization. This has implications for:

Automated planning: Learning competitive heuristics without requiring optimal solutions for supervision (unlike supervised approaches) or expert demonstrations.

Goal-conditioned RL: Demonstrating that best-first search can serve as a superior exploration mechanism when models are known, addressing the sparse-reward exploration problem without HER or reverse walks.

GNN-based reasoning: Providing further evidence that relational GNNs can support combinatorial generalization in structured domains.

Broader implications: The self-improving loop paradigm could influence how the community thinks about exploration in model-based RL more generally. The clear demonstration that real-time search is fundamentally inferior to best-first search for planning-like problems (supported by the AlphaZero baseline's failure) is an important empirical finding.

4. Timeliness & Relevance

The work is highly timely. Combinatorial generalization is an active frontier in deep learning, and the 2023 IPC learning track specifically targeted this challenge. The paper directly addresses current bottlenecks: (1) sparse-reward exploration in structured domains, (2) the gap between supervised heuristic learning and self-improving RL, and (3) the limited scalability of RL-based general policy learning. The comparison with concurrent/recent work (Ståhlberg & Geffner 2026, Chen 2025, Horčík et al. 2025, Bai et al. 2025) positions the paper well within an active research landscape.

5. Strengths & Limitations

Key Strengths:

Uniformity: Same architecture and hyperparameters across all domains—no domain-specific tuning.

Self-improving without external supervision: Unlike supervised heuristic learning, no optimal solutions needed; unlike HER-based methods, no goal relabeling assumptions.

Strong size generalization: The Blocksworld 30→488 block result is remarkable and goes well beyond prior demonstrations.

Dual-use heuristic: The learned Q-function works both as a greedy policy and as a search heuristic, providing flexibility at test time.

Open-source code and data.

Notable Limitations:

24-Puzzle failure: GSP completely fails on the sliding tile puzzle, where the training instances are too hard for initial uninformed search. This reveals a bootstrapping weakness—the method requires at least some training instances to be solvable from scratch.

Local vs. global calibration: The authors honestly acknowledge that the Q-function works better as a local action ranker than as a globally calibrated heuristic, explaining why greedy execution sometimes outperforms WA* at test time—a somewhat paradoxical finding.

1-WL expressivity ceiling: Performance on Rovers and Satellite is inherently limited by GNN expressivity, a known limitation shared with all GNN-based approaches.

Computational cost: 12-hour training runs with 5 parallel workers, though results suggest earlier convergence in many domains.

No theoretical guarantees: No convergence or completeness guarantees for the learning loop, unlike LRTA*/RTDP which have convergence results.

Additional Observations

The paper's framing of the three levels of RL generalization (states → states+goals → states+goals+size) provides a useful conceptual taxonomy. The insight that joint relational encoding of state and actions enables strong local action ranking but poor global calibration has implications for future architecture design. The comparison against AlphaZero, while preliminary, provides useful negative evidence about MCTS for non-adversarial planning.

Rating:7.5/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8.5

Generated May 26, 2026

Comparison History (29)

vs. SIA: Self Improving AI with Harness & Weight Updates

gpt-5.25/27/2026

Paper 1 offers a methodologically grounded integration of classic best-first search (WA*/A*) with learned relational heuristics via Q-learning, demonstrating striking zero-shot combinatorial generalization (e.g., 30→488 blocks) and broad relevance to planning, RL, and neuro-symbolic methods. Its evaluation on standard puzzles and IPC benchmarks increases rigor and reproducibility. Paper 2 targets a timely self-improvement theme with attractive applications, but the claims (large gains, cross-domain harness+weights updates) are harder to assess without strong controls and may be more system/benchmark-dependent, reducing expected durable scientific impact.

vs. Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

gemini-3.15/27/2026

Paper 1 addresses a critical safety and reliability flaw in Retrieval-Augmented LLMs, a widely deployed technology. Its discovery of the 'monitoring-control gap' challenges existing evaluation protocols and has immense real-world implications for AI safety. While Paper 2 offers impressive methodological advancements in RL and classical planning, Paper 1's focus on LLM safety guarantees a broader and more urgent scientific and societal impact.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it proposes a concrete, broadly applicable self-improving framework combining weighted A* search with learned relational heuristics, demonstrates strong empirical results and striking zero-shot scaling (e.g., Blocksworld 488 blocks), and targets a longstanding, timely challenge (combinatorial generalization in planning/RL). The methodology is action-oriented and transferable across planning, RL, and GNN-based reasoning, with clear real-world relevance to automated planning and decision-making. Paper 1 is insightful diagnostically for LLM evaluation, but is narrower and primarily reframes measurement rather than enabling new capabilities.

vs. ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

gemini-3.15/27/2026

Paper 1 addresses a critical bottleneck in the emerging field of autonomous AI research: verifiability and hallucination. By introducing a framework that ensures traceable evidence and eliminates fabricated citations, it has transformative potential across all scientific disciplines, accelerating reliable AI-driven discovery. While Paper 2 presents significant algorithmic advancements in DRL and classical planning with strong zero-shot generalization, its impact is largely confined to the AI planning community, whereas Paper 1 has far-reaching, cross-disciplinary implications.

vs. Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

claude-opus-4.65/26/2026

Paper 1 presents a novel self-improving framework combining classical planning search (WA*) with learned GNN heuristics, demonstrating remarkable zero-shot combinatorial generalization (e.g., training on 30 blocks, solving 488). This addresses a fundamental challenge in AI—combinatorial generalization—with a principled approach bridging classical planning and deep RL. Paper 2 offers useful engineering improvements to test-time scaling for LLM reasoning via stochastic backtracking, but is more incremental, building on existing PRM-guided search. Paper 1's broader theoretical contribution, cross-domain applicability, and striking generalization results suggest higher long-term scientific impact.

vs. MDIA: A Multi-Agent Diagnostic Intelligence Pipeline on HealthBench Professional

claude-opus-4.65/26/2026

Paper 2 presents a novel self-improving framework combining weighted A* search with relational GNNs that demonstrates remarkable zero-shot generalization (e.g., training on <30 blocks, solving 488-block instances). It addresses a fundamental challenge in AI (combinatorial generalization) with broad implications across planning, reinforcement learning, and graph neural networks. Paper 1 is an engineering contribution applying multi-agent LLM architecture to a clinical benchmark with incremental improvements (+3.72pp), and its findings are somewhat narrow—focused on a specific benchmark with grader-dependent variability concerns. Paper 2's methodological novelty and cross-domain applicability give it substantially higher impact potential.

vs. AION: Next-Generation Tasks and Practical Harness for Time Series

claude-opus-4.65/26/2026

Paper 1 presents a novel self-improving search-and-learning framework combining WA* search with relational GNNs, demonstrating remarkable zero-shot generalization (e.g., training on <30 blocks, solving 488-block instances). It addresses a fundamental challenge in AI—combinatorial generalization—with rigorous methodology and strong empirical results across multiple domains including competition benchmarks. Paper 2 proposes a harness for time series tasks but is more of an engineering/framework contribution with limited evaluation (single Kaggle case study) and less methodological novelty. Paper 1's contributions to planning, RL, and generalization are more broadly impactful and scientifically rigorous.

vs. EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

gpt-5.25/26/2026

Paper 1 presents a novel self-improving loop coupling WA* search with a learned relational GNN value heuristic updated via Q-learning, achieving striking zero-shot combinatorial generalization (e.g., 30→488 blocks) across diverse planning benchmarks—an advance likely to influence both planning and RL research and applications. Paper 2 is timely and useful infrastructure, but primarily contributes an evaluation benchmark; its impact is narrower (agent assessment) and less algorithmically transformative. Overall, Paper 1 combines methodological innovation, broad cross-field relevance, and strong empirical generalization claims.

vs. JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

claude-opus-4.65/26/2026

Paper 2 presents a novel and elegant self-improving framework combining classical search (WA*) with learned heuristics via GNNs, demonstrating remarkable zero-shot generalization (e.g., training on 30 blocks, solving 488). This addresses a fundamental challenge in AI—combinatorial generalization—with clear methodological rigor and broad implications for planning, RL, and reasoning. Paper 1, while practically relevant, is more incremental (extending JT-Safe-V1) and primarily an engineering contribution focused on safety benchmarks and cost reduction, with less fundamental scientific novelty.

vs. FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

gemini-3.15/26/2026

While Paper 1 presents impressive generalization results in reinforcement learning and planning, Paper 2 addresses a critical, highly timely bottleneck in modern AI: KV-cache memory limits during long-context LLM inference. By introducing a mathematically rigorous, universal vector quantization method that significantly outperforms existing scalar approaches, Paper 2 offers immediate and immense practical value for deploying efficient, large-scale language models, giving it a higher potential for broad scientific and industry impact.

vs. From Accounting to Coordination: A Virtual Water-Aware Electricity-Computation-Water Nexus Framework for Data Center Dispatch

gpt-5.25/26/2026

Paper 1 has higher potential impact due to a broadly relevant methodological contribution: a self-improving loop combining weighted A* search with a relational GNN heuristic trained via Q-learning, addressing combinatorial generalization in planning/RL. The reported zero-shot scaling (e.g., Blocksworld to 488 blocks) suggests a significant advance with applicability across AI planning, RL, and search. Paper 2 is timely and practically important for sustainable data-center dispatch, but is more domain-specific and appears to offer incremental (3–5%) gains on standard test systems, limiting breadth compared to Paper 1.

vs. Test-Time Deep Thinking to Explore Implicit Rules

gemini-3.15/26/2026

While Paper 1 addresses a highly timely topic in LLM agents, Paper 2 tackles a fundamental, long-standing challenge in Deep RL: combinatorial generalization in sparse-reward domains. Its proposed self-improving search framework demonstrates profound zero-shot generalization capabilities (e.g., scaling from 30 to 488 blocks without search), which offers broader, paradigm-shifting implications for classical planning, reasoning, and autonomous systems.

vs. Representation Without Control: Testing the Realization Effect in Language Models

gpt-5.25/26/2026

Paper 2 is more likely to have higher impact: it introduces a self-improving loop combining weighted A* search with a learned relational GNN heuristic trained via Q-learning, showing striking zero-shot combinatorial generalization (e.g., Blocksworld 30→488 blocks) across multiple planning benchmarks. This targets a central, timely RL/planning challenge with clear downstream applications (automated planning, robotics, optimization) and broad relevance across DRL, classical planning, and graph learning. Paper 1 is methodologically careful and valuable for interpretability, but its main contribution is a negative result with narrower immediate application scope.

vs. PRIMA: Operational Patterns for Resilient Multi-Agent Research with Verifiable Identity and Convergent Feedback

claude-opus-4.65/26/2026

Paper 2 addresses a fundamental challenge in AI—combinatorial generalization in planning—with a clean, well-motivated approach combining weighted A* search with relational GNNs in a self-improving loop. It demonstrates impressive zero-shot generalization (e.g., training on 30-block instances, solving 488-block ones), validated across multiple benchmarks including international competition problems. The methodology is rigorous with clear theoretical grounding and broad applicability. Paper 1 presents engineering patterns for multi-agent LLM systems, but its contributions are more operational/architectural than scientifically foundational, with limited empirical validation beyond a single case study.

vs. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

claude-opus-4.65/26/2026

Paper 1 presents a novel self-improving framework combining classical planning search (WA*) with relational GNNs, demonstrating remarkable zero-shot combinatorial generalization (e.g., training on <30 blocks, solving 488-block instances). This addresses a fundamental challenge in AI—combinatorial generalization—with a principled, broadly applicable method. Paper 2 offers valuable diagnostic insights into LLM tool-use via the 'knowing-doing gap,' but is more incremental and narrower in scope. Paper 1's methodological contribution (bridging classical planning and deep learning) has broader potential impact across planning, RL, and combinatorial optimization.

vs. Beyond Inference-Only Deployment: Comparing Weight-Based Consolidation Against Cascading Compaction

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental challenge in AI—combinatorial generalization in planning—with a novel self-improving framework combining weighted A* search with relational GNNs. It demonstrates remarkable zero-shot generalization (e.g., training on <30 blocks, solving 488-block instances) across multiple benchmarks including IPC 2023. The theoretical contribution of unifying search and learning is broadly impactful across AI planning, RL, and combinatorial optimization. Paper 2 addresses a practical engineering concern (LLM personalization via LoRA) with a small-scale study (n=10) and limited novelty beyond applying existing techniques (LoRA fine-tuning) in a specific deployment scenario.

vs. Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

gemini-3.15/26/2026

Paper 1 tackles a fundamental and persistent challenge in Deep Reinforcement Learning (combinatorial generalization) with a novel self-improving search and learning framework. Its demonstration of extreme zero-shot generalization in complex domains suggests highly impactful methodological advancements for RL and planning. While Paper 2 addresses a timely practical issue in LLM deployment, Paper 1 offers deeper algorithmic contributions with broader potential to advance the foundational capabilities of autonomous AI systems.

vs. Mitigating Cognitive Bias in RLHF by Altering Rationality

gpt-5.25/26/2026

Paper 2 has higher likely impact: it introduces a broadly applicable self-improving loop combining weighted A* search with learned relational GNN heuristics, demonstrated across diverse planning benchmarks and with striking zero-shot combinatorial generalization (e.g., 30→488 blocks). This bridges classical planning and DRL in a way that can influence multiple communities (planning, RL, graph learning) and has clear downstream utility for scalable problem solving. Paper 1 is timely for RLHF robustness but is more incremental (reweighting via context-dependent rationality using an LLM-judge) and narrower in scope.

vs. Solving Combinatorial Counting Problems with Weighted First-Order Model Counting

claude-opus-4.65/26/2026

Paper 2 addresses the fundamental challenge of combinatorial generalization in planning/RL with a self-improving framework combining search and learning. Its strong zero-shot generalization results (e.g., training on 30 blocks, solving 488) demonstrate significant practical impact. The approach bridges classical AI planning and deep RL, has broad applicability across domains, and addresses a core challenge in AI. Paper 1, while technically rigorous in formalizing combinatorial counting via WFOMC, targets a narrower audience and problem class with more incremental contributions to an established field.

vs. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it proposes a self-improving loop tightly integrating classical best-first search (WA*) with learned relational GNN heuristics updated via Q-learning, enabling strong zero-shot combinatorial generalization (e.g., Blocksworld 30→488 blocks). This bridges planning and RL, offering broad relevance across DRL, automated planning, and programmatic reasoning, and targets a long-standing core challenge (generalization under sparse rewards). Paper 1 is timely and useful for multi-agent LLM coordination and efficiency, but its impact may be narrower and more incremental relative to the larger cross-field advance and generalization result in Paper 2.