LLM-Evolved Pattern Generators for Optimal Classical Planning

Windy Phung, Dominik Drexler, Arnaud Lequen, Jendrik Seipp

#1265 of 3355 · Artificial Intelligence
Share
Tournament Score
1432±43
10501800
67%
Win Rate
14
Wins
7
Losses
21
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Learned heuristics have recently become a competitive alternative to traditional domain-independent heuristics for satisficing planning. Existing approaches, however, focus on improving search guidance rather than guaranteeing admissibility, which makes them unsuitable for optimal classical planning. We present the first method for learning domain-dependent heuristics that are admissible by design and thus preserve the optimality guarantees of A* search. Instead of learning a direct mapping from states to heuristic values, we learn to construct abstractions that induce admissible heuristics. We use an LLM-driven evolutionary program-synthesis framework to obtain, for each domain, a program that produces a pattern collection for any task in that domain, and we combine the resulting patterns admissibly via saturated cost partitioning. Empirically, the learned programs encode interpretable domain-specific insights, run with negligible overhead at test time and yield heuristics that match the coverage of state-of-the-art domain-independent baselines on several domains while evaluating each state substantially faster.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: LLM-Evolved Pattern Generators for Optimal Classical Planning

1. Core Contribution

This paper addresses a genuine gap at the intersection of learned heuristics and optimal classical planning. While prior work on learning heuristics has focused on satisficing planning (sacrificing admissibility), this paper presents the first method for learning domain-dependent heuristics that are admissible by construction. The key insight is elegant: rather than learning a direct state-to-value mapping (which is difficult to make admissible), the authors learn to construct *abstractions* — specifically, pattern collections for pattern database (PDB) heuristics — that inherently produce admissible estimates. The admissibility guarantee comes from the mathematical properties of projections and saturated cost partitioning (SCP), not from any learned constraint.

The approach uses OpenEvolve (an open-source AlphaEvolve variant) to synthesize, per domain, a Python program that maps any task description to a pattern collection. This framing — learning a *generator program* rather than a fixed collection — is a meaningful architectural choice that enables generalization across task sizes within a domain.

2. Methodological Rigor

The methodology is sound in its theoretical foundations. The admissibility guarantee is airtight: projections yield admissible heuristics, and SCP preserves admissibility when combining them. The evolutionary loop is well-defined with a clear scoring function combining validity, expansions, and search time.

However, several aspects weaken the experimental rigor:

  • Scale of evaluation: Only 7 domains with 30 test tasks each (210 total) are evaluated. This is relatively small for drawing broad conclusions.
  • Training/test separation: The training set (IPC 2023 Learning track) and test set (Autoscale) are from different benchmarks, which is good practice, but the domains are well-known planning benchmarks that LLMs have likely seen extensively during pretraining. The authors acknowledge this concern but defer obfuscation experiments to future work.
  • LLM choice: Only Kimi K2.5 is tested; sensitivity to the LLM backbone is unexplored.
  • Evolutionary hyperparameters: The configuration (3 islands, 100 iterations) is modest and fixed across domains; no sensitivity analysis is provided.
  • Statistical significance: No variance analysis or confidence intervals are reported. With evolutionary stochasticity and only 100 iterations, understanding variance across runs would strengthen claims.
  • 3. Potential Impact

    The work opens a genuinely new research direction: learning admissible abstractions rather than learning heuristic values directly. This sidesteps the fundamental difficulty of enforcing admissibility on learned value functions while still capturing domain-specific structure.

    Practical implications: The generated patterns run with negligible overhead at test time (the program is a simple Python function), unlike SYS-SCP which can spend 100+ seconds on pattern generation per task. This makes the approach attractive for settings where the same domain is solved repeatedly (e.g., industrial planning applications).

    Broader implications: The idea of using LLMs to synthesize *structural components* of well-understood algorithmic frameworks (rather than end-to-end solutions) is a promising paradigm. It preserves formal guarantees while leveraging LLMs' pattern recognition capabilities. This template could extend to other abstraction types (Cartesian abstractions, merge-and-shrink strategies) and potentially other combinatorial optimization contexts.

    4. Timeliness & Relevance

    The paper is highly timely, sitting at the confluence of three active trends: (1) LLM-driven program synthesis (AlphaEvolve/FunSearch), (2) learning for planning, and (3) optimal classical planning. The observation that existing learned heuristics sacrifice admissibility is a well-known limitation that the community has struggled with. The paper directly addresses this bottleneck.

    The use of OpenEvolve (open-source) enhances reproducibility and accessibility relative to proprietary systems.

    5. Strengths & Limitations

    Key Strengths:

  • Admissibility by design: The most important contribution — a clean architectural choice that guarantees optimality without post-hoc correction.
  • Interpretability: The synthesized generators are readable Python programs (shown in the appendix), encoding domain-specific insights that can be inspected and understood by human planners.
  • Per-state efficiency: hevo evaluates states substantially faster than systematic baselines on most tasks, compensating for weaker per-state guidance with throughput — a practical and underappreciated dimension.
  • Domain-level learning: Unlike task-level pattern generators that must rediscover structure for each instance, domain-level generators amortize this cost.
  • Clean experimental design: The random ablation (h_rand) convincingly demonstrates that the evolutionary process finds meaningful structure, not random atoms.
  • Notable Weaknesses:

  • Inconsistent dominance: hevo clearly fails on Miconic (4/30 vs. 20/30 for SYS-3) and underperforms on Floortile. The aggregate coverage (61 vs. 75 for SYS-3) is substantially lower.
  • Limited domain coverage: 7 domains is narrow; many planning domains with different structural properties are untested.
  • Potential LLM memorization: Planning domains like Blocksworld and Transport are extensively documented in LLM training data. The extent to which the LLM leverages prior knowledge of these specific domains versus genuine structural reasoning is unclear.
  • Weak guidance quality: Expansion counts show hevo provides weaker guidance than SYS-3 and SYS-SCP on most tasks, winning primarily through faster evaluation. This suggests the patterns are less informative, raising questions about scalability to harder instances where guidance quality dominates.
  • No comparison with other learned admissible approaches: The concurrent work by Futuhi & Sturtevant (2026) on neural admissible heuristics is mentioned but not compared experimentally.
  • Single evolutionary run: Without multiple runs, it's impossible to assess the reliability of the approach.
  • 6. Additional Observations

    The appendix containing full generator code is valuable for reproducibility and reveals interesting domain-specific strategies (e.g., the Childsnack generator's handling of gluten allergies, the Transport generator's per-package delivery patterns). This transparency is commendable.

    The scoring function design — combining validity, expansions, and time — introduces an implicit bias toward faster-to-evaluate patterns rather than more informative ones. This explains both the strength (per-state speed) and weakness (guidance quality) of the approach. Alternative scoring functions could shift this balance.

    The constraint of at most 20 patterns with 5×10⁶ states each is quite restrictive and may limit the approach's ability to find competitive patterns in domains like Miconic where many small patterns are more effective.

    Rating:6.5/ 10
    Significance 7Rigor 5.5Novelty 7.5Clarity 7.5

    Generated Jun 2, 2026

    Comparison History (21)

    vs. Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models
    gpt-5.26/3/2026

    Paper 2 likely has higher impact: it introduces the first admissible-by-design, learned domain-dependent heuristics for optimal classical planning, preserving A* optimality—an important methodological advance with clear guarantees. The LLM-evolved program synthesis of pattern generators plus saturated cost partitioning is innovative, interpretable, and offers practical speedups with negligible test-time overhead, making it relevant to planning, search, and automated reasoning. Paper 1 is useful for instruction-following robustness, but its gains are more incremental and may be absorbed by broader alignment/prompting advances.

    vs. InfoMem: Training Long-Context Memory Agents with Answer-Conditioned Information Gain
    gpt-5.26/3/2026

    Paper 2 likely has higher impact due to timeliness and breadth: long-context LLM memory is a central, rapidly growing area with wide applicability across QA, agents, RAG, and document understanding. The proposed answer-conditioned information-gain reward is a generally reusable training signal that could influence RL formulations for memory, summarization, and tool-using agents. Paper 1 is novel and rigorous within optimal classical planning (admissible learned heuristics via synthesized abstractions), but the field is narrower and domain-dependent learning may limit general uptake compared to broadly deployable LLM training methodology.

    vs. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces
    gpt-5.26/3/2026

    Paper 2 is more novel methodologically: it uses LLM-guided evolutionary program synthesis to generate admissible, domain-dependent abstractions for optimal planning—addressing a key gap (learning while preserving A* optimality). Its rigor is strengthened by formal admissibility via abstractions and saturated cost partitioning, and its potential impact spans automated planning, program synthesis, and LLM-based tool generation with clear real-world relevance (robotics, logistics). Paper 1 is a valuable benchmark, but its impact is narrower (evaluation dataset) and may be more incremental relative to existing personalization/behavior benchmarks.

    vs. Coupling Language Models with Physics-based Simulation for Synthesis of Inorganic Materials
    gemini-3.16/2/2026

    Paper 2 demonstrates higher potential scientific impact due to its interdisciplinary approach, bridging AI, physics, and materials science. While Paper 1 advances optimal classical planning within computer science, Paper 2 addresses a critical real-world bottleneck—synthesis planning for novel inorganic materials. By coupling LLMs with physics-based simulations, it paves the way for practical advancements in discovering and manufacturing new materials, which has broad, transformative applications in energy, electronics, and manufacturing.

    vs. POIROT: Interrogating Agents for Failure Detection in Multi-Agent Systems
    gpt-5.26/2/2026

    Paper 2 has higher potential impact: it introduces a first-of-its-kind approach to learn domain-dependent heuristics that are admissible by design, directly advancing optimal classical planning with preserved A* guarantees—an enduring, broadly used formal setting. The method is technically rigorous (program synthesis + admissible combination via saturated cost partitioning), yields interpretable artifacts, and promises practical speedups with negligible test-time overhead. Its contribution generalizes across planning domains and connects LLMs to principled algorithm design. Paper 1 is timely for LLM safety, but its self-auditing premise may face robustness and external-validity limits in safety-critical oversight.

    vs. Self-Healing Agentic Orchestrators for Reliable Tool-Augmented Large Language Model Systems
    gpt-5.26/2/2026

    Paper 2 is more scientifically novel: it proposes admissible-by-design, domain-dependent heuristics for optimal classical planning via LLM-guided evolutionary program synthesis of abstraction/pattern generators, preserving A* optimality—an advance over typical learned (inadmissible) heuristics. Its impact spans planning, heuristic search, program synthesis, and LLM-assisted algorithm design, with strong potential for broad reuse in optimal planning pipelines. Paper 1 is timely and practically useful for LLM systems reliability, but its contributions are more incremental (orchestration/recovery engineering) and may be narrower and faster-moving with the tooling ecosystem.

    vs. A Mathematical Conflict Framework for Contextual Data Modulation
    claude-opus-4.66/2/2026

    Paper 2 presents a concrete, novel contribution—the first method for learning admissible domain-dependent heuristics for optimal classical planning using LLM-driven evolutionary program synthesis. It combines multiple established techniques (A* search, pattern databases, saturated cost partitioning, LLM-based program synthesis) in an innovative way, with strong empirical results matching state-of-the-art baselines. Paper 1 proposes an abstract mathematical framework for conflict in data modulation but lacks concrete algorithms, experiments, or demonstrated applications, limiting its immediate and verifiable impact.

    vs. Advanced Mathematics Learning Behavior Prediction and Academic Early Warning Model Based on Multimodal Data Analysis
    gpt-5.26/2/2026

    Paper 1 is more novel and broadly impactful: it introduces an admissible-by-construction learning framework for optimal classical planning, combining LLM-driven program synthesis with evolutionary search to generate abstractions/pattern databases, preserving A* optimality—an important methodological advance in AI planning. Its approach is timely (LLMs for algorithm design), potentially transferable to other heuristic search and verification-sensitive settings, and claims interpretable artifacts plus runtime benefits. Paper 2 addresses an important applied education problem, but the methods (knowledge graphs, GATs, temporal models) are relatively standard and likely less generalizable scientifically beyond educational analytics.

    vs. RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network
    claude-opus-4.66/2/2026

    Paper 2 presents a fundamentally novel approach—combining LLM-driven evolutionary program synthesis with classical AI planning to learn admissible heuristics, which is a first-of-its-kind contribution. It bridges two major AI paradigms (LLMs and formal planning), provides theoretical guarantees (admissibility), and produces interpretable results. Paper 1 offers incremental improvements to radiology report generation with modest quantitative gains. Paper 2's methodological innovation, cross-field impact (AI planning, program synthesis, LLMs), and the significance of solving optimal planning with learned heuristics give it substantially higher scientific impact potential.

    vs. TraceGraph: Shared Decision Landscapes for Diagnosing and Improving Agent Trajectories
    gemini-3.16/2/2026

    Paper 2 addresses a critical bottleneck in the highly active field of LLM agent evaluation. By providing a novel, interpretable framework for analyzing agent trajectories and demonstrating tangible improvements on major benchmarks like SWE-bench, it offers broader applicability and higher timeliness than Paper 1. While Paper 1 presents a neat application of LLMs to classical planning, its impact is likely confined to a narrower, more specialized subfield, whereas Paper 2 appeals to the rapidly growing community working on autonomous AI agents.

    vs. Repair Before Veto: Repair-Augmented Constraint Learning for Contextual Decisions
    gemini-3.16/2/2026

    Paper 1 represents a significant methodological breakthrough by bridging modern generative AI (LLMs) with classical symbolic AI. It presents the first method for learning domain-dependent heuristics that guarantee admissibility in optimal classical planning, solving a major limitation of prior neural-heuristic approaches. By using LLMs to synthesize interpretable abstraction programs rather than black-box mappings, it offers high methodological rigor, safety (optimality guarantees), and broad implications for hybrid neuro-symbolic systems. Paper 2 is practically valuable but addresses a more specific niche in constraint learning and recourse.

    vs. MCP-Persona: Benchmarking LLM Agents on Real-World Personal Applications via Environment Simulation
    gpt-5.26/2/2026

    Paper 1 is more scientifically novel and methodologically substantive: it proposes the first admissible-by-design learned, domain-dependent heuristics for optimal classical planning by learning abstraction/pattern generators via LLM-guided evolutionary program synthesis, with principled admissible combination (saturated cost partitioning) and measurable efficiency gains. This advances core planning theory/practice and could influence heuristic design, program synthesis for search, and trustworthy ML-for-planning. Paper 2 is timely and useful as an evaluation benchmark for personalized tool-using agents, but benchmarks typically yield narrower scientific novelty and longer-term impact unless they become a dominant standard.

    vs. Teaching Values to Machines: Simulating Human-Like Behavior in LLMs
    claude-opus-4.66/2/2026

    Paper 1 presents a novel and technically rigorous contribution at the intersection of LLMs and classical AI planning, introducing the first method for learning admissible domain-dependent heuristics via LLM-driven program synthesis. This addresses a fundamental gap in optimal planning with strong theoretical guarantees. Its methodological innovation—combining LLM-evolved pattern generators with saturated cost partitioning—offers a new paradigm with clear practical impact. Paper 2, while interesting, primarily applies existing psychological frameworks to LLMs without introducing fundamentally new methods, and its simulation claims face validity concerns regarding whether LLMs truly internalize values versus pattern-matching.

    vs. Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers
    gpt-5.26/2/2026

    Paper 2 has higher estimated impact: it introduces a principled, first-of-its-kind approach to learning domain-dependent heuristics that are admissible by design, preserving optimality guarantees—an enduring, foundational contribution to classical planning. The method combines LLM-driven program synthesis with evolutionary search and integrates cleanly with established optimal planners via abstractions and saturated cost partitioning, suggesting methodological rigor and broad reuse. Its contributions generalize across planning domains and can influence automated reasoning, verification, and hybrid neuro-symbolic AI. Paper 1 is timely and useful but more application-specific to diffusion-model safety and may be superseded quickly.

    vs. SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
    gpt-5.26/2/2026

    Paper 1 has higher potential impact due to a more novel methodological contribution: the first admissible-by-design learned domain-dependent heuristics for optimal classical planning, combining LLM-driven program synthesis with evolutionary search and saturated cost partitioning. This advances a core algorithmic guarantee (optimality) and can influence planning, heuristic search, and neurosymbolic methods broadly. Paper 2 provides a valuable benchmark, but benchmarks tend to have impact mainly within the evaluation ecosystem and may age faster as simulators/tasks become superseded. Paper 1’s technique is more general and foundational.

    vs. AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents
    gpt-5.26/2/2026

    Paper 2 is more novel and potentially higher impact: it introduces the first learned, domain-dependent heuristics that are admissible by construction for optimal planning, combining LLM-driven program synthesis with established admissible composition (saturated cost partitioning). This bridges modern LLM synthesis with classical guarantees, enabling real-world deployment where optimality matters (robotics, logistics, verification) while improving speed and maintaining coverage. Paper 1 is valuable and timely as an evaluation framework for continual learning in agents, but benchmarks/probing tools typically have narrower downstream impact than a new guarantee-preserving method that can directly improve core planning performance.

    vs. Anchorless Diversification for Parallel LLM Ideation
    claude-opus-4.66/2/2026

    Paper 1 introduces a fundamentally novel approach—using LLM-driven evolutionary program synthesis to learn admissible heuristics for optimal classical planning—bridging AI planning and LLMs in a way not previously done. It guarantees optimality (admissibility by design), offers interpretable insights, and matches state-of-the-art coverage with faster evaluation. Paper 2, while useful, addresses a more incremental problem (diversifying LLM idea generation) with narrower methodological novelty and impact scope. Paper 1's contribution to automated planning theory and practice gives it substantially broader and deeper scientific significance.

    vs. eMoT: evolving Memory-of-Thought via Symbolic Anchoring and Memory Corrosion
    gemini-3.16/2/2026

    Paper 2 addresses a critical and widely relevant problem in AI today: improving the reliability and multi-step reasoning capabilities of Large Language Models. By combining neural inference with symbolic computation and dynamic memory, its approach has broad applicability across natural language processing and general AI problem-solving. In contrast, Paper 1 offers strong methodological rigor and important guarantees, but its focus on optimal classical planning is more niche, resulting in a narrower potential impact.

    vs. Does Compression Preserve Uncertainty? A Unified Benchmark for Quantized and Sparse LLMs via Conformal Prediction
    gemini-3.16/2/2026

    Paper 2 addresses a highly timely and critical issue in the widespread deployment of Large Language Models: the impact of compression on uncertainty estimation. Because LLM compression is ubiquitous for practical deployment, evaluating its effect on safety-critical uncertainty metrics gives this paper a broader and more immediate real-world impact across AI domains compared to Paper 1, which focuses on the relatively narrower field of classical planning.

    vs. EVA-Net: Subject-Independent EEG Motor Decoding with Video-Derived Motor Priors
    gpt-5.26/2/2026

    Paper 2 likely has higher scientific impact due to strong real-world applicability (subject-independent EEG decoding is a key bottleneck for practical BCIs), broad cross-field relevance (neuroscience, ML, multimodal learning, healthcare), and timeliness (multimodal priors, distillation). It proposes a generally reusable paradigm—video-derived semantic/motor priors with no inference overhead—and reports clear gains on public datasets. Paper 1 is novel and rigorous for optimal planning, but is more niche and domain-dependent, limiting breadth and immediate application compared to BCI advances.