LLM-Evolved Domain-Independent Heuristics for Symbolic AI Planning

Elliot Gestrin, Jendrik Seipp

May 28, 2026

arXiv:2605.29649v1 PDF

cs.AI(primary)

#184of 2821·Artificial Intelligence

#184 of 2821 · Artificial Intelligence

Tournament Score

1529±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor7

Novelty8

Clarity8.5

Tournament Score

1529±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Heuristic search is the dominant paradigm in symbolic AI planning, and the strongest heuristics are the result of decades of work by planning researchers. Recent work has shown that large language models (LLMs) can design heuristics for individual planning domains, but no LLM-generated heuristic has so far worked on arbitrary planning tasks. In this paper, we use evolutionary search to produce the first LLM-generated domain-independent heuristics that exceed the hand-engineered state of the art. We let an LLM mutate parent heuristics written in C++, store candidates in a MAP-Elites archive keyed on informedness and speed and calculate fitness scores by blending coverage with solving time. To place the evolved programs in context, we additionally benchmark a broad set of hand-engineered heuristics on their informedness-speed tradeoff, which to our knowledge has not been done before. On unseen testing domains, our best evolved heuristic solves more tasks than even the strongest baseline, with our full heuristic suite spanning the Pareto frontier of said tradeoff. We also find that seeding evolution from the trivial blind heuristic outperforms seeding from the strong FF heuristic, even when the resulting program is itself an FF variant, and that LLM reasoning effort affects how often candidates compile much more than the quality of those that do. Because the evolved programs are plain C++, they slot into existing planners as drop-in replacements and inherit the soundness and completeness guarantees of the underlying search.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper demonstrates that LLM-driven evolutionary search can produce domain-independent heuristics for classical AI planning that surpass decades of hand-engineered alternatives. The key distinction from prior work is the shift from domain-specific to domain-independent heuristics—a qualitative leap. Prior LLM-generated heuristics worked only within individual planning domains; this work produces C++ heuristics that generalize across arbitrary PDDL planning tasks and integrate as drop-in replacements into the Scorpion planner.

The technical approach uses MAP-Elites (a quality-diversity evolutionary algorithm) with LLM-driven mutations over C++ code, keyed on two behavioral dimensions: heuristic informedness and evaluation speed. A fitness function blending coverage with an agile (time-decay) score provides a smoother gradient than coverage alone. The best evolved heuristic solves 368/720 test tasks versus 352 for the strongest baseline (`add`), a meaningful margin given that marginal tasks at high coverage are disproportionately difficult.

Methodological Rigor

Strengths in experimental design:

Clean train/test separation: training on Autoscale benchmarks, testing on 2023 IPC Learning Track with overlapping domains removed.

Comprehensive baseline comparison against 19 established heuristics spanning all major families (landmarks, delete-relaxation, abstractions, operator-counting, causal graph).

The informedness-speed tradeoff analysis (Figure 1) is a novel and illuminating contribution in its own right—the authors note this systematic comparison hasn't been done before.

Controlled ablations over seed choice (blind vs. FF) and reasoning effort (none/low/medium).

Concerns:

Only 3 runs per configuration, acknowledged as a limitation. With high variance (blind-seeded scores spanning 0.528–0.606), statistical significance is uncertain.

The mid-experiment API credit exhaustion affecting medium-reasoning runs complicates interpretation of reasoning-effort effects.

All comparisons use greedy best-first search, which disadvantages admissible baselines (lmcut, merge-and-shrink) designed for optimal search. While acknowledged, this makes the "beating state of the art" claim somewhat qualified.

The 16-task improvement (368 vs. 352) is modest in absolute terms, though the authors argue marginal tasks are harder.

Potential Impact

Immediate impact on AI planning: The evolved heuristics are practical artifacts—deterministic C++ programs with soundness/completeness guarantees inherited from the search framework. This contrasts favorably with LLM-as-planner approaches that suffer from hallucination and lack formal guarantees. The DTG-based heuristics (evolved-blind-none-3, evolved-blind-medium-conf) appear genuinely novel, using strategies distinct from established heuristic families.

Broader methodological impact: The finding that seeding from the trivial blind heuristic outperforms seeding from the strong FF heuristic is a counterintuitive and important insight for the growing field of LLM-guided program synthesis. It suggests that diversity in the evolutionary population matters more than starting from a strong point—a lesson applicable beyond planning.

For the AlphaEvolve/FunSearch paradigm: This paper extends LLM-driven code evolution to a problem class (domain-independent planning) that is fundamentally harder than prior targets. The success validates that this paradigm can tackle problems requiring deep algorithmic reasoning, not just numerical optimization.

Reproducibility concern: Reliance on cloud-hosted LLMs whose behavior may change limits exact reproducibility, though the evolved artifacts themselves are fully deterministic.

Timeliness & Relevance

This work sits at the confluence of two hot trends: LLM-driven code generation (AlphaEvolve, FunSearch) and the application of foundation models to classical AI problems. The planning community has long sought automated heuristic design, and this represents perhaps the most successful instance to date. The timing is excellent—it builds on very recent infrastructure (OpenEvolve, 2025 reasoning models) and addresses a recognized bottleneck (manual heuristic engineering taking years per breakthrough).

Strengths

1. First domain-independent LLM-evolved heuristics beating hand-engineered baselines—a genuine milestone.

2. Practical artifacts: C++ programs that inherit formal guarantees, unlike neural or LLM-at-inference approaches.

3. Novel heuristic strategies discovered: The DTG-distance heuristics with CG weighting and landmark levels appear to be genuinely new algorithmic ideas.

4. Informedness-speed tradeoff analysis provides a new lens for comparing heuristics.

5. Counterintuitive finding about blind seeding outperforming FF seeding has broad implications for evolutionary program synthesis.

6. Thorough appendix with full source code, prompts, and detailed experimental setup enhancing reproducibility of the framework (if not exact LLM outputs).

Limitations

1. Restricted to typed STRIPS—doesn't address ADL, numeric, or temporal planning.

2. Statistical power: 3 runs per configuration with high variance.

3. Greedy search only: Admissible baselines are disadvantaged; no evaluation in optimal planning settings.

4. Cost and scalability: ~$16–33 per run plus cluster time; unclear how this scales to longer evolution budgets or richer planning formalisms.

5. LLM memorization: The best heuristic (evolved-blind-medium-2) is essentially an FF variant, suggesting the LLM's training data—not creative invention—drives the strongest result. The truly novel DTG heuristics don't achieve the highest coverage.

6. No analysis of plan quality (cost optimality), only coverage and time.

Additional Observations

The paper's most intellectually interesting contribution may be the DTG-based heuristics, which represent genuinely novel algorithmic ideas that could inspire new hand-engineered heuristics. However, the headline result (368 vs. 352 coverage) rests on the FF-variant, where the LLM is arguably recombining known techniques rather than inventing new ones. This tension between "beating baselines via recombination" and "discovering new algorithmic ideas" deserves further exploration.

The MAP-Elites framework with informedness-speed axes is well-motivated and could become a standard evaluation framework for planning heuristics beyond this paper.

Rating:7.8/ 10

Significance 8Rigor 7Novelty 8Clarity 8.5

Generated May 29, 2026

Comparison History (21)

vs. The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

claude-opus-4.65/29/2026

Paper 1 demonstrates a breakthrough in AI planning by using LLMs with evolutionary search to produce domain-independent heuristics that exceed decades of hand-engineered state-of-the-art. This has broad impact across AI planning, combinatorial optimization, and automated algorithm design. The results are practically deployable as drop-in replacements in existing planners. Paper 2 identifies an important failure mode (unfaithful capitulation) in reasoning models, which is valuable for AI safety, but is more diagnostic/observational in nature with narrower scope. Paper 1's methodological innovation and demonstrated superiority over established baselines suggest higher long-term scientific impact.

vs. Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

gemini-3.15/29/2026

Paper 1 presents a fundamental algorithmic breakthrough by generating the first LLM-evolved domain-independent heuristics that outperform decades of human-engineered state-of-the-art heuristics in symbolic AI planning. This advancement has broad, high-impact applications in optimization, robotics, and logistics. While Paper 2 provides valuable insights into meta-science and literature search evaluation, Paper 1's contribution significantly advances core AI capabilities and algorithmic discovery, offering wider and more profound cross-disciplinary impact.

vs. SAAS: Self-Aware Reinforcement Learning for Over-Search Mitigation in Agentic Search

claude-opus-4.65/29/2026

Paper 2 presents the first LLM-generated domain-independent planning heuristics that exceed decades of hand-engineered state-of-the-art, representing a fundamental breakthrough in AI planning. This has broader impact across symbolic AI, automated algorithm design, and program synthesis. The results are drop-in replacements with formal guarantees, enabling immediate practical adoption. Paper 1 addresses an important but more incremental optimization problem (reducing over-search in agentic LLM systems), which is valuable but narrower in scope and more likely to be superseded as agentic architectures evolve.

vs. Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

gpt-5.25/29/2026

Paper 2 likely has higher impact: it reports the first LLM-generated domain-independent planning heuristics surpassing hand-engineered state of the art, with immediate practical applicability as drop-in C++ replacements across existing planners. The evolutionary+MAP-Elites framework is broadly reusable for program synthesis beyond planning, and the results are timely given current interest in LLM-based code generation and automated algorithm design. Paper 1 is novel and theoretically strong (causal nested bandits + PAC-Bayes certification), but its impact may be narrower and contingent on adoption in specialized causal RL settings.

vs. From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

gpt-5.25/29/2026

Paper 2 likely has higher impact due to stronger novelty (first LLM-evolved domain-independent heuristics surpassing hand-engineered SOTA), broad applicability across symbolic planning domains, and immediate real-world usability as drop-in C++ heuristics with preserved soundness/completeness. Its methodology (evolutionary search + MAP-Elites + systematic informedness–speed benchmarking) is rigorous and general. Paper 1 is innovative and useful for privacy-preserving mobility data, but is more domain-specific (trajectory synthesis) and depends on learned generative realism and dataset-specific evaluations, potentially limiting breadth compared to planning heuristics that can affect many AI systems.

vs. mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

claude-opus-4.65/29/2026

Paper 1 presents a fundamentally novel contribution: the first LLM-generated domain-independent heuristics that exceed decades of hand-engineered state-of-the-art in AI planning. It combines LLM-driven code evolution with MAP-Elites, demonstrates strong empirical results on unseen domains, and provides drop-in replacements for existing planners. This advances core AI methodology with broad implications. Paper 2 describes a useful but incremental engineering tool—an MCP server wrapping existing knowledge graphs for natural language access—with limited novelty beyond integration work.

vs. Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

gpt-5.25/29/2026

Paper 1 likely has higher impact: it claims the first LLM-generated domain-independent planning heuristics that surpass hand-engineered state of the art, a clear novelty with strong methodological framing (evolution + MAP-Elites, explicit informedness/speed Pareto analysis, unseen-domain testing). Results are directly deployable (drop-in C++ heuristics) and could influence both symbolic planning and LLM-for-code research. Paper 2 is timely and useful for KG consistency in QA, but resembles an incremental pipeline refinement (post-extraction correction/canonicalization) with narrower, application-specific impact and less clearly demonstrated step-change over prior neuro-symbolic KG construction work.

vs. Formalizing Mathematics at Scale

gemini-3.15/29/2026

Paper 1 represents a massive leap in the formalization of mathematics, providing a highly sought-after resource (a large-scale, machine-checked library in Lean 4) and a scalable framework. This directly impacts the development of reasoning AI models and the broader mathematical community. While Paper 2 is innovative in symbolic planning, Paper 1's contribution has broader implications for automated reasoning, AI training data, and the future of mathematical research.

vs. LACUNA: Safe Agents as Recursive Program Holes

gemini-3.15/29/2026

Paper 1 represents a significant breakthrough by using LLMs to automatically discover domain-independent heuristics that exceed decades of hand-engineered state-of-the-art methods in symbolic AI planning. This bridges modern generative AI with traditional symbolic AI in a highly impactful way. While Paper 2 offers a valuable programming model for agent safety and expressiveness, Paper 1's demonstrated ability to automatically discover superior algorithms for NP-hard problems has broader implications for automated scientific discovery and combinatorial optimization.

vs. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

claude-opus-4.65/29/2026

Paper 2 demonstrates a breakthrough by producing the first LLM-generated domain-independent planning heuristics that exceed decades of hand-engineered state-of-the-art, with broad implications for AI planning, automated algorithm design, and LLM-guided program synthesis. The results are practically deployable as drop-in replacements in existing planners. Paper 1 makes a valuable diagnostic contribution about search agent evaluation and introduces a temporal benchmark, but its impact is more incremental—primarily methodological critique and a benchmark that will require continuous updating. Paper 2's fusion of LLMs with evolutionary search to surpass human-expert-designed heuristics represents a more fundamental and broadly applicable advance.

vs. DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes

gpt-5.25/29/2026

Paper 2 likely has higher impact: it targets a broadly relevant, timely bottleneck in LLM reasoning—reducing reliance on stronger teachers and curated data—using an RL framework that could generalize across many tasks and domains. If robust, it can influence training pipelines for frontier and open models and connect to RL, self-correction, and scalable alignment research. Paper 1 is highly novel and rigorous within symbolic planning and valuable for that community, but its field breadth and downstream adoption potential are narrower than a general RL-for-reasoning method in LLMs.

vs. Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

claude-opus-4.65/29/2026

Paper 2 presents a constructive advance—the first LLM-generated domain-independent planning heuristics that exceed decades of hand-engineered baselines. This is a significant milestone bridging LLMs and symbolic AI, with immediate practical applicability (drop-in C++ replacements) and broad impact across AI planning. Paper 1 raises an important cautionary finding about CoT distillation degrading reasoning quality despite improved accuracy, which is valuable for the LLM evaluation community. However, Paper 2's contribution is more actionable and generalizable, opening a new research paradigm (LLM-driven algorithm evolution) with clear benchmarks showing state-of-the-art results.

vs. Temporal Stability and Few-Shot Prompting in Math Task Assessment

gpt-5.25/29/2026

Paper 2 is more novel and broadly impactful: it claims the first LLM-evolved domain-independent planning heuristics surpassing hand-engineered state of the art, with a rigorous evolutionary/MAP-Elites methodology and extensive benchmarking on unseen domains. The result is immediately deployable (drop-in C++ heuristics) and relevant to core symbolic planning and automated reasoning, with potential spillover to program synthesis and hybrid neuro-symbolic systems. Paper 1 is timely and useful for AI-in-education evaluation, but its contributions are more incremental (tool/version/prompt stability study) and narrower in scope and generality.

vs. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

claude-opus-4.65/29/2026

Paper 1 demonstrates mechanistic interpretability at production scale for the first time, revealing interpretable features (including safety-relevant ones like deception and power-seeking) in a frontier AI model. This has enormous implications for AI safety, alignment, and understanding neural networks—fields of rapidly growing importance. Its breadth of impact spans interpretability, safety, multimodal understanding, and steering model behavior. While Paper 2 is a strong contribution to AI planning, its impact is narrower. Paper 1 opens a new paradigm for understanding large language models at scale, with far-reaching consequences across AI research.

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

gpt-5.25/29/2026

Paper 2 likely has higher impact: it introduces a novel, concrete method (LLM-guided evolutionary program synthesis with MAP-Elites) that produces the first domain-independent, LLM-generated planning heuristics surpassing hand-engineered state of the art. This is a clear, measurable advance with immediate real-world applicability as drop-in C++ replacements in existing planners, and it bridges LLMs, program synthesis, evolutionary search, and symbolic planning—broadening cross-field relevance. Paper 1 offers valuable insights into CoT compression dynamics, but is more incremental/diagnostic and its applications are less directly actionable.

vs. Differentiable Belief-based Opponent Shaping

gemini-3.15/29/2026

Paper 2 achieves a major milestone by using LLMs to evolve domain-independent heuristics that surpass decades of hand-engineered state-of-the-art in symbolic planning. Its approach bridges LLMs, evolutionary search, and symbolic AI, offering high practical utility as drop-in C++ replacements. This broad applicability across diverse planning domains provides significantly higher potential real-world and scientific impact compared to Paper 1, which focuses on the narrower, albeit interesting, subfield of opponent shaping in multi-agent hidden-role games.

vs. BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

gpt-5.25/29/2026

Paper 1 likely has higher impact: it claims the first LLM-generated domain-independent planning heuristics surpassing hand-engineered state of the art, with drop-in C++ artifacts and broad relevance to heuristic search, automated planning, program synthesis, and LLM+evolution methods. The approach appears methodologically structured (MAP-Elites archive, informedness–speed tradeoff benchmarking, unseen domains) and yields immediately usable improvements. Paper 2 is timely and important for AI safety, but is explicitly preliminary (small n, limited model/SAE coverage, hackathon build), narrowing methodological rigor and generalizability despite interesting auditing ideas.

vs. Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

gemini-3.15/29/2026

Paper 1 represents a major breakthrough by automatically generating domain-independent heuristics that outperform decades of hand-engineered state-of-the-art methods in symbolic AI. This demonstrates a novel and highly impactful use of LLMs for automated algorithm design. Paper 2, while practically useful for engineering agentic systems, is primarily a benchmark study evaluating existing token-optimization formats against JSON, offering incremental engineering insights rather than fundamental scientific advancements.

vs. TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

gpt-5.25/29/2026

Paper 2 has higher likely impact: it reports the first LLM-generated domain-independent planning heuristics surpassing hand-engineered state of the art, a concrete, broadly applicable advance with immediate drop-in use in existing planners and preserved theoretical guarantees. The evolutionary+MAP-Elites methodology and extensive informedness–speed benchmarking are rigorous and likely reproducible, with relevance beyond planning (program synthesis, automated algorithm design). Paper 1 is timely and interesting for multi-LLM RL, but its impact may be narrower and more benchmark/protocol-dependent, and claims of convergence may rely on restrictive action abstractions.

vs. Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

gpt-5.25/29/2026

Paper 2 has higher likely impact due to a clearer methodological innovation (LLM-guided evolutionary search yielding the first domain-independent LLM-generated planning heuristics surpassing hand-engineered SOTA), strong, generalizable technical contribution, and immediate drop-in applicability across many symbolic planners and domains. It also advances evaluation practice by benchmarking informedness–speed tradeoffs. Paper 1 is timely and useful as a descriptive registry analysis and workflow exploration, but its novelty and cross-field technical contribution are more limited, with results constrained by trial record quality and small labeled sample size.