Property-Guided LLM Program Synthesis for Planning

Augusto B. Corrêa, André G. Pereira, Jendrik Seipp

May 15, 2026

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →

#562of 2292·Artificial Intelligence

#562 of 2292 · Artificial Intelligence

Tournament Score

1464±46

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity8.5

Tournament Score

1464±46

10501800

71%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper introduces property-guided LLM program synthesis, bridging counterexample-guided inductive synthesis (CEGIS) from formal methods with LLM-based program generation. Instead of scoring candidate programs with numeric fitness values (e.g., coverage, test-pass rate), the method checks whether a candidate satisfies a formally defined property—here, the *direct* property for heuristic functions in classical planning. When violated, the system provides the LLM with a concrete counterexample (the failing state, its heuristic value, and successor values), enabling targeted repair rather than blind resampling.

The specific application targets synthesis of direct heuristic functions for PDDL planning domains—heuristics that guarantee hill climbing reaches a goal without backtracking. The "direct" property (a relaxation of DDA from Dold & Helmert, 2024) is well-chosen: violations are local, concrete, and actionable, making them ideal feedback for an LLM repair loop.

2. Methodological Rigor

The experimental design is generally sound:

Fair comparison: The authors re-run the sample-and-select baseline (Corrêa et al., 2025) with the same LLM (Gemini 3.1 Pro), same benchmark (IPC 2023 Learning Track, 10 domains, 900 test tasks), and same planning framework (Pyperplan).

Out-of-distribution evaluation: Test tasks are significantly larger than training tasks, testing generalization rigorously.

Multiple runs: Three independent runs with reported standard deviations.

Ablation study: The coverage-feedback ablation (Cov.) isolates the contribution of the direct property from the repair loop structure itself.

However, there are methodological caveats:

Only one LLM (Gemini 3.1 Pro) is used, though the authors justify this as sufficient for their research question.

The validation timeout introduces a subtle issue: if validation times out before finding a counterexample, the heuristic is assumed direct on that task. This could allow non-direct heuristics to pass, though the near-perfect test-time generalization suggests this is not a major problem in practice.

The comparison between HC (their method) and GBFS (baselines) conflates algorithm choice with heuristic quality. The authors partially address this by running baselines under HC as well, showing dramatic degradation, which validates that the property—not the algorithm—is the key differentiator.

The standard deviations for some domains (e.g., Rovers: 42.7±13.6, Satellite: 55.7±14.5) are substantial, suggesting sensitivity to the LLM's stochastic generation.

3. Potential Impact

Within AI planning: This work demonstrates that LLM-synthesized heuristics can solve planning tasks via pure hill climbing (no search), which is remarkable. The 623.3 tasks solved out of 900 via HC substantially exceeds the 573 solved by sample-and-select with GBFS, while using ~7.4× fewer LLM calls and ~1,150× less evaluation compute. This efficiency gain is practically significant for scaling LLM-based planning.

Broader program synthesis: The general framework—using formal property verification with counterexample feedback for LLM synthesis—is domain-agnostic in principle. The paper's contribution is demonstrating that this CEGIS+LLM paradigm works in practice with substantial efficiency gains. This could influence how LLM-based synthesis is approached in optimization, algorithm design (à la FunSearch/AlphaEvolve), and software verification.

Limitations on breadth: The approach requires a formally checkable property with actionable counterexamples. The authors acknowledge this, but it remains a significant constraint. Many synthesis problems lack such clean property specifications.

4. Timeliness & Relevance

The paper addresses a current bottleneck in LLM-based program synthesis: the high cost of generate-and-evaluate approaches. As LLMs are increasingly used for algorithm discovery (FunSearch, AlphaEvolve, EoH), the question of how to provide structured feedback beyond numeric scores is timely. The CEGIS paradigm is well-established in formal methods but underexplored with LLMs—this paper makes a compelling case for the combination.

The planning application is also timely given the intense recent interest in LLMs for planning, and the growing recognition that LLMs struggle with planning tasks directly but can serve as program synthesizers for planning components.

5. Strengths & Limitations

Key Strengths:

Elegant problem formulation: The direct property is a natural synthesis target—it provides both a correctness guarantee (HC completeness) and natural counterexamples.

Dramatic efficiency gains: 7.4× fewer LLM calls and 1,150× less evaluation compute are practically meaningful numbers.

Strong generalization: The near-perfect transfer of the direct property from small training tasks to large OOD test tasks is the paper's most striking empirical finding—only 1 out of 2,700 evaluations failed due to property violation.

Informative qualitative analysis: The Blocksworld and Satellite case studies provide genuine insight into what the LLM learns through the repair loop (e.g., dependency graph construction, sequential reasoning about satellite operations).

Clean ablation: The coverage-feedback ablation convincingly demonstrates that the property, not just iterative feedback, drives improvement.

Key Limitations:

Single LLM: Generalizability across models is untested.

Domain specificity: The direct property is specific to planning heuristics. While the paper argues the general framework applies "whenever a problem admits a verifiable property," no other applications are demonstrated.

Scalability of validation: DFS-based validation can be exponential in the worst case, mitigated only by timeouts—a pragmatic but theoretically unsatisfying solution.

No formal guarantees on test tasks: The method provides no certificate that the heuristic is direct on unseen tasks; generalization is purely empirical.

Limited baseline comparison: Only one prior LLM method is compared (Corrêa et al., 2025). Comparisons with learning-based heuristic approaches (GNNs, description logic features) would strengthen the evaluation.

6. Additional Observations

The paper is well-written with clear exposition of the planning background and the direct property. The inclusion of full prompts (Appendix D) aids reproducibility. The connection between the theoretical hardness of synthesizing direct heuristics (EXPSPACE-hard per task) and the practical success of domain-general synthesis via LLMs is intellectually interesting—the LLM effectively amortizes this complexity through generalization.

The work can be seen as a step toward "verified LLM synthesis," an emerging research direction at the intersection of LLMs and formal methods. The practical success here should encourage further exploration of property-guided approaches in other synthesis domains.

Rating:7.5/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 8.5

Generated May 18, 2026

Comparison History (17)

vs. Learning Quantifiable Visual Explanations Without Ground-Truth

gemini-3.15/19/2026

Paper 2 integrates Large Language Models with formal verification, addressing major bottlenecks (cost and reliability) in LLM-driven program synthesis. By using concrete counterexamples instead of simple scores, it drastically improves efficiency and correctness. Given the exploding interest in LLMs for code generation and automated planning, this methodology has higher potential for broad, immediate impact across AI and software engineering compared to Paper 1's XAI evaluation metric.

vs. Responsible Agentic AI Requires Explicit Provenance

claude-opus-4.65/19/2026

Paper 1 presents a concrete, novel methodology (property-guided LLM program synthesis with counterexample feedback) with rigorous empirical evaluation showing significant improvements (7x fewer generations, orders of magnitude less computation). It addresses a practical bottleneck in LLM-based program synthesis with a generalizable approach. Paper 2 is a position paper proposing a conceptual framework for AI provenance/responsibility with only preliminary experiments. While timely, it lacks the methodological depth and empirical validation of Paper 1, and its impact depends on adoption of a framework rather than demonstrating concrete technical advances.

vs. CVEvolve: Autonomous Algorithm Discovery for Unstructured Scientific Data Processing

gpt-5.25/18/2026

Paper 2 has higher likely scientific impact due to a more clearly novel methodological contribution: replacing scalar rewards with formal property checking plus counterexample-guided feedback, yielding large efficiency gains and stronger programs. This is methodologically rigorous (verifiable properties, early stopping, OOD tests) and broadly applicable to program synthesis, verification, planning, and agentic coding whenever specs/properties exist. It is timely given growing concern over LLM inference/evaluation cost. Paper 1 is valuable for scientific workflows, but appears more systems/engineering-focused and domain-demonstration-heavy with potentially less generalizable methodological novelty.

vs. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

gpt-5.25/18/2026

Paper 2 presents a more technically novel and broadly reusable methodology: integrating formal property checking and counterexample-guided feedback into LLM program synthesis, yielding large efficiency gains and improved solution quality on multiple planning domains with OOD tests. It is timely (LLM tooling + verification), has clear real-world applications (planning, synthesis, reliable AI), and offers methodological rigor via formal properties and measurable compute reductions. Paper 1 is valuable meta-science with a useful dataset/tool, but its impact is more observational and likely narrower, with less direct algorithmic or practical leverage.

vs. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents

claude-opus-4.65/18/2026

Paper 1 introduces a principled methodology—property-guided LLM program synthesis with counterexample feedback—that generalizes broadly beyond planning to any domain with verifiable properties. It demonstrates significant efficiency gains (7x fewer generations, orders of magnitude less computation) with strong theoretical grounding in formal verification. Paper 2 addresses an important but narrower e-commerce personalization problem. While practically valuable, its contributions are more domain-specific. Paper 1's framework has broader cross-field applicability (program synthesis, formal methods, AI planning) and introduces a paradigm shift in how LLMs are guided during code generation.

vs. Joint Consistency: A Unified Test-Time Aggregation Framework via Energy Minimization

claude-opus-4.65/18/2026

Paper 1 introduces a principled unified framework (Joint Consistency) for test-time aggregation that subsumes existing methods as special cases, grounded in Ising-type energy minimization with theoretical guarantees. Its breadth of impact is wider—applicable to any LLM reasoning task using test-time compute scaling, a highly active research area. Paper 2 presents a valuable but more narrowly scoped contribution focused on property-guided synthesis for PDDL planning. While methodologically sound, its applicability is limited to domains with formally verifiable properties, reducing its broader impact compared to Paper 1's general framework.

vs. Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

gemini-3.15/18/2026

Paper 1 addresses a highly active research area by introducing a novel, stateful continuous learning framework for LLM agents without requiring weight updates. Its architecture, integrating RL-style updates into agent knowledge networks, offers a broadly applicable solution to the limitations of stateless agents. The impressive state-of-the-art results on rigorous competitive programming benchmarks suggest wider immediate applicability and higher potential impact across AI and software engineering communities compared to the more specialized, though methodologically rigorous, planning focus of Paper 2.

vs. FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

gpt-5.25/18/2026

Paper 2 has higher potential impact due to a more novel, generalizable methodological contribution: replacing scalar rewards with formally specified properties and counterexample-guided feedback for LLM program synthesis. This is rigorous (formal property checking, early stopping, concrete counterexamples), cost-reducing, and broadly applicable to planning, synthesis, verification, and agent tooling wherever verifiable properties exist. Results span 10 domains with OOD tests and large compute savings, aligning with timely interest in reliable, efficient LLM reasoning. Paper 1 is promising but narrower (single benchmark/attacker) and less general/rigorous.

vs. SLASH the Sink: Sharpening Structural Attention Inside LLMs

claude-opus-4.65/18/2026

Paper 2 addresses a fundamental mechanism inside LLMs—how they process structural/graph information—and provides both theoretical insight (attention sink dilution as a representation bottleneck) and a training-free, generalizable solution (SLASH). This has broader impact across many fields involving graph-structured data (molecules, knowledge graphs, social networks). Paper 1, while methodologically sound and practically useful, addresses a more specialized intersection of LLM program synthesis and PDDL planning. Paper 2's mechanistic insights into LLM internals and its wide applicability give it higher potential for cross-disciplinary impact.

vs. Sign-Separated Finite-Time Error Analysis of Q-Learning

claude-opus-4.65/18/2026

Paper 2 introduces a broadly applicable methodology—property-guided LLM program synthesis with counterexample feedback—that bridges formal methods and LLM-based code generation, with demonstrated practical impact (7x fewer generations, orders of magnitude less computation). It addresses a timely problem in the rapidly growing LLM/AI planning space with clear real-world applications. Paper 1, while technically rigorous, provides incremental theoretical refinements to Q-learning error analysis with narrower audience appeal and less immediate practical impact.

vs. ALSO: Adversarial Online Strategy Optimization for Social Agents

gemini-3.15/18/2026

Paper 2 addresses a fundamental limitation in LLM program synthesis by replacing scalar scores with formal property verification and counterexample-guided repair. This approach significantly improves efficiency and correctness. Its methodology has broad and highly impactful applications across automated planning, software engineering, and general code generation, whereas Paper 1 focuses on the relatively more niche area of social simulation agents.

vs. Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

claude-opus-4.65/18/2026

Paper 1 introduces a novel and practical methodology—property-guided LLM program synthesis with counterexample feedback—that significantly reduces computational costs (7x fewer programs, orders of magnitude less evaluation) while improving quality. This approach is broadly applicable beyond planning to any domain with verifiable properties, offering a general paradigm shift in LLM-based program synthesis. Paper 2, while valuable for legal AI, addresses a narrower domain (tax law) and confirms relatively expected findings (contamination inflates performance, neuro-symbolic approaches are more robust). Paper 1's methodological contribution has broader cross-field impact and greater novelty.

vs. Imperfect World Models are Exploitable

claude-opus-4.65/18/2026

Paper 2 addresses a fundamental theoretical problem in reinforcement learning—model exploitation and its relationship to reward hacking—establishing formal definitions, impossibility results, and safe planning horizons. This has broad implications across RL, AI safety, and model-based planning. Paper 1 presents a useful engineering contribution (property-guided LLM synthesis for planning heuristics) with strong empirical results, but its scope is narrower. Paper 2's theoretical framework is more likely to influence multiple research directions and has higher relevance given growing concerns about AI safety and reliable world models.

vs. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

gemini-3.15/18/2026

Paper 2 addresses a critical and universal bottleneck in modern AI: the massive computational cost of configuring and training LLMs. By introducing an agentic framework that extrapolates from cheap to expensive experiments, supported by a massive multi-fidelity environment (>1M GPU hours), it offers immense practical and economic value. While Paper 1 provides a strong methodological advance in formal program synthesis, Paper 2's potential to democratize and optimize foundational LLM research gives it a broader and more immediate scientific impact across the entire field.

vs. When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration

gpt-5.25/18/2026

Paper 2 has higher potential impact due to a more broadly reusable methodological contribution: integrating formal, counterexample-guided property checking with LLM program synthesis. This can substantially reduce inference/evaluation cost and improve correctness, and it naturally transfers to many domains where verifiable properties exist (planning, synthesis, verification, robotics, compilers). It is timely (LLM cost/correctness) and offers clear real-world applications. Paper 1 is valuable and rigorous, but its main contribution is diagnostic/empirical guidance for context injection in multi-agent design exploration, likely narrower in downstream uptake.

vs. X-SYNTH: Beyond Retrieval -- Enterprise Context Synthesis from Observed Human Attention

gpt-5.25/18/2026

Paper 2 has higher potential impact due to a more generally applicable and methodologically rigorous contribution: integrating formal, property-based verification with counterexample-guided feedback for LLM program synthesis. This bridges LLMs with formal methods, reduces compute substantially, and is likely transferable across domains where properties are specifiable (planning, synthesis, verification, robotics). Paper 1 shows strong applied gains but appears more domain- and data-instrumentation-dependent (enterprise attention traces) and may raise generalization/privacy concerns, potentially limiting breadth and reproducibility.

vs. AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

claude-opus-4.65/18/2026

Paper 1 introduces a fundamentally new paradigm—property-guided LLM program synthesis with counterexample feedback—that is broadly applicable whenever verifiable properties exist. It demonstrates dramatic efficiency gains (7x fewer programs, orders of magnitude less computation) with strong empirical results. Paper 2 proposes an incremental improvement to RLVR training (AGPO), addressing boundary shrinkage with asymmetric updates. While solid and industrially validated, it represents a more incremental contribution within existing RL-for-LLM frameworks. Paper 1's novel synthesis paradigm has broader cross-field applicability (planning, verification, program synthesis) and deeper methodological innovation.