Property-Guided LLM Program Synthesis for Planning
Augusto B. Corrêa, André G. Pereira, Jendrik Seipp
Abstract
LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.
AI Impact Assessments
(1 models)Scientific Impact Assessment
1. Core Contribution
The paper introduces property-guided LLM program synthesis, bridging counterexample-guided inductive synthesis (CEGIS) from formal methods with LLM-based program generation. Instead of scoring candidate programs with numeric fitness values (e.g., coverage, test-pass rate), the method checks whether a candidate satisfies a formally defined property—here, the *direct* property for heuristic functions in classical planning. When violated, the system provides the LLM with a concrete counterexample (the failing state, its heuristic value, and successor values), enabling targeted repair rather than blind resampling.
The specific application targets synthesis of direct heuristic functions for PDDL planning domains—heuristics that guarantee hill climbing reaches a goal without backtracking. The "direct" property (a relaxation of DDA from Dold & Helmert, 2024) is well-chosen: violations are local, concrete, and actionable, making them ideal feedback for an LLM repair loop.
2. Methodological Rigor
The experimental design is generally sound:
However, there are methodological caveats:
3. Potential Impact
Within AI planning: This work demonstrates that LLM-synthesized heuristics can solve planning tasks via pure hill climbing (no search), which is remarkable. The 623.3 tasks solved out of 900 via HC substantially exceeds the 573 solved by sample-and-select with GBFS, while using ~7.4× fewer LLM calls and ~1,150× less evaluation compute. This efficiency gain is practically significant for scaling LLM-based planning.
Broader program synthesis: The general framework—using formal property verification with counterexample feedback for LLM synthesis—is domain-agnostic in principle. The paper's contribution is demonstrating that this CEGIS+LLM paradigm works in practice with substantial efficiency gains. This could influence how LLM-based synthesis is approached in optimization, algorithm design (à la FunSearch/AlphaEvolve), and software verification.
Limitations on breadth: The approach requires a formally checkable property with actionable counterexamples. The authors acknowledge this, but it remains a significant constraint. Many synthesis problems lack such clean property specifications.
4. Timeliness & Relevance
The paper addresses a current bottleneck in LLM-based program synthesis: the high cost of generate-and-evaluate approaches. As LLMs are increasingly used for algorithm discovery (FunSearch, AlphaEvolve, EoH), the question of how to provide structured feedback beyond numeric scores is timely. The CEGIS paradigm is well-established in formal methods but underexplored with LLMs—this paper makes a compelling case for the combination.
The planning application is also timely given the intense recent interest in LLMs for planning, and the growing recognition that LLMs struggle with planning tasks directly but can serve as program synthesizers for planning components.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
6. Additional Observations
The paper is well-written with clear exposition of the planning background and the direct property. The inclusion of full prompts (Appendix D) aids reproducibility. The connection between the theoretical hardness of synthesizing direct heuristics (EXPSPACE-hard per task) and the practical success of domain-general synthesis via LLMs is intellectually interesting—the LLM effectively amortizes this complexity through generalization.
The work can be seen as a step toward "verified LLM synthesis," an emerging research direction at the intersection of LLMs and formal methods. The practical success here should encourage further exploration of property-guided approaches in other synthesis domains.
Generated May 18, 2026
Comparison History (17)
Paper 2 integrates Large Language Models with formal verification, addressing major bottlenecks (cost and reliability) in LLM-driven program synthesis. By using concrete counterexamples instead of simple scores, it drastically improves efficiency and correctness. Given the exploding interest in LLMs for code generation and automated planning, this methodology has higher potential for broad, immediate impact across AI and software engineering compared to Paper 1's XAI evaluation metric.
Paper 1 presents a concrete, novel methodology (property-guided LLM program synthesis with counterexample feedback) with rigorous empirical evaluation showing significant improvements (7x fewer generations, orders of magnitude less computation). It addresses a practical bottleneck in LLM-based program synthesis with a generalizable approach. Paper 2 is a position paper proposing a conceptual framework for AI provenance/responsibility with only preliminary experiments. While timely, it lacks the methodological depth and empirical validation of Paper 1, and its impact depends on adoption of a framework rather than demonstrating concrete technical advances.
Paper 2 has higher likely scientific impact due to a more clearly novel methodological contribution: replacing scalar rewards with formal property checking plus counterexample-guided feedback, yielding large efficiency gains and stronger programs. This is methodologically rigorous (verifiable properties, early stopping, OOD tests) and broadly applicable to program synthesis, verification, planning, and agentic coding whenever specs/properties exist. It is timely given growing concern over LLM inference/evaluation cost. Paper 1 is valuable for scientific workflows, but appears more systems/engineering-focused and domain-demonstration-heavy with potentially less generalizable methodological novelty.
Paper 2 presents a more technically novel and broadly reusable methodology: integrating formal property checking and counterexample-guided feedback into LLM program synthesis, yielding large efficiency gains and improved solution quality on multiple planning domains with OOD tests. It is timely (LLM tooling + verification), has clear real-world applications (planning, synthesis, reliable AI), and offers methodological rigor via formal properties and measurable compute reductions. Paper 1 is valuable meta-science with a useful dataset/tool, but its impact is more observational and likely narrower, with less direct algorithmic or practical leverage.
Paper 1 introduces a principled methodology—property-guided LLM program synthesis with counterexample feedback—that generalizes broadly beyond planning to any domain with verifiable properties. It demonstrates significant efficiency gains (7x fewer generations, orders of magnitude less computation) with strong theoretical grounding in formal verification. Paper 2 addresses an important but narrower e-commerce personalization problem. While practically valuable, its contributions are more domain-specific. Paper 1's framework has broader cross-field applicability (program synthesis, formal methods, AI planning) and introduces a paradigm shift in how LLMs are guided during code generation.
Paper 1 introduces a principled unified framework (Joint Consistency) for test-time aggregation that subsumes existing methods as special cases, grounded in Ising-type energy minimization with theoretical guarantees. Its breadth of impact is wider—applicable to any LLM reasoning task using test-time compute scaling, a highly active research area. Paper 2 presents a valuable but more narrowly scoped contribution focused on property-guided synthesis for PDDL planning. While methodologically sound, its applicability is limited to domains with formally verifiable properties, reducing its broader impact compared to Paper 1's general framework.
Paper 1 addresses a highly active research area by introducing a novel, stateful continuous learning framework for LLM agents without requiring weight updates. Its architecture, integrating RL-style updates into agent knowledge networks, offers a broadly applicable solution to the limitations of stateless agents. The impressive state-of-the-art results on rigorous competitive programming benchmarks suggest wider immediate applicability and higher potential impact across AI and software engineering communities compared to the more specialized, though methodologically rigorous, planning focus of Paper 2.
Paper 2 has higher potential impact due to a more novel, generalizable methodological contribution: replacing scalar rewards with formally specified properties and counterexample-guided feedback for LLM program synthesis. This is rigorous (formal property checking, early stopping, concrete counterexamples), cost-reducing, and broadly applicable to planning, synthesis, verification, and agent tooling wherever verifiable properties exist. Results span 10 domains with OOD tests and large compute savings, aligning with timely interest in reliable, efficient LLM reasoning. Paper 1 is promising but narrower (single benchmark/attacker) and less general/rigorous.
Paper 2 addresses a fundamental mechanism inside LLMs—how they process structural/graph information—and provides both theoretical insight (attention sink dilution as a representation bottleneck) and a training-free, generalizable solution (SLASH). This has broader impact across many fields involving graph-structured data (molecules, knowledge graphs, social networks). Paper 1, while methodologically sound and practically useful, addresses a more specialized intersection of LLM program synthesis and PDDL planning. Paper 2's mechanistic insights into LLM internals and its wide applicability give it higher potential for cross-disciplinary impact.
Paper 2 introduces a broadly applicable methodology—property-guided LLM program synthesis with counterexample feedback—that bridges formal methods and LLM-based code generation, with demonstrated practical impact (7x fewer generations, orders of magnitude less computation). It addresses a timely problem in the rapidly growing LLM/AI planning space with clear real-world applications. Paper 1, while technically rigorous, provides incremental theoretical refinements to Q-learning error analysis with narrower audience appeal and less immediate practical impact.
Paper 2 addresses a fundamental limitation in LLM program synthesis by replacing scalar scores with formal property verification and counterexample-guided repair. This approach significantly improves efficiency and correctness. Its methodology has broad and highly impactful applications across automated planning, software engineering, and general code generation, whereas Paper 1 focuses on the relatively more niche area of social simulation agents.
Paper 1 introduces a novel and practical methodology—property-guided LLM program synthesis with counterexample feedback—that significantly reduces computational costs (7x fewer programs, orders of magnitude less evaluation) while improving quality. This approach is broadly applicable beyond planning to any domain with verifiable properties, offering a general paradigm shift in LLM-based program synthesis. Paper 2, while valuable for legal AI, addresses a narrower domain (tax law) and confirms relatively expected findings (contamination inflates performance, neuro-symbolic approaches are more robust). Paper 1's methodological contribution has broader cross-field impact and greater novelty.
Paper 2 addresses a fundamental theoretical problem in reinforcement learning—model exploitation and its relationship to reward hacking—establishing formal definitions, impossibility results, and safe planning horizons. This has broad implications across RL, AI safety, and model-based planning. Paper 1 presents a useful engineering contribution (property-guided LLM synthesis for planning heuristics) with strong empirical results, but its scope is narrower. Paper 2's theoretical framework is more likely to influence multiple research directions and has higher relevance given growing concerns about AI safety and reliable world models.
Paper 2 addresses a critical and universal bottleneck in modern AI: the massive computational cost of configuring and training LLMs. By introducing an agentic framework that extrapolates from cheap to expensive experiments, supported by a massive multi-fidelity environment (>1M GPU hours), it offers immense practical and economic value. While Paper 1 provides a strong methodological advance in formal program synthesis, Paper 2's potential to democratize and optimize foundational LLM research gives it a broader and more immediate scientific impact across the entire field.
Paper 2 has higher potential impact due to a more broadly reusable methodological contribution: integrating formal, counterexample-guided property checking with LLM program synthesis. This can substantially reduce inference/evaluation cost and improve correctness, and it naturally transfers to many domains where verifiable properties exist (planning, synthesis, verification, robotics, compilers). It is timely (LLM cost/correctness) and offers clear real-world applications. Paper 1 is valuable and rigorous, but its main contribution is diagnostic/empirical guidance for context injection in multi-agent design exploration, likely narrower in downstream uptake.
Paper 2 has higher potential impact due to a more generally applicable and methodologically rigorous contribution: integrating formal, property-based verification with counterexample-guided feedback for LLM program synthesis. This bridges LLMs with formal methods, reduces compute substantially, and is likely transferable across domains where properties are specifiable (planning, synthesis, verification, robotics). Paper 1 shows strong applied gains but appears more domain- and data-instrumentation-dependent (enterprise attention traces) and may raise generalization/privacy concerns, potentially limiting breadth and reproducibility.
Paper 1 introduces a fundamentally new paradigm—property-guided LLM program synthesis with counterexample feedback—that is broadly applicable whenever verifiable properties exist. It demonstrates dramatic efficiency gains (7x fewer programs, orders of magnitude less computation) with strong empirical results. Paper 2 proposes an incremental improvement to RLVR training (AGPO), addressing boundary shrinkage with asymmetric updates. While solid and industrially validated, it represents a more incremental contribution within existing RL-for-LLM frameworks. Paper 1's novel synthesis paradigm has broader cross-field applicability (planning, verification, program synthesis) and deeper methodological innovation.