Property-Guided LLM Program Synthesis for Planning
André G. Pereira, Augusto B. Corrêa, Jendrik Seipp
Abstract
LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Property-Guided LLM Program Synthesis for Planning
1. Core Contribution
The paper proposes replacing numeric score-based feedback in LLM program synthesis with property-guided counterexample feedback, instantiated as a CEGIS-style repair loop. When a candidate heuristic violates the targeted "direct" property—meaning some state on a strictly-improving path from the initial state lacks any strictly-improving successor—the system provides the LLM with the exact failing state, its heuristic value, and all successor states with their values. This concrete, actionable feedback replaces the vague signal of a numeric score.
The specific application is synthesizing direct heuristic functions for PDDL planning domains. A direct heuristic guarantees that hill climbing (HC) reaches a goal without backtracking. The key insight is that the direct property is both formally meaningful (it guarantees HC completeness) and practically useful as a synthesis target (violations yield local, interpretable counterexamples).
2. Methodological Rigor
Strengths in experimental design:
Potential concerns:
3. Potential Impact
Within AI planning: This work demonstrates that LLM-synthesized heuristics can make hill climbing viable for classical planning—a striking result since HC is typically dismissed as incomplete. The 623.3 tasks solved by HC alone versus 573 by GBFS with sample-and-select is noteworthy.
Within LLM program synthesis broadly: The key transferable insight is that property-guided feedback with concrete counterexamples can dramatically reduce synthesis cost (7.4× fewer LLM calls, ~1,150× less evaluation compute). This principle extends beyond planning to any domain where candidate programs can be checked against formal properties—constraint satisfaction, optimization, verification, etc.
Practical impact: The reduction in evaluation cost (from ~206 CPU-hours to ~11 minutes per domain) makes the approach feasible for resource-constrained settings. The synthesized heuristics generalize remarkably well: only 1 out of 2,700 test evaluations fails due to a property violation.
4. Timeliness & Relevance
This paper addresses two timely concerns: (1) the high cost of LLM-based program synthesis at scale, and (2) the lack of formal guarantees in LLM-generated code. CEGIS is well-established in formal methods but underutilized with LLMs—this paper bridges that gap convincingly. The timing is excellent given the rapid adoption of LLMs for code generation and the growing interest in LLMs for planning (multiple ICAPS 2026 and NeurIPS 2025 papers in the references).
5. Strengths & Limitations
Key strengths:
Notable limitations:
Additional Observations
The paper makes an important conceptual contribution by reframing heuristic synthesis as property satisfaction rather than optimization. This shift from "find a good heuristic" to "find a heuristic satisfying this formal property" is theoretically cleaner and practically more efficient. The connection to CEGIS is well-motivated and properly credited.
The generalization results are perhaps underappreciated: synthesizing from tasks with ≤10 blocks and testing on tasks with up to 485 blocks (Blocksworld) with near-zero property violations suggests the LLM captures genuine structural properties of the domains, not just fitting to training instances.
The paper would benefit from experiments on domains beyond the IPC 2023 Learning Track and from an analysis of failure modes when the approach does not converge within the budget.
Generated May 19, 2026
Comparison History (16)
Paper 1 offers a clearer methodological innovation: replacing scalar reward/test scores with formally defined properties plus counterexample-guided feedback, yielding strong efficiency gains and more reliable synthesis. This is timely for LLM-based code generation and bridges LLMs with formal methods/planning, with broader cross-field impact (verification, synthesis, RL/planning, agentic coding) and concrete, measurable benefits. Paper 2 targets an important direction (adaptive multimodal memory), but the abstract is higher-level and closer to an incremental reframing of learnable memory mechanisms without clear technical specificity or guarantees, making impact harder to assess and potentially less novel.
Paper 2 integrates formal property checking with LLM program synthesis, providing a rigorous, counterexample-guided feedback loop. This neurosymbolic approach addresses fundamental inefficiencies in current LLM reasoning and search methods, offering broader, cross-disciplinary impact in software engineering, formal verification, and automated planning compared to Paper 1's domain-specific improvements in Text-to-Image alignment.
Paper 1 uncovers a counterintuitive 'capability paradox' in LLM multi-agent security, demonstrating that stronger components can degrade system safety. This fundamental insight, backed by rigorous large-scale analysis, will broadly impact the design and alignment of autonomous AI systems. While Paper 2 offers a strong methodological improvement for program synthesis, Paper 1 addresses a critical, timely vulnerability in AI safety with wider implications across multiple fields.
Paper 2 bridges formal verification and LLM program synthesis, introducing a neuro-symbolic approach to AI reasoning. By replacing basic scalar feedback with concrete counterexamples for property violations, it tackles fundamental limitations in AI reliability and planning. While Paper 1 offers valuable, timely cost savings for LLM evaluation, Paper 2's methodology advances the broader frontier of verifiable AI agents and reliable code generation, promising deeper transformative impact across AI safety, planning, and software engineering.
MathAtlas introduces a large-scale, novel benchmark (~52k items from 103 graduate textbooks) addressing a significant gap in autoformalization research—graduate-level mathematics. Its dependency graph (~178k relations) is a first-of-its-kind contribution enabling new research directions. The benchmark reveals fundamental limitations of current models (2.6-16.7% correctness), setting a long-term challenge for the community. Paper 2 presents a useful but more incremental contribution combining counterexample-guided refinement with LLM synthesis for planning heuristics, with narrower scope and applicability. MathAtlas has broader impact across AI for mathematics, formal verification, and theorem proving communities.
Paper 1 bridges formal methods and LLM program synthesis, offering a highly rigorous approach that drastically reduces computational costs and improves performance. By using formal property checking and concrete counterexamples, it addresses a major bottleneck in LLM code generation. Paper 2 is an interesting exploration of interpretability, but Paper 1's integration of symbolic verification with neural generation has broader implications and more immediate real-world utility in reliable system design.
Paper 1 offers a fundamental methodological advancement by integrating formal property verification and counterexample-guided feedback into LLM program synthesis. This significantly reduces computational costs and improves program quality, offering broad impact across AI planning, software engineering, and formal methods. In contrast, while Paper 2 provides a rigorous multimodal benchmark, its focus on the highly specific vertical domain of Chinese gaming short-videos limits its broader scientific applicability and long-term theoretical impact compared to Paper 1's algorithmic innovations.
Paper 1 is more novel and likely higher-impact: it reframes LLM program synthesis around formally specified, checkable properties with counterexample-guided feedback and early termination—bringing strong ideas from formal methods/CEGIS into LLM-driven synthesis. This improves methodological rigor and generality (any domain with verifiable properties), and offers clear, scalable cost reductions. Its contributions are broadly relevant across planning, verification, and program synthesis. Paper 2 is useful and timely for CO solver synthesis, but the memory-hierarchy/tree-search design is a more incremental systems advance with narrower theoretical grounding and potentially more domain-specific impact.
Paper 2 addresses the highly relevant and rapidly growing field of test-time compute scaling for LLMs. Its parallel reasoning framework using Bradley-Terry aggregation offers a broadly applicable method to improve reasoning performance across domains without relying on ground-truth verifiers. While Paper 1 presents an innovative formal property-guided approach, its applicability is constrained to domains where formal properties can be strictly defined. Paper 2's broader scope, timeliness, and strong empirical results on general LLM reasoning give it a higher potential for widespread scientific impact.
Paper 1 presents a more novel and broadly impactful framework that combines LLM agents with mathematical optimization solvers (SCIP) to automate the complex MIP research loop, including constraint discovery, code generation, debugging, and benchmarking. It addresses a fundamental bottleneck in solver development with demonstrated results on MIPLIB 2017. Paper 2 contributes a useful property-guided synthesis approach for planning heuristics, but its scope is narrower (PDDL planning domains) and the core idea of counterexample-guided refinement is more incremental. Paper 1's potential to transform how optimization solvers are developed gives it broader cross-field impact.
Paper 1 addresses a critical and underexplored problem—the sim-to-real gap in tool-use LLM agents—with both a comprehensive benchmark (22 perturbation types grounded in real failures) and a novel training recipe (ToolRL-DR) showing strong generalization. It evaluates 21 models including frontier systems, provides actionable insights about failure modes, and demonstrates that domain-randomized RL transfers to unseen perturbation types. Its breadth of impact is wider, spanning LLM deployment reliability, RL for agents, and benchmarking methodology. Paper 2 is a solid contribution on property-guided synthesis but is narrower in scope, focused on PDDL planning heuristics with more limited generalizability.
Paper 2 offers higher scientific impact by bridging Large Language Models with formal verification. While Paper 1 provides a highly practical systems solution for MoE efficiency, Paper 2 introduces a fundamental paradigm shift in program synthesis. By replacing scalar reward signals with formal properties and concrete counterexamples, it advances neuro-symbolic AI, code generation, and planning. This methodology addresses core limitations in current LLM trial-and-error generation and has broad applicability across any domain admitting verifiable properties.
Paper 2 has higher likely scientific impact: it introduces a broadly applicable, formally grounded synthesis paradigm (property checking with counterexample-guided feedback) that improves efficiency and solution quality, with clear methodological rigor and strong empirical evaluation across 10 planning domains including OOD tests. The approach generalizes beyond planning to any setting with verifiable properties, linking LLM synthesis to formal methods and program verification. Paper 1 is timely and useful as a critique plus a verifier-in-the-loop baseline in chess, but its impact is narrower (chess/benchmark interpretation) and more incremental relative to existing verification-assisted generation ideas.
Paper 1 introduces a novel paradigm—property-guided LLM program synthesis with counterexample feedback—that has broad applicability beyond planning domains. It addresses fundamental limitations of score-based LLM synthesis (high inference costs, lack of explanatory feedback) with a principled, formally grounded approach yielding dramatic efficiency improvements (7x fewer programs, orders of magnitude less computation). Paper 2, while solving a practical industrial problem (fuzzy music search), is more narrowly scoped as a domain-specific engineering contribution with incremental methodological novelty (adapting existing sparse retrieval with subword tokenization). Paper 1's framework generalizes to any domain with verifiable properties, giving it broader scientific impact.
Paper 2 likely has higher impact: it tackles a harder and more broadly relevant problem (online world-model/dynamics discovery under prior misalignment) with applications to robotics, games, and agents learning in novel environments. The approach (conflict-driven refinement + class-aware exploration) addresses key failure modes in executable model learning and is timely for foundation-model agents. Paper 1 is methodologically strong and practical for planning/program synthesis, but its applicability depends on having verifiable formal properties, making its impact narrower despite clear efficiency gains.
Paper 2 likely has higher scientific impact: it introduces a broadly applicable paradigm (property-guided synthesis with counterexample feedback and early stopping) that can generalize beyond planning to program synthesis, verification, and agentic systems, with strong computational efficiency gains and clear methodological grounding in formal properties. Its applications span many domains where verifiable properties exist, making it timely for LLM-based automation. Paper 1 is innovative within radiology report generation but is narrower in scope and impact, and relies on domain-specific resources (RadGraph) with more limited cross-field transfer.