Property-Guided LLM Program Synthesis for Planning

André G. Pereira, Augusto B. Corrêa, Jendrik Seipp

May 15, 2026

arXiv:2605.16142v2 PDF

v1v2

cs.AI(primary)cs.LG

#563of 2292·Artificial Intelligence

#563 of 2292 · Artificial Intelligence

Tournament Score

1464±46

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance7.5

Rigor7.5

Novelty8

Clarity8.5

Tournament Score

1464±46

10501800

63%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Property-Guided LLM Program Synthesis for Planning

1. Core Contribution

The paper proposes replacing numeric score-based feedback in LLM program synthesis with property-guided counterexample feedback, instantiated as a CEGIS-style repair loop. When a candidate heuristic violates the targeted "direct" property—meaning some state on a strictly-improving path from the initial state lacks any strictly-improving successor—the system provides the LLM with the exact failing state, its heuristic value, and all successor states with their values. This concrete, actionable feedback replaces the vague signal of a numeric score.

The specific application is synthesizing direct heuristic functions for PDDL planning domains. A direct heuristic guarantees that hill climbing (HC) reaches a goal without backtracking. The key insight is that the direct property is both formally meaningful (it guarantees HC completeness) and practically useful as a synthesis target (violations yield local, interpretable counterexamples).

2. Methodological Rigor

Strengths in experimental design:

The paper uses 10 IPC 2023 Learning Track domains with a clear out-of-distribution test set (900 tasks), ensuring the evaluation is meaningful.

Three independent runs with reported standard deviations provide statistical grounding.

The comparison is fair: the same LLM (Gemini 3.1 Pro) and same initial prompt as the baseline (Corrêa et al. 2024) are used.

The ablation study (Cov.) replacing property feedback with numeric coverage feedback isolates the contribution of the direct property specifically.

Potential concerns:

Validation uses a per-task time limit of 30 seconds; if DFS doesn't find a counterexample in time, the heuristic is assumed direct. This is a pragmatic choice but means "verified direct" is approximate.

Only one LLM is tested. While the authors justify this (the question is about the synthesis paradigm, not LLM comparison), the generalizability across models remains unknown.

The repair loop budget of 10 iterations is modest. Two domains (Sokoban, Floortile) hit or nearly hit this limit, suggesting the approach may struggle on harder domains.

Standard deviations are sometimes large (e.g., Rovers: 42.7±13.6, Satellite: 55.7±14.5), indicating instability in some domains.

3. Potential Impact

Within AI planning: This work demonstrates that LLM-synthesized heuristics can make hill climbing viable for classical planning—a striking result since HC is typically dismissed as incomplete. The 623.3 tasks solved by HC alone versus 573 by GBFS with sample-and-select is noteworthy.

Within LLM program synthesis broadly: The key transferable insight is that property-guided feedback with concrete counterexamples can dramatically reduce synthesis cost (7.4× fewer LLM calls, ~1,150× less evaluation compute). This principle extends beyond planning to any domain where candidate programs can be checked against formal properties—constraint satisfaction, optimization, verification, etc.

Practical impact: The reduction in evaluation cost (from ~206 CPU-hours to ~11 minutes per domain) makes the approach feasible for resource-constrained settings. The synthesized heuristics generalize remarkably well: only 1 out of 2,700 test evaluations fails due to a property violation.

4. Timeliness & Relevance

This paper addresses two timely concerns: (1) the high cost of LLM-based program synthesis at scale, and (2) the lack of formal guarantees in LLM-generated code. CEGIS is well-established in formal methods but underutilized with LLMs—this paper bridges that gap convincingly. The timing is excellent given the rapid adoption of LLMs for code generation and the growing interest in LLMs for planning (multiple ICAPS 2026 and NeurIPS 2025 papers in the references).

5. Strengths & Limitations

Key strengths:

Elegant problem formulation: The direct property is perfectly suited for CEGIS—violations are local, interpretable, and actionable. The choice of property is perhaps the paper's most important contribution.

Dramatic efficiency gains: The ~1,150× reduction in evaluation cost is not incremental; it fundamentally changes the resource profile of LLM-based synthesis.

Strong generalization: Near-perfect directness on out-of-distribution test tasks from small training sets is impressive and practically valuable.

Qualitative analysis: The Blocksworld and Satellite case studies provide genuine insight into what the LLM learns through counterexample-guided repair.

Clean ablation: The Cov. ablation convincingly shows that the property feedback (not just the repair loop structure) drives the improvement.

Notable limitations:

Domain-specificity of the property: The direct property is specific to planning with hill climbing. The paper's broader claim—that property-guided synthesis works "whenever a problem admits a verifiable property"—is supported only by this single instantiation.

Incomplete verification: The DFS-based validation with time limits means the property is checked approximately, not proven. The authors acknowledge this and suggest future work combining with theorem provers.

Scalability of counterexamples: In domains with large state descriptions (the Blocksworld example prompt shows 35+ atoms), counterexamples consume significant context. Whether this scales to more complex domains is unclear.

Limited search capability: HC solves 623/900 tasks. For the remaining ~277, the approach offers no fallback. A hybrid with GBFS would be a natural extension.

Single LLM evaluation: Results with other models (open-source, smaller, reasoning-specialized) would strengthen the claims.

Additional Observations

The paper makes an important conceptual contribution by reframing heuristic synthesis as property satisfaction rather than optimization. This shift from "find a good heuristic" to "find a heuristic satisfying this formal property" is theoretically cleaner and practically more efficient. The connection to CEGIS is well-motivated and properly credited.

The generalization results are perhaps underappreciated: synthesizing from tasks with ≤10 blocks and testing on tasks with up to 485 blocks (Blocksworld) with near-zero property violations suggests the LLM captures genuine structural properties of the domains, not just fitting to training instances.

The paper would benefit from experiments on domains beyond the IPC 2023 Learning Track and from an analysis of failure modes when the approach does not converge within the budget.

Rating:7.8/ 10

Significance 7.5Rigor 7.5Novelty 8Clarity 8.5

Generated May 19, 2026

Comparison History (16)

vs. Learning to Learn from Multimodal Experience

gpt-5.25/19/2026

Paper 1 offers a clearer methodological innovation: replacing scalar reward/test scores with formally defined properties plus counterexample-guided feedback, yielding strong efficiency gains and more reliable synthesis. This is timely for LLM-based code generation and bridges LLMs with formal methods/planning, with broader cross-field impact (verification, synthesis, RL/planning, agentic coding) and concrete, measurable benefits. Paper 2 targets an important direction (adaptive multimodal memory), but the abstract is higher-level and closer to an incremental reframing of learnable memory mechanisms without clear technical specificity or guarantees, making impact harder to assess and potentially less novel.

vs. AutoRubric-T2I: Robust Rule-Based Reward Model for Text-to-Image Alignment

gemini-3.15/19/2026

Paper 2 integrates formal property checking with LLM program synthesis, providing a rigorous, counterexample-guided feedback loop. This neurosymbolic approach addresses fundamental inefficiencies in current LLM reasoning and search methods, offering broader, cross-disciplinary impact in software engineering, formal verification, and automated planning compared to Paper 1's domain-specific improvements in Text-to-Image alignment.

vs. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

gemini-3.15/19/2026

Paper 1 uncovers a counterintuitive 'capability paradox' in LLM multi-agent security, demonstrating that stronger components can degrade system safety. This fundamental insight, backed by rigorous large-scale analysis, will broadly impact the design and alignment of autonomous AI systems. While Paper 2 offers a strong methodological improvement for program synthesis, Paper 1 addresses a critical, timely vulnerability in AI safety with wider implications across multiple fields.

vs. Active Testing of Large Language Models via Approximate Neyman Allocation

gemini-3.15/19/2026

Paper 2 bridges formal verification and LLM program synthesis, introducing a neuro-symbolic approach to AI reasoning. By replacing basic scalar feedback with concrete counterexamples for property violations, it tackles fundamental limitations in AI reliability and planning. While Paper 1 offers valuable, timely cost savings for LLM evaluation, Paper 2's methodology advances the broader frontier of verifiable AI agents and reliable code generation, promising deeper transformative impact across AI safety, planning, and software engineering.

vs. MathAtlas: A Benchmark for Autoformalization in the Wild

claude-opus-4.65/19/2026

MathAtlas introduces a large-scale, novel benchmark (~52k items from 103 graduate textbooks) addressing a significant gap in autoformalization research—graduate-level mathematics. Its dependency graph (~178k relations) is a first-of-its-kind contribution enabling new research directions. The benchmark reveals fundamental limitations of current models (2.6-16.7% correctness), setting a long-term challenge for the community. Paper 2 presents a useful but more incremental contribution combining counterexample-guided refinement with LLM synthesis for planning heuristics, with narrower scope and applicability. MathAtlas has broader impact across AI for mathematics, formal verification, and theorem proving communities.

vs. Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas

gemini-3.15/19/2026

Paper 1 bridges formal methods and LLM program synthesis, offering a highly rigorous approach that drastically reduces computational costs and improves performance. By using formal property checking and concrete counterexamples, it addresses a major bottleneck in LLM code generation. Paper 2 is an interesting exploration of interpretability, but Paper 1's integration of symbolic verification with neural generation has broader implications and more immediate real-world utility in reliable system design.

vs. SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

gemini-3.15/19/2026

Paper 1 offers a fundamental methodological advancement by integrating formal property verification and counterexample-guided feedback into LLM program synthesis. This significantly reduces computational costs and improves program quality, offering broad impact across AI planning, software engineering, and formal methods. In contrast, while Paper 2 provides a rigorous multimodal benchmark, its focus on the highly specific vertical domain of Chinese gaming short-videos limits its broader scientific applicability and long-term theoretical impact compared to Paper 1's algorithmic innovations.

vs. Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

gpt-5.25/19/2026

Paper 1 is more novel and likely higher-impact: it reframes LLM program synthesis around formally specified, checkable properties with counterexample-guided feedback and early termination—bringing strong ideas from formal methods/CEGIS into LLM-driven synthesis. This improves methodological rigor and generality (any domain with verifiable properties), and offers clear, scalable cost reductions. Its contributions are broadly relevant across planning, verification, and program synthesis. Paper 2 is useful and timely for CO solver synthesis, but the memory-hierarchy/tree-search design is a more incremental systems advance with narrower theoretical grounding and potentially more domain-specific impact.

vs. OpenDeepThink: Parallel Reasoning via Bradley--Terry Aggregation

gemini-3.15/19/2026

Paper 2 addresses the highly relevant and rapidly growing field of test-time compute scaling for LLMs. Its parallel reasoning framework using Bradley-Terry aggregation offers a broadly applicable method to improve reasoning performance across domains without relying on ground-truth verifiers. While Paper 1 presents an innovative formal property-guided approach, its applicability is constrained to domains where formal properties can be strictly defined. Paper 2's broader scope, timeliness, and strong empirical results on general LLM reasoning give it a higher potential for widespread scientific impact.

vs. Agentic MIP Research: Accelerated Constraint Handler Generation

claude-opus-4.65/19/2026

Paper 1 presents a more novel and broadly impactful framework that combines LLM agents with mathematical optimization solvers (SCIP) to automate the complex MIP research loop, including constraint discovery, code generation, debugging, and benchmarking. It addresses a fundamental bottleneck in solver development with demonstrated results on MIPLIB 2017. Paper 2 contributes a useful property-guided synthesis approach for planning heuristics, but its scope is narrower (PDDL planning domains) and the core idea of counterexample-guided refinement is more incremental. Paper 1's potential to transform how optimization solvers are developed gives it broader cross-field impact.

vs. When Simulation Lies: A Sim-to-Real Benchmark and Domain-Randomized RL Recipe for Tool-Use Agents

claude-opus-4.65/19/2026

Paper 1 addresses a critical and underexplored problem—the sim-to-real gap in tool-use LLM agents—with both a comprehensive benchmark (22 perturbation types grounded in real failures) and a novel training recipe (ToolRL-DR) showing strong generalization. It evaluates 21 models including frontier systems, provides actionable insights about failure modes, and demonstrates that domain-randomized RL transfers to unseen perturbation types. Its breadth of impact is wider, spanning LLM deployment reliability, RL for agents, and benchmarking methodology. Paper 2 is a solid contribution on property-guided synthesis but is narrower in scope, focused on PDDL planning heuristics with more limited generalizability.

vs. BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

gemini-3.15/19/2026

Paper 2 offers higher scientific impact by bridging Large Language Models with formal verification. While Paper 1 provides a highly practical systems solution for MoE efficiency, Paper 2 introduces a fundamental paradigm shift in program synthesis. By replacing scalar reward signals with formal properties and concrete counterexamples, it advances neuro-symbolic AI, code generation, and planning. This methodology addresses core limitations in current LLM trial-and-error generation and has broad applicability across any domain admitting verifiable properties.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

gpt-5.25/19/2026

Paper 2 has higher likely scientific impact: it introduces a broadly applicable, formally grounded synthesis paradigm (property checking with counterexample-guided feedback) that improves efficiency and solution quality, with clear methodological rigor and strong empirical evaluation across 10 planning domains including OOD tests. The approach generalizes beyond planning to any setting with verifiable properties, linking LLM synthesis to formal methods and program verification. Paper 1 is timely and useful as a critique plus a verifier-in-the-loop baseline in chess, but its impact is narrower (chess/benchmark interpretation) and more incremental relative to existing verification-assisted generation ideas.

vs. Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

claude-opus-4.65/19/2026

Paper 1 introduces a novel paradigm—property-guided LLM program synthesis with counterexample feedback—that has broad applicability beyond planning domains. It addresses fundamental limitations of score-based LLM synthesis (high inference costs, lack of explanatory feedback) with a principled, formally grounded approach yielding dramatic efficiency improvements (7x fewer programs, orders of magnitude less computation). Paper 2, while solving a practical industrial problem (fuzzy music search), is more narrowly scoped as a domain-specific engineering contribution with incremental methodological novelty (adapting existing sparse retrieval with subword tokenization). Paper 1's framework generalizes to any domain with verifiable properties, giving it broader scientific impact.

vs. Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

gpt-5.25/19/2026

Paper 2 likely has higher impact: it tackles a harder and more broadly relevant problem (online world-model/dynamics discovery under prior misalignment) with applications to robotics, games, and agents learning in novel environments. The approach (conflict-driven refinement + class-aware exploration) addresses key failure modes in executable model learning and is timely for foundation-model agents. Paper 1 is methodologically strong and practical for planning/program synthesis, but its applicability depends on having verifiable formal properties, making its impact narrower despite clear efficiency gains.

vs. AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it introduces a broadly applicable paradigm (property-guided synthesis with counterexample feedback and early stopping) that can generalize beyond planning to program synthesis, verification, and agentic systems, with strong computational efficiency gains and clear methodological grounding in formal properties. Its applications span many domains where verifiable properties exist, making it timely for LLM-based automation. Paper 1 is innovative within radiology report generation but is narrower in scope and impact, and relies on domain-specific resources (RadGraph) with more limited cross-field transfer.