A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

Youssef Abdelkader, Humbert Fiorino, Damien Pellier

Jun 9, 2026arXiv:2606.10489v1

cs.AI

#3346of 3489·Artificial Intelligence

#3346 of 3489 · Artificial Intelligence

Tournament Score

1200±47

10501800

Win Rate

Wins

Losses

Matches

Rating

2.5/ 10

Significance2.5

Rigor2.5

Novelty2

Clarity3

Abstract

Automated Planning is a subfield of Artificial Intelligence (AI) where the main objective is generating a sequence of actions, known as a plan, that helps us reach a goal state from an initial state. A planning problem is defined by a set of objects, an initial state and a desired goal state. The objective is to compute a plan that'll lead us from the inital state to the goal state. Programs that generate plans are called planners. In this paper, we did a complementary study to the state-of-the-art LLM called PlanGPT which was released last year. We redid some experiments to verify whether planning with LLMs is \textbf{pertinent} and \textbf{worthwhile}. We also check whether the results obtained in the official PlanGPT paper for plan coverage were correct, and we also performed a more comprehensive study on PlanGPT's performance: in our paper PlanGPT's performance was evaluated using two metrics: Plan Cost and Plan Generation Time. The results of planGPT were compared to those produced by a traditional planner for the same plans and same metrics. We discovered that PlanGPT is no better than a Greedy search strategy.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper presents a replication and extension study of PlanGPT, a GPT-2-inspired LLM built from scratch for automated planning. The authors re-examine PlanGPT's plan coverage results on IPC benchmark problems and extend the evaluation to include two additional metrics: plan cost (number of actions) and plan generation time. They compare PlanGPT's performance against FastDownward using A* and Greedy (FF) search strategies across 7 of 8 supported domains. The central finding is that PlanGPT performs roughly on par with or worse than a Greedy search heuristic, leading the authors to conclude that PlanGPT is "no better than a Greedy search strategy."

The core contribution is modest: it is primarily a replication study with the addition of two standard metrics that were absent from the original PlanGPT paper. No new methods, models, architectures, or theoretical insights are proposed.

2. Methodological Rigor

The methodological rigor has several notable weaknesses:

Experimental design: The comparison framework is reasonable in principle — comparing an LLM-based planner against a classical planner using standard metrics — but execution details are lacking. The authors exclude the Visitall domain without sufficient justification. They use different benchmark sources for different domains (Downward Benchmarks for 5 domains, IPC-2002 for 2 others) due to "unrecognized predicates," which introduces inconsistency.

Statistical analysis: There is no statistical testing whatsoever. Results are presented as raw IPC scores and visual graphs without confidence intervals, significance tests, or variance analysis. The claim that PlanGPT is "no better than Greedy" is based solely on aggregate IPC score comparisons (5.02 vs. 5.36 for cost; 4.95 vs. 5.82 for time), which are close enough to warrant more rigorous statistical treatment.

Terminology issues: The paper conflates terminology in concerning ways. A* and Greedy are referred to as "heuristics" rather than search strategies, and FastDownward is called an "optimal planner" when it is a planning system that can be configured for optimal or satisficing search. The FF heuristic within the Greedy strategy is never explicitly named or described.

Reproducibility: The authors provide a GitLab repository and detailed procedural instructions, which is commendable. However, the description reads more like a user manual than a scientific methodology section, spending excessive space on file directory structures and shell script execution steps.

Missing controls: Training time for PlanGPT is acknowledged but not factored into the comparison. The authors note this would make PlanGPT's performance "much worse" but do not quantify it. PlanGPT was run with default parameters (one plan per problem, no sampling), which may not represent its best possible performance.

3. Potential Impact

The potential impact is limited. The finding that LLM-based planners underperform classical planners is not surprising and has been demonstrated or suggested by multiple prior works (Valmeekam et al., 2022, 2024). The paper reinforces an existing narrative rather than breaking new ground.

The contribution of adding plan cost and time metrics to PlanGPT's evaluation is useful but incremental. These are standard metrics in the planning community, and their absence from the original PlanGPT paper was an obvious gap. The paper essentially performs the evaluation that should have been included in the original work.

The observation that PlanGPT requires 8 separate models (one per domain) and still cannot match Greedy search is a worthwhile point for the community, but it is more of a remark than a finding requiring extensive experimentation.

4. Timeliness & Relevance

The paper addresses a timely question — whether LLMs can serve as effective planners — which is actively debated in the AI planning community. The growing interest in applying LLMs to structured reasoning tasks makes this investigation relevant. However, the paper arrives at a conclusion that is largely expected given the existing literature, particularly Valmeekam et al. (2022, 2024), diminishing its novelty.

5. Strengths & Limitations

Strengths:

Addresses a genuine gap in the original PlanGPT evaluation by adding cost and time metrics

Provides reproducibility materials (code, scripts, data)

Tests across 7 domains from IPC benchmarks

The domain-by-domain analysis reveals interesting variance (PlanGPT excels in Blocksworld/Floortile but struggles in Logistics/Driverlog)

Limitations:

No statistical analysis of results

Writing quality is below typical conference standards, with informal language, grammatical errors, and organizational issues (e.g., extensive procedural descriptions that belong in supplementary material)

The paper presents only 4 of presumably 14 graphs (2 domains × 2 metrics shown), making it hard to assess the full picture

The conclusion is wishy-washy ("can't exactly confirm nor deny our hypothesis") despite the title making a stronger claim

Missing comparison with other LLM-based planners (Plansformer, etc.) that PlanGPT was compared against in the original paper

No analysis of failure modes — why does PlanGPT fail on certain problems?

The paper doesn't examine what properties of problems or domains predict PlanGPT success/failure

Limited novelty — this is essentially a replication with two additional standard metrics

Several citation issues (empty references marked with [?])

Additional Observations

The paper reads more as a technical report or course project than a research contribution. The extensive space devoted to explaining PDDL basics and directory structures could have been used for deeper analysis. A more impactful version of this work would analyze *why* PlanGPT performs differently across domains, examine scaling behavior, or propose improvements based on the diagnostic findings.

Rating:2.5/ 10

Significance 2.5Rigor 2.5Novelty 2Clarity 3

Generated Jun 10, 2026

Comparison History (26)

Lostvs. PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

Paper 2 introduces a novel, practical system (projectmem) addressing a real and growing problem—statelessness in AI coding agents—with a concrete architectural contribution (Memory-as-Governance, event-sourced memory layer via MCP). It has broader applicability to the rapidly expanding AI-assisted development ecosystem. Paper 1 is primarily a replication/complementary study confirming known limitations of LLMs for planning, offering limited novelty beyond verifying prior results. While both have evaluation limitations (Paper 2 uses only a self-study), Paper 2's innovation, open-source tooling, and timeliness give it higher impact potential.

claude-opus-4-6·Jun 11, 2026

Wonvs. Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

Paper 2, despite being modest in scope, provides concrete empirical evaluation of LLM-based planning (PlanGPT), offering reproducible benchmarks and a clear finding that PlanGPT performs no better than greedy search. This has immediate practical value for the AI planning community by tempering hype around LLMs for planning tasks. Paper 1 proposes a speculative theoretical framework ('Soul Computing') with grandiose claims about AI consciousness but lacks empirical grounding, conflates philosophical concepts with engineering, and its core contributions are primarily definitional rather than scientifically testable, limiting its real impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

Paper 2 proposes highly novel co-evolutionary mechanisms for LLM-driven strategy evolution, advancing the frontier of multi-agent adversarial games. Its framework demonstrates significant algorithmic innovation, robust methodological rigor (including ablation studies), and proven real-world applicability by winning a recognized competition. In contrast, Paper 1 is primarily a reproduction and evaluation study of an existing model, offering valuable critical analysis but lacking the broad methodological innovations and transformative potential of Paper 2.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

Paper 1 has higher potential impact due to a novel, timely problem (memory-augmented LLMs amplifying sycophancy), a new benchmark (MIST) spanning multiple high-stakes domains, broad applicability to deployed assistants with persistent memory, and actionable mitigations. Its methodology appears more systematically designed (multiple memory systems, model families, error analysis). Paper 2 is primarily a replication/comparison study with limited novelty and narrower scope (planning vs a baseline), though useful for validation; its broader scientific and cross-domain impact is likely smaller.

gpt-5.2·Jun 10, 2026

Lostvs. Unsupervised Electrofacies Classification and Porosity Characterization in the Offshore Keta Basin Using Wireline Logs

Paper 2 has higher likely impact because it delivers a practical, reproducible workflow for subsurface characterization in data-scarce settings, with clear real-world applications in frontier basin evaluation and potential transferability to other basins. Its methodology (unsupervised clustering with quantitative validation) is standard but reasonably rigorous and directly usable by geoscience practitioners. Paper 1 is mainly a replication/benchmarking study of an existing LLM planner; while timely and valuable for verification, its novelty and cross-domain applicability are more limited and the main finding (LLM no better than greedy) is primarily corrective rather than enabling.

gpt-5.2·Jun 10, 2026

Wonvs. Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence

Paper 2 critically evaluates state-of-the-art LLM capabilities in automated planning, challenging existing claims and establishing rigorous baselines against traditional planners. This has broad implications for AI research by testing the actual reasoning capabilities of LLMs. In contrast, Paper 1 is a small-scale survey (n=72) with limited methodological novelty and generalizability.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Moonshine: An Autonomous Mathematical Research Agent Centered on Conjecture Generation

Paper 2 has higher potential impact: it proposes an autonomous conjecture-generation research agent and introduces a novel, cross-domain conjecture (Neural Jacobian Conjecture) linking classical algebraic geometry/topology ideas to neural network injectivity. It reports nontrivial progress with independent proofs in a specific case, suggesting real methodological capability. The work is timely (AI-for-math discovery), potentially broadly influential across mathematics and ML theory, and could seed follow-on research. Paper 1 is mainly a replication/benchmarking study with narrower scope and less novel contribution.

gpt-5.2·Jun 10, 2026

Lostvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Paper 1 is more novel and methodologically substantial: it integrates an MCTS-like counterfactual evaluation framework with a learned multi-agent world model using rare public 3D ball-tracking data, enabling distributional pass value attribution and releasing code/checkpoints. It has clear real-world application in sports analytics and broader relevance to counterfactual reasoning, model-based RL, and generative trajectory modeling. Paper 2 is mainly a replication/benchmarking study of an existing LLM planner, with limited innovation and narrower impact, despite being timely and useful for validation.

gpt-5.2·Jun 10, 2026

Lostvs. Bellman-Taylor Score Decoding for Markov Decision Processes with State-Dependent Feasible Action Sets

Paper 1 introduces a novel RL framework (Bellman–Taylor score decoding) addressing a broadly important and under-served setting: MDPs with implicitly defined, state-dependent feasible action sets. It offers a principled latent-space formulation, avoids differentiating through decoders, provides a theoretical performance guarantee with a clear error decomposition, and demonstrates gains on queueing network control—high real-world relevance in operations/research and engineering. Paper 2 is mainly a replication/benchmarking study of PlanGPT with limited methodological innovation and narrower impact, though timely.

gpt-5.2·Jun 10, 2026

Lostvs. A Reliable Fault Diagnosis Method Based on Belief Rule Base Consider Robustness Analysis

Paper 2 proposes a novel fault diagnosis method with robustness analysis for belief rule base models, addressing practical industrial needs with validated experiments on real systems. It offers methodological contributions (robustness constraint strategies) with broad applicability across equipment maintenance domains. Paper 1 is primarily a replication/verification study of PlanGPT with limited novel contributions—confirming that an LLM-based planner underperforms traditional planners is useful but incremental. Paper 2 introduces new techniques with wider potential impact in reliability engineering and industrial applications.

claude-opus-4-6·Jun 10, 2026

#3346of 3489·Artificial Intelligence

#3346 of 3489 · Artificial Intelligence

Tournament Score

1200±47

10501800

Win Rate

Wins

Losses

Matches

Rating

2.5/ 10

Significance2.5

Rigor2.5

Novelty2

Clarity3