SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval
Xin Xie, Dongyun Xue, Wuguannan Yao, Mingxiao Feng, Wengang Zhou, Xiang Qi, Houqiang Li, Peng Zhang
Abstract
LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.
AI Impact Assessments
(3 models)Scientific Impact Assessment: SGA-MCTS
1. Core Contribution
SGA-MCTS introduces a two-phase framework that decouples deliberative planning from reactive execution in LLM-based agents. The key insight is that complex multi-step reasoning trajectories can be decomposed into reusable State-Goal-Action (SGA) atoms—de-lexicalized primitives where concrete entities are replaced by typed symbolic slots. Offline, MCTS explores the solution space and distills high-quality trajectories into these atoms. Online, a hybrid symbolic-semantic retrieval mechanism fetches relevant atoms as "soft reasoning hints" that guide a frozen LLM's generation without any parameter updates.
The paper frames this as addressing the latency-generalization tradeoff: inference-time search (Tree of Thoughts, LATS) is expensive at test time, while supervised fine-tuning suffers from parametric rigidity. SGA-MCTS attempts to amortize the search cost into an offline phase and deliver the benefits at inference time through non-parametric retrieval.
2. Methodological Rigor
Strengths in formulation: The problem is cleanly formulated as a Goal-Conditioned MDP with a structured state representation. The gated reward function (Eq. 1) that zeroes out failed trajectories is a reasonable design choice. The hybrid retrieval score (Eq. 3) combining semantic cosine similarity with symbolic feasibility checking is well-motivated—pure semantic retrieval ignores execution preconditions, while pure symbolic matching misses intent alignment.
Concerns about experimental rigor:
3. Potential Impact
The "reasoning as retrieval" paradigm is conceptually appealing and practically relevant. If validated more thoroughly, it could:
However, the practical impact may be limited by the acknowledged dependency on offline MCTS quality and the cold-start problem. The framework also requires maintaining and querying an experience store, adding operational complexity.
4. Timeliness & Relevance
The paper addresses a genuinely pressing issue: how to give LLM agents deep planning capabilities without prohibitive inference costs. The System 1/System 2 framing is timely given the community's interest in test-time compute scaling. The focus on training-free approaches is relevant for practitioners who cannot afford fine-tuning large models for every new domain.
The emphasis on open-weight models matching proprietary systems aligns with the democratization trend in AI. The 76% token reduction claim, if reproducible, addresses real cost concerns in production deployments.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
SGA-MCTS presents a well-structured framework with a compelling high-level idea—amortizing search cost through atomic experience retrieval. The de-lexicalization approach and the OOD generalization analysis are genuine contributions. However, the experimental evaluation is limited in scope and baseline coverage, and several claims (matching GPT-5, System 2 at System 1 cost) are overstated relative to the evidence provided. The paper would benefit from broader comparisons, wall-clock time analysis, and evaluation on more diverse benchmarks.
Generated Apr 17, 2026
Comparison History (49)
Paper 1 (VPR) addresses a fundamental challenge in RL for LLMs—credit assignment in long-horizon reasoning—with a principled framework backed by both theoretical analysis and empirical validation across multiple settings. It demonstrates transfer to general reasoning benchmarks, suggesting broad applicability. Paper 2 (SGA-MCTS) presents an interesting retrieval-based planning approach but makes extraordinary claims (matching GPT-5 without fine-tuning) that raise credibility concerns, and its non-parametric retrieval paradigm may have scalability limitations in truly novel domains. VPR's contribution to process reward design is more foundational and likely to influence future RL-based reasoning research broadly.
Paper 2 uncovers a fundamental structural property of LLM representations—that temporal drift is geometrically orthogonal to correctness and uncertainty. This discovery challenges existing assumptions about hallucination detection and offers deep mechanistic insights, paving the way for fundamentally new approaches to model interpretability and knowledge editing. While Paper 1 offers a strong, practical engineering solution for LLM planning efficiency, Paper 2 provides a novel theoretical framework with broader, foundational implications for our understanding of AI systems.
Paper 2 likely has higher scientific impact due to a clearer, broadly applicable capability (train-once, few-shot cross-domain OOD detection) directly tied to safety-critical deployment. Its information-geometric features from diffusion score trajectories are a novel, principled diagnostic that can transfer across unrelated domains with minimal unlabeled ID data, suggesting wide adoption across ML robustness, medical, autonomy, and monitoring. The claims are measurable (AUROC across 12 benchmarks, large sample-efficiency gains) and methodologically grounded. Paper 1 is promising for LLM planning, but hinges on strong comparative claims and may be narrower and more system-dependent.
Paper 2 (OptimusKG) likely has higher scientific impact due to its immediate real-world utility as a large, openly distributable biomedical infrastructure resource enabling broad downstream work (ML on graphs, LLM retrieval, hypothesis generation) across many life-science domains. Its contribution is timely for biomedical AI, and it includes validation via literature-backed evidence checks and provenance/schema constraints that support rigor and reuse. Paper 1 is innovative in LLM planning/retrieval, but its impact may be narrower (agent planning benchmarks) and more sensitive to rapidly changing LLM baselines and competing approaches.
OptimusKG likely has higher scientific impact due to its immediate, broad real-world utility: a large, schema-constrained, multimodal biomedical knowledge graph that can serve as shared infrastructure for many downstream tasks (drug discovery, hypothesis generation, KG-ML, LLM retrieval, data integration). Its methodological contribution (harmonized LPG with rich properties/provenance plus evidence validation) is concrete and reusable across the life sciences, and timely given rapid adoption of KG+LLM workflows. SGA-MCTS is novel in LLM planning/retrieval, but impact may be narrower and more contingent on benchmark generality and reproducibility of claimed GPT-5-level performance.
Paper 2 addresses a critical bottleneck in LLM agents by bridging the gap between slow System 2 planning and fast System 1 execution. By amortizing expensive MCTS search costs into an offline retrieval-based framework, it claims to match next-generation SOTA performance at low latency. This offers immense practical utility for real-world autonomous systems. While Paper 1 provides rigorous improvements to inference-time decoding, Paper 2's paradigm shift in decoupling planning from execution offers broader application potential and highly transformative impact for scalable AI agents.
Paper 2 proposes a highly innovative paradigm shift by casting complex LLM planning as non-parametric retrieval of de-lexicalized atoms. By decoupling expensive MCTS planning from online execution, it elegantly solves the latency-generalization trade-off, enabling System 2 reasoning at System 1 speeds. This approach has broad applicability to real-time autonomous agents and offers significant real-world impact compared to Paper 1's narrower focus on decoding algorithms.
Paper 2 offers a more novel, scalable algorithmic contribution: amortizing MCTS into reusable de-lexicalized State-Goal-Action atoms with training-free retrieval, potentially enabling real-time planning for frozen open-weight models. This has direct practical applications (latency, cost, deployability) and broad relevance to planning, retrieval, and agentic LLMs. Paper 1 is valuable and timely as a cross-domain diagnostic benchmark with solid evaluation/annotation methodology, but benchmarks/analysis typically yield narrower impact than a generalizable planning framework if results hold widely.
SGA-MCTS presents a novel framework that addresses a fundamental trade-off in LLM planning (latency vs. generalization) with a concrete solution—decoupling planning into offline MCTS exploration and online retrieval of abstracted SGA atoms. It demonstrates that frozen open-weight models can match SOTA performance (GPT-5) without fine-tuning, offering immediate practical impact. While Paper 2 provides valuable diagnostic analysis of long-horizon failures, it is primarily a benchmark/analysis contribution rather than a solution. Paper 1's methodological innovation (non-parametric retrieval for planning, de-lexicalized primitives) has broader potential to influence future agent architectures and real-world deployment.
Paper 2 addresses a critical bottleneck in LLM agents: the trade-off between inference-time search latency and generalization. By decoupling planning from execution using retrieval-augmented MCTS, it enables System 2 reasoning at System 1 speeds. This has massive potential for real-world applications across autonomous systems. While Paper 1 provides excellent fundamental insights into mechanistic interpretability, Paper 2's broad applicability, methodological innovation in non-parametric planning, and solution to a highly timely problem give it higher potential for widespread impact.
Paper 2 has broader, more general impact: it proposes a domain-agnostic framework (planning as non-parametric retrieval of de-lexicalized State-Goal-Action atoms) that could apply across many LLM-agent tasks, with strong real-world implications for lowering latency/cost while improving multi-step reliability. If empirically validated as claimed (matching top proprietary models without fine-tuning), it is timely and potentially transformative for scalable autonomous planning. Paper 1 is rigorous and valuable but more domain-specific (EO) and primarily advances benchmarking/environment infrastructure rather than a widely transferable planning paradigm.
SGA-MCTS introduces a novel paradigm that decouples planning from execution via non-parametric retrieval of abstracted atomic experiences, addressing a fundamental trade-off in LLM planning. Its claim of enabling frozen open-weight models to match frontier systems like GPT-5 without fine-tuning has broad implications across AI planning, reasoning, and autonomous agents. Paper 2, while valuable for medical AI safety, is primarily an auditing/benchmarking study with more incremental contributions (identifying known grounding limitations, standard fine-tuning). Paper 1's methodological novelty and cross-domain applicability suggest higher potential impact.
Paper 2 is more novel in reframing LLM planning as training-free, non-parametric “atomic experience” retrieval distilled from offline MCTS, addressing a central latency–generalization bottleneck. Its applications (agentic decision-making, robotics, tools, web automation) are broad and timely, and the decoupling of expensive search from fast inference could influence many systems beyond any single domain. Paper 1 is important and rigorous for medical VQA trustworthiness auditing, but its impact is narrower (specific to medical VLM grounding/prompting failures) and more diagnostic than paradigm-shifting.
SGA-MCTS presents a concrete, novel framework with empirical results demonstrating that frozen open-weight models can match GPT-5-level performance on complex planning benchmarks without fine-tuning. This addresses a critical practical trade-off (latency vs. generalization) in LLM planning with a creative solution combining MCTS, de-lexicalized atomic experiences, and retrieval-augmented generation. The claimed achievement of System 2 depth at System 1 speed has immediate practical impact. Paper 2, while valuable as a survey formalizing graph world models, primarily organizes existing work rather than introducing new methods, limiting its direct scientific contribution.
Paper 2 proposes a highly innovative methodological breakthrough that addresses a critical bottleneck in LLM deployment: achieving deep System 2 reasoning at System 1 inference speeds. Its approach of decoupling planning from execution via training-free retrieval has immediate, broad real-world applicability in autonomous agents. While Paper 1 provides a valuable taxonomy for an emerging field, Paper 2's algorithmic contribution and potential to match state-of-the-art performance without fine-tuning promises a more disruptive and immediate scientific impact across the AI community.
SGA-MCTS presents a novel framework that addresses a fundamental trade-off in LLM planning (latency vs. generalization) with a creative non-parametric retrieval approach. It claims to enable frozen open-weight models to match GPT-5 performance without fine-tuning, which if validated would have enormous practical impact. The methodological innovation of distilling MCTS trajectories into reusable de-lexicalized SGA atoms is technically novel. Paper 2 provides valuable insights on bias in LLM-as-a-Judge evaluation, but its scope is narrower—documenting a bias rather than solving a core capability problem. Paper 1's breadth of potential applications and architectural contribution gives it higher impact potential.
Paper 1 tackles a foundational bottleneck in LLM agents—achieving System 2 reasoning depth at System 1 inference speeds. By decoupling planning from execution using MCTS and symbolic retrieval, it claims to match state-of-the-art performance without fine-tuning. This has massive, cross-disciplinary implications for scalable autonomous systems. While Paper 2 offers significant improvements in diffusion model alignment and diversity, the potential breadth, timeliness, and transformative real-world applicability of real-time, high-fidelity LLM planning give Paper 1 a higher ceiling for scientific and practical impact.
Paper 2 addresses the critical and broader challenge of efficient System 2 reasoning in LLMs. By decoupling MCTS planning from execution via reusable, de-lexicalized symbolic atoms, it offers a highly novel approach to generalizing complex decision-making without fine-tuning. While Paper 1 provides a valuable optimization for KV cache memory, Paper 2's potential to enable real-time, scalable autonomous planning at System 1 speeds represents a more fundamental architectural leap with wider theoretical impact.
SGA-MCTS introduces a novel framework that addresses a fundamental trade-off in LLM planning (latency vs. generalization) with a creative non-parametric retrieval approach combining MCTS, de-lexicalized primitives, and hybrid retrieval. It demonstrates strong empirical results matching SOTA systems without fine-tuning, has broader applicability across planning domains, and offers a practical solution enabling System 2 reasoning at System 1 speeds. Paper 1, while valuable as a benchmark contribution, has narrower impact as an evaluation suite rather than a methodological advance, and benchmarks inherently have less transformative potential than novel frameworks.
SGA-MCTS presents a more broadly impactful paradigm by addressing a fundamental challenge in LLM planning—bridging the gap between expensive search and fast inference—applicable across diverse domains. Its training-free approach enabling frozen open-weight models to match GPT-5 performance is a striking claim with wide implications. The framework's novelty in decomposing MCTS trajectories into reusable de-lexicalized atoms and the hybrid retrieval mechanism represents a more generalizable contribution. DocSeeker, while solid, addresses the narrower domain of long document understanding with more incremental improvements to existing MLLM pipelines.