SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval

Xin Xie, Dongyun Xue, Wuguannan Yao, Mingxiao Feng, Wengang Zhou, Xiang Qi, Houqiang Li, Peng Zhang

Apr 16, 2026

arXiv:2604.14712v1 PDF

cs.AI(primary)

#131of 2292·Artificial Intelligence

#131 of 2292 · Artificial Intelligence

Tournament Score

1534±27

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5

Novelty6

Clarity7

Tournament Score

1534±27

10501800

67%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM-powered systems require complex multi-step decision-making abilities to solve real-world tasks, yet current planning approaches face a trade-off between the high latency of inference-time search and the limited generalization of supervised fine-tuning. To address this limitation, we introduce \textbf{SGA-MCTS}, a framework that casts LLM planning as non-parametric retrieval. Offline, we leverage Monte Carlo Tree Search (MCTS) to explore the solution space and distill high-fidelity trajectories into State-Goal-Action (SGA) atoms. These atoms are de-lexicalized primitives that abstract concrete entities into symbolic slots, preserving reusable causal logic while discarding domain-specific noise. Online, a retrieval-augmented agent employs a hybrid symbolic-semantic mechanism to fetch relevant SGAs and re-ground them into the current context as soft reasoning hints. Empirical results on complex benchmarks demonstrate that this paradigm enables frozen, open-weights models to match the performance of SOTA systems (e.g., GPT-5) without task-specific fine-tuning. By effectively amortizing the heavy computational cost of search, SGA-MCTS achieves System 2 reasoning depth at System 1 inference speeds, rendering autonomous planning both scalable and real-time feasible.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: SGA-MCTS

1. Core Contribution

SGA-MCTS introduces a two-phase framework that decouples deliberative planning from reactive execution in LLM-based agents. The key insight is that complex multi-step reasoning trajectories can be decomposed into reusable State-Goal-Action (SGA) atoms—de-lexicalized primitives where concrete entities are replaced by typed symbolic slots. Offline, MCTS explores the solution space and distills high-quality trajectories into these atoms. Online, a hybrid symbolic-semantic retrieval mechanism fetches relevant atoms as "soft reasoning hints" that guide a frozen LLM's generation without any parameter updates.

The paper frames this as addressing the latency-generalization tradeoff: inference-time search (Tree of Thoughts, LATS) is expensive at test time, while supervised fine-tuning suffers from parametric rigidity. SGA-MCTS attempts to amortize the search cost into an offline phase and deliver the benefits at inference time through non-parametric retrieval.

2. Methodological Rigor

Strengths in formulation: The problem is cleanly formulated as a Goal-Conditioned MDP with a structured state representation. The gated reward function (Eq. 1) that zeroes out failed trajectories is a reasonable design choice. The hybrid retrieval score (Eq. 3) combining semantic cosine similarity with symbolic feasibility checking is well-motivated—pure semantic retrieval ignores execution preconditions, while pure symbolic matching misses intent alignment.

Concerns about experimental rigor:

The evaluation covers only three benchmarks, and the dataset splits are somewhat unconventional (e.g., using G2 for offline and G3 for online in StableToolBench, 50/50 random split for ToolHop). The potential for information leakage, particularly on ToolHop where tools may overlap between splits, is not rigorously addressed beyond the Tool Familiarity Score analysis.

The comparison to GPT-5 is notable but problematic: GPT-5 is evaluated in "thinking mode" while open models use "non-thinking mode"—this is not an apples-to-apples comparison. The claim of "matching GPT-5" is somewhat misleading given these different evaluation conditions.

Standard deviations are reported (commendable), but the variance on StableToolBench is notably high (e.g., ±4.70 for 14B), suggesting instability.

The baseline comparison is limited: only ReAct (zero-shot) and LangMem are compared. Missing are comparisons against other memory-augmented agents (Reflexion, ExpeL), other retrieval-augmented planning methods, or fine-tuned baselines that would contextualize the contribution more fully.

The paper compares against "ReAct-Thinking" in Table 2 but this baseline appears nowhere in the main results (Table 1), creating inconsistency.

3. Potential Impact

The "reasoning as retrieval" paradigm is conceptually appealing and practically relevant. If validated more thoroughly, it could:

Reduce deployment costs for LLM agents by enabling smaller models to perform competitively through curated experience retrieval rather than model scaling.

Enable modular knowledge management where atomic experiences can be added, removed, or updated without retraining.

Provide a template for amortized reasoning that could extend beyond tool use to other multi-step decision domains (e.g., code generation, scientific workflows).

However, the practical impact may be limited by the acknowledged dependency on offline MCTS quality and the cold-start problem. The framework also requires maintaining and querying an experience store, adding operational complexity.

4. Timeliness & Relevance

The paper addresses a genuinely pressing issue: how to give LLM agents deep planning capabilities without prohibitive inference costs. The System 1/System 2 framing is timely given the community's interest in test-time compute scaling. The focus on training-free approaches is relevant for practitioners who cannot afford fine-tuning large models for every new domain.

The emphasis on open-weight models matching proprietary systems aligns with the democratization trend in AI. The 76% token reduction claim, if reproducible, addresses real cost concerns in production deployments.

5. Strengths & Limitations

Key Strengths:

Clear architectural design: The two-phase pipeline is intuitive and well-illustrated (Figure 1).

De-lexicalization is a genuinely useful idea: Abstracting entities into typed slots for experience reuse is simple but effective, and the ablation in Figure 2 confirms its value over raw text retrieval.

Compelling OOD analysis: The Tool Familiarity Score (Eq. 5) and the analysis showing SGA's advantage grows as tool familiarity decreases (Figure 3) is the paper's strongest empirical contribution, demonstrating genuine generalization rather than memorization.

Efficiency narrative is well-supported: The 6.9× compression ratio and the "deep-but-narrow" MCTS topology analysis (Table 3) provide concrete evidence of computational amortization.

Notable Weaknesses:

Limited baseline comparison: The absence of Reflexion, ExpeL, or any fine-tuning baseline weakens the empirical claims significantly.

Scalability questions unanswered: How does the framework perform as the tool space grows to hundreds or thousands of tools? The current benchmarks appear relatively contained.

The MCTS offline phase is expensive and underspecified: The paper doesn't report wall-clock times for the offline phase, making the "amortization" argument incomplete. How many GPU-hours does offline discovery require?

Reproducibility concerns: While hyperparameters are listed, the SGA extraction relies on LLM-based de-lexicalization, which may be sensitive to prompt engineering and model choice.

The GPT-5 comparison is premature and potentially misleading: As of the purported submission date (April 2026), using GPT-5 as a reference point without detailed methodology comparison raises credibility questions.

No statistical significance tests beyond standard deviation reporting.

The paper's novelty is incremental in some dimensions: MCTS for LLM planning has been explored (LATS); retrieval-augmented agents exist (ExpeL, Reflexion); de-lexicalization is common in dialogue systems. The combination is the contribution, but each component individually is well-established.

Overall Assessment

SGA-MCTS presents a well-structured framework with a compelling high-level idea—amortizing search cost through atomic experience retrieval. The de-lexicalization approach and the OOD generalization analysis are genuine contributions. However, the experimental evaluation is limited in scope and baseline coverage, and several claims (matching GPT-5, System 2 at System 1 cost) are overstated relative to the evidence provided. The paper would benefit from broader comparisons, wall-clock time analysis, and evaluation on more diverse benchmarks.

Rating:5.8/ 10

Significance 6Rigor 5Novelty 6Clarity 7

Generated Apr 17, 2026

Comparison History (49)

vs. Verifiable Process Rewards for Agentic Reasoning

claude-opus-4.65/16/2026

Paper 1 (VPR) addresses a fundamental challenge in RL for LLMs—credit assignment in long-horizon reasoning—with a principled framework backed by both theoretical analysis and empirical validation across multiple settings. It demonstrates transfer to general reasoning benchmarks, suggesting broad applicability. Paper 2 (SGA-MCTS) presents an interesting retrieval-based planning approach but makes extraordinary claims (matching GPT-5 without fine-tuning) that raise credibility concerns, and its non-parametric retrieval paradigm may have scalability limitations in truly novel domains. VPR's contribution to process reward design is more foundational and likely to influence future RL-based reasoning research broadly.

vs. The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

gemini-3.15/16/2026

Paper 2 uncovers a fundamental structural property of LLM representations—that temporal drift is geometrically orthogonal to correctness and uncertainty. This discovery challenges existing assumptions about hallucination detection and offers deep mechanistic insights, paving the way for fundamentally new approaches to model interpretability and knowledge editing. While Paper 1 offers a strong, practical engineering solution for LLM planning efficiency, Paper 2 provides a novel theoretical framework with broader, foundational implications for our understanding of AI systems.

vs. Geometry over Density: Few-Shot Cross-Domain OOD Detection

gpt-5.25/7/2026

Paper 2 likely has higher scientific impact due to a clearer, broadly applicable capability (train-once, few-shot cross-domain OOD detection) directly tied to safety-critical deployment. Its information-geometric features from diffusion score trajectories are a novel, principled diagnostic that can transfer across unrelated domains with minimal unlabeled ID data, suggesting wide adoption across ML robustness, medical, autonomy, and monitoring. The claims are measurable (AUROC across 12 benchmarks, large sample-efficiency gains) and methodologically grounded. Paper 1 is promising for LLM planning, but hinges on strong comparative claims and may be narrower and more system-dependent.

vs. OptimusKG: Unifying biomedical knowledge in a modern multimodal graph

gpt-5.25/5/2026

Paper 2 (OptimusKG) likely has higher scientific impact due to its immediate real-world utility as a large, openly distributable biomedical infrastructure resource enabling broad downstream work (ML on graphs, LLM retrieval, hypothesis generation) across many life-science domains. Its contribution is timely for biomedical AI, and it includes validation via literature-backed evidence checks and provenance/schema constraints that support rigor and reuse. Paper 1 is innovative in LLM planning/retrieval, but its impact may be narrower (agent planning benchmarks) and more sensitive to rapidly changing LLM baselines and competing approaches.

vs. OptimusKG: Unifying biomedical knowledge in a modern multimodal graph

gpt-5.25/5/2026

OptimusKG likely has higher scientific impact due to its immediate, broad real-world utility: a large, schema-constrained, multimodal biomedical knowledge graph that can serve as shared infrastructure for many downstream tasks (drug discovery, hypothesis generation, KG-ML, LLM retrieval, data integration). Its methodological contribution (harmonized LPG with rich properties/provenance plus evidence validation) is concrete and reusable across the life sciences, and timely given rapid adoption of KG+LLM workflows. SGA-MCTS is novel in LLM planning/retrieval, but impact may be narrower and more contingent on benchmark generality and reproducibility of claimed GPT-5-level performance.

vs. The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

gemini-35/5/2026

Paper 2 addresses a critical bottleneck in LLM agents by bridging the gap between slow System 2 planning and fast System 1 execution. By amortizing expensive MCTS search costs into an offline retrieval-based framework, it claims to match next-generation SOTA performance at low latency. This offers immense practical utility for real-world autonomous systems. While Paper 1 provides rigorous improvements to inference-time decoding, Paper 2's paradigm shift in decoupling planning from execution offers broader application potential and highly transformative impact for scalable AI agents.

vs. The Model Knows, the Decoder Finds: Future Value Guided Particle Power Sampling

gemini-35/5/2026

Paper 2 proposes a highly innovative paradigm shift by casting complex LLM planning as non-parametric retrieval of de-lexicalized atoms. By decoupling expensive MCTS planning from online execution, it elegantly solves the latency-generalization trade-off, enabling System 2 reasoning at System 1 speeds. This approach has broad applicability to real-time autonomous agents and offers significant real-world impact compared to Paper 1's narrower focus on decoding algorithms.

vs. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

gpt-5.25/5/2026

Paper 2 offers a more novel, scalable algorithmic contribution: amortizing MCTS into reusable de-lexicalized State-Goal-Action atoms with training-free retrieval, potentially enabling real-time planning for frozen open-weight models. This has direct practical applications (latency, cost, deployability) and broad relevance to planning, retrieval, and agentic LLMs. Paper 1 is valuable and timely as a cross-domain diagnostic benchmark with solid evaluation/annotation methodology, but benchmarks/analysis typically yield narrower impact than a generalizable planning framework if results hold widely.

vs. The Long-Horizon Task Mirage? Diagnosing Where and Why Agentic Systems Break

claude-opus-4.65/5/2026

SGA-MCTS presents a novel framework that addresses a fundamental trade-off in LLM planning (latency vs. generalization) with a concrete solution—decoupling planning into offline MCTS exploration and online retrieval of abstracted SGA atoms. It demonstrates that frozen open-weight models can match SOTA performance (GPT-5) without fine-tuning, offering immediate practical impact. While Paper 2 provides valuable diagnostic analysis of long-horizon failures, it is primarily a benchmark/analysis contribution rather than a solution. Paper 1's methodological innovation (non-parametric retrieval for planning, de-lexicalized primitives) has broader potential to influence future agent architectures and real-world deployment.

vs. Arithmetic in the Wild: Llama uses Base-10 Addition to Reason About Cyclic Concepts

gemini-35/5/2026

Paper 2 addresses a critical bottleneck in LLM agents: the trade-off between inference-time search latency and generalization. By decoupling planning from execution using retrieval-augmented MCTS, it enables System 2 reasoning at System 1 speeds. This has massive potential for real-world applications across autonomous systems. While Paper 1 provides excellent fundamental insights into mechanistic interpretability, Paper 2's broad applicability, methodological innovation in non-parametric planning, and solution to a highly timely problem give it higher potential for widespread impact.

vs. EO-Gym: A Multimodal, Interactive Environment for Earth Observation Agents

gpt-5.25/5/2026

Paper 2 has broader, more general impact: it proposes a domain-agnostic framework (planning as non-parametric retrieval of de-lexicalized State-Goal-Action atoms) that could apply across many LLM-agent tasks, with strong real-world implications for lowering latency/cost while improving multi-step reliability. If empirically validated as claimed (matching top proprietary models without fine-tuning), it is timely and potentially transformative for scalable autonomous planning. Paper 1 is rigorous and valuable but more domain-specific (EO) and primarily advances benchmarking/environment infrastructure rather than a widely transferable planning paradigm.

vs. Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

claude-opus-4.65/1/2026

SGA-MCTS introduces a novel paradigm that decouples planning from execution via non-parametric retrieval of abstracted atomic experiences, addressing a fundamental trade-off in LLM planning. Its claim of enabling frozen open-weight models to match frontier systems like GPT-5 without fine-tuning has broad implications across AI planning, reasoning, and autonomous agents. Paper 2, while valuable for medical AI safety, is primarily an auditing/benchmarking study with more incremental contributions (identifying known grounding limitations, standard fine-tuning). Paper 1's methodological novelty and cross-domain applicability suggest higher potential impact.

vs. Auditing Frontier Vision-Language Models for Trustworthy Medical VQA: Grounding Failures, Format Collapse, and Domain Adaptation

gpt-5.25/1/2026

Paper 2 is more novel in reframing LLM planning as training-free, non-parametric “atomic experience” retrieval distilled from offline MCTS, addressing a central latency–generalization bottleneck. Its applications (agentic decision-making, robotics, tools, web automation) are broad and timely, and the decoupling of expensive search from fast inference could influence many systems beyond any single domain. Paper 1 is important and rigorous for medical VQA trustworthiness auditing, but its impact is narrower (specific to medical VLM grounding/prompting failures) and more diagnostic than paradigm-shifting.

vs. Graph World Models: Concepts, Taxonomy, and Future Directions

claude-opus-4.65/1/2026

SGA-MCTS presents a concrete, novel framework with empirical results demonstrating that frozen open-weight models can match GPT-5-level performance on complex planning benchmarks without fine-tuning. This addresses a critical practical trade-off (latency vs. generalization) in LLM planning with a creative solution combining MCTS, de-lexicalized atomic experiences, and retrieval-augmented generation. The claimed achievement of System 2 depth at System 1 speed has immediate practical impact. Paper 2, while valuable as a survey formalizing graph world models, primarily organizes existing work rather than introducing new methods, limiting its direct scientific contribution.

vs. Graph World Models: Concepts, Taxonomy, and Future Directions

gemini-35/1/2026

Paper 2 proposes a highly innovative methodological breakthrough that addresses a critical bottleneck in LLM deployment: achieving deep System 2 reasoning at System 1 inference speeds. Its approach of decoupling planning from execution via training-free retrieval has immediate, broad real-world applicability in autonomous agents. While Paper 1 provides a valuable taxonomy for an emerging field, Paper 2's algorithmic contribution and potential to match state-of-the-art performance without fine-tuning promises a more disruptive and immediate scientific impact across the AI community.

vs. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

claude-opus-4.64/17/2026

SGA-MCTS presents a novel framework that addresses a fundamental trade-off in LLM planning (latency vs. generalization) with a creative non-parametric retrieval approach. It claims to enable frozen open-weight models to match GPT-5 performance without fine-tuning, which if validated would have enormous practical impact. The methodological innovation of distilling MCTS trajectories into reusable de-lexicalized SGA atoms is technically novel. Paper 2 provides valuable insights on bias in LLM-as-a-Judge evaluation, but its scope is narrower—documenting a bias rather than solving a core capability problem. Paper 1's breadth of potential applications and architectural contribution gives it higher impact potential.

vs. FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling

gemini-34/17/2026

Paper 1 tackles a foundational bottleneck in LLM agents—achieving System 2 reasoning depth at System 1 inference speeds. By decoupling planning from execution using MCTS and symbolic retrieval, it claims to match state-of-the-art performance without fine-tuning. This has massive, cross-disciplinary implications for scalable autonomous systems. While Paper 2 offers significant improvements in diffusion model alignment and diversity, the potential breadth, timeliness, and transformative real-world applicability of real-time, high-fidelity LLM planning give Paper 1 a higher ceiling for scientific and practical impact.

vs. HybridKV: Hybrid KV Cache Compression for Efficient Multimodal Large Language Model Inference

gemini-34/17/2026

Paper 2 addresses the critical and broader challenge of efficient System 2 reasoning in LLMs. By decoupling MCTS planning from execution via reusable, de-lexicalized symbolic atoms, it offers a highly novel approach to generalizing complex decision-making without fine-tuning. While Paper 1 provides a valuable optimization for KV cache memory, Paper 2's potential to enable real-time, scalable autonomous planning at System 1 speeds represents a more fundamental architectural leap with wider theoretical impact.

vs. Aligning Agents via Planning: A Benchmark for Trajectory-Level Reward Modeling

claude-opus-4.64/17/2026

SGA-MCTS introduces a novel framework that addresses a fundamental trade-off in LLM planning (latency vs. generalization) with a creative non-parametric retrieval approach combining MCTS, de-lexicalized primitives, and hybrid retrieval. It demonstrates strong empirical results matching SOTA systems without fine-tuning, has broader applicability across planning domains, and offers a practical solution enabling System 2 reasoning at System 1 speeds. Paper 1, while valuable as a benchmark contribution, has narrower impact as an evaluation suite rather than a methodological advance, and benchmarks inherently have less transformative potential than novel frameworks.

vs. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

claude-opus-4.64/17/2026

SGA-MCTS presents a more broadly impactful paradigm by addressing a fundamental challenge in LLM planning—bridging the gap between expensive search and fast inference—applicable across diverse domains. Its training-free approach enabling frozen open-weight models to match GPT-5 performance is a striking claim with wide implications. The framework's novelty in decomposing MCTS trajectories into reusable de-lexicalized atoms and the hybrid retrieval mechanism represents a more generalizable contribution. DocSeeker, while solid, addresses the narrower domain of long document understanding with more incremental improvements to existing MLLM pipelines.