Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

#76 of 2292 · Artificial Intelligence
Share
Tournament Score
1551±46
10501800
76%
Win Rate
25
Wins
8
Losses
33
Matches
Rating
7.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a methodology for extracting formal search trees from LLM chain-of-thought reasoning traces and uses computational cognitive models to characterize how those trees relate to actual move decisions. The central finding is striking: while LLMs generate reasoning traces that superficially resemble deep tree search, their move choices are best explained by a myopic model that considers only immediate consequences (depth-1 nodes), ignoring deeper lookahead entirely. This is validated through both correlational (cognitive model fitting) and causal (CoT pruning) analyses.

The key insight is a dissociation between what LLMs *write* and what they *act on*. LLMs produce the surface structure of planning without actually leveraging the deep search they generate. This contrasts directly with human planning, where depth of search is the primary driver of expertise (Van Opheusden et al., 2023).

Methodological Rigor

The paper demonstrates strong methodological rigor across multiple complementary analyses:

1. Search tree extraction: Using GPT-5 as a judge with DSPy-optimized prompts, validated against human annotations. This is a creative solution to the challenge of parsing unstructured reasoning traces.

2. Computational modeling: The authors adapt an established cognitive model (Van Opheusden et al., 2023) with four variants (full-tree, myopic, discount, no-tree) that systematically vary the value backup rule. The discount model's convergence to γ≈0 across all models is particularly compelling evidence for myopia. A model recovery analysis (Appendix D.2) confirms the fitting procedure can distinguish between competing models, addressing a key validity concern.

3. Causal intervention: The CoT pruning experiment is methodologically strong. Removing depth-1 paragraphs causes 32% move changes, while adding them back reduces change to 4.1%. Adding deeper paragraphs yields only marginal further reduction (3.7%), indistinguishable from controls. This establishes causality beyond the correlational cognitive modeling.

4. Natural experiment: The GPT-OSS-120B medium vs. high reasoning effort comparison provides a within-architecture control for the relationship between search effort and performance.

However, some limitations deserve note. The extraction relies on an LLM judge (GPT-5), introducing potential extraction errors that could bias results. While validated against human annotations, the validation set details are sparse. The cognitive model uses a fixed heuristic function from prior human work—different feature representations might better capture LLM decision-making. The causal intervention was conducted on only one model (Qwen3-Next-80B-Thinking), limiting generalizability of that specific finding.

Potential Impact

Interpretability and alignment: The finding that reasoning traces don't faithfully reflect decision processes has direct implications for scalable oversight. If deep reasoning in CoT is decorative rather than functional, monitoring approaches that assume trace faithfulness will fail. This connects to growing concerns about unfaithful chain-of-thought (Turpin et al., 2023; Lanham et al., 2023).

Training methodology: The paper suggests that simply scaling test-time compute or encouraging longer reasoning traces may be insufficient. The bottleneck isn't generating deeper search but acting on it—suggesting training signals that explicitly reward value backup from deep lookahead may be needed. This is actionable guidance for reasoning model developers.

Cognitive science bridge: The direct comparison with human planning data (Van Opheusden et al., 2023) using the same computational framework is valuable. It quantifies a precise mechanistic difference: humans improve through depth, LLMs through breadth. This advances understanding of what "reasoning" means in LLMs versus biological systems.

Methodology transfer: The search tree extraction framework could generalize to other strategic domains (chess, negotiation, multi-step planning tasks), providing a reusable toolkit for studying LLM deliberation.

Timeliness & Relevance

This paper is exceptionally timely. Reasoning models (o1, R1, etc.) are being rapidly deployed, and understanding whether their extended reasoning is genuinely functional is a pressing question. The paper directly addresses the ongoing debate about whether LLMs can plan (Kambhampati et al., 2024; Valmeekam et al., 2025) but reframes it productively: rather than asking *whether* LLMs plan, it asks *how* they plan and whether that planning is effective. The focus on reasoning trace faithfulness also connects to active safety research on scalable oversight.

Strengths

  • Novel analytical framework: Bridging cognitive science computational modeling with LLM reasoning analysis is creative and well-executed.
  • Converging evidence: Three complementary approaches (correlational modeling, discount parameter estimation, causal intervention) all point to the same conclusion.
  • Direct human comparison: Using the same formal framework as prior human work enables precise, apples-to-apples comparison.
  • Scale: 27 models, 1404 games, 9696 reasoning traces provide substantial statistical power.
  • Actionable implications: Specific guidance for training (reward deep value backup) and oversight (don't trust trace depth).
  • Limitations

  • Single domain: All findings are from four-in-a-row. Generalization to math, coding, or other reasoning domains where "depth" has different structure remains unverified.
  • Extraction fidelity: Dependence on an LLM judge for tree extraction introduces a potential confound—extraction quality may vary with reasoning clarity.
  • Causal intervention on one model: The pruning study uses only Qwen3-Next-80B-Thinking.
  • Proprietary model exclusion: 13 of 27 models had inaccessible traces, potentially introducing selection bias in the analyzed population.
  • Alternative explanations: The paper acknowledges but doesn't fully test whether myopia might be *adaptive* given model uncertainty at depth, which would change the interpretation from "failure" to "rational strategy."
  • Overall Assessment

    This is a well-crafted paper that makes a clear, important, and well-supported claim about a phenomenon of broad interest. The combination of cognitive modeling and causal intervention is methodologically sophisticated. The main finding—that LLMs generate deep search but don't use it—is both surprising and consequential for how we think about reasoning models, their oversight, and their improvement.

    Rating:7.8/ 10
    Significance 8Rigor 7.5Novelty 8Clarity 8.5

    Generated May 11, 2026

    Comparison History (33)

    vs. Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents
    claude-opus-4.65/19/2026

    Paper 1 introduces a novel, generalizable framework for understanding LLM planning by extracting search trees from reasoning traces, revealing a fundamental dissociation between LLM and human planning (myopic vs. deep search). This provides deep mechanistic insight into how reasoning models actually work, with broad implications for improving LLM reasoning capabilities. Paper 2 identifies an important but more incremental safety concern (temporal memory contamination) in memory-equipped agents. While practically relevant, Paper 1's methodological innovation and fundamental insights into LLM cognition have broader scientific impact across AI, cognitive science, and alignment research.

    vs. Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms
    gemini-3.15/16/2026

    Paper 1 offers fundamental insights into the mechanisms of LLM reasoning, addressing the highly debated topic of whether models genuinely plan. By revealing that LLMs are myopic and do not utilize deep lookahead despite generating deep traces, it fundamentally challenges current assumptions about chain-of-thought and provides a concrete framework for future research in LLM cognition. While Paper 2 offers a highly practical safety application, Paper 1's deep analytical approach will likely have a broader foundational impact on how the field understands and develops reasoning models.

    vs. Quantifying and Understanding Uncertainty in Large Reasoning Models
    claude-opus-4.65/11/2026

    Paper 1 introduces a novel, interpretable framework for understanding LLM planning by extracting search trees from reasoning traces, revealing the striking finding that LLMs exhibit myopic planning despite generating deep deliberation. This provides actionable insights for improving reasoning models and bridges cognitive science with AI. Paper 2 addresses uncertainty quantification with conformal prediction—a more incremental contribution combining existing tools (CP, Shapley values). Paper 1's findings are more surprising, broadly applicable across strategic domains, and directly relevant to the timely question of whether reasoning models truly reason.

    vs. Multi-Environment POMDPs with Finite-Horizon Objectives
    gemini-3.15/11/2026

    Paper 2 has higher potential impact due to its extreme timeliness and relevance to the rapidly growing field of LLM reasoning. While Paper 1 provides solid theoretical and algorithmic contributions to MEPOMDPs, it represents a more incremental advance in a mature subfield. Paper 2 introduces a novel framework for interpreting LLM chain-of-thought as search trees, revealing critical insights about the myopic nature of LLM planning compared to humans. This addresses a major open question in AI, offering broad applications for improving foundation models and understanding AI cognition.

    vs. Temporal Smoothness Doubly Robust Learning for Debiased Knowledge Tracing
    gemini-3.15/11/2026

    Paper 2 investigates the fundamental reasoning and planning capabilities of Large Language Models (LLMs), a highly active and globally impactful area of AI research. By demonstrating that LLMs engage in myopic rather than deep planning compared to humans, it provides critical insights for AI alignment and development. While Paper 1 offers a rigorous methodological contribution to Knowledge Tracing, Paper 2's focus on LLM cognition guarantees a significantly broader impact across AI, cognitive science, and safety domains, making it highly timely and relevant.

    vs. ARMOR: An Agentic Framework for Reaction Feasibility Prediction via Adaptive Utility-aware Multi-tool Reasoning
    gpt-5.25/11/2026

    Paper 1 introduces a broadly generalizable methodology to extract and quantitatively model “search trees” from LLM reasoning traces, coupled with causal pruning interventions, yielding a surprising and theoretically important claim (myopic decision policy despite deep trace expansions). This advances mechanistic interpretability and evaluation of LLM planning with relevance across domains beyond games (agentic planning, alignment, CoT faithfulness). Paper 2 is a solid applied multi-tool agent framework with clear utility in reaction feasibility prediction, but its novelty is more incremental within tool-selection/agent orchestration and its impact is narrower to computational chemistry workflows.

    vs. Tacit Knowledge Extraction via Logic Augmented Generation and Active Inference
    claude-opus-4.65/11/2026

    Paper 1 presents a novel and rigorous framework for understanding LLM planning behavior, a timely topic given the rapid deployment of reasoning models. Its key finding—that LLMs exhibit myopic planning despite generating deep search traces—has broad implications for AI alignment, model interpretability, and improving reasoning capabilities. The methodology (extracting search trees, fitting computational models, causal interventions) is rigorous and generalizable. Paper 2, while addressing an important problem in knowledge engineering, is more narrowly scoped to a specific industrial application with incremental methodological contributions combining existing paradigms.

    vs. Online Goal Recognition using Path Signature and Dynamic Time Warping
    gemini-3.15/11/2026

    Paper 1 addresses a highly timely and critical question in AI: whether LLMs genuinely plan during Chain-of-Thought reasoning. Its novel methodology for extracting and quantifying search trees provides deep insights into LLM behavior, revealing myopic planning. This has broad implications for AI alignment, cognitive science, and future LLM development. Paper 2 offers a solid technical improvement for goal recognition, but its impact is relatively confined to specific subfields of planning and trajectory analysis, whereas Paper 1 influences the broader, rapidly evolving landscape of foundation models.

    vs. Squeeze Evolve: Unified Multi-Model Orchestration for Verifier-Free Evolution
    claude-opus-4.65/11/2026

    Paper 1 introduces a novel analytical framework for understanding LLM reasoning by extracting search trees from chain-of-thought traces, revealing that LLMs exhibit myopic planning despite generating deep deliberation. This provides fundamental insights into how LLMs reason versus humans, with broad implications for AI interpretability and alignment. The methodological innovation (fitting computational models to extracted search trees) is generalizable across domains. Paper 2 is a solid engineering contribution optimizing multi-model orchestration for cost efficiency, but is more incremental and narrower in scientific scope, primarily improving existing evolutionary inference pipelines.

    vs. Alternating Target-Path Planning for Scalable Multi-Agent Coordination
    claude-opus-4.65/11/2026

    Paper 1 introduces a novel analytical framework for understanding LLM reasoning by extracting search trees from chain-of-thought traces, revealing that LLMs exhibit myopic planning despite generating deep reasoning. This has broad impact across AI interpretability, alignment, and cognitive science, addressing a fundamental and timely question about how reasoning models actually work. Paper 2 makes a solid engineering contribution to multi-agent pathfinding scalability, but addresses a narrower problem with incremental methodological advances. Paper 1's findings about the dissociation between LLM deliberation and decision-making have wider implications for the rapidly growing field of LLM reasoning.

    vs. SAVOIR: Learning Social Savoir-Faire via Shapley-based Reward Attribution
    claude-opus-4.65/11/2026

    Paper 1 introduces a novel, generalizable framework for understanding LLM planning by extracting search trees from reasoning traces, revealing a fundamental insight—LLMs exhibit myopic planning despite generating deep reasoning chains. This finding has broad implications for understanding and improving reasoning models across many domains. The methodology combining computational modeling with causal interventions is rigorous and the contrast with human cognition is scientifically deep. Paper 2, while technically strong with its Shapley-based reward attribution for social dialogue, addresses a narrower problem (credit assignment in social RL) with more incremental contributions to an existing benchmark.

    vs. When Does Critique Improve AI-Assisted Theoretical Physics? SCALAR: Structured Critic--Actor Loop for Agentic Reasoning
    gemini-3.15/11/2026

    Paper 1 addresses a fundamental question about LLM reasoning, revealing that despite generating deep Chain-of-Thought traces, LLMs are fundamentally myopic and rely on shallow search. This challenges core assumptions about AI planning and has broad implications for AI development across all domains. Paper 2, while valuable for AI-assisted physics, offers more incremental insights regarding multi-agent prompt engineering and feedback loops, making its overall scientific impact comparatively narrower.

    vs. AdaTKG: Adaptive Memory for Temporal Knowledge Graph Reasoning
    gpt-5.25/11/2026

    Paper 1 is more novel and timely: it introduces a general framework to extract and model search-tree structure from LLM chain-of-thought, and adds causal pruning interventions to link trace structure to decisions. The findings (myopic action despite deep-looking traces) directly impact interpretability, evaluation, and alignment of reasoning LLMs, with broad relevance across AI safety, cognitive science, and decision-making. Paper 2 is a solid incremental advance in temporal KG modeling with clear applications, but the adaptive EMA memory is relatively modest and its impact is likely narrower to the TKG subfield.

    vs. Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback
    claude-opus-4.65/11/2026

    Paper 1 offers a novel analytical framework for understanding LLM reasoning by extracting and quantifying search trees from CoT traces, revealing a fundamental insight—LLMs exhibit myopic planning despite generating deep reasoning traces. This finding has broad implications for understanding and improving LLM reasoning across domains, supported by rigorous methodology including computational modeling and causal interventions. Paper 2 presents an engineering contribution (reasoning graphs for persistent memory) that, while practical, is narrower in scope, lacks reported empirical results on benchmarks, and addresses an incremental improvement in agent architecture rather than a fundamental scientific insight.

    vs. LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning
    gemini-3.15/11/2026

    Paper 1 addresses a fundamental and highly timely question regarding the true nature of LLM reasoning and planning within Chain-of-Thought processes. By revealing that LLMs engage in myopic planning rather than deep lookahead, it provides critical insights into model cognition that will broadly influence future LLM architecture and alignment research. Paper 2, while offering a strong methodological approach for on-device GUI agents, has a narrower, more application-focused impact compared to the foundational cognitive insights of Paper 1.

    vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures
    claude-opus-4.65/11/2026

    Paper 2 introduces a novel, generalizable methodology for understanding LLM reasoning by extracting search trees from CoT traces, revealing a fundamental insight about LLM planning (myopic despite deep traces). This has broad implications across AI interpretability, cognitive science, and alignment research. While Paper 1 (IatroBench) addresses an important and timely safety concern about identity-contingent withholding, its impact is more narrowly focused on AI safety policy and medical applications. Paper 2's methodological contribution and its fundamental insight about the gap between LLM deliberation and action have wider scientific reach and are more likely to influence multiple research directions.

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    gemini-3.15/11/2026

    While Paper 1 provides valuable insights into LLM reasoning limitations, Paper 2 offers a broader and more enduring scientific impact. By unifying diffusion models and physical forces for molecular and crystal structure search, it significantly accelerates materials discovery and drug design. Its tenfold efficiency improvement and out-of-distribution generalization directly translate to real-world applications in chemistry, physics, and medicine, offering tangible advancements beyond the rapidly shifting landscape of LLM diagnostics.

    vs. Machine Collective Intelligence for Explainable Scientific Discovery
    gemini-3.15/11/2026

    Paper 1 addresses a fundamental bottleneck in all empirical sciences: deriving explainable governing equations from data. By achieving up to a million-fold reduction in extrapolation error and condensing massive neural networks into highly interpretable parameters, its methodology has immense potential to catalyze autonomous scientific discoveries across physics, chemistry, and biology. Paper 2, while highly relevant to AI interpretability and LLM reasoning limitations, has a narrower scope restricted to natural language processing and cognitive modeling, giving Paper 1 a substantially broader and more transformative scientific impact.

    vs. Foundation Models to Unlock Real-World Evidence from Nationwide Medical Claims
    claude-opus-4.65/11/2026

    ReClaim represents a major foundation model contribution trained on 43.8 billion medical events from 200M+ patients, demonstrating substantial improvements across 1,000+ prediction tasks, expenditure forecasting, and causal inference. Its direct applications to regulatory decision-making, disease surveillance, and healthcare economics give it enormous real-world impact potential. Paper 2, while offering valuable insights into LLM reasoning mechanisms (myopic planning), is more narrowly focused on understanding existing model behavior in a game domain. Paper 1's methodological scale, practical utility, and breadth of healthcare applications position it for higher scientific impact.

    vs. A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics
    gpt-5.25/11/2026

    Paper 1 has higher potential impact due to a broad, unifying theoretical contribution linking Bayesian inference, thermodynamics, and game theory with formal equivalence results (free-energy minima ↔ approximate Nash equilibria) plus new constructs (free-energy Harsanyi dividend) and cross-domain validation. If correct, it could reshape multi-agent modeling across neuroscience, biology, and AI. Paper 2 is timely and useful for LLM interpretability, but is narrower (one task domain) and primarily diagnostic rather than a foundational unification; its methods may generalize, yet the conceptual scope and cross-field reach are smaller.