Reasoning Structure of Large Language Models

Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer

Jun 2, 2026

arXiv:2606.03883v1 PDF

cs.AI(primary)cs.LG

#557of 3355·Artificial Intelligence

#557 of 3355 · Artificial Intelligence

Tournament Score

1478±42

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5

Tournament Score

1478±42

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Reasoning Structure of Large Language Models"

1. Core Contribution

This paper addresses a genuine gap in LRM evaluation: standard metrics (accuracy, token count) collapse rich reasoning behavior into single numbers, hiding fundamental structural differences. The authors propose three interconnected contributions: (1) a scalable benchmark of 21 grid-based logic puzzles at four difficulty levels, (2) a pipeline that converts free-form textual reasoning traces into directed acyclic graphs (DAGs) of verifiable claims and dependencies, and (3) a reasoning-flow efficiency metric η based on absorbing Markov chains that quantifies how concentrated a model's logical flow is relative to the minimal claim set needed for a solution.

The key insight — modeling reasoning as an absorbing Markov chain over a claim graph and measuring structural entropy — is genuinely novel. By computing the fundamental matrix of the chain, the authors quantify how "logical mass" distributes across the reasoning graph, distinguishing focused deduction from diffuse exploration in a principled, information-theoretic manner.

2. Methodological Rigor

Strengths in design: The pipeline is carefully modular: deterministic extractors handle high-precision claims, LLM-based extractors handle recall, and a screening step filters artifacts. Claims are verified against an executable puzzle environment, grounding the analysis in deterministic truth rather than subjective judgment. The separation of extraction roles (GPT-5.2 for claim extraction, GPT-5-mini for rule extraction) mitigates self-evaluation bias.

Stability analysis: The authors provide thorough robustness checks. Six-extractor ablation shows η varies by only 1.9% across extractors with no self-bias. Same-trace repeated extraction yields high Jaccard overlap (0.79–0.98). Sensitivity analysis to Markov chain assumptions (7 alternative configurations) shows strong rank correlation (ρ ≥ 0.778). Perturbation analysis demonstrates CV below 5% for 6×6+ graphs.

Concerns: The reliance on LLM-based extraction is acknowledged but remains a limitation. Manual inspection of 200 rule applications finds 75.5% fully correct — a 24.5% error rate that could be problematic, though the strict criterion (single missing premise = full error) makes this conservative. The analysis is concentrated on solved Tents instances (n=85), which limits generalizability claims. The pipeline requires puzzle-specific claim and rule type definitions, reducing out-of-the-box scalability to new domains.

3. Potential Impact

Diagnostic evaluation: The framework shifts evaluation from "did the model get it right?" to "how did the model reason?" — a paradigm with broad applicability. The finding that token count is uncorrelated with η (r=-0.05, p=0.64) while verification overhead grows linearly with tokens (r=0.53) is practically important for compute allocation.

Training signal potential: The authors correctly note that η could serve as auxiliary feedback in RLVR-style training, rewarding solution-focused reasoning. If extraction becomes reliable and low-latency, this could influence how reasoning models are trained.

Cross-domain extensibility: The structural layer (graph construction, Markov chain, η) is domain-agnostic; only claim verification is puzzle-specific. The paper suggests extensions to mathematical reasoning (symbolic verification) and code generation (unit tests), though these remain speculative.

Limitations on impact breadth: The benchmark evaluates only open-source models for structural analysis (closed-source models don't expose traces). The puzzle domain, while well-controlled, may not capture reasoning patterns in more naturalistic settings like mathematical proof or scientific reasoning.

4. Timeliness & Relevance

The paper arrives at a critical moment: reasoning models (DeepSeek-R1, o1, etc.) are proliferating, but evaluation remains primitive. The community increasingly recognizes that accuracy alone is insufficient — recent work on overthinking, underthinking, and diversity collapse all point to the same gap this paper addresses structurally. The finding that hardest puzzles remain largely unsolved despite massive token budgets (all models ≤5.7% on "Human hard") adds to growing evidence that simply scaling test-time compute is insufficient.

5. Strengths & Limitations

Key strengths:

The η metric fills a genuine void — it is the only tested metric that correlates positively with accuracy while remaining uncorrelated with token count (Table 2)

Executable environment provides ground-truth verification, unlike many trace-analysis approaches

The absorbing Markov chain formulation is elegant and theoretically motivated

Comprehensive robustness analysis across extractors, Markov chain variants, and perturbations

The "some redundancy may be beneficial" finding (restatements correlating with higher η) provides nuanced insight into reasoning behavior

Notable limitations:

Graph extraction deeply relies on LLMs, creating a dependency on current-generation models

Structural analysis limited to 85 solved Tents instances; broader puzzle-family analysis is incomplete

Each puzzle family requires custom claim/rule definitions — substantial engineering effort for new domains

The paper evaluates only 4 models, limiting the generalizability of comparative findings

η is well-defined only for solved instances; the failed-trace analysis (Table 10, n=10) is preliminary

The correlation analyses, while informative, are observational — causal claims about reasoning quality remain speculative

Additional observations: The benchmark itself (21 puzzles × 4 difficulties × 5 instances) is a useful contribution but modest in scale. The paper's strongest contribution is conceptual — establishing that reasoning topology is measurable and informative — rather than delivering a turnkey evaluation tool. The extensive appendix (28 pages of prompts and specifications) underscores the engineering complexity, which may limit adoption.

Rating:6.8/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.5

Generated Jun 3, 2026

Comparison History (23)

vs. ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

claude-opus-4.66/6/2026

Paper 2 introduces a fundamentally new way to analyze and measure LLM reasoning through structural graph-based representations, addressing a critical gap in evaluation methodology. This has broad impact across the entire LLM research community, as better evaluation tools influence model development, training strategies, and architectural decisions. The reasoning efficiency metric and structural analysis framework are generalizable and could become standard diagnostic tools. Paper 1, while technically strong, represents an incremental advance in agentic system design with a narrower scope of impact focused on agent configuration.

vs. Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

gemini-3.16/6/2026

Paper 1 introduces a novel, structured approach to analyzing the internal reasoning topology of Large Language Models, addressing a critical gap in current LLM evaluation which relies heavily on opaque token counts and final accuracy. By converting reasoning traces into verifiable graphs, it offers a foundational interpretability and diagnostic tool that has broad implications across AI safety, cognitive modeling, and model development, giving it a broader potential scientific impact than the specialized optimization tasks in Paper 2.

vs. Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models

gemini-3.16/6/2026

Paper 2 addresses a highly timely and critical challenge in the dominant field of large language models: evaluating and understanding reasoning structures beyond simple token counts or accuracy. By converting reasoning traces into verifiable graphs, it introduces a novel, scalable, and broadly applicable methodology for AI evaluation. While Paper 1 offers a strong technical improvement for Federated Learning, Paper 2's focus on LLM reasoning evaluation has a wider potential impact across the broader AI community.

vs. Prototype Transformer: Towards Language Model Architectures Interpretable by Design

gemini-3.16/5/2026

Paper 2 introduces a fundamental architectural shift, offering linear-cost attention and inherent interpretability. This addresses critical scaling, opacity, and trust issues in AI, potentially revolutionizing how future foundational models are designed. While Paper 1 provides a valuable evaluation framework for existing models, Paper 2's architectural innovation has broader implications for the core design, efficiency, and safety of next-generation AI systems across all domains.

vs. Online Skill Learning for Web Agents via State-Grounded Dynamic Retrieval

claude-opus-4.66/5/2026

Paper 2 introduces a novel framework for analyzing the *structure* of reasoning in LLMs, moving beyond surface metrics (accuracy, token count) to graph-based topological analysis. This addresses a fundamental gap in understanding LLM reasoning and has broad applicability across all reasoning-capable models and tasks. Paper 1, while solid engineering work showing incremental improvements in web agent skill retrieval, is more narrowly scoped to web automation. Paper 2's methodological contribution—converting reasoning traces into verifiable graphs with efficiency metrics—offers a new analytical paradigm with wider cross-field impact and greater potential to influence future evaluation standards.

vs. AutoLab: Can Frontier Models Solve Long-Horizon Auto Research and Engineering Tasks?

claude-opus-4.66/5/2026

AutoLab addresses a critical gap in evaluating frontier AI models on long-horizon iterative tasks, a fundamental capability for autonomous scientific research. With 36 expert-curated tasks across diverse domains, evaluation of 17 state-of-the-art models, and fully open-sourced artifacts, it provides substantial infrastructure for the community. Its finding that persistence and iterative refinement matter more than initial quality is actionable and timely. Paper 2 offers useful structural analysis of reasoning traces but is narrower in scope, focusing on logic puzzles and reasoning graph topology, with more incremental contributions to LLM evaluation methodology.

vs. AgentCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

gemini-3.16/3/2026

Paper 2 addresses the critical and highly timely challenge of evaluating Large Reasoning Models (LRMs). By transforming opaque reasoning traces into verifiable, measurable topological graphs, it offers a novel and rigorous methodology to analyze test-time compute. This structural approach to assessing reasoning efficiency has broader potential impact on understanding and improving state-of-the-art LLMs compared to Paper 1's focus on continual learning benchmarks for agents.

vs. From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework

gemini-3.16/3/2026

Paper 1 introduces a quantitative, graph-based methodology for evaluating LLM reasoning, addressing a critical bottleneck in AI research. Its rigorous approach to measuring reasoning efficiency offers foundational scientific value for computer science and AI safety. In contrast, Paper 2 focuses on legal, policy, and insurance frameworks (risk transfer and claim reconstruction). While highly valuable for industry and governance, Paper 1 presents stronger potential to directly influence core scientific research and technical model development.

vs. Property-Guided LLM Program Synthesis for Planning

claude-opus-4.66/3/2026

Paper 1 presents a concrete, novel methodology (property-guided LLM synthesis with counterexample feedback) that demonstrates significant practical improvements—7x fewer program generations and orders of magnitude less computation—with direct applicability to planning and potentially other program synthesis domains. It introduces a verifiable, formally grounded approach that bridges LLM synthesis and formal methods. Paper 2 contributes useful evaluation methodology for reasoning models but is more diagnostic/analytical in nature, with narrower immediate practical impact. Paper 1's combination of methodological novelty, strong empirical results, and broad applicability gives it higher potential impact.

vs. AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

claude-opus-4.66/3/2026

Paper 2 presents a novel framework combining agentic RL with LLMs for automatic heuristic design, demonstrating practical results across eight diverse NP-hard optimization domains with a compact 4B-parameter model matching larger models. It addresses a broadly applicable problem (combinatorial optimization) with clear real-world applications, introduces a novel training paradigm (agentic RL for AHD), and shows strong generalization to held-out tasks. Paper 1 offers valuable diagnostic tools for analyzing reasoning structures but is more narrowly focused on LRM evaluation methodology with less immediate transformative potential.

vs. TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

claude-opus-4.66/3/2026

Paper 2 (TRON) introduces a practical, scalable training infrastructure for visual reasoning RL with 520 environments, demonstrating consistent improvements across multiple models and benchmarks. It addresses a fundamental bottleneck (static datasets for RL post-training) with an unbounded online generation approach. Paper 1 offers valuable diagnostic tools for analyzing reasoning structures but is primarily an evaluation/analysis contribution. TRON has broader impact potential: it enables new training paradigms, supports curriculum learning, and provides a reusable substrate for the rapidly growing multimodal reasoning community, making it more likely to influence future research directions.

vs. RelGT-AC: A Relational Graph Transformer for Autocomplete Tasks in Relational Databases

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact due to broader relevance and timeliness: it introduces a general framework and benchmark for analyzing LLM reasoning via verifiable reasoning graphs and a new efficiency metric. This could influence evaluation practices across many LLM applications (alignment, safety, interpretability, model selection), beyond a single domain. Paper 1 is solid and practical for relational ML, but its contributions are more incremental (masking, unified head, TF-IDF) and its impact is narrower to RelBench-style autocomplete in relational databases.

vs. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management

gemini-3.16/3/2026

Paper 1 offers a foundational methodological innovation by converting unstructured LLM reasoning into measurable, topological graphs. While Paper 2 presents a highly effective applied agent system, Paper 1 addresses a fundamental scientific gap in evaluating and interpreting the 'black box' of Large Reasoning Models. As the field shifts toward complex reasoning models (like OpenAI's o1), verifiable evaluation frameworks and efficiency metrics will have a broader, longer-lasting impact across AI research than specific agent architectures, which tend to be superseded rapidly.

vs. StepFinder: A Temporal Semantic Framework for Failure Attribution in Multi-Agent Systems

gpt-5.26/3/2026

Paper 2 has higher potential impact because it introduces a more general, field-spanning framework: converting LLM reasoning traces into verifiable dependency graphs and defining topology-based metrics (including reasoning efficiency). This is broadly applicable to evaluation, interpretability, scaling analysis, and failure diagnosis across many reasoning tasks and model classes, beyond multi-agent settings. Its benchmark+measurement approach is timely and likely to become a reusable evaluation primitive. Paper 1 is rigorous and practically useful, but is more domain-specific (failure attribution in multi-agent trajectories) and thus narrower in cross-field influence.

vs. SafeSteer: Localized On-Policy Distillation for Efficient Safety Alignment

gemini-3.16/3/2026

Paper 2 introduces a fundamental methodological innovation by transforming unstructured reasoning traces into measurable, verifiable graphs. As large reasoning models become increasingly prominent, this novel evaluation framework offers a crucial tool for diagnosing and understanding logical flow beyond simple accuracy metrics, promising broader conceptual impact and long-term relevance across AI research compared to the specific alignment technique in Paper 1.

vs. Towards Non-Monotonic Entailment in Propositional Defeasible Standpoint Logic

gemini-3.16/3/2026

Paper 1 addresses the highly timely and impactful field of large language models by proposing a novel, structured framework to evaluate reasoning. Its practical tools for diagnosing LLM behaviors have immediate, broad applications across AI and NLP. In contrast, Paper 2 offers theoretical advancements in a much narrower, niche subfield of formal logic and symbolic AI, limiting its overall scientific and real-world impact.

vs. The Shadow Price of Reasoning: Economic Perspective on Optimal Budget Allocation for LLMs

claude-opus-4.66/3/2026

Paper 2 introduces a fundamentally new way to analyze and measure LLM reasoning through structured reasoning graphs and efficiency metrics, addressing a significant gap in how we evaluate reasoning models. This provides a foundational diagnostic framework applicable across all reasoning models and tasks. Paper 1, while practically valuable with its economic optimization framework for budget allocation (CLEAR), addresses a more specific deployment optimization problem. Paper 2's contribution of converting reasoning traces into verifiable graph structures has broader methodological impact, enabling new research directions in understanding, comparing, and improving reasoning across the field.

vs. ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

claude-opus-4.66/3/2026

ReSkill introduces a novel framework that addresses a fundamental challenge in agentic RL—co-evolving reusable skills with policy optimization. It combines multiple technical innovations (assertion-driven skill creation, within-group rollout sampling, Thompson Sampling with adaptive discounting) and demonstrates broad empirical gains across domains, especially on unseen tasks. Paper 2 offers a useful diagnostic tool for analyzing reasoning structures in LRMs, but its scope is narrower (benchmark + metric for logic puzzles). ReSkill's potential for real-world impact in LLM agent systems and its methodological depth give it higher estimated impact.

vs. AGENTCL: Toward Rigorous Evaluation of Continual Learning in Language Agents

gemini-3.16/3/2026

Paper 1 introduces a highly novel and timely methodology for evaluating Large Reasoning Models by converting unstructured reasoning traces into verifiable graphs. Given the recent surge in reasoning-focused LLMs, moving beyond superficial metrics like accuracy to analyze the topological structure and efficiency of reasoning addresses a critical bottleneck in the field. While Paper 2 presents a valuable benchmark for continual learning in agents, Paper 1's approach has broader foundational implications for understanding and diagnosing the core cognitive mechanics of modern AI models.

vs. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

gpt-5.26/3/2026

Paper 2 likely has higher impact due to a large-scale, real-world benchmark (millions of transactions) enabling rigorous, reproducible evaluation of personalized decision modeling—an area with immediate applications in recommender systems, decision support, fintech, and human-AI interaction. Its use of observed behavioral traces addresses a timely gap (simulation vs. human behavior divergence) and offers breadth across ML, economics/markets, and behavioral modeling. Paper 1 is innovative for reasoning-structure evaluation, but its scope is narrower (logic puzzles/trace graphs) and may see slower real-world adoption compared to a widely usable dataset and evaluation framework.