Reasoning Structure of Large Language Models
Frédéric Berdoz, Luca A. Lanzendörfer, Fabian Farestam, Roger Wattenhofer
Abstract
Large reasoning models (LRMs) are often evaluated using metrics such as final-answer accuracy or token count. However, identical scores on these metrics can hide fundamentally different reasoning structures. To address this limitation, we introduce a scalable LRM benchmark of logic puzzles and a pipeline that converts unstructured traces into verifiable reasoning graphs of claims and dependencies. This turns reasoning into a structured, measurable object whose topology can be quantitatively analyzed. Building on this, we define a reasoning efficiency metric that quantifies how concentrated the model's logical flow is. Our analysis on open-source reasoning models shows that structural measurements separate behaviors that token count and accuracy conflate, providing a practical tool for diagnosing failure modes and comparing how reasoning scales with puzzle difficulty.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Reasoning Structure of Large Language Models"
1. Core Contribution
This paper addresses a genuine gap in LRM evaluation: standard metrics (accuracy, token count) collapse rich reasoning behavior into single numbers, hiding fundamental structural differences. The authors propose three interconnected contributions: (1) a scalable benchmark of 21 grid-based logic puzzles at four difficulty levels, (2) a pipeline that converts free-form textual reasoning traces into directed acyclic graphs (DAGs) of verifiable claims and dependencies, and (3) a reasoning-flow efficiency metric η based on absorbing Markov chains that quantifies how concentrated a model's logical flow is relative to the minimal claim set needed for a solution.
The key insight — modeling reasoning as an absorbing Markov chain over a claim graph and measuring structural entropy — is genuinely novel. By computing the fundamental matrix of the chain, the authors quantify how "logical mass" distributes across the reasoning graph, distinguishing focused deduction from diffuse exploration in a principled, information-theoretic manner.
2. Methodological Rigor
Strengths in design: The pipeline is carefully modular: deterministic extractors handle high-precision claims, LLM-based extractors handle recall, and a screening step filters artifacts. Claims are verified against an executable puzzle environment, grounding the analysis in deterministic truth rather than subjective judgment. The separation of extraction roles (GPT-5.2 for claim extraction, GPT-5-mini for rule extraction) mitigates self-evaluation bias.
Stability analysis: The authors provide thorough robustness checks. Six-extractor ablation shows η varies by only 1.9% across extractors with no self-bias. Same-trace repeated extraction yields high Jaccard overlap (0.79–0.98). Sensitivity analysis to Markov chain assumptions (7 alternative configurations) shows strong rank correlation (ρ ≥ 0.778). Perturbation analysis demonstrates CV below 5% for 6×6+ graphs.
Concerns: The reliance on LLM-based extraction is acknowledged but remains a limitation. Manual inspection of 200 rule applications finds 75.5% fully correct — a 24.5% error rate that could be problematic, though the strict criterion (single missing premise = full error) makes this conservative. The analysis is concentrated on solved Tents instances (n=85), which limits generalizability claims. The pipeline requires puzzle-specific claim and rule type definitions, reducing out-of-the-box scalability to new domains.
3. Potential Impact
Diagnostic evaluation: The framework shifts evaluation from "did the model get it right?" to "how did the model reason?" — a paradigm with broad applicability. The finding that token count is uncorrelated with η (r=-0.05, p=0.64) while verification overhead grows linearly with tokens (r=0.53) is practically important for compute allocation.
Training signal potential: The authors correctly note that η could serve as auxiliary feedback in RLVR-style training, rewarding solution-focused reasoning. If extraction becomes reliable and low-latency, this could influence how reasoning models are trained.
Cross-domain extensibility: The structural layer (graph construction, Markov chain, η) is domain-agnostic; only claim verification is puzzle-specific. The paper suggests extensions to mathematical reasoning (symbolic verification) and code generation (unit tests), though these remain speculative.
Limitations on impact breadth: The benchmark evaluates only open-source models for structural analysis (closed-source models don't expose traces). The puzzle domain, while well-controlled, may not capture reasoning patterns in more naturalistic settings like mathematical proof or scientific reasoning.
4. Timeliness & Relevance
The paper arrives at a critical moment: reasoning models (DeepSeek-R1, o1, etc.) are proliferating, but evaluation remains primitive. The community increasingly recognizes that accuracy alone is insufficient — recent work on overthinking, underthinking, and diversity collapse all point to the same gap this paper addresses structurally. The finding that hardest puzzles remain largely unsolved despite massive token budgets (all models ≤5.7% on "Human hard") adds to growing evidence that simply scaling test-time compute is insufficient.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Additional observations: The benchmark itself (21 puzzles × 4 difficulties × 5 instances) is a useful contribution but modest in scale. The paper's strongest contribution is conceptual — establishing that reasoning topology is measurable and informative — rather than delivering a turnkey evaluation tool. The extensive appendix (28 pages of prompts and specifications) underscores the engineering complexity, which may limit adoption.
Generated Jun 3, 2026
Comparison History (23)
Paper 2 introduces a fundamentally new way to analyze and measure LLM reasoning through structural graph-based representations, addressing a critical gap in evaluation methodology. This has broad impact across the entire LLM research community, as better evaluation tools influence model development, training strategies, and architectural decisions. The reasoning efficiency metric and structural analysis framework are generalizable and could become standard diagnostic tools. Paper 1, while technically strong, represents an incremental advance in agentic system design with a narrower scope of impact focused on agent configuration.
Paper 1 introduces a novel, structured approach to analyzing the internal reasoning topology of Large Language Models, addressing a critical gap in current LLM evaluation which relies heavily on opaque token counts and final accuracy. By converting reasoning traces into verifiable graphs, it offers a foundational interpretability and diagnostic tool that has broad implications across AI safety, cognitive modeling, and model development, giving it a broader potential scientific impact than the specialized optimization tasks in Paper 2.
Paper 2 addresses a highly timely and critical challenge in the dominant field of large language models: evaluating and understanding reasoning structures beyond simple token counts or accuracy. By converting reasoning traces into verifiable graphs, it introduces a novel, scalable, and broadly applicable methodology for AI evaluation. While Paper 1 offers a strong technical improvement for Federated Learning, Paper 2's focus on LLM reasoning evaluation has a wider potential impact across the broader AI community.
Paper 2 introduces a fundamental architectural shift, offering linear-cost attention and inherent interpretability. This addresses critical scaling, opacity, and trust issues in AI, potentially revolutionizing how future foundational models are designed. While Paper 1 provides a valuable evaluation framework for existing models, Paper 2's architectural innovation has broader implications for the core design, efficiency, and safety of next-generation AI systems across all domains.
Paper 2 introduces a novel framework for analyzing the *structure* of reasoning in LLMs, moving beyond surface metrics (accuracy, token count) to graph-based topological analysis. This addresses a fundamental gap in understanding LLM reasoning and has broad applicability across all reasoning-capable models and tasks. Paper 1, while solid engineering work showing incremental improvements in web agent skill retrieval, is more narrowly scoped to web automation. Paper 2's methodological contribution—converting reasoning traces into verifiable graphs with efficiency metrics—offers a new analytical paradigm with wider cross-field impact and greater potential to influence future evaluation standards.
AutoLab addresses a critical gap in evaluating frontier AI models on long-horizon iterative tasks, a fundamental capability for autonomous scientific research. With 36 expert-curated tasks across diverse domains, evaluation of 17 state-of-the-art models, and fully open-sourced artifacts, it provides substantial infrastructure for the community. Its finding that persistence and iterative refinement matter more than initial quality is actionable and timely. Paper 2 offers useful structural analysis of reasoning traces but is narrower in scope, focusing on logic puzzles and reasoning graph topology, with more incremental contributions to LLM evaluation methodology.
Paper 2 addresses the critical and highly timely challenge of evaluating Large Reasoning Models (LRMs). By transforming opaque reasoning traces into verifiable, measurable topological graphs, it offers a novel and rigorous methodology to analyze test-time compute. This structural approach to assessing reasoning efficiency has broader potential impact on understanding and improving state-of-the-art LLMs compared to Paper 1's focus on continual learning benchmarks for agents.
Paper 1 introduces a quantitative, graph-based methodology for evaluating LLM reasoning, addressing a critical bottleneck in AI research. Its rigorous approach to measuring reasoning efficiency offers foundational scientific value for computer science and AI safety. In contrast, Paper 2 focuses on legal, policy, and insurance frameworks (risk transfer and claim reconstruction). While highly valuable for industry and governance, Paper 1 presents stronger potential to directly influence core scientific research and technical model development.
Paper 1 presents a concrete, novel methodology (property-guided LLM synthesis with counterexample feedback) that demonstrates significant practical improvements—7x fewer program generations and orders of magnitude less computation—with direct applicability to planning and potentially other program synthesis domains. It introduces a verifiable, formally grounded approach that bridges LLM synthesis and formal methods. Paper 2 contributes useful evaluation methodology for reasoning models but is more diagnostic/analytical in nature, with narrower immediate practical impact. Paper 1's combination of methodological novelty, strong empirical results, and broad applicability gives it higher potential impact.
Paper 2 presents a novel framework combining agentic RL with LLMs for automatic heuristic design, demonstrating practical results across eight diverse NP-hard optimization domains with a compact 4B-parameter model matching larger models. It addresses a broadly applicable problem (combinatorial optimization) with clear real-world applications, introduces a novel training paradigm (agentic RL for AHD), and shows strong generalization to held-out tasks. Paper 1 offers valuable diagnostic tools for analyzing reasoning structures but is more narrowly focused on LRM evaluation methodology with less immediate transformative potential.
Paper 2 (TRON) introduces a practical, scalable training infrastructure for visual reasoning RL with 520 environments, demonstrating consistent improvements across multiple models and benchmarks. It addresses a fundamental bottleneck (static datasets for RL post-training) with an unbounded online generation approach. Paper 1 offers valuable diagnostic tools for analyzing reasoning structures but is primarily an evaluation/analysis contribution. TRON has broader impact potential: it enables new training paradigms, supports curriculum learning, and provides a reusable substrate for the rapidly growing multimodal reasoning community, making it more likely to influence future research directions.
Paper 2 likely has higher scientific impact due to broader relevance and timeliness: it introduces a general framework and benchmark for analyzing LLM reasoning via verifiable reasoning graphs and a new efficiency metric. This could influence evaluation practices across many LLM applications (alignment, safety, interpretability, model selection), beyond a single domain. Paper 1 is solid and practical for relational ML, but its contributions are more incremental (masking, unified head, TF-IDF) and its impact is narrower to RelBench-style autocomplete in relational databases.
Paper 1 offers a foundational methodological innovation by converting unstructured LLM reasoning into measurable, topological graphs. While Paper 2 presents a highly effective applied agent system, Paper 1 addresses a fundamental scientific gap in evaluating and interpreting the 'black box' of Large Reasoning Models. As the field shifts toward complex reasoning models (like OpenAI's o1), verifiable evaluation frameworks and efficiency metrics will have a broader, longer-lasting impact across AI research than specific agent architectures, which tend to be superseded rapidly.
Paper 2 has higher potential impact because it introduces a more general, field-spanning framework: converting LLM reasoning traces into verifiable dependency graphs and defining topology-based metrics (including reasoning efficiency). This is broadly applicable to evaluation, interpretability, scaling analysis, and failure diagnosis across many reasoning tasks and model classes, beyond multi-agent settings. Its benchmark+measurement approach is timely and likely to become a reusable evaluation primitive. Paper 1 is rigorous and practically useful, but is more domain-specific (failure attribution in multi-agent trajectories) and thus narrower in cross-field influence.
Paper 2 introduces a fundamental methodological innovation by transforming unstructured reasoning traces into measurable, verifiable graphs. As large reasoning models become increasingly prominent, this novel evaluation framework offers a crucial tool for diagnosing and understanding logical flow beyond simple accuracy metrics, promising broader conceptual impact and long-term relevance across AI research compared to the specific alignment technique in Paper 1.
Paper 1 addresses the highly timely and impactful field of large language models by proposing a novel, structured framework to evaluate reasoning. Its practical tools for diagnosing LLM behaviors have immediate, broad applications across AI and NLP. In contrast, Paper 2 offers theoretical advancements in a much narrower, niche subfield of formal logic and symbolic AI, limiting its overall scientific and real-world impact.
Paper 2 introduces a fundamentally new way to analyze and measure LLM reasoning through structured reasoning graphs and efficiency metrics, addressing a significant gap in how we evaluate reasoning models. This provides a foundational diagnostic framework applicable across all reasoning models and tasks. Paper 1, while practically valuable with its economic optimization framework for budget allocation (CLEAR), addresses a more specific deployment optimization problem. Paper 2's contribution of converting reasoning traces into verifiable graph structures has broader methodological impact, enabling new research directions in understanding, comparing, and improving reasoning across the field.
ReSkill introduces a novel framework that addresses a fundamental challenge in agentic RL—co-evolving reusable skills with policy optimization. It combines multiple technical innovations (assertion-driven skill creation, within-group rollout sampling, Thompson Sampling with adaptive discounting) and demonstrates broad empirical gains across domains, especially on unseen tasks. Paper 2 offers a useful diagnostic tool for analyzing reasoning structures in LRMs, but its scope is narrower (benchmark + metric for logic puzzles). ReSkill's potential for real-world impact in LLM agent systems and its methodological depth give it higher estimated impact.
Paper 1 introduces a highly novel and timely methodology for evaluating Large Reasoning Models by converting unstructured reasoning traces into verifiable graphs. Given the recent surge in reasoning-focused LLMs, moving beyond superficial metrics like accuracy to analyze the topological structure and efficiency of reasoning addresses a critical bottleneck in the field. While Paper 2 presents a valuable benchmark for continual learning in agents, Paper 1's approach has broader foundational implications for understanding and diagnosing the core cognitive mechanics of modern AI models.
Paper 2 likely has higher impact due to a large-scale, real-world benchmark (millions of transactions) enabling rigorous, reproducible evaluation of personalized decision modeling—an area with immediate applications in recommender systems, decision support, fintech, and human-AI interaction. Its use of observed behavioral traces addresses a timely gap (simulation vs. human behavior divergence) and offers breadth across ML, economics/markets, and behavioral modeling. Paper 1 is innovative for reasoning-structure evaluation, but its scope is narrower (logic puzzles/trace graphs) and may see slower real-world adoption compared to a widely usable dataset and evaluation framework.