GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory

Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta

#2658 of 3355 · Artificial Intelligence
Share
Tournament Score
1323±43
10501800
32%
Win Rate
8
Wins
17
Losses
25
Matches
Rating
4.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GTBench

1. Core Contribution

GTBench introduces a 63-problem benchmark for evaluating LLMs on graph theory tasks, organized into three difficulty tiers: undergraduate definitions (Group 1, 31 problems), algorithm tracing (Group 2, 21 problems), and graduate-level proof construction (Group 3, 11 problems). The benchmark is "curriculum-grounded," meaning problems follow the standard pedagogical progression of graph theory education, sourced from Diestel's textbook and a UPC course problem set. Five frontier LLMs are evaluated under zero-shot and chain-of-thought prompting, with a hybrid evaluation protocol combining exact-match, LLM-as-judge, and human expert evaluation.

The paper fills a genuine gap: graph-theoretic reasoning has been underrepresented in LLM benchmarks compared to algebraic, numerical, and competition-style mathematics. The curriculum-grounded framing — explicitly asking whether LLMs can serve as "mathematical research assistants" — is a practical and timely angle.

2. Methodological Rigor

Strengths in methodology:

  • The three-tier structure with distinct evaluation protocols per tier is well-motivated. Using exact-match for Groups 1-2 and adding human expert evaluation for Group 3 reflects awareness that proof assessment demands nuanced judgment.
  • The failure mode taxonomy (Types A-D) provides interpretable diagnostics beyond raw accuracy.
  • Three independent human evaluators with inter-rater agreement analysis (Cohen's κ) adds credibility to Group 3 results.
  • Temperature set to 0 with three repeated runs per condition enables consistency analysis, yielding the interesting finding that 95-99% of errors are deterministically reproduced.
  • Weaknesses in methodology:

  • The benchmark is quite small: 63 problems total, with only 11 in the most interesting tier (Group 3). This limits statistical power and generalizability. Confidence intervals are not reported, and with 11 problems, a single problem flip changes accuracy by ~9 percentage points.
  • The filtering procedure is not fully transparent — we know 31 problems survived from Sources A and B for Group 1, but the initial pool size and rejection rates are not disclosed.
  • The paper uses GPT-4o as the LLM judge to avoid "self-serving bias" from GPT-5, but does not validate this choice empirically or discuss whether GPT-4o might systematically disadvantage or advantage specific models.
  • The 0/0.5/1 scoring rubric for Group 3 is acknowledged as coarse by the authors themselves. Given only 11 problems, this coarseness significantly limits discriminative power.
  • Only zero-shot and basic CoT prompting are tested. No few-shot, retrieval-augmented, or multi-turn interactive settings are explored, which limits the paper's applicability to the "research assistant" framing.
  • 3. Potential Impact

    The paper has moderate practical impact. It provides a useful, if small, resource for the LLM evaluation community focused on mathematical reasoning. The finding that errors are deterministically reproduced (not stochastic) is genuinely important for trust and governance discussions — it means LLMs don't just occasionally fail but hold systematic misconceptions that could mislead students consistently.

    The curriculum-grounded framing could influence how educational institutions think about deploying LLMs as study tools, though the benchmark would need to be substantially larger to support policy decisions. The failure mode taxonomy, while not novel in concept, provides useful diagnostic categories specific to graph theory.

    However, the benchmark's narrow domain focus (graph theory only), small size, and lack of formal verification infrastructure limit broader adoption. Unlike benchmarks such as MATH (12,500 problems) or GSM8K (8,500 problems), GTBench's 63 problems are insufficient for fine-tuning, capability tracking over time, or robust statistical comparisons.

    4. Timeliness & Relevance

    The paper addresses a timely concern: the increasing use of LLMs as educational and research tools without adequate understanding of their domain-specific reliability. The choice of graph theory is well-justified as it requires relational/structural reasoning distinct from algebraic manipulation. The inclusion of GPT-5 and Claude Sonnet 4.6 (with 2026 release dates, suggesting these are very recent or possibly speculative model designations) positions the work at the frontier of model evaluation.

    However, the rapid pace of model development means benchmark results become stale quickly. The paper's value will depend on whether the benchmark itself (rather than the specific results) is adopted by the community.

    5. Strengths & Limitations

    Key Strengths:

  • First systematic benchmark specifically targeting graph-theoretic reasoning across difficulty levels
  • The consistency analysis revealing deterministic failures is a genuinely novel and concerning finding
  • Detailed failure mode analysis with topic-level granularity (Tables 5 and 7) provides actionable diagnostics
  • The human vs. LLM-judge comparison on Group 3 provides empirical evidence on automated evaluation limitations, finding that κ between human and LLM judge is only moderate (0.48-0.64)
  • The finding that CoT prompting can degrade performance on definitional/combinatorial tasks is counterintuitive and valuable
  • Key Limitations:

  • Very small benchmark size (63 problems, 11 at the graduate level) severely limits statistical conclusions
  • No confidence intervals, significance tests, or effect sizes reported
  • The paper claims to evaluate LLMs as "mathematical research assistants" but only tests passive question-answering, not interactive assistance, verification, or iterative refinement
  • Some model names appear speculative or futuristic (GPT-5, Claude Sonnet 4.6 with Feb. 2026 release), raising questions about the paper's timeline
  • Data and code are not yet available ("upon acceptance"), limiting reproducibility assessment
  • The paper does not discuss potential data contamination — problems from Diestel and CMU course solutions are widely available online and likely in pretraining data
  • No comparison to formal verification approaches (Lean, Isabelle) that could serve as ground truth for proof correctness
  • Notable Omission: The contamination issue is critical. If models were trained on Diestel solutions or CMU homework answers, Group 1 and 3 performance may reflect memorization rather than reasoning. This is not addressed.

    Summary

    GTBench makes a reasonable contribution as a domain-specific benchmark for an underexplored area of LLM evaluation. Its strongest contributions are the failure mode taxonomy, the consistency analysis, and the human-vs-LLM judge comparison. However, the benchmark's small size, lack of contamination analysis, and limited evaluation methodology (only zero-shot and basic CoT) constrain its scientific impact. It reads more as a preliminary study than a definitive benchmark contribution.

    Rating:4.5/ 10
    Significance 4.5Rigor 4Novelty 4.5Clarity 6.5

    Generated Jun 3, 2026

    Comparison History (25)

    vs. Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory
    gemini-3.16/5/2026

    Paper 2 introduces a novel benchmark for evaluating LLMs in mathematical reasoning, a highly active and rapidly evolving field. Benchmarks typically drive significant subsequent research and gather high citations, impacting both AI and mathematics. In contrast, Paper 1 presents a valuable but narrower application of predictive maintenance and fatigue assessment for circular factories, which likely has a more limited audience and targeted real-world application.

    vs. VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark
    claude-opus-4.66/5/2026

    VAMPS addresses a broader and more novel research question—whether models can benefit from constructing and reasoning over self-generated visualizations—which is a fundamental gap in multimodal AI evaluation. Its larger dataset (1,168 problems), bilingual design, and counterintuitive finding (analytical solving outperforms visual solving even when plotting is natural) offer actionable insights for the multimodal AI community. GTBench, while rigorous, is narrower in scope (63 problems, single subdomain of graph theory) and its findings about performance degradation with difficulty are less surprising. VAMPS has broader applicability across engineering and scientific workflows.

    vs. Benchmarking at the Edge of Comprehension
    gemini-3.16/5/2026

    Paper 1 addresses a fundamental, existential problem in AI research: how to evaluate models that surpass human comprehension. Its novel adversarial framework has broad applicability across all domains of AI benchmarking. In contrast, Paper 2 presents a valuable but narrower, domain-specific benchmark for graph theory. Consequently, Paper 1 promises a much broader and deeper scientific impact on the future methodology of AI evaluation.

    vs. Cascading Hallucination in Agentic RAG: The CHARM Framework for Detection and Mitigation
    gpt-5.26/5/2026

    Paper 2 is likely to have higher impact: it targets a broad, timely reliability problem in widely deployed agentic RAG systems, introduces a formalized failure mode (cascading hallucination) plus a taxonomy and a modular mitigation framework, and evaluates across multiple established benchmarks with quantitative gains and ablations. Its applications span search/QA, enterprise copilots, and safety/governance, giving cross-field relevance. Paper 1 is novel and rigorous within math/graph-theory LLM evaluation, but its scope (63 problems, one domain) is narrower and its direct real-world uptake is likely smaller.

    vs. From Control Boundary to Insurance Claim: Reconstructing AI-Mediated Losses Through the CER Framework
    gemini-3.16/3/2026

    Paper 1 addresses a critical and highly timely challenge in enterprise AI adoption: liability, risk transfer, and forensics for agentic AI systems. By bridging AI governance, cybersecurity, and insurance law, it offers a novel framework with massive real-world economic and legal implications. While Paper 2 presents a rigorous benchmark for LLMs in graph theory, benchmarks tend to have transient impact and narrower scope compared to foundational frameworks solving systemic industry bottlenecks like AI-mediated financial loss.

    vs. EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction
    gpt-5.26/3/2026

    Paper 2 likely has higher impact because it introduces a reusable benchmark and evaluation protocol for LLM mathematical reasoning, a timely and broadly relevant problem across AI, education, and scientific governance. Benchmarks often become community standards that drive follow-on research, model development, and policy. Its methodological focus on difficulty stratification, mixed human/LLM judging, and disagreement analysis increases rigor and utility. Paper 1 is a solid applied ML contribution with clear real-world value in energy forecasting, but its novelty is more incremental (model + calibration gains) and its impact is narrower to the energy/time-series domain.

    vs. EvoDS: Self-Evolving Autonomous Data Science Agent with Skill Learning and Context Management
    gemini-3.16/3/2026

    Paper 2 presents a novel, self-evolving agent architecture with broad applications across automated data science. It combines strong methodological innovation (Autonomous Skill Acquisition and Adaptive Context Compression) with rigorous theoretical proofs and significant empirical improvements over state-of-the-art models. In contrast, Paper 1, while highly relevant to AI evaluation, is limited to a domain-specific benchmark (graph theory). Paper 2's capacity to autonomously learn skills and manage context addresses fundamental bottlenecks in agentic AI, offering wider cross-disciplinary utility and higher potential for long-term scientific impact.

    vs. Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation
    gpt-5.26/3/2026

    Paper 2 likely has higher impact due to stronger methodological rigor and broader, reusable contribution: a curriculum-grounded benchmark with verified sources, multi-model evaluation, difficulty stratification, and human+LLM judging with agreement analysis. Benchmarks tend to become shared infrastructure for the community, influencing model development, evaluation standards, and governance across math/CS/AI. Paper 1 targets an important application (misinformation) but appears more system-specific and harder to generalize, with impact depending on deployment and educational efficacy beyond the presented evaluations.

    vs. Decomposing how prompting steers behavior
    gemini-3.16/3/2026

    Paper 1 offers a novel, mechanistic framework for understanding how prompting alters internal representations in foundation models. This fundamental research in mechanistic interpretability has broad implications for AI alignment, capability steering, and general LLM architecture. In contrast, Paper 2 proposes a domain-specific benchmark for graph theory which, while useful for evaluation, provides less foundational innovation and has a narrower scope of impact.

    vs. Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning
    gemini-3.16/3/2026

    Paper 1 offers a highly rigorous, deterministically graded benchmark that solves the prevalent 'LLM-as-judge' circularity issue by using verified expert reasoning traces. Its focus on high-value, real-world financial tasks reveals a massive performance gap (<16% accuracy) for frontier models, providing a clear and impactful target for future agentic AI research. Paper 2, while interesting for mathematical education, relies on partially flawed LLM judges and evaluates hypothetical future models (e.g., GPT-5), making its current scientific applicability and methodological rigor comparatively lower.

    vs. SkillPyramid: A Hierarchical Skill Consolidation Framework for Self-Evolving Agents
    gpt-5.26/3/2026

    Paper 1 has higher potential scientific impact due to proposing a generalizable, novel framework for hierarchical skill consolidation and self-evolution in agents, with sizable empirical gains across multiple interactive environments and backbone models. If robust, it can directly influence agent architectures, continual learning, tool/skill libraries, and real-world task automation. Paper 2 is timely and valuable as a benchmark for LLM mathematical assistance, but its impact is narrower (graph theory + evaluation) and primarily diagnostic rather than enabling new capabilities. Overall, Paper 1 offers broader cross-domain applicability and stronger downstream application potential.

    vs. Acting with AI: An Interaction-Based Framework for Agentic Tort Liability
    gemini-3.16/3/2026

    Paper 1 addresses a foundational and highly timely societal challenge—legal liability for agentic AI—proposing a novel framework that bridges common law and AI governance. Its potential to influence future AI legislation and regulatory standards gives it broader, longer-lasting interdisciplinary impact compared to Paper 2, which introduces a niche, domain-specific benchmark for graph theory that is likely to be quickly superseded as models evolve.

    vs. A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting
    gpt-5.26/3/2026

    Paper 1 likely has higher impact: it introduces a new, reusable benchmark with clear real-world relevance (evaluating LLM reliability for mathematical education/research), includes human expert evaluation, and can influence governance and model development across math-reasoning tasks. Its curriculum-grounded design and analysis of judge–human disagreement address timely evaluation challenges broadly applicable beyond graph theory. Paper 2 is a well-scoped negative result with methodological rigor, but its narrow setting (small Pythia models, specific injection method) limits breadth and immediate applications, though it is still valuable for mechanistic interpretability research.

    vs. TriEval: A Resource-Efficient Pipeline for LLM Bias, Toxicity, and Truthfulness Assessment
    claude-opus-4.66/3/2026

    GTBench offers higher scientific impact due to its more rigorous methodology, novel curriculum-grounded evaluation framework for mathematical reasoning, and deeper insights into LLM failure modes. It addresses a less-explored area (mathematical research assistance) with a structured difficulty hierarchy, revealing important findings about human-LLM judge disagreement. The benchmark has broader implications for AI governance in education and research. TriEval, while useful for resource-constrained evaluation, is more incremental—combining existing evaluation dimensions (bias, toxicity, truthfulness) into a lightweight pipeline without fundamentally advancing evaluation methodology.

    vs. Visual Graph Scaffolds for Structural Reasoning in Large Language Models
    gpt-5.26/3/2026

    Paper 2 likely has higher impact: it introduces a reusable, curriculum-grounded benchmark with rigorous multi-tier evaluation (including human expert assessment and judge-disagreement analysis), enabling standardized measurement and governance-relevant auditing of LLM mathematical reasoning. Its applications span model evaluation, education, safety/policy, and research tooling, with broad community utility and timeliness as benchmarks drive progress. Paper 1 is novel in proposing visual graph scaffolds for reasoning, but its impact may be narrower and depends on practical adoption of visual guidance interfaces and training pipelines.

    vs. MedCUA-Bench: A Screenshot-Only Benchmark for Clinical Computer-Use Agents
    gpt-5.26/3/2026

    Paper 2 has higher potential impact due to timeliness and real-world applicability: validating computer-use agents in clinical GUIs directly affects healthcare safety, workflow automation, and regulatory evaluation. Its interactive, screenshot-only design plus deterministic checking and explicit safety dimensions improves methodological rigor and reproducibility for a high-stakes domain. The benchmark spans multiple medical domains and evaluates many agents, enabling broad adoption across ML, HCI, and clinical informatics. Paper 1 is valuable for AI-in-math evaluation, but its narrower domain (graph theory) and less direct deployment pathway likely limit cross-field and societal impact.

    vs. AURA: Action-Gated Memory for Robot Policies at Constant VRAM
    claude-opus-4.66/3/2026

    AURA introduces a genuinely novel architectural contribution—action-gated memory for embodied AI on edge hardware—addressing a fundamental and growing problem (deploying VLAs on resource-constrained robots). It offers a creative reframing of memory management with constant VRAM, a learned gating mechanism trained on action-error signals, and strong empirical results showing dramatic write reductions without performance loss. GTBench, while competent, is primarily a benchmark/evaluation paper for LLM graph theory reasoning—a narrower contribution in a space already crowded with LLM benchmarks. AURA's cross-disciplinary impact (robotics, edge computing, memory-efficient inference) and practical applicability give it higher potential impact.

    vs. Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection
    gpt-5.26/3/2026

    Paper 2 has higher estimated impact: it proposes a novel, generalizable RL method (VEPO) addressing a clear failure mode in multimodal RL (entropy-based credit assignment missing vision-sensitive low-entropy tokens), with demonstrated gains across model scales and supporting ablations—suggesting methodological rigor and broad applicability to vision-language reasoning and training pipelines. Paper 1 is valuable as a curriculum-grounded benchmark in graph theory, but its impact is narrower (evaluation-focused, domain-specific) and less likely to reshape methods across fields compared to a training framework that could affect many multimodal systems.

    vs. SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems
    gpt-5.26/3/2026

    Paper 1 (SAGE) likely has higher scientific impact due to greater novelty (compute-matched social vs self evolution with controlled conditions), broader applicability (multi-agent learning, AI alignment/safety, collective intelligence, benchmarking), and timeliness as agent ecosystems become common. Its cross-arena evaluation and ablation on sharing modalities (raw logs vs abstractions) adds methodological rigor and actionable insights for building scalable agent systems. Paper 2 is a solid, useful benchmark for LLM graph-theory assistance, but its impact is narrower (single domain, modest benchmark size) and more incremental relative to existing reasoning benchmarks.

    vs. DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
    claude-opus-4.66/3/2026

    DeskCraft addresses a broader and more impactful gap in AI evaluation—benchmarking desktop agents on realistic professional workflows with human-in-the-loop collaboration. It covers a wider range of applications (design, video, audio, 3D), evaluates 18 agents on 538 tasks, and formalizes novel interaction protocols. Its open-source commitment and relevance to the rapidly growing GUI agent field give it higher potential impact. GTBench, while rigorous, is narrowly focused on graph theory reasoning, a smaller niche with more limited cross-field applicability. DeskCraft's timeliness and practical relevance to agentic AI systems give it the edge.