GTBench: A Curriculum-Grounded Benchmark for Evaluating LLMs as Mathematical Research Assistants in Graph Theory
Noujoud Nader, Ibrahem Aljabea, Patrick Diehl, Deepti Gupta
Abstract
Large language models (LLMs) are increasingly used as self-study assistants in technical disciplines, yet their reliability as mathematical reasoning assistants remains poorly understood. We introduce GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, comprising 63 problems organized into three groups of increasing difficulty: undergraduate definitions and basic properties (Group 1), algorithm tracing and structural reasoning (Group 2), and graduate-level proof construction (Group 3). Problems are sourced from verified academic materials including Diestel's Graph Theory. We evaluate five frontier models -- GPT-5, Claude Sonnet 4.6, Gemini 2.5 Flash-Lite, Llama 3.3 70B, and Mistral Large 3 -- under zero-shot and chain-of-thought prompting, using exact-match and LLM-as-judge evaluation for Groups 1 and 2, and a hybrid human expert and LLM-as-judge protocol for Group 3. Our results reveal a pronounced performance hierarchy: GPT-5 approaches ceiling on Group 1 (95.8% zero-shot) and maintains meaningful accuracy on graduate proofs (82%), while all other models degrade substantially with difficulty, with Llama achieving 0% under human evaluation on Group 3 zero-shot. Failure mode analysis shows that correct algorithm, wrong execution errors dominate Groups 1 and 2, while Group 3 additionally surfaces incomplete reasoning failures and reveals systematic disagreement between human evaluators and the automated judge, particularly on verbose or near-complete proofs (kappa = 0.48-0.83 across human pairs). GTBench provides the first curriculum-grounded evaluation framework for graph-theoretic reasoning in LLMs, with direct implications for the governance of AI tools in mathematical education and scientific research.
AI Impact Assessments
(1 models)Scientific Impact Assessment: GTBench
1. Core Contribution
GTBench introduces a 63-problem benchmark for evaluating LLMs on graph theory tasks, organized into three difficulty tiers: undergraduate definitions (Group 1, 31 problems), algorithm tracing (Group 2, 21 problems), and graduate-level proof construction (Group 3, 11 problems). The benchmark is "curriculum-grounded," meaning problems follow the standard pedagogical progression of graph theory education, sourced from Diestel's textbook and a UPC course problem set. Five frontier LLMs are evaluated under zero-shot and chain-of-thought prompting, with a hybrid evaluation protocol combining exact-match, LLM-as-judge, and human expert evaluation.
The paper fills a genuine gap: graph-theoretic reasoning has been underrepresented in LLM benchmarks compared to algebraic, numerical, and competition-style mathematics. The curriculum-grounded framing — explicitly asking whether LLMs can serve as "mathematical research assistants" — is a practical and timely angle.
2. Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
3. Potential Impact
The paper has moderate practical impact. It provides a useful, if small, resource for the LLM evaluation community focused on mathematical reasoning. The finding that errors are deterministically reproduced (not stochastic) is genuinely important for trust and governance discussions — it means LLMs don't just occasionally fail but hold systematic misconceptions that could mislead students consistently.
The curriculum-grounded framing could influence how educational institutions think about deploying LLMs as study tools, though the benchmark would need to be substantially larger to support policy decisions. The failure mode taxonomy, while not novel in concept, provides useful diagnostic categories specific to graph theory.
However, the benchmark's narrow domain focus (graph theory only), small size, and lack of formal verification infrastructure limit broader adoption. Unlike benchmarks such as MATH (12,500 problems) or GSM8K (8,500 problems), GTBench's 63 problems are insufficient for fine-tuning, capability tracking over time, or robust statistical comparisons.
4. Timeliness & Relevance
The paper addresses a timely concern: the increasing use of LLMs as educational and research tools without adequate understanding of their domain-specific reliability. The choice of graph theory is well-justified as it requires relational/structural reasoning distinct from algebraic manipulation. The inclusion of GPT-5 and Claude Sonnet 4.6 (with 2026 release dates, suggesting these are very recent or possibly speculative model designations) positions the work at the frontier of model evaluation.
However, the rapid pace of model development means benchmark results become stale quickly. The paper's value will depend on whether the benchmark itself (rather than the specific results) is adopted by the community.
5. Strengths & Limitations
Key Strengths:
Key Limitations:
Notable Omission: The contamination issue is critical. If models were trained on Diestel solutions or CMU homework answers, Group 1 and 3 performance may reflect memorization rather than reasoning. This is not addressed.
Summary
GTBench makes a reasonable contribution as a domain-specific benchmark for an underexplored area of LLM evaluation. Its strongest contributions are the failure mode taxonomy, the consistency analysis, and the human-vs-LLM judge comparison. However, the benchmark's small size, lack of contamination analysis, and limited evaluation methodology (only zero-shot and basic CoT) constrain its scientific impact. It reads more as a preliminary study than a definitive benchmark contribution.
Generated Jun 3, 2026
Comparison History (25)
Paper 2 introduces a novel benchmark for evaluating LLMs in mathematical reasoning, a highly active and rapidly evolving field. Benchmarks typically drive significant subsequent research and gather high citations, impacting both AI and mathematics. In contrast, Paper 1 presents a valuable but narrower application of predictive maintenance and fatigue assessment for circular factories, which likely has a more limited audience and targeted real-world application.
VAMPS addresses a broader and more novel research question—whether models can benefit from constructing and reasoning over self-generated visualizations—which is a fundamental gap in multimodal AI evaluation. Its larger dataset (1,168 problems), bilingual design, and counterintuitive finding (analytical solving outperforms visual solving even when plotting is natural) offer actionable insights for the multimodal AI community. GTBench, while rigorous, is narrower in scope (63 problems, single subdomain of graph theory) and its findings about performance degradation with difficulty are less surprising. VAMPS has broader applicability across engineering and scientific workflows.
Paper 1 addresses a fundamental, existential problem in AI research: how to evaluate models that surpass human comprehension. Its novel adversarial framework has broad applicability across all domains of AI benchmarking. In contrast, Paper 2 presents a valuable but narrower, domain-specific benchmark for graph theory. Consequently, Paper 1 promises a much broader and deeper scientific impact on the future methodology of AI evaluation.
Paper 2 is likely to have higher impact: it targets a broad, timely reliability problem in widely deployed agentic RAG systems, introduces a formalized failure mode (cascading hallucination) plus a taxonomy and a modular mitigation framework, and evaluates across multiple established benchmarks with quantitative gains and ablations. Its applications span search/QA, enterprise copilots, and safety/governance, giving cross-field relevance. Paper 1 is novel and rigorous within math/graph-theory LLM evaluation, but its scope (63 problems, one domain) is narrower and its direct real-world uptake is likely smaller.
Paper 1 addresses a critical and highly timely challenge in enterprise AI adoption: liability, risk transfer, and forensics for agentic AI systems. By bridging AI governance, cybersecurity, and insurance law, it offers a novel framework with massive real-world economic and legal implications. While Paper 2 presents a rigorous benchmark for LLMs in graph theory, benchmarks tend to have transient impact and narrower scope compared to foundational frameworks solving systemic industry bottlenecks like AI-mediated financial loss.
Paper 2 likely has higher impact because it introduces a reusable benchmark and evaluation protocol for LLM mathematical reasoning, a timely and broadly relevant problem across AI, education, and scientific governance. Benchmarks often become community standards that drive follow-on research, model development, and policy. Its methodological focus on difficulty stratification, mixed human/LLM judging, and disagreement analysis increases rigor and utility. Paper 1 is a solid applied ML contribution with clear real-world value in energy forecasting, but its novelty is more incremental (model + calibration gains) and its impact is narrower to the energy/time-series domain.
Paper 2 presents a novel, self-evolving agent architecture with broad applications across automated data science. It combines strong methodological innovation (Autonomous Skill Acquisition and Adaptive Context Compression) with rigorous theoretical proofs and significant empirical improvements over state-of-the-art models. In contrast, Paper 1, while highly relevant to AI evaluation, is limited to a domain-specific benchmark (graph theory). Paper 2's capacity to autonomously learn skills and manage context addresses fundamental bottlenecks in agentic AI, offering wider cross-disciplinary utility and higher potential for long-term scientific impact.
Paper 2 likely has higher impact due to stronger methodological rigor and broader, reusable contribution: a curriculum-grounded benchmark with verified sources, multi-model evaluation, difficulty stratification, and human+LLM judging with agreement analysis. Benchmarks tend to become shared infrastructure for the community, influencing model development, evaluation standards, and governance across math/CS/AI. Paper 1 targets an important application (misinformation) but appears more system-specific and harder to generalize, with impact depending on deployment and educational efficacy beyond the presented evaluations.
Paper 1 offers a novel, mechanistic framework for understanding how prompting alters internal representations in foundation models. This fundamental research in mechanistic interpretability has broad implications for AI alignment, capability steering, and general LLM architecture. In contrast, Paper 2 proposes a domain-specific benchmark for graph theory which, while useful for evaluation, provides less foundational innovation and has a narrower scope of impact.
Paper 1 offers a highly rigorous, deterministically graded benchmark that solves the prevalent 'LLM-as-judge' circularity issue by using verified expert reasoning traces. Its focus on high-value, real-world financial tasks reveals a massive performance gap (<16% accuracy) for frontier models, providing a clear and impactful target for future agentic AI research. Paper 2, while interesting for mathematical education, relies on partially flawed LLM judges and evaluates hypothetical future models (e.g., GPT-5), making its current scientific applicability and methodological rigor comparatively lower.
Paper 1 has higher potential scientific impact due to proposing a generalizable, novel framework for hierarchical skill consolidation and self-evolution in agents, with sizable empirical gains across multiple interactive environments and backbone models. If robust, it can directly influence agent architectures, continual learning, tool/skill libraries, and real-world task automation. Paper 2 is timely and valuable as a benchmark for LLM mathematical assistance, but its impact is narrower (graph theory + evaluation) and primarily diagnostic rather than enabling new capabilities. Overall, Paper 1 offers broader cross-domain applicability and stronger downstream application potential.
Paper 1 addresses a foundational and highly timely societal challenge—legal liability for agentic AI—proposing a novel framework that bridges common law and AI governance. Its potential to influence future AI legislation and regulatory standards gives it broader, longer-lasting interdisciplinary impact compared to Paper 2, which introduces a niche, domain-specific benchmark for graph theory that is likely to be quickly superseded as models evolve.
Paper 1 likely has higher impact: it introduces a new, reusable benchmark with clear real-world relevance (evaluating LLM reliability for mathematical education/research), includes human expert evaluation, and can influence governance and model development across math-reasoning tasks. Its curriculum-grounded design and analysis of judge–human disagreement address timely evaluation challenges broadly applicable beyond graph theory. Paper 2 is a well-scoped negative result with methodological rigor, but its narrow setting (small Pythia models, specific injection method) limits breadth and immediate applications, though it is still valuable for mechanistic interpretability research.
GTBench offers higher scientific impact due to its more rigorous methodology, novel curriculum-grounded evaluation framework for mathematical reasoning, and deeper insights into LLM failure modes. It addresses a less-explored area (mathematical research assistance) with a structured difficulty hierarchy, revealing important findings about human-LLM judge disagreement. The benchmark has broader implications for AI governance in education and research. TriEval, while useful for resource-constrained evaluation, is more incremental—combining existing evaluation dimensions (bias, toxicity, truthfulness) into a lightweight pipeline without fundamentally advancing evaluation methodology.
Paper 2 likely has higher impact: it introduces a reusable, curriculum-grounded benchmark with rigorous multi-tier evaluation (including human expert assessment and judge-disagreement analysis), enabling standardized measurement and governance-relevant auditing of LLM mathematical reasoning. Its applications span model evaluation, education, safety/policy, and research tooling, with broad community utility and timeliness as benchmarks drive progress. Paper 1 is novel in proposing visual graph scaffolds for reasoning, but its impact may be narrower and depends on practical adoption of visual guidance interfaces and training pipelines.
Paper 2 has higher potential impact due to timeliness and real-world applicability: validating computer-use agents in clinical GUIs directly affects healthcare safety, workflow automation, and regulatory evaluation. Its interactive, screenshot-only design plus deterministic checking and explicit safety dimensions improves methodological rigor and reproducibility for a high-stakes domain. The benchmark spans multiple medical domains and evaluates many agents, enabling broad adoption across ML, HCI, and clinical informatics. Paper 1 is valuable for AI-in-math evaluation, but its narrower domain (graph theory) and less direct deployment pathway likely limit cross-field and societal impact.
AURA introduces a genuinely novel architectural contribution—action-gated memory for embodied AI on edge hardware—addressing a fundamental and growing problem (deploying VLAs on resource-constrained robots). It offers a creative reframing of memory management with constant VRAM, a learned gating mechanism trained on action-error signals, and strong empirical results showing dramatic write reductions without performance loss. GTBench, while competent, is primarily a benchmark/evaluation paper for LLM graph theory reasoning—a narrower contribution in a space already crowded with LLM benchmarks. AURA's cross-disciplinary impact (robotics, edge computing, memory-efficient inference) and practical applicability give it higher potential impact.
Paper 2 has higher estimated impact: it proposes a novel, generalizable RL method (VEPO) addressing a clear failure mode in multimodal RL (entropy-based credit assignment missing vision-sensitive low-entropy tokens), with demonstrated gains across model scales and supporting ablations—suggesting methodological rigor and broad applicability to vision-language reasoning and training pipelines. Paper 1 is valuable as a curriculum-grounded benchmark in graph theory, but its impact is narrower (evaluation-focused, domain-specific) and less likely to reshape methods across fields compared to a training framework that could affect many multimodal systems.
Paper 1 (SAGE) likely has higher scientific impact due to greater novelty (compute-matched social vs self evolution with controlled conditions), broader applicability (multi-agent learning, AI alignment/safety, collective intelligence, benchmarking), and timeliness as agent ecosystems become common. Its cross-arena evaluation and ablation on sharing modalities (raw logs vs abstractions) adds methodological rigor and actionable insights for building scalable agent systems. Paper 2 is a solid, useful benchmark for LLM graph-theory assistance, but its impact is narrower (single domain, modest benchmark size) and more incremental relative to existing reasoning benchmarks.
DeskCraft addresses a broader and more impactful gap in AI evaluation—benchmarking desktop agents on realistic professional workflows with human-in-the-loop collaboration. It covers a wider range of applications (design, video, audio, 3D), evaluates 18 agents on 538 tasks, and formalizes novel interaction protocols. Its open-source commitment and relevance to the rapidly growing GUI agent field give it higher potential impact. GTBench, while rigorous, is narrowly focused on graph theory reasoning, a smaller niche with more limited cross-field applicability. DeskCraft's timeliness and practical relevance to agentic AI systems give it the edge.