Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Tahreem Yasir, Wenbo Li, Sam Gilson, Sutapa Dey Tithi, Xiaoyi Tian, Tiffany Barnes

May 15, 2026

arXiv:2605.16207v1 PDF

cs.AI(primary)cs.CL

#1587of 2292·Artificial Intelligence

#1587 of 2292 · Artificial Intelligence

Tournament Score

1361±36

10501800

39%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor6

Novelty7

Clarity7

Tournament Score

1361±36

10501800

39%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution--feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper introduces a benchmark for evaluating LLM tutoring agents' ability to perform three-way diagnostic classification of student solutions in propositional logic: optimal, valid-alternative, and incorrect. The key insight is that effective tutoring requires distinguishing between these categories—a distinction central to intelligent tutoring systems (ITS) but previously untested for LLMs. Using knowledge-graph (KG)-derived ground truth that enumerates all valid inference paths for 32 proof problems, the authors evaluate 10,836 solution-feedback pairs across seven LLMs and three feedback conditions (Peer, Teacher, Judge).

The main finding is a striking asymmetry: LLMs achieve near-ceiling performance (~94-99% F1) on optimal steps but systematically fail on valid-alternative solutions (0-76% F1) and incorrect solutions (4-55% F1). Two pedagogically distinct failure modes are identified—over-rejection (penalizing valid reasoning that diverges from expert paths) and over-validation (accepting incorrect solutions)—which map directly onto the "assistance dilemma" from learning science. The paper further shows that accurate diagnosis does not reliably translate into pedagogically sound feedback.

Methodological Rigor

The experimental design is generally well-constructed. The use of knowledge graphs to provide exhaustive ground truth for all valid inference paths is a principled choice that enables evaluation beyond binary correctness. The three feedback conditions (Peer, Teacher, Judge) with varying information access systematically isolate the role of solution context. The analysis of variance showing model selection explains >95% of diagnostic variance (η² > 0.95) while feedback conditions contribute negligibly (η² < 0.01) is a compelling finding.

However, several methodological concerns deserve attention:

1. Simulated students rather than real students: While justified (logs lack reasoning traces), LLM-simulated student solutions may not capture authentic student error patterns. The authors acknowledge this but it limits ecological validity.

2. Zero-shot prompting only: The authors note they didn't observe significant improvement with few-shot prompting during calibration, but systematic ablations are absent. Fine-tuning is deferred to future work.

3. Human evaluation scale: Only 100 pairs per condition on a coarse 3-point scale limits the statistical power for RQ3. The connection between diagnostic accuracy and feedback quality—arguably the most novel claim—rests on relatively thin evidence.

4. Missing KG statistics: The paper contains placeholder values ("[X] to [Y]") for distance-to-conclusion statistics, suggesting incomplete preparation.

5. Annotator independence: Both annotators are co-authors, which creates potential bias despite the blinding procedures described.

Potential Impact

The paper's framing around the assistance dilemma provides a theoretically grounded lens that bridges learning science and NLP communities. The practical recommendation—that hybrid architectures should delegate diagnostic classification to KG-grounded systems while leveraging LLMs for dialogue—is actionable for ITS designers.

The benchmark itself (10,836 pairs, publicly available) could serve as a useful resource for the educational AI community, though its restriction to propositional logic limits breadth. The three-way classification framework could inspire analogous evaluations in other structured reasoning domains (e.g., mathematics, programming).

The finding that model selection dominates over information access has implications beyond tutoring: it suggests that simply providing better context to LLMs does not overcome fundamental limitations in constrained reasoning domains, challenging the assumption that retrieval-augmented or context-enriched approaches suffice.

Timeliness & Relevance

This work is highly timely. The deployment of LLM-based tutoring tools (e.g., Khanmigo, various GPT-based tutors) is accelerating without adequate evaluation of their diagnostic capabilities. The paper addresses a genuine gap: most evaluations use binary correctness, obscuring the pedagogically critical middle ground where students produce valid but suboptimal reasoning. The assistance dilemma framing connects to decades of learning science research, grounding the evaluation in established theory.

The comparison across seven contemporary models (including GPT-o3, DeepSeek-R1, Qwen-3-32B) provides a useful snapshot of current capabilities, though this will inevitably become dated.

Strengths

1. Novel evaluation framework: The three-way diagnostic classification grounded in exhaustive KG enumeration is a genuine methodological contribution that enables evaluation impossible under binary schemes.

2. Clear pedagogical framing: OR and OV map directly to the assistance dilemma, making the computational findings immediately interpretable for education researchers.

3. Scale and systematicity: 10,836 pairs across seven models and three conditions provides substantial statistical power for the automated analyses.

4. Model-level patterns: The discovery that models cluster into distinct failure profiles (e.g., Gemini/DeepSeek as over-validators vs. LLaMA as over-rejector) is informative for system design.

5. Practical recommendation: The hybrid architecture proposal is concrete and motivated by the empirical findings.

Limitations

1. Domain specificity: Propositional logic is an ideal testbed because exhaustive KG enumeration is feasible, but this very property makes it unrepresentative of most educational domains. Generalizability is unclear.

2. No mitigation mechanism: The paper diagnoses failures without proposing solutions. The hybrid architecture recommendation, while reasonable, is not implemented or tested.

3. Incomplete manuscript: Placeholder values in the KG statistics section and some rough edges suggest the paper may not be fully polished.

4. Limited multi-turn analysis: Single-step evaluation doesn't capture how feedback errors compound in realistic tutoring dialogues.

5. Confound between simulation and evaluation: Using the same models as both student simulators and feedback agents could introduce systematic biases, though the authors acknowledge this design choice enables model-level analysis.

Overall Assessment

This paper makes a meaningful contribution by introducing a pedagogically motivated evaluation framework that reveals important limitations of LLM tutoring agents. The three-way classification approach and the OR/OV metrics are well-designed and could influence future evaluation methodology. The main findings—that LLMs excel at confirming correct solutions but fail where adaptive tutoring matters most—are clearly demonstrated and practically relevant. The work is somewhat limited by domain specificity, the absence of proposed solutions, and relatively thin human evaluation, but it represents a solid empirical contribution to the intersection of NLP and education.

Rating:6.5/ 10

Significance 7Rigor 6Novelty 7Clarity 7

Generated May 18, 2026

Comparison History (28)

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to its broad relevance and timeliness: it rigorously benchmarks LLM tutoring agents with large-scale, KG-grounded ground truth, revealing systematic failure modes that directly affect real-world educational deployments. Its conclusions motivate hybrid architectures and inform both AI-in-education and LLM evaluation/safety communities. Paper 1 is technically novel and useful for improving RL fine-tuning of diffusion multimodal models, but its impact is more specialized (generative modeling/RL) and likely incremental within a fast-moving area. Paper 2’s diagnostic insights generalize across models and settings.

vs. Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

gemini-3.15/19/2026

Paper 1 exposes a fundamental reasoning limitation in LLMs regarding suboptimal and incorrect solutions, which is critical for AI-assisted education. Its rigorous benchmarking provides broad scientific insights into LLM diagnostic failures, advocating for a shift toward neuro-symbolic hybrid architectures. While Paper 2 offers excellent practical improvements for enterprise SRE workflows, it represents an applied systems engineering advancement rather than a fundamental scientific discovery about core AI capabilities and limitations.

vs. LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

claude-opus-4.65/19/2026

Paper 2 addresses the timely and broadly impactful question of LLM reliability in educational tutoring, a topic with massive real-world deployment potential. Its rigorous benchmark methodology (10,836 pairs, 7 models, ground-truth evaluation) reveals fundamental architectural limitations of LLMs in diagnostic feedback, which has implications across AI-in-education, HCI, and LLM evaluation research. Paper 1, while technically sound, addresses a narrower domain (degradation model selection in reliability engineering) with a more specialized audience. Paper 2's findings about LLM limitations will likely influence a broader research community and practical deployments.

vs. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

gemini-3.15/19/2026

Paper 1 focuses on automating scientific discovery via symbolic regression, directly enabling breakthroughs across multiple scientific disciplines. While Paper 2 provides valuable insights into LLM limitations in education, Paper 1 introduces a novel, rigorous framework for generating and refining scientific knowledge, offering broader transformative potential across the hard sciences.

vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

claude-opus-4.65/19/2026

Paper 2 introduces a novel methodology for improving LLM scientific reasoning through logicality-enriched training, with broader applicability across scientific domains, validated experiments on multiple backbone LLMs, and released code/data. While Paper 1 provides valuable diagnostic insights about LLM tutoring limitations in a specific domain (propositional logic), its contributions are primarily evaluative rather than constructive. Paper 2's systematic framework for assessing and improving logical faithfulness in scientific reasoning addresses a more fundamental and widely applicable problem, with greater potential to influence future LLM training paradigms across disciplines.

vs. POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

claude-opus-4.65/19/2026

Paper 1 introduces a novel framework (POST) addressing a clearly identified gap in multivariate time series anomaly detection—spatial over-generalization—with a new adversarial learning paradigm, a synthetic benchmark with channel-wise annotations, and state-of-the-art results. It offers methodological innovation, reproducible artifacts, and broad applicability across domains using time series data. Paper 2 provides valuable empirical insights on LLM tutoring limitations but is more diagnostic/evaluative in nature, narrower in scope (propositional logic tutoring), and proposes no new method, limiting its transformative potential.

vs. Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

gemini-3.15/19/2026

Paper 2 demonstrates higher potential scientific impact by introducing an agentic AI system for automated hypothesis generation in nanomedicine. Accelerating scientific discovery and bridging fragmented literature in a high-stakes medical field has a massive transformative ceiling. While Paper 1 provides a highly rigorous and valuable critique of LLM tutors in education, Paper 2's contribution to the frontier of AI-driven scientific research offers broader, life-saving applications across the biomedical sciences.

vs. Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

claude-opus-4.65/19/2026

Paper 1 introduces a novel, comprehensive benchmark for evaluating deep research agents on realistic enterprise tasks with innovative methodological contributions (cognitive traps, conjunctive VRS scoring, dual-layer evaluation). It addresses a timely gap as DRAs are rapidly deployed in industry, evaluates three frontier systems, and provides distinctive failure mode analysis. While Paper 2 offers valuable findings about LLM tutoring limitations in a specific domain (propositional logic), its scope is narrower and its conclusions (hybrid architectures needed) are somewhat expected. Paper 1's broader applicability across enterprise AI deployment and its benchmark contribution give it higher impact potential.

vs. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

claude-opus-4.65/19/2026

Paper 2 has higher potential scientific impact because it addresses the rapidly growing deployment of LLMs as educational tutors, a high-stakes real-world application. It identifies a critical and actionable failure mode—systematic over-rejection of valid-but-suboptimal solutions and over-validation of incorrect ones—with clear architectural implications. The finding that diagnostic accuracy doesn't translate to pedagogically effective feedback is novel and practically important, directly informing hybrid system design. Paper 1, while thorough as a benchmark, primarily documents LLM limitations on specialized spatial/temporal reasoning tasks with narrower applicability beyond the QSTR community.

vs. RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

claude-opus-4.65/19/2026

Paper 2 has higher potential impact because it addresses a timely, broadly relevant question about LLM limitations in educational AI—a rapidly growing deployment area. Its rigorous benchmark (10,836 pairs, 7 models, ground-truth evaluation) reveals systematic architectural failures in LLM feedback, providing actionable insights for hybrid ITS design. Paper 1, while technically interesting, reports modest improvements (cosine similarity 0.181 vs 0.139) on a niche EEG-to-text task with limited practical applicability. Paper 2's findings generalize across models and have immediate implications for responsible LLM deployment in education.

vs. Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

gemini-3.15/19/2026

Paper 2 presents a novel methodological framework (MEMOIR) for LLM solver synthesis applied to combinatorial optimization, an area with massive real-world and economic implications. Its introduction of a two-level memory hierarchy for cross-branch knowledge transfer demonstrates strong methodological innovation and yields significant performance and consistency improvements. In contrast, Paper 1 offers a valuable but narrower benchmarking study on LLM limitations in logic tutoring, making Paper 2's potential breadth of impact and real-world applicability substantially higher.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact: it evaluates a core, widely deployed use case (LLM tutoring) with a large, structured benchmark and reveals systematic failure modes exactly where pedagogical adaptivity depends on nuanced diagnosis. Its implications generalize beyond propositional logic to educational AI/ITS design, motivating hybrid architectures with grounded diagnosis plus LLM dialogue—relevant across edtech, human-AI interaction, and evaluation methodology. Paper 2 is strong and reproducible, but its domain (chess) is narrower and its verifier-in-the-loop message is more incremental relative to existing tool-augmented/verified generation work.

vs. ALSO: Adversarial Online Strategy Optimization for Social Agents

gemini-3.15/18/2026

Paper 2 introduces a novel methodological framework combining adversarial bandits and neural surrogates for online strategy optimization in non-stationary environments. This theoretical advancement has broad applicability across multi-agent systems, reinforcement learning, and AI social agents. In contrast, while Paper 1 provides a rigorous and valuable empirical evaluation, its impact is largely confined to the specific domain of intelligent tutoring systems and educational technology.

vs. Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

claude-opus-4.65/18/2026

Paper 1 addresses a more broadly applicable and timely problem—monitoring AI agents for safety at scale—which is central to the rapidly growing AI safety/alignment field. Its ensemble monitoring approach offers a practical, generalizable methodology applicable across many domains. Paper 2 provides valuable diagnostic insights about LLM tutoring limitations but is narrower in scope (propositional logic tutoring) and its conclusions (use hybrid architectures) are less novel. Paper 1's findings about diversity vs. scale and fine-tuning advantages have wider implications for AI deployment safety, a topic of increasing urgency.

vs. Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

gemini-3.15/18/2026

Paper 1 offers a fundamental architectural advancement for LLM agents by introducing goal-oriented reasoning and backward chaining for RAG systems. This addresses a critical bottleneck in modern AI—long-horizon reasoning and multi-hop retrieval—making its applications vast across virtually all conversational AI domains. While Paper 2 provides a highly rigorous and valuable benchmark for educational technology, Paper 1 introduces a novel algorithmic framework (Natural Language Logic) that improves foundational AI capabilities, giving it higher potential for widespread scientific impact and adoption across multiple fields.

vs. $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

claude-opus-4.65/18/2026

Paper 2 identifies a fundamental architectural limitation of LLMs in tutoring—systematic failure to distinguish suboptimal from incorrect solutions—with rigorous methodology (10,836 pairs, ground truth from knowledge graphs, 7 models). This finding has immediate implications for the rapidly growing LLM-in-education space, providing actionable guidance (hybrid architectures). Paper 1 introduces a useful benchmark for proactive agents but addresses a narrower problem. Paper 2's findings are more generalizable, methodologically stronger, and more timely given widespread LLM deployment in education.

vs. Beyond Partner Diversity: An Influence-Based Team Steering Framework for Zero-Shot Human-Machine Teaming

claude-opus-4.65/18/2026

Paper 2 introduces a novel framework (IBTS) for zero-shot human-machine teaming that addresses a fundamental scalability challenge in HMT, validated with both simulated and real human studies (including a 30-subject study). It offers broader cross-domain applicability and advances the increasingly important field of human-AI collaboration. Paper 1, while rigorous in benchmarking LLM tutoring limitations, is more diagnostic than constructive—it identifies problems rather than solving them—and is narrower in scope (propositional logic tutoring). Paper 2's methodological contributions and real-world validation give it stronger potential for lasting impact.

vs. PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

gpt-5.25/18/2026

Paper 1 offers a more novel technical contribution: it identifies a new task regime (precision-sensitive GUI control), releases a large-scale benchmark with dense pixel-level supervision, and proposes an agent (PAGER) with topology-aware planning plus precision-aligned RL that yields large empirical gains. This combination of dataset + method + demonstrated SOTA is likely to drive follow-on work across GUI agents, embodied/interactive VLMs, and verification-aware RL. Paper 2 is timely and valuable as a diagnostic evaluation, but is primarily an assessment with a domain-limited benchmark and less direct algorithmic advancement.

vs. Reasoning Compression with Mixed-Policy Distillation

gemini-3.15/18/2026

Paper 2 addresses a fundamental bottleneck in modern LLM deployment—inference-time cost and token efficiency for reasoning models. Its novel Mixed-Policy Distillation approach offers broad, cross-domain impact for AI scaling and deployment. Paper 1 is a valuable empirical study, but its scope is limited to the specific domain of educational technology and tutoring systems, making Paper 2 significantly more impactful for the broader AI research community.

vs. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

gemini-3.15/18/2026

Paper 2 addresses a highly timely and critical issue—the efficacy of LLMs as educational tutors. By demonstrating fundamental flaws in how LLMs handle suboptimal and incorrect student solutions, it provides crucial insights that impact the rapidly growing intersection of AI and EdTech. Paper 1 offers a valuable framework for web agents in e-commerce, but its scope is more niche compared to the broader societal and cross-disciplinary implications of evaluating and improving AI-driven educational tools.