LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Shradha Agarwal, Deepak Rajbhar, Tariq J
Abstract
We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.
AI Impact Assessments
(1 models)Scientific Impact Assessment: LinAlg-Bench
1. Core Contribution
LinAlg-Bench introduces a controlled diagnostic benchmark that isolates *computational depth* from *mathematical knowledge* by holding the algorithm fixed (e.g., cofactor expansion) while scaling only matrix dimension (3×3 → 4×4 → 5×5). This is a genuinely clever experimental design—unlike benchmarks where difficulty and novelty are confounded (e.g., GSM8K, MATH), LinAlg-Bench enables attribution of failure to specific computational bottlenecks. The central empirical finding—a "fabrication-to-abandonment threshold" at 4×4 scale where failure modes categorically shift from execution errors to computational abandonment—is novel and well-documented across 6,600 model outputs.
The paper also introduces a forensic error taxonomy (10 primary tags with subtypes) and a three-stage automated classification pipeline, enabling analysis beyond binary accuracy. The identification of "constraint-aware confabulation" (fabricated eigenvalues that satisfy trace/Frobenius norm constraints) and "tool roleplay collapse" (simulating invocation of unavailable tools) are genuinely new observations about LLM failure modes.
2. Methodological Rigor
Strengths: The experimental design is unusually disciplined for an LLM evaluation paper. All 660 problems are SymPy-verified. The complete enumeration of 6,600 outputs at temperature 0 eliminates sampling variability concerns. The five-level cognitive taxonomy (Reading → Arithmetic → Sequential → Recursive → Compositional) provides clean isolation of computational demands.
The forced Gaussian elimination ablation is particularly well-designed: by showing that enforcing the algorithmically efficient O(n³) strategy does not recover accuracy for collapsed models, the paper convincingly separates strategy selection from execution capability. The survival curve analysis (Figure 6.1, Panel B) showing collapse at steps 3-4 when fractional dependencies are introduced is compelling mechanistic evidence.
Weaknesses: The forensic pipeline relies on an LLM-as-judge approach (Gemini 3.1 Pro Preview), which introduces circularity—using an LLM to diagnose LLM failures. While the 92.6% agreement with 593 human-labeled responses is reasonable, several low-frequency tags achieve very poor agreement (VARIABLE_ENTANGLEMENT: 0% at 5×5, ALGEBRAIC_PRECEDENCE: 0% at 4×4). The authors acknowledge this and exclude these from distributional analysis, but it limits the taxonomic completeness claims.
The constraint-aware confabulation analysis (n=20 Ungrounded_Guess cases) is too small for strong claims. The 45% trace-matching rate and 85% Frobenius norm bound satisfaction are intriguing but statistically fragile. The benchmark also covers only integer-entry matrices, which simplifies the computational landscape and may not generalize to practical linear algebra with floating-point entries.
Single-run evaluation at temperature 0, while ensuring reproducibility, may not capture the full distribution of model behaviors. Self-consistency sampling could shift threshold locations.
3. Potential Impact
Evaluation methodology: The paper demonstrates that *failure mode analysis* can be more informative than accuracy leaderboards. The dimensional gradient design—holding the algorithm constant while scaling complexity—is a transferable experimental paradigm applicable to other mathematical domains (polynomial arithmetic, symbolic integration, combinatorics).
Understanding LLM limitations: The working memory account, while not mechanistically proven, generates testable hypotheses for the interpretability community. The prediction that sign-tracking failures involve late-layer parity circuits, while Complete_Collapse involves suppression of those circuits, is specific enough for activation patching experiments.
Practical implications: The finding that tool roleplay emerges as a collapse response (not a capability) has implications for AI safety—models hallucinating tool invocation when overwhelmed could be dangerous in deployment. The constraint-aware confabulation finding warns that surface-level verification (e.g., checking eigenvalue sums against traces) is insufficient for detecting fabricated outputs.
Limitations of impact scope: Linear algebra is a narrow domain. The authors acknowledge uncertainty about whether the fabrication-to-abandonment threshold generalizes to other mathematical fields, physics, or chemistry. The benchmark tests only unaided computation—in practice, models with tool access would not face these failures.
4. Timeliness & Relevance
The paper addresses a current bottleneck: understanding *why* LLMs fail at mathematical reasoning, not just *how often*. As reasoning-optimized models (o1, DeepSeek-R1 derivatives) proliferate, understanding their failure modes becomes critical. The observation that even Tier 1 models universally collapse at 5×5 eigenvalues—despite mastering the underlying sub-tasks—is directly relevant to ongoing debates about whether LLMs genuinely reason or approximate reasoning patterns.
The compositionality gap evidence (models solving determinants but failing eigenvalues that depend on determinants) connects to a growing literature on compositional reasoning limitations.
5. Strengths & Limitations
Key strengths: (1) Exceptional experimental control—same algorithm, same models, only dimension varies; (2) Comprehensive forensic analysis going far beyond accuracy reporting; (3) The ablation study is the paper's strongest contribution, cleanly separating strategy from execution; (4) Full data and pipeline release enables replication; (5) The tier classification system and model-specific collapse personas are richly detailed.
Notable limitations: (1) Domain narrowness—linear algebra only; (2) Integer-only matrices are artificial; (3) LLM-as-judge circularity; (4) Some model versions may have changed during the 4-month evaluation window; (5) The "working memory" framing is metaphorical—transformers don't have working memory in any cognitive science sense, and the paper acknowledges it defers mechanistic validation; (6) The paper is extremely long with extensive appendices but could benefit from tighter exposition in the main text; (7) Some model names appear futuristic (GPT-5.2, Claude-4.5-Sonnet, Gemini-3.0-Pro), raising questions about the temporal claims.
Overall Assessment
LinAlg-Bench is a well-executed diagnostic benchmark that reveals genuine structural insights about LLM mathematical reasoning failures. Its primary contribution—the fabrication-to-abandonment threshold and the ablation showing strategy enforcement doesn't help—is substantive and falsifiable. However, the narrow domain scope, small sample sizes for some key claims, and the metaphorical working memory framing limit the generalizability of conclusions. The paper's impact will depend heavily on whether the threshold phenomenon replicates in other mathematical domains.
Generated May 19, 2026
Comparison History (19)
Paper 1 identifies a highly specific, generalizable structural failure mode (the 4x4 threshold) that fundamentally informs our understanding of LLM working memory and reasoning limits. Its rigorous forensic pipeline and concrete findings offer broader implications for cognitive modeling of LLMs and architecture design compared to Paper 2's domain-specific findings on prior knowledge in code optimization.
Paper 1 introduces a novel diagnostic benchmark with a rigorous forensic error taxonomy that reveals fundamental structural failure modes in LLM mathematical reasoning, including the discovery of a near-universal behavioral threshold and novel hallucination patterns. These findings have broad implications across AI safety, model architecture design, and understanding of LLM cognitive limitations. Paper 2, while practically useful, is more incremental—applying an existing LLM to PHR-based question answering with predictable improvements from added context. Paper 1's methodological innovation (forensic error pipeline) and theoretical contributions (working memory limits, fabrication transitions) offer deeper scientific insights with wider cross-field relevance.
While Paper 1 introduces a valuable benchmark for programmatic video generation, Paper 2 provides fundamental theoretical insights into LLM reasoning limitations. By uncovering structural failure modes, working memory constraints, and the phenomenon of computational abandonment at specific matrix dimensions, Paper 2 offers profound implications for understanding and improving the cognitive architecture of large language models across multiple domains.
Paper 2 has higher potential impact because it proposes a general, actionable framework (closed-loop self-correction) that can improve LLM reliability across many domains, with control-theoretic metrics and measurable gains. Its ideas (detector/controller/judge, convergence/oscillation analysis) are broadly transferable and timely for deployment-facing safety and performance. Paper 1 is a strong, rigorous diagnostic benchmark with insightful failure taxonomy, but its direct applicability is narrower (structured linear algebra) and primarily observational rather than offering a general corrective method.
Paper 2 presents a completed empirical study with concrete, novel findings about LLM failure modes in mathematical reasoning—a timely topic given the rapid deployment of LLMs. Its discovery of a universal behavioral threshold and structured hallucination taxonomy provides immediately actionable insights for the AI community. The forensic methodology and public release of all data enhance reproducibility and impact. Paper 1 is a thesis proposal outlining future work on uncertainty in knowledge graphs—while important, it lacks completed results and addresses a more niche audience. Paper 2's relevance to the booming LLM field gives it broader and more immediate impact potential.
Paper 1 provides fundamental insights into the mathematical reasoning limitations and structural failure modes of LLMs, discovering a critical scale-emergent threshold (at 4x4 matrices). This contributes broadly to our understanding of LLM working memory and hallucinations. While Paper 2 presents a rigorous and useful framework for e-commerce agents, Paper 1's findings have a broader impact on general AI capabilities, model evaluation, and the fundamental science of large language models.
Paper 2 proposes a paradigm shift in world modeling by using coding agents to generate executable simulation code rather than relying on latent video generation. This explicitly solves the physical inconsistency problem plaguing current video-based world models. Its approach directly enables robust applications in embodied AI, robotics, and autonomous driving by ensuring strict physical constraints. While Paper 1 provides a highly rigorous and insightful diagnostic benchmark for LLM mathematical reasoning limits, Paper 2 introduces a constructive, scalable framework with broader and more immediate transformative implications across multiple AI disciplines.
Paper 2 has higher likely scientific impact due to stronger real-world applicability and scale: it proposes a deployable framework for grounding LLM agents in heterogeneous user behavior, validated on 8.37M buyers across 42 live storefronts with a concrete metric (conversion-rate alignment). The approach (learning discrete personas via behavior-aware VQ-VAE and exposing them as persona tokens) is novel and broadly relevant to personalization, agent alignment, simulation, and recommender systems. Paper 1 is timely and methodologically solid, but as a diagnostic benchmark its impact is more specialized and less directly deployable.
While Paper 1 provides valuable behavioral insights into LLM mathematical limits, Paper 2 introduces a novel, probe-free methodological approach to causal intervention and hidden-state steering. Techniques that fundamentally enhance the controllability, interpretability, and alignment of generative models typically demonstrate broader applicability and higher potential impact across the AI field than domain-specific diagnostic benchmarks.
Paper 1 offers fundamental theoretical insights into the reasoning limitations and working memory constraints of LLMs, identifying a near-universal behavioral threshold where models transition from execution errors to computational abandonment. This deepens our understanding of LLM failure modes and hallucinations, impacting a broader range of foundational AI research compared to Paper 2, which, while methodologically rigorous, focuses on the more specialized application domain of mobile GUI agents.
LinAlg-Bench offers higher scientific impact due to its novel forensic methodology for understanding LLM failure modes in mathematical reasoning, revealing a universal structural threshold (the fabrication-to-abandonment transition at 4x4 scale) that generalizes across architectures. This finding about working memory limits has broad implications for understanding fundamental LLM capabilities. The benchmark, error taxonomy, and publicly released pipeline provide lasting community infrastructure. Paper 2, while practically useful, addresses a narrower domain (compound LLM agents in adversarial POMDPs) with findings that, though valuable for practitioners, are more incremental and context-specific.
Paper 2 likely has higher scientific impact due to a more novel systems-level contribution (bilevel policies combining learned low-level control with symbolic high-level planning), stronger real-world applicability to embodied robotics, and broader cross-field relevance (robot learning, planning, symbolic-neural integration). Its claims suggest scalability to very large object counts and improved efficiency, which is timely for long-horizon agent research. Paper 1 is rigorous and useful diagnostically, but its impact is narrower (evaluation/analysis of LLM math failures) and less directly enabling for downstream capabilities.
Paper 2 addresses a more novel and practically urgent problem—safety in memory-augmented LLM agents—that has broader implications for deployed AI systems. The concept of 'memory laundering' and sub-threshold toxicity propagation through compressed state is a genuinely new failure mode with immediate relevance to AI safety engineering. It introduces actionable insights (sanitize before compression, not after) and a new metric (SPG). Paper 1, while methodologically thorough, is primarily a diagnostic benchmark for a known limitation (LLMs struggling with multi-step math at scale), offering less surprising conclusions and narrower impact beyond the benchmarking community.
Paper 1 has higher likely impact: it delivers a concrete, large-scale, reproducible benchmark with an automated error forensics pipeline and a clear, surprising empirical finding (a near-universal 4x4 behavioral threshold and structured confabulation modes). The released dataset/labels can immediately support model evaluation, training, interpretability, and tooling across the LLM ecosystem, making applications and cross-field uptake (ML, HCI, education, verification) strong and timely. Paper 2 is conceptually interesting but more speculative, with narrower empirical validation and less immediate, scalable utility.
CASPO introduces a novel, practical framework (confidence-aware alignment + pruning) that improves reasoning reliability and efficiency across multiple benchmarks and model families, with broad applicability to reasoning LLMs. It offers a scalable solution without external verifiers, addresses a critical problem in LLM deployment, and releases both code and datasets. While LinAlg-Bench provides valuable diagnostic insights about LLM failure modes (the fabrication-to-abandonment transition is interesting), it is primarily an analytical/benchmark contribution limited to linear algebra. CASPO's methodological contribution has broader downstream impact and practical utility.
LinAlg-Bench has higher potential impact due to its broader relevance to the rapidly growing LLM evaluation field. It reveals a fundamental structural limitation (working memory thresholds) affecting all frontier models, introduces novel concepts like 'constraint-aware confabulation,' and provides a forensic error taxonomy useful across AI safety and reasoning research. The finding of a universal fabrication-to-abandonment transition at 4x4 scale is a significant insight into LLM behavior. Paper 2, while methodologically sound, addresses a narrower NLP subfield (personality prediction) with incremental architectural improvements over existing baselines.
Paper 2 has higher potential impact due to broader applicability and methodological contributions. GIM targets a widely relevant evaluation gap (integration across cognitive domains) and provides a scalable psychometric framework (continuous-response 2PL IRT) calibrated on >200k outputs across 28 models, enabling robust cross-configuration/model comparisons and contamination diagnostics via public/private split. Its findings on test-time compute and configuration sensitivity are timely and actionable for both research and deployment. Paper 1 is novel and rigorous but narrower (linear algebra-specific), with more limited cross-field reach.
Paper 2 likely has higher scientific impact because it delivers a broadly useful, publicly released benchmark plus an automated forensic error taxonomy across many frontier models, enabling reproducible measurement and follow-on work. Its key finding (a near-universal 4x4 behavioral threshold and fabrication/abandonment transition) is timely and generalizable to reasoning, evaluation, and safety research, with applications spanning model training, tool use, and interpretability. Paper 1 is innovative for personalized systems, but its impact is narrower (personalization/commitment control) and depends more on adoption of a specific framework and validator scope.
LinAlg-Bench offers higher scientific impact due to its novel forensic error taxonomy revealing a near-universal structural failure mode (fabrication-to-abandonment transition) at specific computational thresholds across 10 frontier LLMs. The finding of a working memory limit rather than knowledge gap is a fundamental insight about LLM architecture. The systematic classification of 1,156 failures into fine-grained error types provides actionable diagnostic value. Paper 2 addresses logicality in scientific reasoning but is more incremental—training on higher-quality data improving performance is less surprising. Paper 1's benchmark and error taxonomy have broader implications for understanding and improving LLM reasoning.