Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Syed Wasiq, Syed Mohamad Tawseeq, Yashwant Pravinrao Bangde, Debaditya Roy

Jun 9, 2026arXiv:2606.10833v1

cs.AI

#2335of 3489·Artificial Intelligence

#2335 of 3489 · Artificial Intelligence

Tournament Score

1355±44

10501800

35%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5.5

Novelty6

Clarity7

Abstract

Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation"

1. Core Contribution

This paper makes two intertwined contributions: (1) EngVQA, a multimodal benchmark of 696 engineering problems across five disciplines (Fluid Mechanics, Heat & Mass Transfer, Mechanics of Materials, Thermodynamics, and Dynamics), and (2) EngJudge, an 8-stage process-oriented evaluation framework that decomposes engineering solutions into interpretable reasoning stages with dependency-aware error propagation.

The central insight is that engineering reasoning requires evaluating *how* a model arrives at an answer—not just *whether* the final answer is correct. The paper identifies that errors in early stages (e.g., wrong assumptions, misread diagrams) cascade through downstream computation, and designs a DAG-based propagation mechanism to model this. This is a meaningful conceptual advance over flat scoring or final-answer-only evaluation.

2. Methodological Rigor

Strengths of the evaluation framework design: The 8-stage decomposition is empirically motivated by a pilot error analysis (Appendix A), which examined failure patterns in Gemini-2.0-flash-exp solutions. The authors demonstrate that error types cluster into distinct reasoning operations and exhibit sparse structured dependencies—directly informing the DAG topology. This data-driven design philosophy is commendable and distinguishes the work from heuristically-imposed evaluation structures.

The penalty-based scoring with four severity levels (2, 4, 7, 10 points), fatal error capping, and three meta-evaluation checks (verbosity, coverage, physical sanity) creates a rich grading rubric. The mathematical formulation of dependency propagation (Equation 2) is straightforward and interpretable.

Concerns about rigor:

The benchmark contains only 696 problems, which is relatively modest. The distribution is uneven (236 Thermodynamics vs. 93 Fluid Mechanics), which may affect statistical reliability of cross-subject comparisons.

Only two generator models (Qwen3-VL-8B and Gemini-2.5-Flash) are evaluated—a narrow sample that limits generalizability claims about "SOTA VLMs."

The human validation study, while showing impressive correlation (r=0.975, MAE=0.67), uses only 9 evaluators rating 4 questions each. The Likert-to-numerical mapping (Appendix E) introduces assumptions: reconstructing "human scores" by adding δ offsets to the automated scores creates a somewhat circular validation. True independent human scoring would be more convincing.

The paper acknowledges but does not adequately address data contamination risks. Since problems appear drawn from standard textbooks, frontier models may have encountered them during pretraining.

3. Potential Impact

Immediate applications: The framework could serve as a diagnostic tool for engineering education AI systems, helping identify where tutoring agents fail (e.g., algebraic execution vs. conceptual setup). The finding that current VLMs score below 4/10 on EngJudge across all subjects is a sobering calibration for anyone deploying these models in technical contexts.

Broader influence: The process-oriented evaluation paradigm with dependency-aware propagation could generalize beyond engineering to other domains requiring multi-step reasoning with causal dependencies (e.g., medical diagnosis, legal reasoning, scientific experiment design). The DAG-based trust propagation is a reusable design pattern.

Limitations on impact: The computational cost (11 LLM judge calls per solution) significantly limits scalability. The reliance on a specific LLM (Gemini-3.1-Pro-Preview) as the judge introduces model-specific biases and vendor dependencies. The dramatic score differences between SinglePass (~8.0) and EngJudge (~2.9) for Gemini-2.5-Flash raise questions about whether EngJudge is calibrated appropriately or is excessively punitive—the ablation shows that removing any single component substantially raises scores, suggesting the multiplicative combination may over-penalize.

4. Timeliness & Relevance

The paper addresses a timely gap. As VLMs are increasingly marketed for STEM education and technical assistance, rigorous evaluation of their engineering reasoning is critical. The observation that models produce "physically invalid yet superficially plausible solutions" is practically important. The work arrives alongside related efforts (EngiBench, EEE-Bench, SeePhys) but distinguishes itself through process-oriented evaluation—a meaningful differentiator.

5. Strengths & Limitations

Key Strengths:

Well-motivated framework design grounded in empirical error analysis rather than intuition

Strong conceptual contribution in modeling error propagation through a dependency DAG

Detailed, reproducible evaluation prompts (Appendix F provides all prompts verbatim)

Clear demonstration that holistic LLM-as-judge evaluation suffers from severe leniency bias

The correlation analysis (Figure 5) empirically validates the DAG structure

Thoughtful ablation study (Table 4, Table 10) demonstrating contributions of each component

Notable Weaknesses:

Limited model diversity in evaluation (only 2 generators)

Small-scale human validation with methodological concerns about score reconstruction

Very low absolute scores under EngJudge (most < 3/10) may indicate the framework is too strict rather than models being uniformly poor—calibration validation against known-good solutions would strengthen claims

The paper does not evaluate whether EngJudge scores predict real-world engineering competence or educational outcomes

No analysis of evaluator (judge LLM) consistency across runs or sensitivity to prompt variations

The "average topics per question" metric as a proxy for difficulty is weakly justified

Missing comparisons: The paper would benefit from evaluating more models (GPT-4o, Claude, Llama variants) and comparing against MMMU engineering subsets directly. Testing on problems guaranteed to be outside training data (e.g., newly created problems) would address contamination concerns.

Summary

This is a solid benchmark paper that identifies a real evaluation gap and proposes a principled framework to address it. The empirically-grounded DAG design and process-oriented evaluation are the strongest contributions. However, the limited scale of both the benchmark and the validation study, narrow model coverage, and potential over-punitiveness of the scoring framework temper the impact. The work opens a productive research direction but requires broader validation to establish EngJudge as a community standard.

Rating:5.8/ 10

Significance 6.5Rigor 5.5Novelty 6Clarity 7

Generated Jun 10, 2026

Comparison History (17)

Lostvs. A History-Aware Visually Grounded Critic for Computer Use Agents

Paper 2 addresses a critical bottleneck in autonomous Computer Use Agents (long-horizon planning and visual grounding in GUIs), which has broad applications across web, mobile, and desktop environments. Its cross-platform performance improvements offer immediate, wide-ranging impact for AI agents. Paper 1 introduces a valuable but narrower domain-specific benchmark for engineering reasoning, making its broader scientific and practical impact slightly more constrained compared to the generalized agent framework in Paper 2.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Trace2Policy: From Expert Behavior Traces to Self-Evolving Decision Agents

Paper 2 introduces a highly innovative methodology for transforming expert traces into deterministic code, bridging neuro-symbolic AI and agentic systems. Its successful real-world enterprise deployment, cost-efficiency analysis, and generalization to multiple benchmarks demonstrate exceptional practical utility and methodological rigor. While Paper 1 offers a valuable benchmark for VLMs, Paper 2's broader implications for reliable, low-cost AI deployment and its novel iterative refinement approach afford it a significantly higher potential impact across both academia and industry.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios

STAGE-Claw addresses a more fundamental and broadly impactful challenge: automated benchmark generation and state-based evaluation for LLM-powered personal agents, a rapidly growing area. Its scalable, automated framework for creating and validating benchmarks tackles key limitations (static tasks, coarse scoring) affecting the entire agent ecosystem. While EngVQA provides valuable domain-specific evaluation for engineering reasoning with a strong process-oriented framework, its scope is narrower. STAGE-Claw's contribution to agent evaluation infrastructure has broader applicability across the AI community and addresses more timely scalability concerns.

claude-opus-4-6·Jun 10, 2026

Wonvs. When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

Paper 2 likely has higher scientific impact: it introduces a new benchmark (EngVQA) plus a general stage-wise evaluation framework that can become shared infrastructure for measuring progress in multimodal engineering reasoning, with clear real-world relevance (education, technical decision-making) and strong validation (human agreement, high correlation). Its impact can span VLM evaluation, reliability, and domain-specific AI. Paper 1 offers a solid, novel insight into subgoal persistence in latent hierarchical reasoning, but is narrower in immediate applicability and primarily advances a specific modeling design within limited tasks.

gpt-5.2·Jun 10, 2026

Lostvs. Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Paper 1 addresses a fundamental and broadly applicable problem—memory retention for long-horizon language agents—with a novel constrained optimization framework (OSL-MR) that introduces rigorous formalism (observability constraints, delayed costs, budget feasibility) to a problem mostly handled by heuristics. This has broad impact across all applications of long-context agents. Paper 2 introduces a useful benchmark (EngVQA) with a stage-wise evaluation framework, but benchmarks tend to have more incremental impact unless widely adopted. Paper 1's methodological contribution—bridging optimization theory with practical agent memory—offers deeper conceptual novelty and broader applicability across fields.

claude-opus-4-6·Jun 10, 2026

Lostvs. ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

Paper 1 likely has higher impact due to stronger novelty and urgency: it benchmarks agentic LLM capabilities directly tied to biosecurity and includes wet-lab validation showing real-world execution (robotic DNA assembly), which raises immediate safety and governance implications. Its applications span AI evaluation, biotechnology automation, and biosecurity policy, giving broad cross-field relevance and timeliness. Paper 2 is methodologically solid and useful for VLM assessment in engineering, but is narrower in societal stakes and lacks comparable real-world validation beyond benchmarking and stage-wise scoring.

gpt-5.2·Jun 10, 2026

Lostvs. Evaluating Research-Level Math Proofs via Strict Step-Level Verification

Paper 2 addresses a more fundamental and broadly impactful problem—rigorous step-level verification of mathematical proofs—with a novel framework that reveals deep insights about LLM reasoning failures (context poisoning, pedantic hyper-rigor). Its contributions have broader implications for automated theorem proving, formal verification, and agentic reasoning systems. The discovery that remaining errors stem from implicit domain conventions rather than logical hallucinations is a significant conceptual advance. Paper 1, while valuable for engineering education evaluation, is more application-specific and incremental in its benchmarking contribution.

claude-opus-4-6·Jun 10, 2026

Lostvs. Recalling Too Well: Sycophancy Evaluation and Mitigation in Memory-Augmented Models

Paper 1 is likely to have higher scientific impact because it identifies and systematically measures a broadly consequential failure mode (memory-amplified sycophancy) that affects correctness and safety across many LLM applications. It contributes a benchmark (MIST), a cross-system empirical study spanning multiple memory systems and model families, a mechanistic hypothesis (memory extraction/compression), and practical mitigations—making it actionable for deployed systems. Paper 2 is timely and rigorous with a useful benchmark and stage-wise evaluation, but its impact is more domain-specific (engineering VLM reasoning) and primarily diagnostic rather than offering mitigation.

gpt-5.2·Jun 10, 2026

Wonvs. The Agent's First Day: Benchmarking Learning, Exploration, and Scheduling in the Workplace Scenarios

Paper 1 (EngVQA) addresses a clear, well-defined gap in VLM evaluation for engineering reasoning with a novel stage-wise evaluation framework that achieves strong human correlation (0.975 Pearson). Its focus on process-oriented evaluation rather than just final answers is methodologically rigorous and broadly applicable. Paper 2 tackles interesting challenges in dynamic agent evaluation but covers a broader, less focused scope. Paper 1's concrete benchmark with 696 problems across 5 engineering subjects, combined with its practical relevance to engineering education and AI-assisted technical decision-making, gives it stronger potential for adoption and citation impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Paper 1 has higher potential impact due to several factors: (1) it addresses a broader and more fundamental question about VLM reasoning capabilities in engineering domains, which spans multiple scientific fields; (2) the 8-stage evaluation framework for process-oriented assessment is methodologically novel and transferable to other reasoning benchmarks; (3) it has stronger real-world implications for AI in engineering education and scientific assistance; (4) the high correlation (0.975) with human evaluation validates the framework rigorously. Paper 2, while valuable, addresses a narrower domain (Office automation) with more incremental contributions to benchmarking LLM agents.

claude-opus-4-6·Jun 10, 2026

#2335of 3489·Artificial Intelligence

#2335 of 3489 · Artificial Intelligence

Tournament Score

1355±44

10501800

35%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5.5

Novelty6

Clarity7