Syed Wasiq, Syed Mohamad Tawseeq, Yashwant Pravinrao Bangde, Debaditya Roy
Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.
This paper makes two intertwined contributions: (1) EngVQA, a multimodal benchmark of 696 engineering problems across five disciplines (Fluid Mechanics, Heat & Mass Transfer, Mechanics of Materials, Thermodynamics, and Dynamics), and (2) EngJudge, an 8-stage process-oriented evaluation framework that decomposes engineering solutions into interpretable reasoning stages with dependency-aware error propagation.
The central insight is that engineering reasoning requires evaluating *how* a model arrives at an answer—not just *whether* the final answer is correct. The paper identifies that errors in early stages (e.g., wrong assumptions, misread diagrams) cascade through downstream computation, and designs a DAG-based propagation mechanism to model this. This is a meaningful conceptual advance over flat scoring or final-answer-only evaluation.
Strengths of the evaluation framework design: The 8-stage decomposition is empirically motivated by a pilot error analysis (Appendix A), which examined failure patterns in Gemini-2.0-flash-exp solutions. The authors demonstrate that error types cluster into distinct reasoning operations and exhibit sparse structured dependencies—directly informing the DAG topology. This data-driven design philosophy is commendable and distinguishes the work from heuristically-imposed evaluation structures.
The penalty-based scoring with four severity levels (2, 4, 7, 10 points), fatal error capping, and three meta-evaluation checks (verbosity, coverage, physical sanity) creates a rich grading rubric. The mathematical formulation of dependency propagation (Equation 2) is straightforward and interpretable.
Immediate applications: The framework could serve as a diagnostic tool for engineering education AI systems, helping identify where tutoring agents fail (e.g., algebraic execution vs. conceptual setup). The finding that current VLMs score below 4/10 on EngJudge across all subjects is a sobering calibration for anyone deploying these models in technical contexts.
Broader influence: The process-oriented evaluation paradigm with dependency-aware propagation could generalize beyond engineering to other domains requiring multi-step reasoning with causal dependencies (e.g., medical diagnosis, legal reasoning, scientific experiment design). The DAG-based trust propagation is a reusable design pattern.
Limitations on impact: The computational cost (11 LLM judge calls per solution) significantly limits scalability. The reliance on a specific LLM (Gemini-3.1-Pro-Preview) as the judge introduces model-specific biases and vendor dependencies. The dramatic score differences between SinglePass (~8.0) and EngJudge (~2.9) for Gemini-2.5-Flash raise questions about whether EngJudge is calibrated appropriately or is excessively punitive—the ablation shows that removing any single component substantially raises scores, suggesting the multiplicative combination may over-penalize.
The paper addresses a timely gap. As VLMs are increasingly marketed for STEM education and technical assistance, rigorous evaluation of their engineering reasoning is critical. The observation that models produce "physically invalid yet superficially plausible solutions" is practically important. The work arrives alongside related efforts (EngiBench, EEE-Bench, SeePhys) but distinguishes itself through process-oriented evaluation—a meaningful differentiator.
Missing comparisons: The paper would benefit from evaluating more models (GPT-4o, Claude, Llama variants) and comparing against MMMU engineering subsets directly. Testing on problems guaranteed to be outside training data (e.g., newly created problems) would address contamination concerns.
This is a solid benchmark paper that identifies a real evaluation gap and proposes a principled framework to address it. The empirically-grounded DAG design and process-oriented evaluation are the strongest contributions. However, the limited scale of both the benchmark and the validation study, narrow model coverage, and potential over-punitiveness of the scoring framework temper the impact. The work opens a productive research direction but requires broader validation to establish EngJudge as a community standard.
Generated Jun 10, 2026
Paper 2 addresses a critical bottleneck in autonomous Computer Use Agents (long-horizon planning and visual grounding in GUIs), which has broad applications across web, mobile, and desktop environments. Its cross-platform performance improvements offer immediate, wide-ranging impact for AI agents. Paper 1 introduces a valuable but narrower domain-specific benchmark for engineering reasoning, making its broader scientific and practical impact slightly more constrained compared to the generalized agent framework in Paper 2.
Paper 2 introduces a highly innovative methodology for transforming expert traces into deterministic code, bridging neuro-symbolic AI and agentic systems. Its successful real-world enterprise deployment, cost-efficiency analysis, and generalization to multiple benchmarks demonstrate exceptional practical utility and methodological rigor. While Paper 1 offers a valuable benchmark for VLMs, Paper 2's broader implications for reliable, low-cost AI deployment and its novel iterative refinement approach afford it a significantly higher potential impact across both academia and industry.
STAGE-Claw addresses a more fundamental and broadly impactful challenge: automated benchmark generation and state-based evaluation for LLM-powered personal agents, a rapidly growing area. Its scalable, automated framework for creating and validating benchmarks tackles key limitations (static tasks, coarse scoring) affecting the entire agent ecosystem. While EngVQA provides valuable domain-specific evaluation for engineering reasoning with a strong process-oriented framework, its scope is narrower. STAGE-Claw's contribution to agent evaluation infrastructure has broader applicability across the AI community and addresses more timely scalability concerns.
Paper 2 likely has higher scientific impact: it introduces a new benchmark (EngVQA) plus a general stage-wise evaluation framework that can become shared infrastructure for measuring progress in multimodal engineering reasoning, with clear real-world relevance (education, technical decision-making) and strong validation (human agreement, high correlation). Its impact can span VLM evaluation, reliability, and domain-specific AI. Paper 1 offers a solid, novel insight into subgoal persistence in latent hierarchical reasoning, but is narrower in immediate applicability and primarily advances a specific modeling design within limited tasks.
Paper 1 addresses a fundamental and broadly applicable problem—memory retention for long-horizon language agents—with a novel constrained optimization framework (OSL-MR) that introduces rigorous formalism (observability constraints, delayed costs, budget feasibility) to a problem mostly handled by heuristics. This has broad impact across all applications of long-context agents. Paper 2 introduces a useful benchmark (EngVQA) with a stage-wise evaluation framework, but benchmarks tend to have more incremental impact unless widely adopted. Paper 1's methodological contribution—bridging optimization theory with practical agent memory—offers deeper conceptual novelty and broader applicability across fields.
Paper 1 likely has higher impact due to stronger novelty and urgency: it benchmarks agentic LLM capabilities directly tied to biosecurity and includes wet-lab validation showing real-world execution (robotic DNA assembly), which raises immediate safety and governance implications. Its applications span AI evaluation, biotechnology automation, and biosecurity policy, giving broad cross-field relevance and timeliness. Paper 2 is methodologically solid and useful for VLM assessment in engineering, but is narrower in societal stakes and lacks comparable real-world validation beyond benchmarking and stage-wise scoring.
Paper 2 addresses a more fundamental and broadly impactful problem—rigorous step-level verification of mathematical proofs—with a novel framework that reveals deep insights about LLM reasoning failures (context poisoning, pedantic hyper-rigor). Its contributions have broader implications for automated theorem proving, formal verification, and agentic reasoning systems. The discovery that remaining errors stem from implicit domain conventions rather than logical hallucinations is a significant conceptual advance. Paper 1, while valuable for engineering education evaluation, is more application-specific and incremental in its benchmarking contribution.
Paper 1 is likely to have higher scientific impact because it identifies and systematically measures a broadly consequential failure mode (memory-amplified sycophancy) that affects correctness and safety across many LLM applications. It contributes a benchmark (MIST), a cross-system empirical study spanning multiple memory systems and model families, a mechanistic hypothesis (memory extraction/compression), and practical mitigations—making it actionable for deployed systems. Paper 2 is timely and rigorous with a useful benchmark and stage-wise evaluation, but its impact is more domain-specific (engineering VLM reasoning) and primarily diagnostic rather than offering mitigation.
Paper 1 (EngVQA) addresses a clear, well-defined gap in VLM evaluation for engineering reasoning with a novel stage-wise evaluation framework that achieves strong human correlation (0.975 Pearson). Its focus on process-oriented evaluation rather than just final answers is methodologically rigorous and broadly applicable. Paper 2 tackles interesting challenges in dynamic agent evaluation but covers a broader, less focused scope. Paper 1's concrete benchmark with 696 problems across 5 engineering subjects, combined with its practical relevance to engineering education and AI-assisted technical decision-making, gives it stronger potential for adoption and citation impact.
Paper 1 has higher potential impact due to several factors: (1) it addresses a broader and more fundamental question about VLM reasoning capabilities in engineering domains, which spans multiple scientific fields; (2) the 8-stage evaluation framework for process-oriented assessment is methodologically novel and transferable to other reasoning benchmarks; (3) it has stronger real-world implications for AI in engineering education and scientific assistance; (4) the high correlation (0.975) with human evaluation validates the framework rigorously. Paper 2, while valuable, addresses a narrower domain (Office automation) with more incremental contributions to benchmarking LLM agents.