EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

Gioele Molinari, Florian Felten, Soheyl Massoudi, Mark Fuge

May 19, 2026

arXiv:2605.19743v1 PDF

cs.AI(primary)cs.LGcs.MA

#1543of 2292·Artificial Intelligence

#1543 of 2292 · Artificial Intelligence

Tournament Score

1368±40

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5

Novelty6

Clarity7.5

Tournament Score

1368±40

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores ( $\approx 1.0$ ) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: EngiAI

1. Core Contribution

EngiAI addresses a genuine gap at the intersection of LLM-based multi-agent systems and engineering design automation. The paper makes two intertwined contributions: (1) a benchmark suite with three evaluation dimensions—workflow execution under varying cognitive demands, gated RAG scoring for parameter selection, and HPC training orchestration—and (2) a reference multi-agent implementation built on LangGraph that coordinates seven specialized agents (engineering, RAG, ArXiv, search, HPC, CLI, and 3D printer control) through a supervisor architecture.

The benchmark design is the more novel contribution. The seven prompt styles (FULL, NATURAL, W-RAND, W-DERIVED, W-DISTRACT, W-COND, W-MULTI) systematically probe distinct failure modes: numerical fidelity, semantic grounding under competing values, conditional branching on simulation outputs, and working-memory demands across sequential tool calls. These go beyond standard tool-calling benchmarks by requiring agents to chain numeric outputs from one tool into downstream decisions—a realistic demand in engineering workflows that existing benchmarks do not capture.

2. Methodological Rigor

Strengths in evaluation design: The gated RAG scoring mechanism is well-conceived—it prevents agents from receiving credit for parameters guessed from parametric memory by requiring that `search_documents` be invoked. The three-condition ablation (RAG-on, RAG-off, Empty RAG) cleanly isolates retrieval contributions. The discovery that Gemini scores well on P0 even with an empty index (likely due to memorized defaults) demonstrates the mechanism's diagnostic value.

Weaknesses: The experimental scope is limited. Only two EngiBench problems (Beams2D, Photonics2D) are tested, with Photonics2D receiving only three of seven prompt styles. Four LLM backends is a modest sample, and the two open-source models come from the same family (Qwen3), limiting generalizability claims. The HPC benchmark is restricted to two proprietary models. The paper acknowledges these limitations but they substantially constrain the conclusions.

The analysis is entirely descriptive—no statistical significance tests, confidence intervals, or sensitivity analyses are provided. With 15 runs per cell (3 seeds × 5 samples), the standard deviations reported are sometimes large relative to mean differences (e.g., W-COND task completion for Qwen3.5-4B: 0.60 ± 0.49), making it difficult to draw firm conclusions about model comparisons in several cases.

The composite scoring methodology, while documented transparently with weights justified via AHP analysis, involves many degrees of freedom. The 3D watertightness sub-score is uniformly zero due to a pipeline limitation (non-manifold meshes from threshold-based STL extraction), effectively wasting 11% of the design quality weight. This is acknowledged but not corrected.

The absence of a single-agent baseline with identical tool access is a notable omission for validating the multi-agent supervisor architecture's value.

3. Potential Impact

Practical relevance: The system addresses a real need—making ML-driven engineering design accessible to non-experts through natural language. The integration of topology optimization, RAG, HPC orchestration, and 3D printer control in a unified framework is practically valuable, even if each component individually is not novel.

Benchmark utility: If released publicly, the benchmark could become a useful evaluation tool for the growing number of engineering agent systems. The prompt-style taxonomy (especially W-COND and W-DISTRACT) provides reusable evaluation patterns applicable beyond engineering to any domain requiring conditional reasoning over tool outputs.

Findings with broader implications: The conditional branching results are the paper's most interesting finding. W-COND performance drops sharply on Photonics2D (best model: 53% TC), and the failure mode is specifically *branch inversion*—agents parse the conditional structure correctly but swap the branches. This is a precise, actionable insight about LLM reasoning that could inform prompt engineering and model development broadly. The HPC orchestration finding that GPT-5-mini degrades from 70% to 50% on natural vs. explicit prompts at the final step similarly provides useful evidence about multi-step instruction following degradation over long-horizon workflows.

4. Timeliness & Relevance

The paper is highly timely. LLM-based agent systems are proliferating rapidly, but engineering-specific evaluation frameworks lag behind. The positioning against Table 1 (showing no prior system covers all six capability dimensions) is compelling, though the comparison is somewhat apples-to-oranges since the systems address different design phases and levels of complexity.

The inclusion of HPC orchestration as a benchmark dimension is forward-looking—as ML-augmented design becomes more common, the ability to autonomously manage training pipelines becomes increasingly important.

5. Strengths & Limitations

Key Strengths:

Well-structured prompt taxonomy that systematically isolates cognitive demands not tested by existing benchmarks

Gated RAG scoring is an elegant evaluation design that prevents credit attribution to parametric memory

The branch-inversion failure mode on W-COND is a precise, novel finding about LLM reasoning

Generational comparison (Qwen3-4B → Qwen3.5-4B) provides useful data on open-source model progress

Comprehensive appendix with all raw scores enables alternative scoring analyses

Notable Limitations:

Narrow problem coverage (two problems, one model family for open-source)

No single-agent ablation to validate the multi-agent architecture's benefit

No user study with domain engineers

Descriptive statistics only—no significance testing

The system itself (supervisor + specialized agents) is architecturally straightforward; the engineering contribution is primarily in the benchmark design rather than the agent architecture

Source code is "planned for public release" rather than available, limiting immediate reproducibility

Design quality scores are remarkably similar across models within each style (IoU ≈ 0.40, PA ≈ 0.73 for Beams2D), suggesting the discriminative power lies primarily in task completion and tool efficiency rather than design quality itself

Overall Assessment

This is a competent systems paper that identifies a real evaluation gap and proposes a reasonable benchmark to fill it. The prompt-style taxonomy and gated RAG scoring are genuinely useful contributions. However, the experimental scope is limited, the agent architecture is not particularly novel, and the analysis lacks statistical rigor. The most impactful elements are the benchmark design patterns (especially conditional branching and semantic disambiguation), which could transfer to other domains. The findings, while interesting, are preliminary given the narrow model and problem coverage.

Rating:5.8/ 10

Significance 6Rigor 5Novelty 6Clarity 7.5

Generated May 20, 2026

Comparison History (22)

vs. ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking

gemini-3.15/22/2026

Paper 2 introduces a comprehensive benchmark suite and a multi-agent framework bridging LLMs with complex, real-world engineering workflows like HPC orchestration and 3D printing. Benchmarks in emerging intersections of AI and physical sciences typically drive significant future research and garner high citations. While Paper 1 presents a strong algorithmic contribution for ranking and explainability, Paper 2 offers broader interdisciplinary impact, higher potential for real-world application, and addresses the highly timely challenge of evaluating multi-agent systems in specialized domains.

vs. Towards Direct Evaluation of Harness Optimizers via Priority Ranking

gpt-5.25/22/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: it delivers an end-to-end multi-agent engineering design framework plus a benchmark spanning workflow reasoning, RAG isolation, and real HPC/SLURM orchestration with manufacturing integration. This supports real-world adoption in engineering pipelines and provides a standardized evaluation suite across models and tasks. Paper 1 is novel and methodologically useful for direct optimizer evaluation, but is narrower in scope (harness optimization) and its benchmark size/domain breadth is more limited, reducing cross-field impact.

vs. Scaling Observation-aware Planning in Uncertain Domains

gpt-5.25/22/2026

Paper 1 likely has higher impact due to timeliness and breadth: it targets the rapidly expanding LLM-agent ecosystem and provides a benchmark suite plus reference multi-agent implementation spanning retrieval, simulation, HPC orchestration, and manufacturing control. Benchmarks and tooling can become community infrastructure with broad cross-domain adoption. The methodology includes gated RAG evaluation and varied prompt/agent workflow tests, supporting rigorous comparisons across models. Paper 2 offers strong algorithmic advances for OOP/POMDP fragments with large performance gains, but the niche scope and narrower applicability may limit overall impact despite higher theoretical rigor.

vs. Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to broader applicability and stronger real-world relevance: it delivers a benchmark suite spanning workflow prompting, RAG attribution via gated scoring, and HPC/SLURM orchestration, plus a reference multi-agent implementation integrating simulation, retrieval, and manufacturing (3D printing). This creates reusable infrastructure for evaluating and building engineering-design agents across ML, HCI, systems, and CAD/CAE domains. Paper 1 is novel for ToM/BDI persuasive dialogue and provides a dataset and reasoning framework, but its impact is more niche (persuasion/ToM) and less directly tied to deployable engineering workflows.

vs. Governance by Construction for Generalist Agents

claude-opus-4.65/21/2026

Paper 2 introduces both a novel multi-agent framework and a comprehensive benchmark suite for LLM-driven engineering design, combining simulation, retrieval, and HPC orchestration with rigorous empirical evaluation across multiple LLM backends. It addresses a clear gap in evaluating multi-agent systems for engineering tasks and provides reproducible quantitative results. Paper 1 presents a governance architecture for enterprise agents that, while practically useful, is primarily a demo/system paper describing policy enforcement mechanisms without deep empirical validation or novel algorithmic contributions. Paper 2's broader methodological contribution and benchmark utility give it higher potential impact.

vs. AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

gpt-5.25/21/2026

Paper 1 offers a clearer methodological contribution: a structured benchmark suite with gated RAG scoring, multi-agent workflow dimensions, and an HPC orchestration benchmark, plus a reference MAS implementation and comparative results across multiple LLMs. This is timely for evaluating agentic LLM systems and is likely reusable across engineering, ML evaluation, and tool-using agent research. Paper 2 is relevant infrastructure for publishing and may have practical adoption impact, but the abstract suggests less technical novelty and rigor (evaluation metrics/controls unclear), making its scientific impact less certain.

vs. Governance by Construction for Generalist Agents

claude-opus-4.65/21/2026

Paper 2 introduces both a novel multi-agent framework and a comprehensive benchmark suite for LLM-driven engineering design, addressing a clear gap in evaluation methodology. It provides rigorous empirical results across multiple LLM backends, evaluation dimensions (workflow, RAG, HPC), and problem domains, offering reproducible scientific contributions. Paper 1, while practically valuable, is primarily a demo/system paper describing a governance architecture without substantial empirical evaluation or novel scientific methodology. Paper 2's benchmark contributions have broader impact potential across the engineering AI and multi-agent systems communities.

vs. AiraXiv: An AI-Driven Open-Access Platform for Human and AI Scientists

gpt-5.25/21/2026

Paper 2 has higher likely scientific impact: it contributes a concrete multi-agent framework plus a benchmark suite with clearly defined evaluation dimensions, measurable results across multiple LLM backends, and methodological controls (e.g., gated RAG scoring) that strengthen rigor and reproducibility. Its applications span engineering design, simulation/RAG, HPC orchestration, and manufacturing workflows, making it broadly useful to both AI-systems and engineering communities. Paper 1 is timely and potentially transformative as infrastructure, but appears more platform/proposal-oriented with impact dependent on adoption and less on generalizable scientific evaluation.

vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

claude-opus-4.65/20/2026

Paper 1 addresses a fundamental gap in LLM scientific reasoning by systematically investigating and improving logical faithfulness, which is broadly applicable across scientific domains. Its methodology—assessment criteria and logicality-guided training—offers transferable contributions beyond physics. Paper 2, while practical and well-constructed, is more narrowly focused on engineering design benchmarks and multi-agent orchestration. Paper 1's focus on the essence of scientific reasoning (logicality) has broader implications for the growing field of LLM-based science, making it likely to influence more research directions and attract wider citation.

vs. Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

claude-opus-4.65/20/2026

Paper 2 addresses a broader interdisciplinary problem (survey methodology + LLMs + disaster research) with novel theoretical contributions (PMT-constrained knowledge graph, A-TLM method, subgroup-stratified bias auditing). It introduces methodologically rigorous comparisons against established statistical baselines, proposes new reporting standards, and has clear real-world applications in disaster preparedness and survey science—fields with massive user bases. Paper 1, while technically solid, is more narrowly focused on engineering design benchmarking for LLM agents, serving a smaller community. Paper 2's contributions to imputation methodology and bias auditing have broader cross-field applicability.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

claude-opus-4.65/20/2026

OpenComputer addresses the broader and more impactful problem of general-purpose computer-use agents with verifiable evaluation across 33 desktop applications. Its contributions—state verifiers, self-evolving verification, task generation, and trajectory evaluation—have wider applicability across AI agent research. The framework exposes fundamental gaps in frontier and open-source models for robust computer automation, which is a highly timely research direction. EngiAI, while rigorous, targets a narrower engineering design niche with a domain-specific multi-agent benchmark that will impact fewer research communities.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gemini-3.15/20/2026

Paper 1 addresses a critical, universal bottleneck in LLM agent deployment: privacy and security against adversarial third-party interactions. Its focus on the vulnerabilities of smaller, on-device models provides essential insights for the broader AI alignment and security communities. While Paper 2 offers a robust and rigorous framework for engineering design, its impact is highly domain-specific. Paper 1's generalizable findings on privacy-utility trade-offs have far-reaching implications across almost all real-world LLM agent applications, giving it a higher potential for broad scientific impact.

vs. Streamlined Constraint Reasoning via CNN Pattern Recognition on Enumerated Solutions

gpt-5.25/20/2026

Paper 1 is more methodologically and scientifically novel: it introduces a concrete, testable pipeline that links solution enumeration, contrastive CNN pattern learning, and grounded streamliner synthesis, yielding very large, quantifiable speedups on classic hard CP benchmarks. The approach advances automated constraint reasoning and could transfer to other combinatorial optimization settings. Paper 2 is timely and useful as an evaluation/benchmarking contribution for LLM engineering agents, but much of its innovation is systems integration and benchmarking rather than a new core scientific method, and its impact may be more transient as LLM tooling and benchmarks evolve quickly.

vs. Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

claude-opus-4.65/20/2026

Paper 1 introduces a novel conceptual contribution—'embedding by elicitation'—where LLMs serve as adaptive semantic representation builders for Bayesian optimization over natural language. This is a genuinely new idea that bridges LLM capabilities with principled optimization, with broad applicability beyond prompt tuning to any text-based optimization problem. Paper 2, while solid engineering work, is primarily a benchmark and reference implementation for multi-agent engineering design—a more incremental contribution in an increasingly crowded space of LLM agent frameworks. Paper 1's methodological innovation has greater potential to influence future research directions across multiple fields.

vs. HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

gpt-5.25/20/2026

Paper 2 has higher likely scientific impact due to a clear, high-stakes real-world application (72-hour flood alerts protecting food security), domain-specific methodological innovation (explicit deseasonalization to prevent leakage and SAR-based upstream proxy adding lead time), and an operational end-to-end pipeline including damage estimation. Its contributions can influence hydrology, remote sensing, disaster risk reduction, and applied ML. Paper 1 is timely and useful for LLM-agent evaluation, but benchmarks/frameworks may see narrower adoption and faster obsolescence as agent tooling and models evolve.

vs. Minimax Optimal Variance-Aware Regret Bounds for Multinomial Logistic MDPs

claude-opus-4.65/20/2026

Paper 2 establishes minimax optimal regret bounds for MNL mixture MDPs, introducing a variance-aware constant that tightens existing bounds and proving matching lower bounds. This is a fundamental theoretical contribution that fully characterizes the regret complexity of an important RL problem class for the first time. Its methodological rigor (matching upper and lower bounds) and broad applicability to structured MDPs (e.g., robust MDPs) give it lasting theoretical impact. Paper 1, while practically useful, is primarily an engineering benchmark/framework contribution with more incremental novelty and narrower impact scope.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

gpt-5.25/20/2026

Paper 2 is likely to have higher impact due to greater novelty and broader applicability: it introduces a benchmark suite plus a reference multi-agent framework spanning workflow reasoning, RAG attribution, and HPC orchestration, with quantitative results across multiple LLM backends. This creates a reusable evaluation standard and tooling that can influence research in LLM agents, engineering design automation, retrieval evaluation, and ML ops/HPC. Paper 1 provides valuable empirical systems insights for multi-model LLM serving, but it is more specialized (scheduler/offloading/preemption characterization) and primarily informs infrastructure design rather than establishing a widely adoptable benchmark.

vs. Agentic Trading: When LLM Agents Meet Financial Markets

gemini-3.15/20/2026

Paper 1 presents a novel multi-agent framework and benchmark suite bridging LLMs with complex, real-world engineering tasks like HPC orchestration and 3D printing. This end-to-end integration of software AI with physical manufacturing demonstrates high innovation and broad applicability in engineering domains. In contrast, Paper 2 is primarily a systematic review and audit of existing literature in financial trading. While Paper 2 provides valuable methodological critiques, Paper 1's introduction of a new open benchmark and functional reference architecture offers greater potential to drive immediate technical advancements and novel applications in applied AI.

vs. CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to broader cross-field applicability and stronger real-world integration: it benchmarks and operationalizes multi-agent LLM workflows spanning simulation/optimization, RAG, HPC orchestration, and manufacturing (3D printing). This ties evaluation directly to reproducible, end-to-end engineering tasks with clear deployment relevance and timely importance as agentic systems move into production. Paper 2 is novel and rigorous for affective reasoning evaluation, but its applications are narrower and less immediately tied to measurable external-world task execution.

vs. Self-supervised Hierarchical Visual Reasoning with World Model

claude-opus-4.65/20/2026

ResDreamer introduces a novel hierarchical world model architecture with residual reconstruction that addresses fundamental challenges in reinforcement learning—error accumulation in visual foresight and scalability. Its purely self-supervised approach, grounded in the 'Bitter Lesson' philosophy, offers broad applicability across open-world RL domains. Paper 1, while practically useful, is more of an engineering benchmark/framework for LLM-driven design—a narrower contribution. Paper 2's methodological innovation in hierarchical representation learning has greater potential to influence the broader ML/RL community and spawn follow-up work.