Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Yizhe Chi, Deyao Hong, Dapeng Jiang, Tianwei Luo, Kaisen Yang, Boshi Zhang, Zhe Cao, Xiaoyan Fan
Abstract
Current LLM agent benchmarks, which predominantly focus on binary pass/fail tasks such as code generation or search-based question answering, often neglect the value of real-world engineering that is often captured through the iterative optimization of feasible designs. To this end, we introduce Frontier-Eng, a human-verified benchmark for generative optimization -- an iterative propose-execute-evaluate loop in which an agent generates candidate artifacts, receives executable verifier feedback, and revises them under a fixed interaction budget -- spanning tasks across five broad engineering categories. Unlike previous suites, Frontier-Eng tasks are grounded in industrial-grade simulators and verifiers that provide continuous reward signals and enforce hard feasibility constraints under constrained budgets. We evaluate eight frontier language models using representative search frameworks, finding that while Claude 4.6 Opus achieves the most robust performance, the benchmark remains challenging for all models. Our analysis suggests a dual power-law decay in improvement frequency ( 1/iteration) and magnitude ( 1/improvement count). We further show that although width improves parallelism and diversity, depth remains crucial for hard-won improvements under a fixed budget. Frontier-Eng establishes a new standard for assessing the capacity of AI agents to integrate domain knowledge with executable feedback to solve complex, open-ended engineering problems.
AI Impact Assessments
(3 models)Scientific Impact Assessment: Frontier-Eng
1. Core Contribution
Frontier-Eng introduces a benchmark of 47 engineering optimization tasks spanning five categories (computing/quantum, operations research, robotics/control, optics/communications, physical sciences) designed to evaluate LLM agents in a generative optimization paradigm — an iterative propose-execute-evaluate loop under fixed budgets. The central insight is that most existing LLM benchmarks evaluate binary pass/fail tasks, while real engineering value accrues through iterative improvement of feasible designs under hard constraints. The benchmark formalizes this as a triple τ = (C, x₀, E), with task context, initial feasible solution, and frozen evaluator returning both feasibility indicators and continuous scalar scores.
The key novelty is the combination of: (a) continuous reward signals rather than binary outcomes, (b) hard feasibility constraints enforced by domain-specific simulators, (c) budget-aware evaluation where the metric is the best feasible solution found within a fixed interaction budget, and (d) cross-domain coverage with a unified evaluation interface.
2. Methodological Rigor
Strengths in design: The benchmark employs a layered anti-hack architecture (read-only evaluators, sandboxed execution, verifier-parsed scoring) that addresses a genuine concern in agentic benchmarks. The two-stage quality control (automated + human review) adds credibility. The evaluation protocol is thoughtfully designed — average rank as the primary metric sidesteps the incomparability of heterogeneous score units, while Dolan-Moré performance profiles recover magnitude information.
Experimental breadth: Eight frontier models are evaluated under three search frameworks (OpenEvolve, ShinkaEvolve, ABMCTS), with trajectory-level analysis providing deeper insights than simple leaderboard rankings. The dual power-law finding (improvement frequency ∝ 1/iteration, magnitude ∝ 1/improvement count) is an interesting empirical observation, though the R² = 0.84 with a constrained slope of -1 is somewhat forced — the authors fix the exponent rather than fitting it freely, which weakens the claim.
Limitations: The paper evaluates only a single budget (100 iterations) for most experiments, with 500 iterations for the scaling analysis. Statistical significance is not reported — there are no confidence intervals, no repeated runs with different seeds for the search frameworks, and no hypothesis testing for model comparisons. Given the stochasticity of both LLM sampling and some evaluators, this is a notable gap. The depth-vs-width analysis (Section 3.3.2) uses only 10 of 47 tasks, limiting generalizability of the "depth dominates width" conclusion.
3. Potential Impact
Benchmark impact: If adopted, Frontier-Eng could meaningfully shift how the community evaluates LLM agents, moving beyond pass/fail metrics toward optimization quality. The engineering grounding makes results more interpretable to practitioners. The 47-task scale across five domains is substantial enough to resist overfitting by any single method.
Practical relevance: The tasks connect directly to real engineering workflows — battery charging optimization, structural design, GPU kernel engineering, job-shop scheduling. Improvements on these tasks translate to tangible value (faster computations, lighter structures, more efficient processes). This practical grounding is a significant strength over purely synthetic benchmarks.
Methodological insight: The dual power-law observation and depth-vs-width analysis provide actionable guidance for designing generative optimization systems. The case studies (Appendix B) offer particularly rich qualitative insights — for example, the finding that mid-tier models benefit from incremental scaffolding while frontier models can perform architectural "leapfrogs" is practically useful.
Limitations to impact: The benchmark requires significant computational resources (multiple frontier model API calls × 100+ iterations × 47 tasks × multiple frameworks), which may limit adoption. The paper does not discuss cost or carbon footprint. Reproducibility depends on API-accessed models whose behavior changes over time.
4. Timeliness & Relevance
This work arrives at an important moment. The success of FunSearch, AlphaEvolve, and Learning to Discover at Test Time has demonstrated LLMs' capacity for iterative optimization, but no comprehensive cross-domain benchmark existed. The gap between binary coding benchmarks (SWE-bench) and real engineering optimization is well-articulated. The framing of "generative optimization" as a distinct evaluation paradigm is timely and could crystallize a research direction.
However, the benchmark's longevity faces risks: as models improve, tasks that are currently challenging may become trivially solved, and the fixed-budget evaluation may need recalibration. The authors acknowledge this implicitly but provide no saturation analysis.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
6. Additional Observations
The paper is well-written and clearly structured. The case study in Section 2.5 (SustainableDataCenterControl) effectively grounds the abstract formulation. The comprehensive task catalog in Appendix A demonstrates serious curatorial effort. The finding that Claude systematically outperforms across tasks while GPT-OSS shows more persistent late-stage updates is nuanced and avoids simplistic model rankings.
Generated Apr 15, 2026
Comparison History (57)
Paper 1 offers a fundamental methodological breakthrough in Out-of-Distribution (OOD) detection, a critical challenge for AI safety. By leveraging diffusion models as universal feature extractors via geometric analysis, it achieves a 'train-once, deploy-anywhere' paradigm with massive (~500x) sample efficiency improvements. While Paper 2 provides a highly valuable and timely benchmark for evaluating LLM agents in engineering, Paper 1 presents a highly novel, theoretically grounded algorithmic advancement that solves a pervasive bottleneck in robust model deployment across multiple domains, giving it a higher potential for foundational scientific impact.
Paper 1 introduces a novel paradigm (train-once, deploy-anywhere) for OOD detection with strong theoretical grounding in information geometry and diffusion models. Its ~500× sample efficiency improvement and cross-domain generalization from a single pretrained model represent a significant methodological advance with broad practical implications for safe AI deployment. Paper 2 contributes a useful benchmark for LLM agents but benchmarks are incremental by nature and more narrowly scoped. Paper 1's theoretical insights (connecting diffusion score functions to Sobolev norms for OOD detection) and practical framework have greater potential to influence multiple research directions.
Paper 1 presents a concrete, novel benchmark (Frontier-Eng) with 47 engineering tasks, empirical evaluations of 8 frontier models, and quantitative findings (dual power-law decay, depth vs. width analysis). It fills a clear gap in LLM agent evaluation with reproducible methodology and actionable results. Paper 2, while addressing an important conceptual problem (contextual multi-objective optimization), is primarily a position/framework paper without empirical validation. Concrete benchmarks with empirical findings tend to generate more citations and downstream research than conceptual frameworks, as they provide tools the community can directly build upon.
Paper 1 is likely higher impact due to a concrete, human-verified benchmark with industrial-grade simulators, continuous rewards, and hard constraints—an immediately usable artifact that can standardize evaluation and drive measurable progress in agentic generative optimization. It demonstrates empirical results across models and provides quantitative scaling observations, indicating methodological rigor and actionable insights. Paper 2 is timely and broadly relevant conceptually, but is primarily a framework/proposal with less empirical grounding; its impact depends on later formalization and benchmarks/implementations.
Paper 1 introduces a novel benchmark grounded in industrial-grade simulators, bridging AI with real-world engineering. By shifting focus from binary pass/fail tasks to continuous generative optimization, it opens new avenues for AI applications in physical sciences and engineering. While Paper 2 presents a strong algorithmic improvement for LLM reasoning, Paper 1's interdisciplinary approach and potential to drive practical, high-impact industrial applications give it broader real-world and scientific impact.
Paper 1 introduces a rigorous, human-verified benchmark that connects LLM agents to real-world engineering via industrial simulators. By moving beyond binary evaluations to generative optimization and continuous feedback, it opens a highly impactful, cross-disciplinary domain for AI applications. Paper 2 presents an innovative metacognitive framework, but it represents a more incremental architectural improvement within the well-explored domain of LLM reasoning.
Paper 2 likely has higher impact: it introduces a human-verified, simulator-backed benchmark for generative optimization on real engineering tasks, directly enabling standardized evaluation and progress in agentic design workflows with executable feedback—high real-world applicability and broad relevance (ML, robotics, CAD, systems, optimization). The methodology appears rigorous (industrial-grade verifiers, fixed budgets, multiple models/frameworks) and timely given the push toward tool-using agents. Paper 1 is novel and important for prompting/scientific reliability, but its contribution is primarily diagnostic and narrower in immediate downstream ecosystem effects.
Paper 1 likely has higher scientific impact: it introduces a human-verified, executable-feedback benchmark grounded in industrial simulators, enabling rigorous, scalable evaluation of self-improving agent loops with continuous rewards and feasibility constraints—directly relevant to real-world engineering deployment. Its methodological contribution (benchmark + analyses of optimization dynamics and depth/width tradeoffs) can become a standard across ML, agent systems, and applied engineering domains. Paper 2 is novel and timely as a cautionary finding about in-context learning, but its main contribution is diagnostic and narrower in application compared to a widely adoptable benchmark infrastructure.
Paper 2 addresses a critically timely societal concern—LLM persuasion and its psychological mechanisms—with a rigorous longitudinal study design (770 participants, 3,080 conversations). Its findings on psychological susceptibility, fallacious reasoning by LLMs, and trust dynamics have broad implications across psychology, AI safety, policy, and platform governance. The interdisciplinary nature and real-world relevance to AI regulation give it wider impact. Paper 1, while valuable as a benchmark contribution, serves a narrower AI/engineering audience and is more incremental in advancing the field.
Paper 2 likely has higher impact due to a broadly reusable, human-verified benchmark targeting self-evolving agents on real-world engineering tasks with executable verifiers and continuous rewards—an area central to current LLM-agent progress. Its methodological contribution (47 industrial-grade tasks, fixed-budget propose-execute-evaluate loop, multi-model evaluation, empirical scaling/decay laws) is likely to be adopted across ML, robotics, systems, and design optimization, enabling standardized comparison and driving follow-on work. Paper 1 is novel and societally important but is narrower in generalizability and may be more context-dependent.
Paper 2 addresses a fundamental question about AI safety—whether iterative self-training amplifies undesirable traits like sycophancy or misalignment. This has broad implications for AI alignment, model training practices, and safety policy. Its finding that iterative finetuning is 'mostly idempotent' is a surprising, clean result with immediate practical relevance as self-training becomes widespread. Paper 1, while valuable as a benchmark contribution, is more incremental—adding another benchmark to an already crowded space. Paper 2's insights are more generalizable across the field and timely given current concerns about recursive self-improvement and AI safety.
Paper 2 addresses a fundamental question about AI safety—whether iterative self-training amplifies undesirable traits like sycophancy or misalignment. This has broad implications for the entire LLM training pipeline and AI safety community. Its finding that iterative finetuning is 'mostly idempotent' with an amplification-coherence tradeoff provides actionable safety insights. Paper 1 introduces a useful benchmark but is more incremental, extending existing evaluation paradigms. Paper 2's relevance to AI alignment, its rigorous experimental design across multiple training regimes, and its timeliness given concerns about recursive self-improvement give it higher potential impact.
Paper 1 has higher potential impact due to its more novel, broadly applicable benchmarking framework for iterative generative optimization with executable feedback, continuous rewards, and feasibility constraints across diverse real-world engineering domains. This addresses a key gap in evaluating agentic systems beyond pass/fail tasks and can become a foundational standard used across robotics, design, simulation-based optimization, and AI safety/evaluation. Paper 2 is strong and timely, but is closer to an integrated agentic AutoML system evaluated on an existing benchmark; its impact may be more incremental and sensitive to rapid tooling/model improvements.
Paper 1 has higher potential impact due to its unifying theoretical contribution: it bridges Bayesian variational inference (Free Energy Principle), stochastic game theory, and thermodynamics, with formal equivalence results (stationary points ↔ approximate Nash equilibria) and a new higher-order synergy measure (free-energy Harsanyi dividend). This offers broad cross-field relevance (neuroscience, biology, multi-agent AI, statistical physics) and generates falsifiable predictions validated across domains. Paper 2 is timely and useful as an evaluation benchmark, but its impact is narrower and more incremental (benchmark + empirical scaling observations) and depends on community adoption.
D3-Gym offers higher scientific impact because it provides an automatically constructable, large-scale benchmark (565 tasks from 239 real repositories) with verifiable environments for scientific discovery, and critically demonstrates that training on sampled trajectories yields substantial downstream gains (7.8 points on ScienceAgentBench). This dual contribution—benchmark + training data pipeline—has broader utility for the community. While Frontier-Eng is a well-designed engineering optimization benchmark with interesting scaling analyses, D3-Gym's open-source artifacts, automatic construction workflow, and demonstrated training benefits make it more immediately impactful for advancing AI-for-science research.
Paper 1 pioneers the integration of LLM agents with industrial-grade simulators for generative optimization in real-world engineering tasks. This pushes beyond standard pass/fail software benchmarks into physical and engineering domains with continuous constraints, offering broader cross-disciplinary impact and potential real-world applications than Paper 2's efficiency improvements for conversational agent evaluation.
Paper 2 introduces a novel, rigorous benchmark grounded in real-world engineering and industrial simulators, moving beyond traditional pass/fail tasks. High-quality benchmarks historically drive significant progress in the AI community by establishing new research directions. While Paper 1 offers a valuable efficiency improvement for evaluation, Paper 2 sets a new standard for assessing self-evolving agents in complex, open-ended scenarios, likely leading to broader applicability and higher long-term impact on frontier model development.
Paper 2 introduces a novel theoretical framework for understanding MoE models by decomposing hidden states into control and content channels, revealing that expert paths (not individual experts) are the natural unit of interpretability. This provides fundamental mechanistic insight applicable across all MoE architectures, with broad implications for interpretability, model design, and AI safety. Paper 1, while practically useful as a benchmark, is more incremental—adding another evaluation suite to an already crowded space. Paper 2's conceptual contribution (monosemantic paths, control/content decomposition) is more likely to influence future research directions across multiple subfields.
Paper 1 likely has higher impact due to broader, timely relevance: it introduces a human-verified benchmark for self-evolving LLM agents on 47 real-world engineering optimization tasks with executable verifiers, continuous rewards, and feasibility constraints—directly targeting a major gap in agent evaluation. The benchmark can be reused across the community, enabling standardized comparisons and accelerating progress across multiple domains (AI agents, optimization, simulation-based design). Paper 2 is innovative and high-value for MRI, but its impact is narrower (medical imaging) and more application-specific.
Paper 1 introduces a comprehensive benchmark spanning multiple engineering disciplines, addressing a fundamental gap in evaluating LLM agents on continuous, real-world optimization tasks. Benchmarks typically have profound, field-wide impacts by standardizing evaluation and driving future research. While Paper 2 presents an impressive, highly effective framework for RTL optimization, its focus is narrowly confined to hardware design and EDA, limiting its broader scientific impact compared to Paper 1's generalizable approach to agentic engineering.