FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization
Minwei Kong, Chonghe Jiang, Ao Qu, Wenbin Ouyang, Zhaoming Zeng, Xiaotong Guo, Zhekai Li, Junyi Li
Abstract
Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at https://anonymous.4open.science/r/efficient-opt-bench-F03D.
AI Impact Assessments
(1 models)Scientific Impact Assessment: FrontierOR
1. Core Contribution
FrontierOR introduces a benchmark of 180 optimization tasks derived from top-tier OR publications, designed to evaluate whether LLMs can move beyond formulating optimization problems to actually *designing efficient algorithms* that scale to realistic problem sizes (10²–10⁷ variables/constraints). This is a meaningful conceptual shift: prior benchmarks (NL4OPT, OptiBench, CO-Bench) primarily test formulation accuracy on small instances, whereas FrontierOR tests algorithmic engineering — whether LLM-generated code can outperform or match Gurobi baselines in both solution quality and runtime on large instances where monolithic solver calls fail.
The benchmark includes natural-language problem descriptions (no mathematical notation), hidden expert-verified Gurobi baselines, standalone feasibility checkers, and a multi-dimensional evaluation suite (execution rate, feasibility, solution quality, quality–time efficiency). The "Hard" subset of 50 tasks where Gurobi itself struggles adds a particularly meaningful evaluation tier.
2. Methodological Rigor
Strengths: The construction pipeline is thorough — papers are sourced from leading OR venues, formulations are extracted and verified, Gurobi baselines are cross-checked against feasibility checkers, and 15 OR experts conducted a three-week multi-round review. The evaluation protocol is well-designed: the cascaded metric (feasibility → quality → QTE) properly orders concerns, and the continuous metrics (Δq, Δt) in Appendix D supplement binary pass rates with magnitude information. The containerized single-core execution environment ensures fair comparison.
Concerns: The paper evaluates 7 LLMs but only in a one-shot setting for all and self-evolution for only one backbone (GPT-5.3-Codex) on 20 tasks. The self-evolution budget of 30 candidates is modest. The choice of Gurobi as the sole baseline is reasonable but somewhat limiting — many OR practitioners use specialized algorithms, and comparing only against a general-purpose solver may not capture the full difficulty spectrum. Additionally, while the paper claims 180 tasks, the diversity within each task (number of large instances per task) is not clearly reported, making it difficult to assess statistical power.
The natural-language descriptions deliberately exclude mathematical notation, which tests a specific capability (inferring structure from prose) but may not reflect how practitioners actually communicate optimization problems. This design choice could conflate language understanding failures with algorithmic design failures.
3. Potential Impact
Benchmark impact: FrontierOR fills a genuine gap. The OR/optimization community lacks standardized benchmarks for evaluating algorithmic creativity rather than just formulation correctness. If adopted, it could become a standard evaluation platform for LLM-based optimization agents, similar to how MIPLIB serves the solver community.
Practical implications: The finding that the best one-shot model achieves QTE on only 31% of instances is a sobering calibration of current capabilities. The algorithmic pattern analysis (Figure 3a) — showing that stronger models produce more diverse algorithm families (decomposition, local search, matheuristics) while weaker models default to monolithic solver calls — provides actionable insight for developers building LLM-based optimization tools.
Agent/self-evolution insights: The comparison of OpenEvolve, EoH, and CORAL reveals that different search strategies (depth vs. breadth vs. migration) suit different problem structures. CORAL's multi-agent approach reaching 50% QTE on hard tasks (vs. 15% one-shot) demonstrates the potential of test-time computation scaling for optimization, though 50% still signals a major capability gap.
4. Timeliness & Relevance
This paper is highly timely. The intersection of LLMs and optimization is seeing rapid growth (AlphaEvolve, EoH, CORAL all from 2024-2026), yet evaluation infrastructure has lagged behind. The community needs benchmarks that test beyond toy problems, and FrontierOR addresses this directly. The shift from "can LLMs formulate?" to "can LLMs design efficient algorithms?" matches the natural progression of the field.
The paper also appears at a moment when commercial interest in AI-for-optimization is surging (Gurobi, FICO, and others integrating LLM features), making rigorous capability benchmarking commercially as well as scientifically relevant.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
FrontierOR makes a substantive contribution as an evaluation infrastructure for an important emerging capability. Its main value is in establishing that current LLMs are far from reliable algorithm designers and in providing a platform to track progress. The benchmark design is thoughtful, the evaluation is multi-faceted, and the findings are informative. However, the experimental scope (especially for self-evolution) could be broader, and several methodological concerns (contamination, single-run variance, absence of human baselines) temper the strength of the conclusions.
Generated May 26, 2026
Comparison History (22)
Paper 2 (FrontierOR) likely has higher impact because it introduces a scalable, expert-verified benchmark suite for a timely, high-stakes capability—LLM-driven efficient algorithm design in large-scale optimization—spanning 180 tasks from top OR papers with standardized instances and hidden evaluation. Benchmarks can catalyze broad, cross-field progress (LLMs, agents, OR, benchmarking, software engineering) and provide enduring infrastructure for measurement and comparison. Paper 1 is a strong, practical systems contribution, but appears more architecture-specific (Qwen3-VL) and narrower in breadth despite clear real-world efficiency gains.
Paper 2 likely has higher impact due to a clearer leap in task difficulty and real-world relevance: benchmarking LLMs on scalable algorithm design for large-scale optimization, with expert-verified hidden evaluation and tasks grounded in top OR papers. This targets a concrete, high-value capability gap (going beyond formulation to efficient algorithms) with broad applicability across operations research, industrial optimization, and AI-for-science/engineering. Paper 1 is novel in automated benchmark generation for tool-using agents, but its impact is more confined to agent evaluation methodology and may depend on ecosystem adoption of the specific tool/task framework.
Paper 1 likely has higher impact due to stronger novelty and broader cross-field relevance: it targets a hard, under-benchmarked capability (LLM-driven scalable algorithm design) with realistic large-scale OR tasks curated from top venues and a hidden expert-verified evaluation suite, supporting rigorous and durable measurement. Its applications span logistics, scheduling, energy, and general optimization, making it broadly useful to both ML and OR communities. Paper 2 is timely and valuable for retrieval-augmented forecasting agents, but its scope is narrower (forecasting with external documents) and may have more domain-specific impact.
Paper 1 addresses a fundamental limitation in evaluating LLM reasoning by introducing a general, multi-dimensional framework. Its findings on the disconnect between accuracy and logical coherence have broad, field-wide implications for AI benchmarking, safety, and accountability. Paper 2, while highly valuable, introduces a benchmark for a specific domain (operations research and optimization algorithm design), making its potential impact narrower than the generalized evaluation framework proposed in Paper 1.
Paper 2 likely has higher scientific impact because it introduces a large, realistic, expert-verified benchmark (FrontierOR) that targets a timely and broadly relevant gap: whether LLMs can design scalable optimization algorithms beyond naive formulate-and-solve. High-quality benchmarks often catalyze progress across academia and industry by standardizing evaluation and revealing capability bottlenecks. Its applications span operations research, ML, and agentic coding, with immediate utility for model developers and researchers. Paper 1 is innovative but more niche (text-based embodied tasks) and depends on a specific training pipeline and model.
Paper 2 likely has higher impact: it targets a central, broadly relevant open problem in LLM agents—whether episodic experience can be distilled into reusable procedural skills—spanning multiple agent environments and touching continual learning, memory, program induction, and safety/robustness. Its design explicitly disentangles skill abstraction from base capability and raw trajectory reuse, and probes distribution shift, shortcuts, and composition, making it timely and widely applicable. Paper 1 is rigorous and valuable but more domain-specific (operations research algorithm design), narrowing breadth despite strong real-world relevance.
Paper 2 addresses a critical, highly timely issue: the massive energy consumption and environmental impact of LLM inference in data centers. By treating GPU power caps as a controllable resource, it offers immediate, practical benefits for energy efficiency and operational costs across all LLM deployments. While Paper 1 introduces a valuable benchmark for operations research, Paper 2's potential economic and environmental impact gives it a broader and more significant scientific and real-world footprint.
Paper 2 has higher likely scientific impact due to its broad, timely relevance to AI governance and open-weight model supply-chain accountability, with immediate policy and platform applications. It introduces a clear, quantifiable concept (the “governance horizon”), audits a very large real-world dataset, and derives actionable design implications validated via comparisons/interventions. Its conclusions generalize across stakeholders (research, platforms, regulators) and fields (ML, security, policy, software ecosystems). Paper 1 is valuable and novel for OR/LLM evaluation but is narrower in domain reach and downstream policy leverage.
Paper 1 introduces a novel benchmark for evaluating LLMs on large-scale optimization algorithm design. Benchmarks in the rapidly evolving LLM space often have a broad and high scientific impact by standardizing evaluation and driving future research directions. While Paper 2 offers strong methodological rigor and important real-world environmental applications, Paper 1's potential to catalyze widespread advancements in LLM capabilities for operations research gives it a broader and more foundational scientific impact.
Paper 1 addresses a critical and universally relevant bottleneck in modern AI: trajectory-level hallucinations in autonomous LLM agents. Its framework for auditing intermediate reasoning steps has broad applicability across AI safety, multi-agent systems, and industrial deployments. Paper 2, while highly rigorous and valuable for Operations Research, focuses on a narrower domain of optimization algorithm design, giving Paper 1 a higher potential for widespread cross-disciplinary impact.
Paper 2 likely has higher scientific impact because it introduces a broadly useful, timely benchmark (FrontierOR) for a core emerging capability—LLM-driven efficient algorithm design—grounded in real, large-scale operations research problems with hidden expert-verified evaluation. Benchmarks often become lasting community infrastructure, enabling standardized comparison across models and stimulating follow-up work across ML, OR, and agentic coding. Paper 1 is a strong method improving VLM reliability, but its impact may be narrower (specific framework/heuristic library) and more susceptible to being subsumed by rapidly improving base VLMs.
FrontierOR introduces a comprehensive benchmark (180 tasks) for evaluating LLMs on large-scale optimization algorithm design, addressing a significant gap between toy benchmarks and real-world OR problems. It has broader impact across optimization, OR, and AI communities, with a publicly available benchmark that enables systematic future evaluation. While SpecAlign addresses an important niche problem in hardware verification, FrontierOR's scope is wider, its benchmark methodology is more reusable, and it reveals fundamental limitations of current LLMs in algorithm design—insights relevant to the broader AI research community.
Paper 2 (CausaLab) is likely to have higher impact: it introduces a scalable, interactive benchmark that directly targets causal discovery and experimental design—core capabilities for “AI scientist” agents with broad relevance across ML, robotics, biomedicine, and scientific automation. The environment cleanly separates predictive accuracy from mechanism recovery via an explicit SCM-hypothesis DSL and ground-truth evaluability, enabling more rigorous diagnosis of causal reasoning failures and intervention strategy quality. Its framing is timely given current emphasis on agentic science and causality. Paper 1 is valuable for OR/optimization, but its impact is more domain-bounded.
Paper 1 introduces a large-scale, practical benchmark evaluating LLMs on complex operations research tasks. Benchmarks bridging LLMs and scalable algorithm design have immense real-world applications in logistics, finance, and engineering, and typically attract high citation counts by driving sub-field progress. This will likely yield broader scientific and industrial impact compared to Paper 2's mechanistic interpretability framework, which, while rigorous, addresses a more niche aspect of spatial reasoning architectures.
Paper 1 establishes a fundamental mathematical impossibility theorem (a quadrilemma) regarding AI explainability, which has profound and long-lasting implications for AI safety, governance, and theoretical machine learning. In contrast, while Paper 2 introduces a valuable and rigorous benchmark for LLMs in optimization, benchmarks typically have a shorter lifespan of scientific relevance as models evolve. The theoretical and cross-disciplinary breadth of Paper 1 gives it a higher potential for foundational scientific impact.
Paper 1 likely has higher impact due to its broad, timely benchmark infrastructure for evaluating LLMs on scalable algorithm design in realistic large-scale optimization—an area with clear real-world stakes across industries (logistics, energy, scheduling) and multiple research communities (LLMs/agents, OR, benchmarking, software engineering). Its methodological rigor (expert-derived tasks, standardized instances, hidden evaluation) supports durable, field-wide adoption. Paper 2 is novel and relevant for diffusion-LLM safety, but targets a narrower model class and application niche; its techniques may be less broadly reusable than a benchmark that can steer progress across many models and tasks.
Paper 2 addresses a critical bottleneck in the highly impactful field of general-purpose AI agents: the lack of scalable, verifiable training data for Computer-Use Agents (CUAs). While Paper 1 introduces a valuable benchmark for the specific niche of Operations Research, Paper 2 provides a comprehensive, scalable pipeline that co-generates environments and rewards, yields a massive dataset (32k+ tasks), and trains state-of-the-art open-source models that outperform existing baselines on major benchmarks. This demonstrates broader applicability, higher methodological innovation, and immediate real-world utility for developing autonomous agents.
Paper 2 (FrontierOR) likely has higher impact because it introduces a large, realistic, expert-verified benchmark targeting a critical and under-measured capability (scalable algorithm design for large-scale optimization). Such benchmarks tend to become community standards, enabling reproducible evaluation and driving progress across LLMs, agents, and operations research. Its applications span OR, industrial optimization, and AI evaluation, with strong timeliness as LLMs move into decision/optimization workflows. Paper 1 is a solid coordination method with efficiency gains, but its contribution is narrower and more model/benchmark-dependent.
FrontierOR has higher potential impact due to several factors: (1) It addresses the broader and more fundamental challenge of LLM-based algorithm design for large-scale optimization, which spans many fields beyond finance; (2) Operations research optimization is applicable across logistics, manufacturing, healthcare, and many industries, giving it wider breadth of impact; (3) The benchmark methodology (180 tasks from top-tier OR venues with expert-verified evaluation) is more rigorous and scalable; (4) The gap it identifies (LLMs struggling to design efficient algorithms beyond correct formulations) points to a deeper scientific challenge. WorkstreamBench, while valuable, is more domain-specific to finance spreadsheets.
Paper 2 introduces a novel foundational framework (Inverse Learning) that bridges Reinforcement Learning and Optimal Control. Its demonstrated ability to significantly outperform existing baselines while reducing compute by orders of magnitude across highly diverse fields—from embodied AI/robotics to quantum computing—suggests massive cross-disciplinary impact. In contrast, Paper 1 is primarily a benchmarking dataset for LLMs in operations research; while valuable, it evaluates existing limitations rather than providing a broadly applicable algorithmic breakthrough.