FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

Minwei Kong, Chonghe Jiang, Ao Qu, Wenbin Ouyang, Zhaoming Zeng, Xiaotong Guo, Zhekai Li, Junyi Li

May 24, 2026

arXiv:2605.25246v1 PDF

cs.AI(primary)

#1143of 2682·Artificial Intelligence

#1143 of 2682 · Artificial Intelligence

Tournament Score

1427±40

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7

Novelty7

Clarity7.5

Tournament Score

1427±40

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models (LLMs) are increasingly used for optimization modeling and solver-code generation, yet practical operations research and optimization problems often require a harder capability: designing scalable algorithms that exploit problem structure and outperform direct formulation-and-solve baselines. Existing benchmarks are limited to small or simplified examples far below real-world scale and complexity. We introduce FrontierOR, among the first benchmarks to systematically evaluate LLM-based efficient algorithm design for realistic large-scale optimization problems. FrontierOR includes 180 tasks derived from methodologically diverse papers published in top-tier operations research venues, each with standardized instances and a hidden, expert-verified evaluation suite. We evaluate seven LLMs spanning frontier, cost-effective, and open-source models both in one-shot and test-time evolution settings. The results reveal that frontier models still struggle to move from executable formulations to efficient optimization algorithms: the strongest one-shot model outperforms Gurobi in only 31% of cases in both solution quality and computational efficiency, and even strong coding agents with test-time evolution achieve only 50% on selected hard tasks. FrontierOR establishes a practical evaluation platform for LLM-based optimization algorithm design, which enables future LLMs and agents to be systematically tested on whether they can move beyond correct formulation toward a feasible, high-quality, and efficient algorithm. Our FrontierOR Benchmark is available at https://anonymous.4open.science/r/efficient-opt-bench-F03D.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FrontierOR

1. Core Contribution

FrontierOR introduces a benchmark of 180 optimization tasks derived from top-tier OR publications, designed to evaluate whether LLMs can move beyond formulating optimization problems to actually *designing efficient algorithms* that scale to realistic problem sizes (10²–10⁷ variables/constraints). This is a meaningful conceptual shift: prior benchmarks (NL4OPT, OptiBench, CO-Bench) primarily test formulation accuracy on small instances, whereas FrontierOR tests algorithmic engineering — whether LLM-generated code can outperform or match Gurobi baselines in both solution quality and runtime on large instances where monolithic solver calls fail.

The benchmark includes natural-language problem descriptions (no mathematical notation), hidden expert-verified Gurobi baselines, standalone feasibility checkers, and a multi-dimensional evaluation suite (execution rate, feasibility, solution quality, quality–time efficiency). The "Hard" subset of 50 tasks where Gurobi itself struggles adds a particularly meaningful evaluation tier.

2. Methodological Rigor

Strengths: The construction pipeline is thorough — papers are sourced from leading OR venues, formulations are extracted and verified, Gurobi baselines are cross-checked against feasibility checkers, and 15 OR experts conducted a three-week multi-round review. The evaluation protocol is well-designed: the cascaded metric (feasibility → quality → QTE) properly orders concerns, and the continuous metrics (Δq, Δt) in Appendix D supplement binary pass rates with magnitude information. The containerized single-core execution environment ensures fair comparison.

Concerns: The paper evaluates 7 LLMs but only in a one-shot setting for all and self-evolution for only one backbone (GPT-5.3-Codex) on 20 tasks. The self-evolution budget of 30 candidates is modest. The choice of Gurobi as the sole baseline is reasonable but somewhat limiting — many OR practitioners use specialized algorithms, and comparing only against a general-purpose solver may not capture the full difficulty spectrum. Additionally, while the paper claims 180 tasks, the diversity within each task (number of large instances per task) is not clearly reported, making it difficult to assess statistical power.

The natural-language descriptions deliberately exclude mathematical notation, which tests a specific capability (inferring structure from prose) but may not reflect how practitioners actually communicate optimization problems. This design choice could conflate language understanding failures with algorithmic design failures.

3. Potential Impact

Benchmark impact: FrontierOR fills a genuine gap. The OR/optimization community lacks standardized benchmarks for evaluating algorithmic creativity rather than just formulation correctness. If adopted, it could become a standard evaluation platform for LLM-based optimization agents, similar to how MIPLIB serves the solver community.

Practical implications: The finding that the best one-shot model achieves QTE on only 31% of instances is a sobering calibration of current capabilities. The algorithmic pattern analysis (Figure 3a) — showing that stronger models produce more diverse algorithm families (decomposition, local search, matheuristics) while weaker models default to monolithic solver calls — provides actionable insight for developers building LLM-based optimization tools.

Agent/self-evolution insights: The comparison of OpenEvolve, EoH, and CORAL reveals that different search strategies (depth vs. breadth vs. migration) suit different problem structures. CORAL's multi-agent approach reaching 50% QTE on hard tasks (vs. 15% one-shot) demonstrates the potential of test-time computation scaling for optimization, though 50% still signals a major capability gap.

4. Timeliness & Relevance

This paper is highly timely. The intersection of LLMs and optimization is seeing rapid growth (AlphaEvolve, EoH, CORAL all from 2024-2026), yet evaluation infrastructure has lagged behind. The community needs benchmarks that test beyond toy problems, and FrontierOR addresses this directly. The shift from "can LLMs formulate?" to "can LLMs design efficient algorithms?" matches the natural progression of the field.

The paper also appears at a moment when commercial interest in AI-for-optimization is surging (Gurobi, FICO, and others integrating LLM features), making rigorous capability benchmarking commercially as well as scientifically relevant.

5. Strengths & Limitations

Key Strengths:

Scale and diversity: 180 tasks across 9 problem classes, 5 optimization paradigms, and multiple application domains — significantly larger and more diverse than competitors (CO-Bench: 36, HeuriGym: 9).

Literature grounding: Tasks derived from real papers ensures practical relevance and provides natural difficulty calibration.

Multi-dimensional evaluation: The separation of feasibility, quality, and efficiency metrics enables nuanced diagnosis of failure modes.

Failure mode taxonomy: The analysis in Figure 3b showing how failure modes stratify by model strength (formulation errors → heuristic search failures) is a genuinely useful diagnostic contribution.

Detailed case studies: The appendix examples (Tasks 1-3) with side-by-side success/failure analysis provide concrete, reproducible illustrations.

Notable Limitations:

Reproducibility concerns: The benchmark URL points to an anonymous repository; long-term maintenance and community adoption remain uncertain.

Limited self-evolution evaluation: Only GPT-5.3-Codex is used as the backbone for self-evolution experiments, and only on 20 tasks — the generalizability of these findings is unclear.

Gurobi-only baseline: Some problems may be better served by specialized solvers (CP-SAT, SCIP, domain-specific algorithms), and the Gurobi baseline may be unnecessarily strong or weak depending on problem structure.

No human expert baseline: The paper compares LLMs against Gurobi but never against human algorithm designers, which would be the most meaningful comparison for assessing whether LLMs can replicate expert-level algorithm engineering.

Potential data contamination: Papers from 1992-2025 are likely in LLM training data; the paper does not address whether models may have memorized algorithmic approaches from the source papers.

Single-run evaluation: No mention of variance across multiple generations or statistical significance testing for one-shot results.

Overall Assessment

FrontierOR makes a substantive contribution as an evaluation infrastructure for an important emerging capability. Its main value is in establishing that current LLMs are far from reliable algorithm designers and in providing a platform to track progress. The benchmark design is thoughtful, the evaluation is multi-faceted, and the findings are informative. However, the experimental scope (especially for self-evolution) could be broader, and several methodological concerns (contamination, single-run variance, absence of human baselines) temper the strength of the conclusions.

Rating:7.2/ 10

Significance 7.5Rigor 7Novelty 7Clarity 7.5

Generated May 26, 2026

Comparison History (22)

vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

gpt-5.25/28/2026

Paper 2 (FrontierOR) likely has higher impact because it introduces a scalable, expert-verified benchmark suite for a timely, high-stakes capability—LLM-driven efficient algorithm design in large-scale optimization—spanning 180 tasks from top OR papers with standardized instances and hidden evaluation. Benchmarks can catalyze broad, cross-field progress (LLMs, agents, OR, benchmarking, software engineering) and provide enduring infrastructure for measurement and comparison. Paper 1 is a strong, practical systems contribution, but appears more architecture-specific (Qwen3-VL) and narrower in breadth despite clear real-world efficiency gains.

vs. A Matter of TASTE: Improving Coverage and Difficulty of Agent Benchmarks

gpt-5.25/28/2026

Paper 2 likely has higher impact due to a clearer leap in task difficulty and real-world relevance: benchmarking LLMs on scalable algorithm design for large-scale optimization, with expert-verified hidden evaluation and tasks grounded in top OR papers. This targets a concrete, high-value capability gap (going beyond formulation to efficient algorithms) with broad applicability across operations research, industrial optimization, and AI-for-science/engineering. Paper 1 is novel in automated benchmark generation for tool-using agents, but its impact is more confined to agent evaluation methodology and may depend on ecosystem adoption of the specific tool/task framework.

vs. Dr-CiK: A Testbed for Foresight-Driven Agents

gpt-5.25/28/2026

Paper 1 likely has higher impact due to stronger novelty and broader cross-field relevance: it targets a hard, under-benchmarked capability (LLM-driven scalable algorithm design) with realistic large-scale OR tasks curated from top venues and a hidden expert-verified evaluation suite, supporting rigorous and durable measurement. Its applications span logistics, scheduling, energy, and general optimization, making it broadly useful to both ML and OR communities. Paper 2 is timely and valuable for retrieval-augmented forecasting agents, but its scope is narrower (forecasting with external documents) and may have more domain-specific impact.

vs. Measuring Reasoning Quality in LLMs: A Multi-Dimensional Behavioral Framework

gemini-3.15/26/2026

Paper 1 addresses a fundamental limitation in evaluating LLM reasoning by introducing a general, multi-dimensional framework. Its findings on the disconnect between accuracy and logical coherence have broad, field-wide implications for AI benchmarking, safety, and accountability. Paper 2, while highly valuable, introduces a benchmark for a specific domain (operations research and optimization algorithm design), making its potential impact narrower than the generalized evaluation framework proposed in Paper 1.

vs. Test-Time Deep Thinking to Explore Implicit Rules

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact because it introduces a large, realistic, expert-verified benchmark (FrontierOR) that targets a timely and broadly relevant gap: whether LLMs can design scalable optimization algorithms beyond naive formulate-and-solve. High-quality benchmarks often catalyze progress across academia and industry by standardizing evaluation and revealing capability bottlenecks. Its applications span operations research, ML, and agentic coding, with immediate utility for model developers and researchers. Paper 1 is innovative but more niche (text-based embodied tasks) and depends on a specific training pipeline and model.

vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

gpt-5.25/26/2026

Paper 2 likely has higher impact: it targets a central, broadly relevant open problem in LLM agents—whether episodic experience can be distilled into reusable procedural skills—spanning multiple agent environments and touching continual learning, memory, program induction, and safety/robustness. Its design explicitly disentangles skill abstraction from base capability and raw trajectory reuse, and probes distribution shift, shortcuts, and composition, making it timely and widely applicable. Paper 1 is rigorous and valuable but more domain-specific (operations research algorithm design), narrowing breadth despite strong real-world relevance.

vs. PALS: Power-Aware LLM Serving for Mixture-of-Experts Models

gemini-3.15/26/2026

Paper 2 addresses a critical, highly timely issue: the massive energy consumption and environmental impact of LLM inference in data centers. By treating GPU power caps as a controllable resource, it offers immediate, practical benefits for energy efficiency and operational costs across all LLM deployments. While Paper 1 introduces a valuable benchmark for operations research, Paper 2's potential economic and environmental impact gives it a broader and more significant scientific and real-world footprint.

vs. A governance horizon for ethical-use constraints in open-weight AI models

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact due to its broad, timely relevance to AI governance and open-weight model supply-chain accountability, with immediate policy and platform applications. It introduces a clear, quantifiable concept (the “governance horizon”), audits a very large real-world dataset, and derives actionable design implications validated via comparisons/interventions. Its conclusions generalize across stakeholders (research, platforms, regulators) and fields (ML, security, policy, software ecosystems). Paper 1 is valuable and novel for OR/LLM evaluation but is narrower in domain reach and downstream policy leverage.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gemini-3.15/26/2026

Paper 1 introduces a novel benchmark for evaluating LLMs on large-scale optimization algorithm design. Benchmarks in the rapidly evolving LLM space often have a broad and high scientific impact by standardizing evaluation and driving future research directions. While Paper 2 offers strong methodological rigor and important real-world environmental applications, Paper 1's potential to catalyze widespread advancements in LLM capabilities for operations research gives it a broader and more foundational scientific impact.

vs. Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

gemini-3.15/26/2026

Paper 1 addresses a critical and universally relevant bottleneck in modern AI: trajectory-level hallucinations in autonomous LLM agents. Its framework for auditing intermediate reasoning steps has broad applicability across AI safety, multi-agent systems, and industrial deployments. Paper 2, while highly rigorous and valuable for Operations Research, focuses on a narrower domain of optimization algorithm design, giving Paper 1 a higher potential for widespread cross-disciplinary impact.

vs. Pseudocode-Guided Structured Reasoning for Automating Reliable Inference in Vision-Language Models

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact because it introduces a broadly useful, timely benchmark (FrontierOR) for a core emerging capability—LLM-driven efficient algorithm design—grounded in real, large-scale operations research problems with hidden expert-verified evaluation. Benchmarks often become lasting community infrastructure, enabling standardized comparison across models and stimulating follow-up work across ML, OR, and agentic coding. Paper 1 is a strong method improving VLM reliability, but its impact may be narrower (specific framework/heuristic library) and more susceptible to being subsumed by rapidly improving base VLMs.

vs. SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

claude-opus-4.65/26/2026

FrontierOR introduces a comprehensive benchmark (180 tasks) for evaluating LLMs on large-scale optimization algorithm design, addressing a significant gap between toy benchmarks and real-world OR problems. It has broader impact across optimization, OR, and AI communities, with a publicly available benchmark that enables systematic future evaluation. While SpecAlign addresses an important niche problem in hardware verification, FrontierOR's scope is wider, its benchmark methodology is more reusable, and it reveals fundamental limitations of current LLMs in algorithm design—insights relevant to the broader AI research community.

vs. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

gpt-5.25/26/2026

Paper 2 (CausaLab) is likely to have higher impact: it introduces a scalable, interactive benchmark that directly targets causal discovery and experimental design—core capabilities for “AI scientist” agents with broad relevance across ML, robotics, biomedicine, and scientific automation. The environment cleanly separates predictive accuracy from mechanism recovery via an explicit SCM-hypothesis DSL and ground-truth evaluability, enabling more rigorous diagnosis of causal reasoning failures and intervention strategy quality. Its framing is timely given current emphasis on agentic science and causality. Paper 1 is valuable for OR/optimization, but its impact is more domain-bounded.

vs. Interaction Locality in Hierarchical Recursive Reasoning

gemini-3.15/26/2026

Paper 1 introduces a large-scale, practical benchmark evaluating LLMs on complex operations research tasks. Benchmarks bridging LLMs and scalable algorithm design have immense real-world applications in logistics, finance, and engineering, and typically attract high citation counts by driving sub-field progress. This will likely yield broader scientific and industrial impact compared to Paper 2's mechanistic interpretability framework, which, while rigorous, addresses a more niche aspect of spatial reasoning architectures.

vs. Fundamental Limitation in Explaining AI

gemini-3.15/26/2026

Paper 1 establishes a fundamental mathematical impossibility theorem (a quadrilemma) regarding AI explainability, which has profound and long-lasting implications for AI safety, governance, and theoretical machine learning. In contrast, while Paper 2 introduces a valuable and rigorous benchmark for LLMs in optimization, benchmarks typically have a shorter lifespan of scientific relevance as models evolve. The theoretical and cross-disciplinary breadth of Paper 1 gives it a higher potential for foundational scientific impact.

vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

gpt-5.25/26/2026

Paper 1 likely has higher impact due to its broad, timely benchmark infrastructure for evaluating LLMs on scalable algorithm design in realistic large-scale optimization—an area with clear real-world stakes across industries (logistics, energy, scheduling) and multiple research communities (LLMs/agents, OR, benchmarking, software engineering). Its methodological rigor (expert-derived tasks, standardized instances, hidden evaluation) supports durable, field-wide adoption. Paper 2 is novel and relevant for diffusion-LLM safety, but targets a narrower model class and application niche; its techniques may be less broadly reusable than a benchmark that can steer progress across many models and tasks.

vs. CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

gemini-3.15/26/2026

Paper 2 addresses a critical bottleneck in the highly impactful field of general-purpose AI agents: the lack of scalable, verifiable training data for Computer-Use Agents (CUAs). While Paper 1 introduces a valuable benchmark for the specific niche of Operations Research, Paper 2 provides a comprehensive, scalable pipeline that co-generates environments and rewards, yields a massive dataset (32k+ tasks), and trains state-of-the-art open-source models that outperform existing baselines on major benchmarks. This demonstrates broader applicability, higher methodological innovation, and immediate real-world utility for developing autonomous agents.

vs. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

gpt-5.25/26/2026

Paper 2 (FrontierOR) likely has higher impact because it introduces a large, realistic, expert-verified benchmark targeting a critical and under-measured capability (scalable algorithm design for large-scale optimization). Such benchmarks tend to become community standards, enabling reproducible evaluation and driving progress across LLMs, agents, and operations research. Its applications span OR, industrial optimization, and AI evaluation, with strong timeliness as LLMs move into decision/optimization workflows. Paper 1 is a solid coordination method with efficiency gains, but its contribution is narrower and more model/benchmark-dependent.

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

claude-opus-4.65/26/2026

FrontierOR has higher potential impact due to several factors: (1) It addresses the broader and more fundamental challenge of LLM-based algorithm design for large-scale optimization, which spans many fields beyond finance; (2) Operations research optimization is applicable across logistics, manufacturing, healthcare, and many industries, giving it wider breadth of impact; (3) The benchmark methodology (180 tasks from top-tier OR venues with expert-verified evaluation) is more rigorous and scalable; (4) The gap it identifies (LLMs struggling to design efficient algorithms beyond correct formulations) points to a deeper scientific challenge. WorkstreamBench, while valuable, is more domain-specific to finance spreadsheets.

vs. Neuro-Inspired Inverse Learning for Planning and Control

gemini-3.15/26/2026

Paper 2 introduces a novel foundational framework (Inverse Learning) that bridges Reinforcement Learning and Optimal Control. Its demonstrated ability to significantly outperform existing baselines while reducing compute by orders of magnitude across highly diverse fields—from embodied AI/robotics to quantum computing—suggests massive cross-disciplinary impact. In contrast, Paper 1 is primarily a benchmarking dataset for LLMs in operations research; while valuable, it evaluates existing limitations rather than providing a broadly applicable algorithmic breakthrough.