Open-World Evaluations for Measuring Frontier AI Capabilities

Sayash Kapoor, Peter Kirgis, Andrew Schwartz, Stephan Rabanser, J. J. Allaire, Rishi Bommasani, Harry Coppock, Magda Dubois

#474 of 2292 · Artificial Intelligence
Share
Tournament Score
1475±42
10501800
65%
Win Rate
13
Wins
7
Losses
20
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Benchmark-based evaluation remains important for tracking frontier AI progress. But it can both overstate and understate deployed capability because it privileges tasks that can be precisely specified, automatically graded, easy to optimize for, and run with low budgets and short time horizons. We advocate for a complementary class of evaluations, which we term open-world evaluations: long-horizon, messy, real-world tasks assessed through small-sample qualitative analysis rather than benchmark-scale automation. In this paper we survey recent open-world evaluations, identify their strengths and limitations, and introduce CRUX (Collaborative Research for Updating AI eXpectations), a project for conducting such evaluations regularly. As a first instance, we task an AI agent with developing and publishing a simple iOS application to the Apple App Store. The agent completed the task with only a single avoidable manual intervention, suggesting that open-world evaluations can provide early warning of capabilities that may soon become widespread. We conclude with recommendations for designing and reporting open-world evals.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Open-World Evaluations for Measuring Frontier AI Capabilities

1. Core Contribution

The paper makes two intertwined contributions: a conceptual framework for "open-world evaluations" — long-horizon, real-world tasks assessed through qualitative log analysis rather than benchmark-scale automation — and a concrete instantiation (CRUX #1) in which an AI agent autonomously develops and publishes an iOS app to the Apple App Store. The conceptual contribution is to name, taxonomize, and legitimize an evaluation methodology that has been emerging organically across labs (Anthropic's C compiler, Project Vend, Claude Plays Pokemon, etc.) but lacked a unifying framework or shared methodological standards.

The five-dimensional taxonomy (openness, complexity/duration, number of tasks, human intervention, method of evaluation) provides a useful vocabulary for positioning evaluations along a gradient from simple Q&A benchmarks to messy real-world deployments. The six methodological recommendations (specify construct, document interventions, analyze/release logs, real-time monitoring, dry runs, cost reporting) are practical and well-motivated.

2. Methodological Rigor

The paper is transparent about its own methodological limitations, which is both a strength and a challenge. The CRUX #1 experiment has n=1, is not reproducible in the traditional sense, and the authors explicitly acknowledge this. The honesty is commendable, but it means the empirical contribution is thin by conventional standards.

The experimental design is reasonable for what it is: dry runs were conducted, interventions were classified and documented, logs are released, and costs are reported — the paper practices what it preaches. The distinction between avoidable and unavoidable interventions (e.g., Apple-mandated 2FA vs. the agent losing track of credentials) is clearly drawn. However, some analytical judgments are acknowledged as subjective (e.g., classifying the OpenClaw daemon crash as infrastructure rather than agent failure).

The survey of prior open-world evaluations (Section 2.3, Appendix C, Table 1) is useful but necessarily incomplete and somewhat informal. The paper does not offer quantitative meta-analysis — it couldn't, given the heterogeneity of the evaluations surveyed.

One concern: the paper's core claim that open-world evaluations provide "early warning of capabilities" rests on a single demonstration. The iOS app task was deliberately chosen to be simple (a breathing exercise app), and the non-coding aspects (navigating Apple's submission process) are well-documented online. Whether this generalizes to genuinely novel or harder deployment scenarios remains untested.

3. Potential Impact

Methodological influence. The paper could meaningfully shape how the AI evaluation community thinks about complementing benchmarks. The gradient framework (Figure 2) is pedagogically effective and could become a standard reference. If CRUX becomes an ongoing project with regular iterations, it could establish norms for a currently ad hoc practice.

Policy relevance. The framing around early warning for policymakers and app store operators is timely. The responsible disclosure to Apple and the observation about potential agent-driven spam submissions are practically relevant. The paper explicitly connects evaluation methodology to governance decisions, which broadens its audience.

For AI developers. The recommendations around log analysis, intervention documentation, and cost reporting could improve the quality of capability claims from labs. Currently, many open-world evaluations (as the paper documents) lack these basic methodological commitments.

Limitations of impact. The paper does not introduce new technical methods for building or improving AI agents. It is primarily a position/methods paper with a single case study. Its influence will depend heavily on whether CRUX continues and whether the community adopts its recommendations.

4. Timeliness & Relevance

The paper is exceptionally timely. The rapid saturation of traditional benchmarks (well-illustrated by Figure 1), the proliferation of ad hoc open-world experiments from labs, and the growing policy stakes of AI capability assessment all create demand for the kind of methodological reflection this paper provides. The observation that Anthropic's Mythos Preview system card acknowledges saturating "most concrete, objectively-scored evaluations" underscores the urgency.

The paper addresses a genuine methodological vacuum: there was no systematic framework or set of norms for the growing class of real-world agent evaluations, and this paper provides one.

5. Strengths & Limitations

Key Strengths:

  • *Crystallizes an emerging practice* into a coherent framework with clear terminology and taxonomy
  • *Practices what it preaches*: CRUX #1 follows the paper's own recommendations, with logs released, interventions documented, and costs reported
  • *Balanced treatment of benchmarks*: does not dismiss them but articulates specific, well-supported reasons they both over- and under-estimate capabilities
  • *Rich qualitative findings* from CRUX #1: the fabricated phone number, the emergent cost optimization (from 35/hrto35/hr to3/hr), and the credential-recovery episode are individually interesting observations
  • *Strong author team* spanning academia, government AI safety institutes, and industry, lending credibility and breadth of perspective
  • *Responsible disclosure* to Apple demonstrates ethical practice
  • Notable Limitations:

  • *Empirical thinness*: a single n=1 experiment on a deliberately simple task provides limited evidence for the paper's broad claims
  • *The recommendations are sensible but not deeply novel*: documenting interventions, running dry runs, and reporting costs are general good practices rather than methodological innovations
  • *No formal framework for comparing or aggregating open-world evaluations*: the paper acknowledges this but doesn't propose even partial solutions
  • *Selection bias in the survey*: the surveyed evaluations are predominantly from well-resourced labs, raising questions about accessibility and inclusivity of this evaluation paradigm
  • *The "early warning" claim is weakly supported*: one successful app submission doesn't establish predictive validity — we'd need longitudinal evidence that open-world evaluation results precede real-world capability diffusion
  • Additional Observations

    The paper's positioning is strategic: by framing open-world evaluations as complementary rather than competitive with benchmarks, it avoids antagonizing the benchmarking community while still making a case for methodological pluralism. The cost analysis (25fordevelopmentvs.25 for development vs.975 for monitoring) is a useful practical insight. The discussion of evaluation awareness (Section 3.1.1) is thoughtful and reveals a genuine dilemma for capability evaluations.

    The paper reads more as a manifesto and inaugural case study for an ongoing research program than as a self-contained contribution. Its ultimate impact will depend on execution of future CRUX iterations and community adoption.

    Rating:6.5/ 10
    Significance 7Rigor 5.5Novelty 6Clarity 8.5

    Generated May 21, 2026

    Comparison History (20)

    vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling
    gpt-5.25/22/2026

    Paper 2 has higher potential impact because it reframes how frontier AI capability is measured, proposing a broadly applicable evaluation paradigm (open-world evals) that can influence research, policy, and deployment practices across many subfields. Its real-world, long-horizon tasks address a timely gap in benchmark-centric evaluation and can become a standard complement to existing methods (via CRUX). Paper 1 is a solid, method-level contribution to agentic test-time scaling with measurable gains, but its impact is narrower to LLM-agent workflow optimization and may be more incremental.

    vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
    gpt-5.25/22/2026

    Paper 1 has higher potential impact due to its broader, timelier framing of frontier AI evaluation: it challenges benchmark-centric paradigms and proposes open-world evaluations applicable across domains (agents, safety, governance, deployment readiness). The CRUX proposal and real deployment case study (shipping an iOS app) are likely to influence evaluation practice and policy beyond a single subfield. Paper 2 is methodologically strong with a valuable dataset/metrics for grounded personality perception in MLLMs, but its scope is narrower (social cognition/personality) and impacts fewer adjacent fields.

    vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?
    gemini-3.15/22/2026

    While Paper 1 provides a rigorous and valuable benchmark for a specific domain (personality perception), Paper 2 addresses a fundamental limitation in how frontier AI capabilities are currently measured across all domains. By advocating for and demonstrating 'open-world evaluations,' Paper 2 has a broader potential impact on AI safety, policy, and general capability tracking, shaping future methodologies for evaluating autonomous agents in real-world scenarios.

    vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?
    gpt-5.25/22/2026

    Paper 2 likely has higher impact: it contributes a large, standardized, multi-turn, long-horizon benchmark (502 solvable instances, 102 targets) with a public leaderboard, enabling reproducible comparison and sustained community progress in an economically and societally critical domain (drug discovery). Its methodological rigor and concrete task specification make it broadly usable for model training/evaluation and tool-augmented agent research. Paper 1 is timely and conceptually important, but its evidence base is smaller-sample and more qualitative, with less immediately scalable infrastructure for broad adoption.

    vs. Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents
    gemini-3.15/22/2026

    Paper 1 introduces a highly practical and novel methodological framework for adapting LLM agents without retraining. Its rigorous empirical validation across 18 models and demonstration of high transferability and significant performance gains (88.5% average improvement) offer immediate, broad utility for agent development. Paper 2, while conceptually valuable for AI evaluation and policy, acts more as a position paper with a single qualitative case study, lacking the algorithmic rigor and broad, direct applicability of Paper 1.

    vs. Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
    claude-opus-4.65/22/2026

    Paper 1 introduces a broadly applicable evaluation paradigm ('open-world evaluations') for frontier AI systems, addressing a fundamental gap in how the AI community measures capabilities. Its conceptual contribution—complementing benchmarks with long-horizon, real-world tasks—has wide relevance across all of AI safety, policy, and capability research. The CRUX project provides an institutional framework for ongoing impact. Paper 2, while technically strong, addresses a narrower domain (EDA/Verilog agents) with a specialized test-time scaling method. Paper 1's breadth of influence across AI evaluation, governance, and multiple application domains gives it higher potential impact.

    vs. PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents
    gpt-5.25/21/2026

    Paper 1 proposes a concrete, novel system (a persistent, budgeted “context map” cache with explicit modules) and reports substantial efficiency and accuracy gains across tasks, models, and even a production coding agent—suggesting strong real-world applicability and methodological rigor with measurable benchmarks. Its ideas could broadly influence long-context agent design, memory/caching strategies, and tooling for recurring corpora. Paper 2 is timely and valuable conceptually for evaluation practice, but is largely a survey/framework with limited empirical depth (small-sample qualitative case study), making near-term scientific impact less certain.

    vs. MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning
    gemini-3.15/21/2026

    Paper 2 offers a higher potential scientific impact due to its fundamental algorithmic innovation (MAP framework) and broad applicability across interactive LLM agents. While Paper 1 raises a critical point about AI evaluation, its approach relies on qualitative, small-sample case studies. In contrast, Paper 2 provides a rigorous methodology, introduces a new dataset (MAP-2K), and demonstrates generalizable empirical performance gains on complex benchmarks like ARC-AGI-3. The release of a concrete, plug-and-play paradigm and training dataset will likely spur more direct follow-up technical research and downstream applications.

    vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving
    claude-opus-4.65/21/2026

    Paper 2 addresses a fundamental challenge in AI evaluation methodology that applies broadly across the entire frontier AI ecosystem, not just one domain. Its proposal of 'open-world evaluations' as a complement to benchmarks is timely given rapid AI progress, has broad cross-field relevance, and introduces a reusable framework (CRUX) for ongoing capability assessment. While Paper 1 makes a solid technical contribution to autonomous driving testing, its scope is narrower. Paper 2's potential to reshape how the community evaluates and anticipates AI capabilities gives it higher estimated impact.

    vs. PALS: Power-Aware LLM Serving for Mixture-of-Experts Models
    gpt-5.25/21/2026

    Paper 1 likely has higher scientific impact due to a concrete, novel systems contribution (treating GPU power caps as a control variable jointly optimized with batching) with demonstrated, reproducible gains (up to 26.3% energy efficiency, fewer QoS violations) and immediate applicability to widespread LLM serving stacks (integration into vLLM, no retraining). It is timely given datacenter energy constraints and could influence both systems research and production deployments. Paper 2 is important conceptually but is more of a survey/proposal with limited methodological rigor and harder-to-standardize, lower-scalability evaluation evidence.

    vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
    claude-opus-4.65/21/2026

    DeepWeb-Bench provides a concrete, reproducible benchmark with detailed evaluation of 9 frontier models, revealing actionable insights about failure modes (derivation vs. retrieval bottlenecks, model specialization). It offers immediately usable artifacts (data, rubrics, code) that the research community can adopt. Paper 2 introduces valuable conceptual framing for open-world evaluations but is more of a position/survey paper with a single case study (iOS app deployment). DeepWeb-Bench's empirical rigor, granular error taxonomy, and practical benchmark release give it broader and more immediate scientific utility for advancing deep research systems.

    vs. Probing Embodied LLMs: When Higher Observation Fidelity Hurts Problem Solving
    claude-opus-4.65/21/2026

    Paper 1 introduces a novel evaluation paradigm (open-world evaluations) for frontier AI systems, addresses a critical gap in AI assessment methodology, and presents a concrete framework (CRUX) with broad implications for AI safety and policy. Its relevance spans the entire AI community and policymakers. Paper 2 presents an interesting counterintuitive finding about observation fidelity in embodied LLMs, but its scope is narrower, focused on a specific puzzle task. Paper 1's timeliness regarding frontier AI governance and its potential to reshape how the field evaluates AI capabilities gives it significantly broader impact.

    vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination
    gpt-5.25/21/2026

    Paper 2 likely has higher scientific impact due to breadth, timeliness, and real-world relevance: it argues for a new evaluation paradigm (open-world evals) applicable across models, domains, and safety/governance contexts, and proposes an ongoing program (CRUX) that could shape how frontier capabilities are measured. While Paper 1 is more mechanistically novel and methodologically rigorous within MLLM hallucination, its impact is narrower (specific failure mode + intervention) and may be less field-wide than an evaluation framework influencing research, deployment, and policy.

    vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards
    gemini-3.15/21/2026

    While Paper 1 presents a strong, rigorous technical solution to a specific problem in generative modeling (off-manifold drift), Paper 2 addresses a fundamental and highly timely crisis in AI research: the saturation and limitations of standard benchmarks. By introducing 'open-world evaluations' and demonstrating them on real-world, long-horizon tasks, Paper 2 has the potential for significantly broader impact. It is likely to influence how frontier AI capabilities are measured across the entire field, informing not only technical AI development but also AI safety, policy, and deployment strategies.

    vs. From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)
    gemini-3.15/21/2026

    Paper 1 offers broader scientific impact by addressing a critical, field-wide bottleneck: the evaluation of frontier AI systems. As AI advances, static benchmarks fail to capture real-world, long-horizon capabilities, making the proposed open-world evaluations highly relevant for AI safety, policy, and capability tracking across all domains. Paper 2 presents a strong, practical application of multi-agent systems in telecommunications, but its impact is narrower and domain-specific compared to the foundational, paradigm-shifting evaluation methodology proposed in Paper 1.

    vs. From Automated to Autonomous: Hierarchical Agent-native Network Architecture (HANA)
    gemini-3.15/21/2026

    Paper 1 addresses a fundamental and critical challenge in the broader field of AI: evaluating frontier models beyond static benchmarks. Its introduction of open-world evaluations has widespread implications for AI safety, alignment, and policy. Paper 2 presents a strong, practical application of multi-agent systems in telecommunications, but its scope and potential impact are much more domain-specific compared to the foundational AI evaluation issues tackled in Paper 1.

    vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
    gemini-3.15/21/2026

    Paper 1 addresses a critical, timely issue in frontier AI evaluation by advocating for open-world, real-world tasks to supplement automated benchmarks. Its focus on long-horizon capabilities and safety has profound implications for AI policy, safety, and broad real-world applications. While Paper 2 offers a rigorous benchmark for a specific domain (programmatic video generation), Paper 1's conceptual shift in how we evaluate state-of-the-art AI systems ensures a much broader and more significant impact across the entire AI ecosystem.

    vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
    gemini-3.15/21/2026

    Paper 2 addresses a fundamental and critical challenge in AI: the limitations of static benchmarks for evaluating frontier models. By proposing a framework for open-world, long-horizon evaluations, it has a broader potential impact across the entire field of AI development and safety compared to Paper 1, which focuses on a more specific niche of programmatic video generation.

    vs. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
    gemini-3.15/21/2026

    Paper 2 addresses a critical and universal challenge in modern AI: accurately evaluating frontier models beyond easily gamed static benchmarks. Its proposed open-world evaluation framework has broad implications for AI safety, policy, and capability tracking, influencing how the entire field assesses progress. While Paper 1 offers a valuable technical optimization for LLM training stability, its impact is confined to a specific subfield of systems and optimization, making Paper 2's potential breadth and societal relevance significantly higher.

    vs. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency
    gemini-3.15/21/2026

    Paper 2 addresses the critical, high-cost problem of LLM training instability. By introducing a control layer above the optimizer, it demonstrates significant improvements in training efficiency, stability under stress, and final perplexity for large models. This offers immense practical utility and immediate applicability for the AI industry, likely driving broader and more direct scientific and economic impact than the conceptual evaluation framework proposed in Paper 1.