Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Maksim Ivanov, Abhijay Rana

#930 of 2682 · Artificial Intelligence
Share
Tournament Score
1442±41
10501800
62%
Win Rate
13
Wins
8
Losses
21
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Anchor — Mitigating Artifact Drift in Agent Benchmark Generation

1. Core Contribution

The paper identifies and formalizes a failure mode in agent benchmark construction called artifact drift: inconsistencies between a task's instruction, environment, oracle solution, and verifier that arise when these are created by loosely coupled processes. The authors cite concrete, well-known examples (do-nothing agents passing 38% of τ-bench tasks; SWE-bench test strengthening reranking 40.9% of leaderboard positions) that make this problem tangible and urgent.

The proposed solution, Anchor, compiles all four task artifacts from a single solved constraint-satisfaction specification (using OR-Tools CP-SAT). This is conceptually clean: a domain expert formalizes a workflow as a parametric constraint program, the solver certifies an optimal solution, and deterministic translation layers produce the instruction, environment, oracle, and verifier. Because all artifacts are projections of the same solved specification, inter-artifact consistency is guaranteed by construction rather than post-hoc auditing.

The applied output, ERP-Bench, is a 300-task benchmark spanning procurement and manufacturing workflows in Odoo 19, a production-grade ERP system. The benchmark is evaluated across 5 frontier models, 3 harnesses (coding, browser, computer-use), and 18,000 trials.

2. Methodological Rigor

The methodology is well-structured. Several aspects stand out:

  • Formal consistency guarantees: The single-source generation approach is more principled than post-hoc validation. The five end-to-end validity checks (no-op agent scores zero on 300/300 tasks, oracle replay scores full credit on 300/300, LLM judge cross-checks, reward-hacking canary with zero triggers across 16,159 trials, expert spot-checks) provide strong evidence that the pipeline works as claimed.
  • Controlled difficulty: The parametric difficulty recipes produce monotonically decreasing pass rates across easy/medium/hard tiers in every harness, with Spearman correlations linking specific task parameters to realized difficulty. This is a meaningful advance over benchmarks with uncontrolled or post-hoc difficulty labeling.
  • Scale of evaluation: 18,000 trials across 5 models and 3 harnesses is substantial. The feasibility-optimality gap analysis (26.1% constraint satisfaction vs. 17.4% full optimality) reveals a nuanced failure mode that coarser benchmarks would miss.
  • Honest limitations: The paper acknowledges that Anchor doesn't eliminate all errors (the constraint program can encode incomplete logic, renderers can mistranslate), that the domain is limited to terminal-state-verifiable workflows, and that instructions are more explicit than real workplace requests.
  • However, there are methodological concerns. The 29 workflow patterns required ~40 person-hours from 10 domain experts, but the paper doesn't quantify the engineering effort for CP-SAT formalization per pattern. The claim of scalability rests on scaling instances per pattern, not patterns themselves—the bottleneck is still expert formalization. The expert spot-check involved only 15 tasks across 2 experts, which is a thin validation layer for a 300-task benchmark.

    3. Potential Impact

    Benchmark methodology: The artifact drift framework provides a useful conceptual vocabulary for the benchmark construction community. The single-source generation principle could be adopted beyond ERP to any domain where workflows are expressible as constraint programs (CRM, HRIS, healthcare intake, logistics). The paper explicitly suggests these extensions.

    Enterprise AI evaluation: ERP-Bench fills a genuine gap. Most agent benchmarks target software engineering (SWE-bench) or consumer web tasks (WebArena). Enterprise operations benchmarks are scarce despite the enormous economic value of back-office automation ($2.91T manufacturing GDP cited). The benchmark's grounding in a production-grade ERP system and consultation with practitioners adds ecological validity.

    Training data for agent RL: The paper positions Anchor as a potential source of verifiable training curricula for reinforcement learning on enterprise tasks, analogous to AlphaProof's use of formalized mathematical problems. This is forward-looking but not demonstrated in the paper.

    Practical findings: The harness comparison reveals that coding agents dramatically outperform browser/computer-use agents on identical tasks (73% vs. 24% vs. 17% pass@5 for GPT-5.5), at 3-14x lower cost. This has implications for how enterprises should deploy agents. The pass-all-5 rate of 9.8% (coding) underscores that frontier agents are far from reliable enough for unattended deployment.

    4. Timeliness & Relevance

    The paper addresses a current bottleneck: as AI agents move toward economically consequential enterprise workflows, benchmarks must be simultaneously realistic, verifiable, and scalable. The cited failures in existing benchmarks (SWE-bench, τ-bench) demonstrate this is an active pain point. The paper's RLEval '26 venue placement is appropriate.

    The constraint-programming approach is timely given the convergence of interest in verifiable rewards for RL training of agents, where benchmark quality directly affects training signal quality.

    5. Strengths & Limitations

    Key strengths:

  • Clean formalization of the artifact drift problem with a principled solution
  • Strong empirical validation of consistency (zero reward-hacking, perfect oracle replay)
  • Controlled difficulty with predictive parameter-to-difficulty mapping
  • Multi-harness evaluation revealing interface-dependent performance gaps
  • Open-source release of both generator and benchmark
  • Notable weaknesses:

  • Domain scope: Currently limited to procurement/manufacturing in one ERP system. The generalization to other enterprise domains is claimed but undemonstrated.
  • Formalization bottleneck: The approach still requires substantial expert + engineer effort per workflow pattern. The paper doesn't sufficiently address how this scales.
  • Instruction explicitness: As acknowledged, instructions expose constraints and objectives more explicitly than real workplace requests, limiting ecological validity for downstream deployment assessment.
  • Limited model diversity: Only 5 models are evaluated; the open-weight models perform poorly, making cross-model comparisons somewhat lopsided.
  • No RL training demonstration: The potential for curriculum-based agent training is mentioned but not explored.
  • 6. Additional Observations

    The feasibility-optimality gap is one of the most interesting findings—it suggests that current agents fail primarily on constraint adherence rather than optimization, implying that improving business-rule following (not planning quality) is the key bottleneck. The failure analysis (Section H) with concrete examples adds substantial interpretive value.

    The comparison to AlphaProof's approach is apt but somewhat aspirational—AlphaProof's formalization was automated, while Anchor's is manual, which fundamentally limits scaling characteristics.

    Rating:7/ 10
    Significance 7.5Rigor 7.5Novelty 6.5Clarity 8

    Generated May 27, 2026

    Comparison History (21)

    vs. Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation
    gpt-5.25/28/2026

    Paper 1 likely has higher impact due to its novel, generalizable approach to a key bottleneck in agent evaluation: generating realistic, verifiable, non-reward-hackable long-horizon enterprise tasks. Anchor’s constraint-based joint generation of instructions, environments, certified solutions, and verifiers is methodologically rigorous and broadly applicable beyond ERP (any workflow/task benchmark generation). ERP-Bench targets timely, economically relevant agent capabilities and could become a standard for auditable evaluation. Paper 2 is strong and practical, but split inference/compression for privacy and bandwidth is a more incremental extension in a crowded area.

    vs. Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness
    gpt-5.25/28/2026

    Paper 2 likely has higher impact due to broader real-world applicability and timeliness: it provides a generalizable, rigorous pipeline (constraint optimization + solver-certified solutions + state-based verifiers) that directly addresses a key blocker for agent evaluation/training in enterprise settings—artifact drift. The resulting ERP-Bench is a sizable, production-grade, economically relevant benchmark with auditable optimality and controllable difficulty, enabling reliable progress measurement across models and institutions. Paper 1 is novel for faithfulness verification in Agentic XAI, but its impact is more specialized and benchmark gains are narrower.

    vs. Measuring Progress Toward AGI: A Cognitive Framework
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental, widely-discussed challenge—measuring progress toward AGI—by proposing a comprehensive cognitive taxonomy grounded in decades of cognitive science research. Its breadth of impact spans AI, cognitive science, neuroscience, and AI governance/policy, making it relevant to a very large audience. Paper 2, while methodologically rigorous and practically useful, addresses a narrower problem (artifact drift in agent benchmarks for enterprise tasks). Paper 1's framework has greater potential to shape discourse, standardize evaluation across the field, and influence policy, giving it higher estimated scientific impact.

    vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
    gemini-3.15/28/2026

    Paper 1 presents a method to automate AI model building, which has profound implications for accelerating scientific discovery across multiple disciplines by lowering the barrier to entry for natural scientists. Its evolving, knowledge-enhanced architecture addresses a major limitation in current autonomous AI agents. While Paper 2 offers a rigorous approach to benchmarking enterprise agents, Paper 1's potential to democratize AI development and its state-of-the-art performance on MLE benchmarks give it a significantly broader and deeper scientific impact.

    vs. Cultural Binding Heads in Language Models
    claude-opus-4.65/28/2026

    Paper 1 addresses a critical infrastructure problem in AI agent evaluation—artifact drift in benchmarks—with a principled, generalizable pipeline (Anchor) and a concrete benchmark (ERP-Bench) for economically significant enterprise tasks. Its contributions span methodology (constraint optimization for task generation), evaluation infrastructure (300 long-horizon tasks), and empirical findings on frontier model performance. The breadth of applicability to agent benchmarking broadly, combined with real-world enterprise relevance and public release, gives it higher potential impact than Paper 2, which, while offering interesting mechanistic insights into cultural binding heads, addresses a narrower interpretability question with more incremental findings (1-3pp improvements).

    vs. Can LLMs Introspect? A Reality Check
    gemini-3.15/27/2026

    Paper 1 tackles fundamental theoretical questions about LLM capabilities (introspection and metacognition) and rigorously challenges flawed assumptions in existing evaluation paradigms. By redirecting future research and preventing the field from building on invalid behavioral evidence, it promises a broader and more profound scientific impact across AI, cognitive science, and alignment compared to the valuable but more domain-specific benchmarking methodology presented in Paper 2.

    vs. Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents
    claude-opus-4.65/27/2026

    Paper 2 addresses a fundamental and broadly applicable problem—instance-level tool selection under imperfect conditions—with a novel RL framework (GRPO-based) that has theoretical grounding (Single-Oracle risk gap) and practical implications for safety-critical medical AI. Its contributions (instance-level selection formulation, disagreement-aware synergy learning, entropy-guided sampling) are generalizable beyond medicine. Paper 1, while practically valuable for ERP benchmarking, is more narrowly focused on a specific benchmark generation methodology. Paper 2's methodological innovations and validation across seven benchmarks suggest broader scientific influence.

    vs. How Well Do Models Follow Their Constitutions?
    claude-opus-4.65/27/2026

    Paper 2 addresses the critical and timely problem of AI alignment auditing—evaluating whether frontier models actually follow their published behavioral specifications. Its multi-method audit pipeline is broadly applicable across labs and model generations, with clear governance implications. The finding that models improve across generations but cluster failures around specific categories (agentic deployments, identity questioning, fabricated claims) has direct relevance for AI safety policy. Paper 1 makes a solid engineering contribution to benchmark generation for enterprise agents, but its scope is narrower (ERP systems) and its impact is more domain-specific. Paper 2's breadth of impact across AI safety, governance, and alignment research gives it higher potential scientific impact.

    vs. Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
    claude-opus-4.65/27/2026

    Paper 1 introduces a novel, well-formalized framework (Anchor) addressing a clearly defined problem (artifact drift) in AI agent benchmarking, backed by a substantial benchmark (ERP-Bench, 300 tasks) with reproducible methodology. It tackles the timely and high-impact area of AI agent evaluation for enterprise tasks, offering broader methodological contributions applicable beyond ERP. Paper 2 addresses important LLM reliability concerns but presents a more incremental neuro-symbolic verification approach validated on a single domain-specific system, with moderate detection rates and narrower generalizability.

    vs. Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering
    gpt-5.25/27/2026

    Paper 2 has higher estimated impact due to a broadly applicable, rigorous pipeline for generating verifiable, scalable agent benchmarks with certified solutions and drift-resistant verifiers—addressing a timely bottleneck in agent evaluation and enterprise deployment. Its constraint-optimization formalization, harness-agnostic end-state rewards, and release of a production-grade ERP benchmark/dataset increase real-world adoption and cross-field relevance (agents, benchmarking, program synthesis, optimization, enterprise systems). Paper 1 is novel for controllable counterfactual writing generation, but is narrower in scope and likely more incremental relative to existing controlled decoding/LLM prompting methods.

    vs. PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models
    claude-opus-4.65/27/2026

    PALoRA addresses a fundamental and broadly relevant challenge in LLM fine-tuning—the plasticity-stability dilemma—with both theoretical grounding and empirical validation across multiple models and reasoning benchmarks. Its spectral analysis framework and orthogonality-constrained adaptation method are generalizable contributions to parameter-efficient fine-tuning, impacting a wide research community. Paper 2 introduces a useful benchmark generation pipeline for enterprise agent evaluation, but its scope is narrower (ERP systems, business workflows) and its impact is more domain-specific, limiting its breadth of scientific influence.

    vs. Hypothesis Generation and Inductive Inference in Children and Language Models
    gpt-5.25/27/2026

    Paper 1 likely has higher scientific impact due to a concrete, scalable solution to a pressing, widely felt problem in agent evaluation: artifact drift in benchmark/task generation. Anchor’s joint generation of instructions, environments, certified optimal solutions, and verifiers is a methodological innovation with strong rigor and immediate real-world applicability (enterprise workflows), plus a sizable released benchmark (ERP-Bench) that can become a community standard. Its impact spans ML evaluation, agent reliability, and enterprise automation. Paper 2 is timely and interesting for cognitive science–LLM comparisons, but its broader downstream tooling/standardization and practical leverage are less direct.

    vs. Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning
    gemini-3.15/27/2026

    While Paper 1 presents a strong, rigorous approach to improving LLM trustworthiness in the legal domain via formal reasoning, Paper 2 tackles a more broadly applicable and urgent bottleneck: evaluating long-horizon AI agents. By identifying and mitigating 'artifact drift' in benchmark generation, Paper 2's Anchor pipeline offers a scalable, verifiable methodology for agent evaluation across various enterprise and real-world domains, likely leading to wider methodological adoption and impact across the broader AI agent research community.

    vs. Representation Without Control: Testing the Realization Effect in Language Models
    claude-opus-4.65/27/2026

    Paper 2 introduces a concrete, reusable pipeline (Anchor) addressing a recognized practical problem (artifact drift in agent benchmarks) and releases a substantial benchmark (ERP-Bench) for economically valuable enterprise tasks. This has broader impact: it enables reproducible evaluation of AI agents on real-world business operations, serves multiple research communities, and provides infrastructure others can build upon. Paper 1 offers a thoughtful mechanistic interpretability study with an important null result (representation without causal control), but its scope is narrower, focused on one cognitive effect in one model, limiting its broader influence.

    vs. Automatic Layer Selection for Hallucination Detection
    gpt-5.25/27/2026

    Paper 2 likely has higher impact due to introducing a general task-generation framework (Anchor) that addresses a fundamental evaluation failure mode (artifact drift) and produces jointly consistent instructions, environments, certified solutions, and verifiers. It yields a substantial new benchmark (ERP-Bench) in a production-grade ERP setting, with clear real-world relevance to enterprise agent deployment and broad applicability to benchmarking, verification, and programmatic environment design. Paper 1 is useful and timely for LLM hallucination detection, but its contribution is narrower and more incremental (layer selection criterion + truncation) with more limited cross-field reach.

    vs. LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation
    claude-opus-4.65/27/2026

    Paper 2 introduces a novel framework (Anchor) addressing a well-defined problem (artifact drift) in AI agent benchmarking, along with a concrete benchmark (ERP-Bench) for enterprise tasks. It tackles a timely and important challenge as AI agents move into real-world business operations, offers methodological rigor through constraint optimization and solver-certified solutions, and has broader impact potential across agent evaluation research. Paper 1 is primarily an engineering contribution packaging existing techniques (LLM-based entity linking) into a library, with more incremental novelty.

    vs. Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry
    gpt-5.25/27/2026

    Paper 1 has higher likely scientific impact due to a broadly applicable, novel framework (Anchor) that addresses a general and timely bottleneck in agent evaluation: artifact drift and benchmark unsoundness. Its method jointly generates instructions, environments, certified optimal solutions, and verifiers from a single formal specification—strong methodological rigor and clear potential to influence benchmarking standards across AI/ML, software engineering, and enterprise automation. Paper 2 is valuable but more domain-specific (steel VOCs) and largely integrates existing KG+LLM techniques for a narrower application area.

    vs. FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue
    gpt-5.25/27/2026

    Paper 1 has higher likely scientific impact due to a more novel methodological contribution (constraint-program “anchor” jointly generating instructions, environments, certified solutions, and verifiers) addressing a timely, cross-cutting problem in agent evaluation: benchmark validity and reward hacking. Its applicability spans many enterprise and agentic domains beyond ERP, and the approach could influence how future benchmarks are constructed and audited. Paper 2 is rigorous and valuable for offshore wind engineering, but its impact is more domain-specific (tabular surrogate benchmarking for FOWT fatigue) and less broadly generalizable across fields.

    vs. ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence
    gemini-3.15/27/2026

    Paper 2 addresses a critical bottleneck in the highly impactful field of autonomous scientific research: verifiability and hallucination. Its Chain-of-Evidence framework and ScientistOne system have broad, cross-disciplinary implications for accelerating scientific discovery, whereas Paper 1 focuses more narrowly on enterprise business operations and ERP benchmarking. The ability to automate rigorous, verifiable research inherently possesses a broader and more transformative potential scientific impact.

    vs. A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks
    gpt-5.25/27/2026

    Paper 1 has higher potential impact due to a more novel, generalizable methodology: a constraint-optimization “single source of truth” that jointly generates instructions, environments, certified solutions, and verifiers, directly addressing a pervasive benchmarking failure mode (artifact drift). This improves rigor and auditability and can transfer beyond ERP to other agentic domains needing verifiable long-horizon evaluation. ERP-Bench targets economically important enterprise workflows and provides controllable difficulty plus solver-certified optimality, likely influencing both research and industry evaluation practices. Paper 2 is valuable but narrower: mainly a dataset/benchmark limited to four conditions and a specific sentence-selection task.