Anchor: Mitigating Artifact Drift in Agent Benchmark Generation
Maksim Ivanov, Abhijay Rana
Abstract
AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai
AI Impact Assessments
(1 models)Scientific Impact Assessment: Anchor — Mitigating Artifact Drift in Agent Benchmark Generation
1. Core Contribution
The paper identifies and formalizes a failure mode in agent benchmark construction called artifact drift: inconsistencies between a task's instruction, environment, oracle solution, and verifier that arise when these are created by loosely coupled processes. The authors cite concrete, well-known examples (do-nothing agents passing 38% of τ-bench tasks; SWE-bench test strengthening reranking 40.9% of leaderboard positions) that make this problem tangible and urgent.
The proposed solution, Anchor, compiles all four task artifacts from a single solved constraint-satisfaction specification (using OR-Tools CP-SAT). This is conceptually clean: a domain expert formalizes a workflow as a parametric constraint program, the solver certifies an optimal solution, and deterministic translation layers produce the instruction, environment, oracle, and verifier. Because all artifacts are projections of the same solved specification, inter-artifact consistency is guaranteed by construction rather than post-hoc auditing.
The applied output, ERP-Bench, is a 300-task benchmark spanning procurement and manufacturing workflows in Odoo 19, a production-grade ERP system. The benchmark is evaluated across 5 frontier models, 3 harnesses (coding, browser, computer-use), and 18,000 trials.
2. Methodological Rigor
The methodology is well-structured. Several aspects stand out:
However, there are methodological concerns. The 29 workflow patterns required ~40 person-hours from 10 domain experts, but the paper doesn't quantify the engineering effort for CP-SAT formalization per pattern. The claim of scalability rests on scaling instances per pattern, not patterns themselves—the bottleneck is still expert formalization. The expert spot-check involved only 15 tasks across 2 experts, which is a thin validation layer for a 300-task benchmark.
3. Potential Impact
Benchmark methodology: The artifact drift framework provides a useful conceptual vocabulary for the benchmark construction community. The single-source generation principle could be adopted beyond ERP to any domain where workflows are expressible as constraint programs (CRM, HRIS, healthcare intake, logistics). The paper explicitly suggests these extensions.
Enterprise AI evaluation: ERP-Bench fills a genuine gap. Most agent benchmarks target software engineering (SWE-bench) or consumer web tasks (WebArena). Enterprise operations benchmarks are scarce despite the enormous economic value of back-office automation ($2.91T manufacturing GDP cited). The benchmark's grounding in a production-grade ERP system and consultation with practitioners adds ecological validity.
Training data for agent RL: The paper positions Anchor as a potential source of verifiable training curricula for reinforcement learning on enterprise tasks, analogous to AlphaProof's use of formalized mathematical problems. This is forward-looking but not demonstrated in the paper.
Practical findings: The harness comparison reveals that coding agents dramatically outperform browser/computer-use agents on identical tasks (73% vs. 24% vs. 17% pass@5 for GPT-5.5), at 3-14x lower cost. This has implications for how enterprises should deploy agents. The pass-all-5 rate of 9.8% (coding) underscores that frontier agents are far from reliable enough for unattended deployment.
4. Timeliness & Relevance
The paper addresses a current bottleneck: as AI agents move toward economically consequential enterprise workflows, benchmarks must be simultaneously realistic, verifiable, and scalable. The cited failures in existing benchmarks (SWE-bench, τ-bench) demonstrate this is an active pain point. The paper's RLEval '26 venue placement is appropriate.
The constraint-programming approach is timely given the convergence of interest in verifiable rewards for RL training of agents, where benchmark quality directly affects training signal quality.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
6. Additional Observations
The feasibility-optimality gap is one of the most interesting findings—it suggests that current agents fail primarily on constraint adherence rather than optimization, implying that improving business-rule following (not planning quality) is the key bottleneck. The failure analysis (Section H) with concrete examples adds substantial interpretive value.
The comparison to AlphaProof's approach is apt but somewhat aspirational—AlphaProof's formalization was automated, while Anchor's is manual, which fundamentally limits scaling characteristics.
Generated May 27, 2026
Comparison History (21)
Paper 1 likely has higher impact due to its novel, generalizable approach to a key bottleneck in agent evaluation: generating realistic, verifiable, non-reward-hackable long-horizon enterprise tasks. Anchor’s constraint-based joint generation of instructions, environments, certified solutions, and verifiers is methodologically rigorous and broadly applicable beyond ERP (any workflow/task benchmark generation). ERP-Bench targets timely, economically relevant agent capabilities and could become a standard for auditable evaluation. Paper 2 is strong and practical, but split inference/compression for privacy and bandwidth is a more incremental extension in a crowded area.
Paper 2 likely has higher impact due to broader real-world applicability and timeliness: it provides a generalizable, rigorous pipeline (constraint optimization + solver-certified solutions + state-based verifiers) that directly addresses a key blocker for agent evaluation/training in enterprise settings—artifact drift. The resulting ERP-Bench is a sizable, production-grade, economically relevant benchmark with auditable optimality and controllable difficulty, enabling reliable progress measurement across models and institutions. Paper 1 is novel for faithfulness verification in Agentic XAI, but its impact is more specialized and benchmark gains are narrower.
Paper 1 addresses a fundamental, widely-discussed challenge—measuring progress toward AGI—by proposing a comprehensive cognitive taxonomy grounded in decades of cognitive science research. Its breadth of impact spans AI, cognitive science, neuroscience, and AI governance/policy, making it relevant to a very large audience. Paper 2, while methodologically rigorous and practically useful, addresses a narrower problem (artifact drift in agent benchmarks for enterprise tasks). Paper 1's framework has greater potential to shape discourse, standardize evaluation across the field, and influence policy, giving it higher estimated scientific impact.
Paper 1 presents a method to automate AI model building, which has profound implications for accelerating scientific discovery across multiple disciplines by lowering the barrier to entry for natural scientists. Its evolving, knowledge-enhanced architecture addresses a major limitation in current autonomous AI agents. While Paper 2 offers a rigorous approach to benchmarking enterprise agents, Paper 1's potential to democratize AI development and its state-of-the-art performance on MLE benchmarks give it a significantly broader and deeper scientific impact.
Paper 1 addresses a critical infrastructure problem in AI agent evaluation—artifact drift in benchmarks—with a principled, generalizable pipeline (Anchor) and a concrete benchmark (ERP-Bench) for economically significant enterprise tasks. Its contributions span methodology (constraint optimization for task generation), evaluation infrastructure (300 long-horizon tasks), and empirical findings on frontier model performance. The breadth of applicability to agent benchmarking broadly, combined with real-world enterprise relevance and public release, gives it higher potential impact than Paper 2, which, while offering interesting mechanistic insights into cultural binding heads, addresses a narrower interpretability question with more incremental findings (1-3pp improvements).
Paper 1 tackles fundamental theoretical questions about LLM capabilities (introspection and metacognition) and rigorously challenges flawed assumptions in existing evaluation paradigms. By redirecting future research and preventing the field from building on invalid behavioral evidence, it promises a broader and more profound scientific impact across AI, cognitive science, and alignment compared to the valuable but more domain-specific benchmarking methodology presented in Paper 2.
Paper 2 addresses a fundamental and broadly applicable problem—instance-level tool selection under imperfect conditions—with a novel RL framework (GRPO-based) that has theoretical grounding (Single-Oracle risk gap) and practical implications for safety-critical medical AI. Its contributions (instance-level selection formulation, disagreement-aware synergy learning, entropy-guided sampling) are generalizable beyond medicine. Paper 1, while practically valuable for ERP benchmarking, is more narrowly focused on a specific benchmark generation methodology. Paper 2's methodological innovations and validation across seven benchmarks suggest broader scientific influence.
Paper 2 addresses the critical and timely problem of AI alignment auditing—evaluating whether frontier models actually follow their published behavioral specifications. Its multi-method audit pipeline is broadly applicable across labs and model generations, with clear governance implications. The finding that models improve across generations but cluster failures around specific categories (agentic deployments, identity questioning, fabricated claims) has direct relevance for AI safety policy. Paper 1 makes a solid engineering contribution to benchmark generation for enterprise agents, but its scope is narrower (ERP systems) and its impact is more domain-specific. Paper 2's breadth of impact across AI safety, governance, and alignment research gives it higher potential scientific impact.
Paper 1 introduces a novel, well-formalized framework (Anchor) addressing a clearly defined problem (artifact drift) in AI agent benchmarking, backed by a substantial benchmark (ERP-Bench, 300 tasks) with reproducible methodology. It tackles the timely and high-impact area of AI agent evaluation for enterprise tasks, offering broader methodological contributions applicable beyond ERP. Paper 2 addresses important LLM reliability concerns but presents a more incremental neuro-symbolic verification approach validated on a single domain-specific system, with moderate detection rates and narrower generalizability.
Paper 2 has higher estimated impact due to a broadly applicable, rigorous pipeline for generating verifiable, scalable agent benchmarks with certified solutions and drift-resistant verifiers—addressing a timely bottleneck in agent evaluation and enterprise deployment. Its constraint-optimization formalization, harness-agnostic end-state rewards, and release of a production-grade ERP benchmark/dataset increase real-world adoption and cross-field relevance (agents, benchmarking, program synthesis, optimization, enterprise systems). Paper 1 is novel for controllable counterfactual writing generation, but is narrower in scope and likely more incremental relative to existing controlled decoding/LLM prompting methods.
PALoRA addresses a fundamental and broadly relevant challenge in LLM fine-tuning—the plasticity-stability dilemma—with both theoretical grounding and empirical validation across multiple models and reasoning benchmarks. Its spectral analysis framework and orthogonality-constrained adaptation method are generalizable contributions to parameter-efficient fine-tuning, impacting a wide research community. Paper 2 introduces a useful benchmark generation pipeline for enterprise agent evaluation, but its scope is narrower (ERP systems, business workflows) and its impact is more domain-specific, limiting its breadth of scientific influence.
Paper 1 likely has higher scientific impact due to a concrete, scalable solution to a pressing, widely felt problem in agent evaluation: artifact drift in benchmark/task generation. Anchor’s joint generation of instructions, environments, certified optimal solutions, and verifiers is a methodological innovation with strong rigor and immediate real-world applicability (enterprise workflows), plus a sizable released benchmark (ERP-Bench) that can become a community standard. Its impact spans ML evaluation, agent reliability, and enterprise automation. Paper 2 is timely and interesting for cognitive science–LLM comparisons, but its broader downstream tooling/standardization and practical leverage are less direct.
While Paper 1 presents a strong, rigorous approach to improving LLM trustworthiness in the legal domain via formal reasoning, Paper 2 tackles a more broadly applicable and urgent bottleneck: evaluating long-horizon AI agents. By identifying and mitigating 'artifact drift' in benchmark generation, Paper 2's Anchor pipeline offers a scalable, verifiable methodology for agent evaluation across various enterprise and real-world domains, likely leading to wider methodological adoption and impact across the broader AI agent research community.
Paper 2 introduces a concrete, reusable pipeline (Anchor) addressing a recognized practical problem (artifact drift in agent benchmarks) and releases a substantial benchmark (ERP-Bench) for economically valuable enterprise tasks. This has broader impact: it enables reproducible evaluation of AI agents on real-world business operations, serves multiple research communities, and provides infrastructure others can build upon. Paper 1 offers a thoughtful mechanistic interpretability study with an important null result (representation without causal control), but its scope is narrower, focused on one cognitive effect in one model, limiting its broader influence.
Paper 2 likely has higher impact due to introducing a general task-generation framework (Anchor) that addresses a fundamental evaluation failure mode (artifact drift) and produces jointly consistent instructions, environments, certified solutions, and verifiers. It yields a substantial new benchmark (ERP-Bench) in a production-grade ERP setting, with clear real-world relevance to enterprise agent deployment and broad applicability to benchmarking, verification, and programmatic environment design. Paper 1 is useful and timely for LLM hallucination detection, but its contribution is narrower and more incremental (layer selection criterion + truncation) with more limited cross-field reach.
Paper 2 introduces a novel framework (Anchor) addressing a well-defined problem (artifact drift) in AI agent benchmarking, along with a concrete benchmark (ERP-Bench) for enterprise tasks. It tackles a timely and important challenge as AI agents move into real-world business operations, offers methodological rigor through constraint optimization and solver-certified solutions, and has broader impact potential across agent evaluation research. Paper 1 is primarily an engineering contribution packaging existing techniques (LLM-based entity linking) into a library, with more incremental novelty.
Paper 1 has higher likely scientific impact due to a broadly applicable, novel framework (Anchor) that addresses a general and timely bottleneck in agent evaluation: artifact drift and benchmark unsoundness. Its method jointly generates instructions, environments, certified optimal solutions, and verifiers from a single formal specification—strong methodological rigor and clear potential to influence benchmarking standards across AI/ML, software engineering, and enterprise automation. Paper 2 is valuable but more domain-specific (steel VOCs) and largely integrates existing KG+LLM techniques for a narrower application area.
Paper 1 has higher likely scientific impact due to a more novel methodological contribution (constraint-program “anchor” jointly generating instructions, environments, certified solutions, and verifiers) addressing a timely, cross-cutting problem in agent evaluation: benchmark validity and reward hacking. Its applicability spans many enterprise and agentic domains beyond ERP, and the approach could influence how future benchmarks are constructed and audited. Paper 2 is rigorous and valuable for offshore wind engineering, but its impact is more domain-specific (tabular surrogate benchmarking for FOWT fatigue) and less broadly generalizable across fields.
Paper 2 addresses a critical bottleneck in the highly impactful field of autonomous scientific research: verifiability and hallucination. Its Chain-of-Evidence framework and ScientistOne system have broad, cross-disciplinary implications for accelerating scientific discovery, whereas Paper 1 focuses more narrowly on enterprise business operations and ERP benchmarking. The ability to automate rigorous, verifiable research inherently possesses a broader and more transformative potential scientific impact.
Paper 1 has higher potential impact due to a more novel, generalizable methodology: a constraint-optimization “single source of truth” that jointly generates instructions, environments, certified solutions, and verifiers, directly addressing a pervasive benchmarking failure mode (artifact drift). This improves rigor and auditability and can transfer beyond ERP to other agentic domains needing verifiable long-horizon evaluation. ERP-Bench targets economically important enterprise workflows and provides controllable difficulty plus solver-certified optimality, likely influencing both research and industry evaluation practices. Paper 2 is valuable but narrower: mainly a dataset/benchmark limited to four conditions and a specific sentence-selection task.