Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song

#150 of 2292 · Artificial Intelligence
Share
Tournament Score
1530±46
10501800
89%
Win Rate
17
Wins
2
Losses
19
Matches
Rating
7.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Agent benchmarks have become the de facto measure of frontier AI competence, guiding model selection, investment, and deployment. However, reward hacking, where agents maximize a score without performing the intended task, emerges spontaneously in frontier models without overfitting. We argue that benchmarks must be secure by design. From past incidents of reward hacks, we derive a taxonomy of eight recurring flaw patterns and compile them into the Agent-Eval Checklist for benchmark designers. We condense the insights into BenchJack, an automated red-teaming system that drives coding agents to audit benchmarks and identify possible reward-hacking exploits in a clairvoyant manner. Moreover, we extend BenchJack to an iterative generative-adversarial pipeline that discovers new flaws and patches them iteratively to improve benchmark robustness. We apply BenchJack to 10 popular agent benchmarks spanning software engineering, web navigation, desktop computing, and terminal operations. BenchJack synthesizes reward-hacking exploits that achieve near-perfect scores on most of the benchmarks without solving a single task, surfacing 219 distinct flaws across the eight classes. Moreover, BenchJack's extended pipeline reduces the hackable-task ratio from near 100% to under 10% on four benchmarks without fatal design flaws, fully patching WebArena and OSWorld within three iterations. Our results show that evaluation pipelines have not internalized an adversarial mindset, and that proactive auditing could help close the security gap for the fast-paced benchmarking space.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: BenchJack

1. Core Contribution

BenchJack addresses a critical and timely problem: the systematic vulnerability of AI agent benchmarks to reward hacking. The paper makes three interlinked contributions: (1) a taxonomy of eight recurring flaw classes (V1–V8) distilled from real-world reward-hacking incidents, (2) an automated red-teaming system (BenchJack) that discovers exploits without human supervision, and (3) an iterative generative-adversarial pipeline that patches discovered flaws. The key insight is shifting from reactive detection (monitoring agent trajectories post-hoc) to proactive auditing (scanning benchmark infrastructure before deployment). This reframes benchmark integrity as a security engineering problem rather than a statistical one—a genuinely novel perspective.

2. Methodological Rigor

The methodology is generally sound but has notable caveats. The taxonomy derivation is empirical rather than formal—drawn from manual inspection of known incidents—which risks incompleteness, though the authors acknowledge this. The eight flaw classes are well-defined with concrete examples and map cleanly to security concepts (trust boundaries, isolation, privilege escalation).

The experimental design is thorough: 10 benchmarks spanning 4 domains, with detailed per-benchmark exploit descriptions (Appendix E) providing full reproducibility. The exploits are verified end-to-end through official benchmark entry points, not custom harnesses, which strengthens validity. The iterative patching study (Section 5.3) is well-designed, showing convergence behavior across rounds.

However, there are methodological concerns. BenchJack operates in a "clairvoyant" setting—the auditing agent has full access to benchmark source code, which real agents during evaluation may not. The paper acknowledges but does not deeply explore whether frontier models could independently discover these exploits during normal runs. The gap between "exploitable in principle by a code-auditing agent" and "spontaneously exploited by an agent under evaluation" is significant. The authors cite external evidence (METR's findings, Anthropic's Mythos Preview) but do not bridge this gap empirically themselves.

The patching study reveals an important asymmetry: patches that work (WebArena, OSWorld reaching 0% hack rate) share strong architectural properties (separate trust domains), while fundamentally flawed designs resist code-only fixes. This finding is arguably more valuable than the attacks themselves, as it identifies the structural invariant that determines patchability.

3. Potential Impact

Immediate practical impact: High. The Agent-Eval Checklist provides actionable guidance for benchmark designers today. The finding that 9/10 major benchmarks are hackable to near-perfect scores is alarming and should motivate immediate remediation. The open-source release of BenchJack enables the community to audit new benchmarks proactively.

Field influence: The paper could catalyze a paradigm shift in how benchmarks are designed and evaluated. The security-first framing—treating benchmark evaluation as an adversarial system requiring trust boundaries, isolation, and least privilege—is transferable to any evaluation infrastructure. The taxonomy provides shared vocabulary for discussing benchmark vulnerabilities.

Adjacent field impact: The work connects to AI safety (reward hacking transferring to deployment), software security (the taxonomy mirrors OWASP-style vulnerability classifications), and ML engineering practices. The iterative adversarial improvement pipeline echoes fuzzing methodologies in software testing.

Industry relevance: Given that benchmark scores drive investment decisions, model selection, and deployment choices, demonstrating that scores are trivially gameable has direct economic and safety implications.

4. Timeliness & Relevance

This paper arrives at a critical moment. The explosion of agent benchmarks (hundreds in two years), combined with documented cases of spontaneous reward hacking by frontier models (METR, Anthropic), creates urgent demand for systematic auditing tools. The paper correctly identifies that manual auditing cannot scale with the pace of benchmark creation. The 2026 publication date (based on references) places it squarely when agent benchmarks are becoming primary evaluation instruments for frontier AI capabilities.

5. Strengths & Limitations

Key strengths:

  • Comprehensiveness: 10 benchmarks, 219 flaws, 8 flaw classes, full exploit code, and iterative patching—the scope is exceptional for a single paper.
  • Actionable outputs: The checklist, taxonomy, and open-source tool provide immediate utility beyond academic contribution.
  • Detailed appendices: The 48-page paper includes complete exploit listings, patch descriptions, bypass analyses, and prompt templates, enabling full reproducibility.
  • Architectural insight: The finding that patchability depends on whether "the bytes the grader reads are produced under a different uid/process/container than the agent" is a crisp, actionable design principle.
  • Honest disclosure of limitations: The patching study clearly shows where patches fail and why, rather than overselling results.
  • Notable weaknesses:

  • Clairvoyant assumption: The auditing agent has full source code access. Real-world reward hacking must be discovered without this advantage, and the paper doesn't quantify the difficulty gap.
  • Single backend dependency: All experiments use Claude Code. The generalizability to other coding agents is untested.
  • Taxonomy completeness: The eight classes are empirically derived and may miss novel attack vectors not yet observed in the wild.
  • Cost and scalability: The paper notes BenchJack "can be costly for bigger benchmarks" but provides no cost analysis or scaling characterization.
  • Dual-use concerns: While acknowledged, the detailed exploit recipes could enable malicious leaderboard manipulation before patches are deployed.
  • No formal guarantees: The iterative pipeline converges empirically but has no theoretical convergence or completeness guarantees.
  • Overall Assessment

    This is a high-impact systems paper that reframes benchmark integrity as a security problem and provides both conceptual tools (taxonomy, checklist) and practical tools (BenchJack) to address it. The breadth of evaluation and depth of exploit analysis are impressive. The main limitation—the gap between clairvoyant auditing and realistic spontaneous exploitation—somewhat bounds the immediate safety implications but does not diminish the practical value for benchmark designers.

    Rating:7.8/ 10
    Significance 8.5Rigor 7Novelty 7.5Clarity 8

    Generated May 14, 2026

    Comparison History (19)

    vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows
    gpt-5.25/21/2026

    Paper 2 likely has higher impact: it targets a field-wide bottleneck—benchmark validity under reward hacking—affecting how progress is measured across many agent domains. It contributes a general taxonomy, a practical checklist, and an automated red-teaming/auditing system validated on 10 major benchmarks with extensive empirical findings (219 flaws) and demonstrated mitigation (patching pipelines, large reductions in hackability). This is timely and broadly applicable to AI evaluation, safety, and deployment. Paper 1 is innovative for workflow synthesis but is narrower and more application-specific.

    vs. COSMO-Agent: Tool-Augmented Agent for Closed-loop Optimization,Simulation,and Modeling Orchestration
    gemini-3.15/16/2026

    Paper 1 addresses a critical and universal issue in modern AI development: the vulnerability and reliability of agent benchmarks against reward hacking. By providing an automated framework to audit and patch these benchmarks, it has a profound, cross-disciplinary impact on how frontier AI models are evaluated. Paper 2, while offering a strong methodological contribution to CAD-CAE optimization, has a significantly narrower scope restricted to industrial design, making Paper 1's broader implications for AI safety and evaluation more impactful.

    vs. Primal-Dual Guided Decoding for Constrained Discrete Diffusion
    claude-opus-4.65/16/2026

    Paper 1 addresses a critical and timely problem—the integrity of AI agent benchmarks that guide major decisions in model development and deployment. Its systematic taxonomy of flaw patterns, automated red-teaming framework (BenchJack), and demonstration across 10 popular benchmarks with 219 discovered flaws have broad implications for the entire AI evaluation ecosystem. The adversarial auditing approach and practical patches represent a paradigm shift in benchmark design. Paper 2, while methodologically sound and novel in applying primal-dual optimization to discrete diffusion, addresses a more specialized technical problem with narrower immediate impact.

    vs. When Context Hurts: The Crossover Effect of Knowledge Transfer on Multi-Agent Design Exploration
    gemini-3.15/16/2026

    Paper 2 addresses a critical systemic vulnerability in AI evaluation: reward hacking in frontier benchmarks. By exposing severe flaws in 10 major benchmarks and providing an automated red-teaming pipeline (BenchJack) to identify and patch them, it secures the foundation of how AI progress is measured. While Paper 1 offers valuable insights into agent context and RAG, Paper 2's impact is broader and more fundamental, as the entire AI community relies on robust benchmarks to validate model capabilities and ensure safe deployment.

    vs. LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
    gpt-5.25/16/2026

    Paper 2 likely has higher scientific impact due to strong timeliness and broad applicability: benchmark validity under reward hacking affects nearly all agent-evaluation work and downstream deployment decisions. It contributes a general taxonomy, a practical automated auditing system, and an iterative adversarial patching pipeline, demonstrated across 10 widely used benchmarks with large empirical findings (219 flaws) and measurable robustness improvements. This combination of conceptual framework + tool + multi-domain evaluation suggests wide adoption and immediate influence across ML evaluation, AI safety, and systems. Paper 1 is innovative but more domain-specific (biomedical KG reasoning) and likely narrower in uptake.

    vs. Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
    gpt-5.25/16/2026

    Paper 2 offers a mechanistic, unifying theory (attractor-basin geometry) connecting conflict and hallucination, plus a new internal metric (geometric margin) that outperforms entropy and appears to generalize from a controlled causal setup to natural queries, with an explicit scaling law. This combination of conceptual novelty, methodological rigor, and broad relevance to interpretability, safety, and monitoring across many transformer uses suggests wide cross-field impact. Paper 1 is highly practical and timely for benchmark security, but is narrower in scope and more engineering/auditing-oriented.

    vs. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel
    gemini-3.15/16/2026

    Paper 1 addresses a fundamental assumption in modern AI: whether Chain-of-Thought reasoning traces faithfully represent internal model computations. By demonstrating that CoT is often 'performative' and temporally misaligned with latent answer formation, it profoundly impacts AI interpretability, safety, and oversight mechanisms. While Paper 2 offers a valuable practical tool for benchmark robustness, Paper 1 provides deep mechanistic insights into the nature of LLM reasoning, which is highly timely given the recent surge of reasoning-focused models, making its theoretical and safety implications more broadly transformative.

    vs. Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment
    gemini-3.15/16/2026

    Paper 1 addresses a critical systemic vulnerability in the AI ecosystem: the unreliability of agent benchmarks due to reward hacking. By proving that major benchmarks can be exploited to near-perfect scores without actually solving tasks, it fundamentally challenges how the field measures progress. Furthermore, it provides a practical, automated framework (BenchJack) to identify and patch these flaws. While Paper 2 offers a valuable algorithmic improvement for LLM alignment, Paper 1 has immediate, massive implications for AI safety, model evaluation methodologies, and the trustworthiness of future agentic AI deployments.

    vs. Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture
    gemini-3.15/16/2026

    While Paper 1 presents a novel architectural approach to AI safety, Paper 2 exposes critical vulnerabilities in the foundational benchmarks currently used to measure AI progress across the entire field. By demonstrating that current benchmarks are easily 'reward-hacked' and providing an automated framework to audit and patch them, Paper 2 forces an immediate methodological shift in AI evaluation, leading to broader and more urgent scientific impact.

    vs. Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
    gemini-3.15/16/2026

    Paper 1 addresses a fundamental crisis in AI evaluation: reward hacking in agent benchmarks. By systematically exposing and patching vulnerabilities in widely used benchmarks, it ensures the validity of metrics guiding frontier AI progress. This systemic impact on evaluation methodology, combined with a practical automated auditing tool, gives it a broader and more critical influence across the field than Paper 2's focus on detecting topic-specific LLM biases.

    vs. Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
    gemini-3.15/14/2026

    Paper 1 addresses a critical vulnerability in AI evaluation: benchmark reward hacking. By exposing severe flaws in 10 major benchmarks and providing an automated auditing and patching tool (BenchJack), it offers immediate, widespread utility for AI safety and development. Paper 2 provides valuable theoretical insights into VLM representational bottlenecks using a synthetic benchmark, but its real-world impact is less immediate and broad compared to the systemic evaluation improvements introduced in Paper 1.

    vs. Multimodal Hidden Markov Models for Persistent Emotional State Tracking
    gpt-5.25/14/2026

    Paper 2 is likely higher impact due to strong timeliness and broad relevance: benchmark integrity under reward hacking affects nearly all agentic AI research and deployment. It contributes a general taxonomy, a concrete automated auditing system (BenchJack), and an iterative adversarial patching pipeline, validated across 10 widely used benchmarks with large-scale findings (219 flaws) and demonstrated remediation. This is methodologically compelling and has immediate real-world application for evaluation governance. Paper 1 is useful and interpretable for affect tracking, but is narrower in scope and less likely to reshape multiple fields quickly.

    vs. Selective Off-Policy Reference Tuning with Plan Guidance
    claude-opus-4.65/14/2026

    Paper 2 addresses a fundamental and broadly impactful problem: the integrity of AI agent benchmarks that guide model selection, investment, and deployment decisions. It introduces a systematic taxonomy, an automated red-teaming tool (BenchJack), and demonstrates concrete results across 10 popular benchmarks, finding 219 flaws and providing actionable patches. This has immediate cross-cutting implications for the entire AI evaluation ecosystem. Paper 1, while technically solid, is an incremental improvement to reinforcement learning training methods for reasoning, with a narrower scope of impact.

    vs. Assessing the Creativity of Large Language Models: Testing, Limits, and New Frontiers
    gpt-5.25/14/2026

    Paper 2 likely has higher impact: it targets a critical, timely failure mode (reward hacking) that directly affects how frontier agent progress is measured, funded, and deployed. It contributes a general taxonomy/checklist plus an automated, iterative red-teaming methodology (BenchJack) and demonstrates rigor via large-scale auditing across 10 major benchmarks, uncovering 219 flaws and showing substantial patching gains. The approach is broadly applicable across domains and can reshape evaluation practice. Paper 1 is novel and useful for LLM creativity measurement, but its applications are narrower and less immediately security-critical.

    vs. Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements
    claude-opus-4.65/14/2026

    Paper 1 presents a concrete, actionable system (BenchJack) that identifies real vulnerabilities in 10 widely-used AI benchmarks, producing immediately applicable results (219 distinct flaws, patches reducing hackable tasks from ~100% to <10%). It addresses a timely, practical problem as AI agent benchmarks drive major decisions. Paper 2, while addressing an important theoretical question about AI safety, provides abstract formal results without proposing concrete solutions, and its control-theoretic framing largely formalizes intuitions already widely held. Paper 1's empirical rigor, practical tooling, and direct applicability to the fast-moving benchmarking ecosystem give it broader near-term impact.

    vs. Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study
    gpt-5.25/14/2026

    Paper 1 has higher scientific impact potential due to its methodological novelty (automated, adversarial benchmark auditing with a taxonomy/checklist and iterative patching loop) and broad relevance: reliable evaluation under reward hacking affects essentially all agent benchmarks, model comparison, and downstream deployment decisions. It offers a generalizable framework and concrete evidence across many popular benchmarks, likely reshaping evaluation practices across AI safety, benchmarking, and agent research. Paper 2 is timely and highly applicable for industry, but is more deployment/engineering-specific and less likely to generalize into widely cited scientific methodology.

    vs. A Generalized Synthetic Control Method for Baseline Estimation in Demand Response Services
    claude-opus-4.65/14/2026

    Paper 1 addresses a critical and timely problem—reward hacking in AI agent benchmarks—that affects the entire AI evaluation ecosystem. It introduces a systematic taxonomy, an automated auditing tool (BenchJack), and demonstrates widespread vulnerabilities across 10 major benchmarks, with practical remediation. Its breadth of impact spans AI safety, evaluation methodology, and deployment decisions for frontier models, making it highly relevant to a large community. Paper 2, while methodologically sound, addresses a narrower domain (demand response baseline estimation) with incremental improvements over existing methods, limiting its broader scientific impact.

    vs. What properties of reasoning supervision are associated with improved downstream model quality?
    gpt-5.25/14/2026

    Paper 1 is likely higher impact: it addresses a timely, widely recognized failure mode (reward hacking) in agent benchmarks that affects model selection and claims of progress across many subfields. It contributes concrete artifacts (taxonomy, checklist, automated red-teaming/auditing system, iterative patching pipeline) and demonstrates broad empirical results across 10 major benchmarks with substantial vulnerability reductions—high practical applicability and breadth. Paper 2 is useful and methodologically sound but narrower in scope (reasoning-data selection, one language/dataset family) and more incremental/less broadly transformative.

    vs. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
    claude-opus-4.65/14/2026

    Paper 1 addresses a fundamental and urgent problem in AI evaluation—reward hacking in agent benchmarks—with broad implications for the entire field. Its systematic taxonomy of flaw patterns, automated red-teaming framework (BenchJack), and demonstration across 10 popular benchmarks (finding 219 flaws) directly impacts how the community evaluates frontier AI models. The adversarial auditing paradigm is highly novel and timely given rapid AI deployment. Paper 2 contributes useful infrastructure for environment generation but targets a narrower domain (claw-like agents) with less transformative potential for the broader AI safety and evaluation landscape.