Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian

May 27, 2026

arXiv:2605.27922v1 PDF

cs.AI(primary)

#629of 2682·Artificial Intelligence

#629 of 2682 · Artificial Intelligence

Tournament Score

1466±49

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance7

Rigor4.5

Novelty6.5

Clarity7.5

Tournament Score

1466±49

10501800

69%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Harness-Bench

1. Core Contribution

Harness-Bench introduces a diagnostic benchmark that makes the harness—the system layer managing context, tools, state, permissions, and recovery—a first-class axis of evaluation in LLM agent benchmarks. The key insight is that existing benchmarks either (a) hold the harness fixed while varying models, (b) evaluate complete opaque agent systems, or (c) abstract away execution entirely. By fixing external task conditions (prompts, sandboxes, budgets, evaluators) while varying both the harness and the model backend, Harness-Bench enables a factorial study of model–harness interactions.

The benchmark comprises 106 sandboxed offline tasks across 8 workflow categories, evaluated with 6 configurable harnesses × 8 model backends (plus Codex as a reference), producing 5,194 execution trajectories. The paper's central empirical finding—a 23.8-point gap between the best and worst harness configurations averaged across models—provides concrete evidence that agent capability should be reported at the model–harness configuration level.

The concept of execution alignment (the degree to which a harness preserves correspondence between model reasoning, workspace state, tool actions, and evaluator conditions) is a useful conceptual contribution, though it remains more descriptive than formalized.

2. Methodological Rigor

Strengths in design: The factorial experimental design (6 harnesses × 8 models × 106 tasks) is well-structured for studying interaction effects. The four-criteria task validation process (realism, solvability, oracle-checkability, integrity) is sensible. The multiplicative scoring formula (Security × Completion × Process) is intentionally conservative and transparent. The decision to preserve each harness's native execution behavior rather than forcing a common internal implementation is pragmatically sound and ecologically valid.

Weaknesses and concerns:

Statistical analysis is thin. The paper reports means and variances but lacks formal statistical tests (e.g., ANOVA decompositions of variance into model, harness, and interaction effects). The "harness dependence" metric is simply the variance of harness-level averages—a coarse measure. No confidence intervals, significance tests, or effect sizes are reported. For a paper making claims about systematic variation, this is a notable gap.

Process scoring relies on LLM judges (Claude Sonnet 4.6), which introduces potential bias, especially since Claude is also one of the model backends being evaluated. The paper acknowledges this limitation but does not quantify inter-rater reliability or judge consistency.

106 tasks is a modest scale. While manually curated, the task suite may not provide sufficient statistical power to draw robust conclusions about category-level differences (some categories have only 7–12 tasks).

No repeated runs are reported. LLM agents are stochastic, yet each model–harness–task triple appears to be run once. Without repeated trials, it's impossible to disentangle systematic harness effects from run-to-run variance.

Harness selection is not well-justified. The 6 harnesses span different design philosophies, but the paper does not clearly articulate why these particular harnesses were chosen or how representative they are of the broader ecosystem.

3. Potential Impact

The paper addresses a genuine blind spot in agent evaluation. As LLM agents become production systems, understanding how the execution layer affects outcomes is critical for:

Practitioners choosing and configuring agent frameworks

Researchers reporting agent results (the paper's argument for configuration-level reporting is compelling)

Developers building and debugging agent execution stacks

Safety/alignment researchers concerned with execution-layer failure modes

The failure taxonomy (Table 3) identifying contract/format violations (36.4%), tool/recovery failures (24.6%), and evidence/grounding failures (14.6%) provides actionable diagnostic categories that could guide harness improvement.

However, the practical impact may be limited by the benchmark's scope: 106 tasks, offline-only, and a specific set of harnesses that may not age well as the ecosystem evolves rapidly.

4. Timeliness & Relevance

This is highly timely. The explosion of agent frameworks (LangChain, AutoGen, CrewAI, Claude Code, Codex, etc.) has created exactly the confusion this benchmark aims to address. The community routinely conflates model capability with system-level performance. The paper's framing aligns with growing recognition (Kapoor et al., 2025; Yang et al., 2024) that evaluation infrastructure for agents is insufficient.

The concept of measuring harness effects independently is an idea whose time has come. Whether Harness-Bench specifically becomes the standard tool for this is less certain, but the problem framing is valuable.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated evaluation axis (harness as variable)

Practical relevance to the rapidly growing agent ecosystem

Careful task curation with explicit quality criteria

Rich evidence collection (traces, usage statistics, validator outputs) enabling multi-dimensional analysis

The finding that stronger models show lower cross-harness variance is an interesting empirical observation

Open-source release of code and data

Notable Limitations:

Lack of statistical rigor in analysis (no formal tests, no repeated runs, no confidence intervals)

LLM-based process scoring introduces circularity concerns

Modest task scale for the granularity of claims being made

Category-level analysis (Appendix C) is underpowered given small category sizes

The "execution alignment" concept is described qualitatively but not operationalized as a measurable quantity

No ablation studies that could isolate which specific harness mechanisms matter most

The paper explicitly disclaims causal claims, which is honest but limits the actionability of findings

Some evaluated harnesses and models appear to be from 2026 references that are difficult to verify

Overall Assessment

Harness-Bench makes a timely and conceptually valuable contribution by formalizing the harness as an independent evaluation axis. The experimental design is thoughtful and the empirical findings are suggestive. However, the statistical analysis falls short of what's needed to firmly establish the paper's claims, the task scale is modest, and the lack of repeated runs is a significant methodological gap. The paper is better as a position piece with supporting evidence than as a rigorous empirical study. Its lasting impact will likely depend on community adoption and whether the benchmark evolves to address its current limitations.

Rating:5.8/ 10

Significance 7Rigor 4.5Novelty 6.5Clarity 7.5

Generated May 28, 2026

Comparison History (13)

vs. Do Clinical Models Change Treatment Decisions?

gpt-5.25/28/2026

Paper 1 likely has higher impact because it targets a broadly relevant, under-measured variable—agent “harness” effects—across many models and realistic tool-using workflows, providing a sizable, instrumented benchmark with traces to diagnose failure modes. This can influence evaluation standards, reporting practices, and system design across the fast-growing LLM agent ecosystem (software engineering, productivity, robotics-like tool use). Paper 2 is timely and important for clinical NLP, but its scope is narrower and more domain-specific, with real-world deployment constrained by regulation and data access.

vs. Position: AI Safety Requires Effective Controllability

gpt-5.25/28/2026

Paper 1 is likely higher impact due to a concrete, broadly useful benchmark and dataset enabling systematic study of harness (execution-layer) effects across models—an under-measured, timely factor in real-world agent performance. It offers methodological rigor (sandboxed tasks, oracle-checkable validators, large trajectory set, rich traces) and clear applications for evaluating/debugging agent stacks, improving reliability, efficiency, and auditability. Paper 2 is important conceptually and timely for safety, but as a position piece its impact depends more on subsequent adoption and empirical depth; its benchmark scope appears narrower and tied to a specific agent setup.

vs. Satisfiability Solving with LLMs: A Matched-Pair Evaluation of Reasoning Capability

claude-opus-4.65/28/2026

Harness-Bench addresses a critical and timely gap in LLM agent evaluation—isolating the effect of the execution harness from the model itself—which has broad practical implications for the rapidly growing agent deployment ecosystem. It introduces a comprehensive benchmark with 5,194 trajectories and identifies a novel concept (execution-alignment failures) relevant across many agent applications. Paper 2 makes a solid methodological contribution (ADR metric, paired-formula protocol) for evaluating LLM reasoning on SAT, but targets a narrower, more theoretical question with less immediate breadth of impact across the field.

vs. MemFail: Stress-Testing Failure Modes of LLM Memory Systems

gemini-3.15/28/2026

Paper 1 introduces a paradigm shift in evaluating LLM agents by highlighting the critical role of the system 'harness', rather than just the base model. This addresses a major gap in current methodologies. While Paper 2 offers valuable insights into memory systems, Paper 1's broader focus on the entire execution stack and its potential to redefine standard reporting practices for agent capabilities gives it a higher potential for widespread methodological impact across the field of AI agents.

vs. Retrying vs Resampling in AI Control

claude-opus-4.65/28/2026

Paper 2 addresses a timely and critical problem in AI safety—how to handle potentially adversarial behavior in deployed AI coding agents. It provides concrete, actionable findings (retrying leaks exploitable information, resampling strategies, contradictions with prior work) that directly inform the design of safer AI systems. Its findings challenge established assumptions in the AI control literature, which amplifies its impact. Paper 1 introduces a useful benchmark for studying harness effects on agent performance, which is valuable but more incremental. Paper 2's safety implications give it broader urgency and cross-field relevance.

vs. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

gemini-3.15/28/2026

Paper 2 presents a high-impact, interdisciplinary application of LLM agents to neuroimaging, addressing significant bottlenecks in medical research. By automating complex preprocessing and achieving high accuracy in Alzheimer's Disease classification, it offers immediate, tangible benefits to clinical and neuroscience research. While Paper 1 provides a useful AI benchmark, Paper 2's direct real-world healthcare application and ability to accelerate reproducible scientific analysis give it a broader and more profound potential scientific impact.

vs. Utility-Aware Multimodal Contrastive Learning for Product Image Generation

gemini-3.15/28/2026

Paper 1 addresses a critical and foundational challenge in the rapidly expanding field of autonomous LLM agents: standardized evaluation. By demonstrating that the system 'harness' significantly affects agent performance, it challenges current evaluation paradigms and provides a necessary benchmark for the broader AI research community. Paper 2, while highly valuable for commercial applications, proposes a more domain-specific application of contrastive learning (e-commerce product images), giving it a narrower scope of scientific impact compared to the fundamental methodological shift proposed in Paper 1.

vs. Revealing Algorithmic Deductive Circuits for Logical Reasoning

claude-opus-4.65/28/2026

Paper 1 offers deeper mechanistic insights into how LLMs perform reasoning, identifying specific attention heads responsible for deductive reasoning steps through causal mediation analysis. This contributes fundamental understanding of LLM internals with broad implications for interpretability and model improvement. Paper 2 introduces a useful benchmark for evaluating harness effects on agent workflows, but benchmarks tend to have more incremental impact and shorter relevance windows. Paper 1's contributions to mechanistic interpretability address a more foundational question with broader applicability across the field.

vs. Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

claude-opus-4.65/28/2026

Paper 2 introduces a novel conceptual framework (GEM) that redefines long-term agent memory as a fundamentally new data-management workload, with formal correctness conditions and proof that existing paradigms are insufficient. This has broader theoretical and practical impact across database systems, AI agents, and memory management. Paper 1, while useful, is primarily a benchmarking contribution for harness-level evaluation—important but more incremental. Paper 2 opens new research directions at the intersection of databases and AI agents, likely inspiring more follow-on work across multiple communities.

vs. Can LLMs Introspect? A Reality Check

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a new, reproducible benchmark targeting an under-measured but practically critical variable (agent harness configuration) with substantial empirical scale (106 tasks, 5,194 trajectories) and rich logged artifacts/traces enabling broad downstream research (reliability, tool use, eval methodology, systems). Its applications are immediate for deployed agent stacks and could standardize reporting at the model–harness level. Paper 2 is timely and conceptually important as a critique with stronger controls, but its scope is narrower (specific introspection paradigms) and is less directly enabling for engineering and cross-field benchmarking.

vs. Cross-Entropy Games and Frost Training

gemini-3.15/28/2026

Paper 1 introduces a critical benchmark for evaluating the system-level 'harness' of LLM agents, addressing a major gap in real-world agent evaluation. Benchmarks that redefine how capabilities are measured (model-harness pairing vs. base model alone) typically have broader, longer-lasting impact across the field than specific training optimizations like the one presented in Paper 2.

vs. DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a broadly applicable benchmark isolating “harness” effects in LLM agent execution, a timely and under-measured source of variance as agents move into real deployments. Its methodology (sandboxed offline tasks, oracle-checkable outcomes, traces, large trajectory count) enables rigorous, reproducible diagnosis across many systems and research groups. The resulting framing—reporting capability at the model–harness configuration level—could shift evaluation practice across agent tooling, alignment, reliability, and systems engineering. Paper 2 is useful and practical, but more incremental and narrower to claim–citation verification.

vs. OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

gemini-3.15/28/2026

Paper 1 addresses a critical, highly practical gap in the rapidly expanding field of LLM agents by isolating the impact of the system 'harness' from the base model. This insight fundamentally shifts how agent capabilities should be evaluated and reported. While Paper 2 offers a valuable cognitive benchmark for Theory of Mind, Paper 1 has broader real-world applicability for building, diagnosing, and deploying reliable autonomous systems, giving it a higher potential for immediate and widespread scientific and engineering impact.