Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian
Abstract
LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Harness-Bench
1. Core Contribution
Harness-Bench introduces a diagnostic benchmark that makes the harness—the system layer managing context, tools, state, permissions, and recovery—a first-class axis of evaluation in LLM agent benchmarks. The key insight is that existing benchmarks either (a) hold the harness fixed while varying models, (b) evaluate complete opaque agent systems, or (c) abstract away execution entirely. By fixing external task conditions (prompts, sandboxes, budgets, evaluators) while varying both the harness and the model backend, Harness-Bench enables a factorial study of model–harness interactions.
The benchmark comprises 106 sandboxed offline tasks across 8 workflow categories, evaluated with 6 configurable harnesses × 8 model backends (plus Codex as a reference), producing 5,194 execution trajectories. The paper's central empirical finding—a 23.8-point gap between the best and worst harness configurations averaged across models—provides concrete evidence that agent capability should be reported at the model–harness configuration level.
The concept of execution alignment (the degree to which a harness preserves correspondence between model reasoning, workspace state, tool actions, and evaluator conditions) is a useful conceptual contribution, though it remains more descriptive than formalized.
2. Methodological Rigor
Strengths in design: The factorial experimental design (6 harnesses × 8 models × 106 tasks) is well-structured for studying interaction effects. The four-criteria task validation process (realism, solvability, oracle-checkability, integrity) is sensible. The multiplicative scoring formula (Security × Completion × Process) is intentionally conservative and transparent. The decision to preserve each harness's native execution behavior rather than forcing a common internal implementation is pragmatically sound and ecologically valid.
Weaknesses and concerns:
3. Potential Impact
The paper addresses a genuine blind spot in agent evaluation. As LLM agents become production systems, understanding how the execution layer affects outcomes is critical for:
The failure taxonomy (Table 3) identifying contract/format violations (36.4%), tool/recovery failures (24.6%), and evidence/grounding failures (14.6%) provides actionable diagnostic categories that could guide harness improvement.
However, the practical impact may be limited by the benchmark's scope: 106 tasks, offline-only, and a specific set of harnesses that may not age well as the ecosystem evolves rapidly.
4. Timeliness & Relevance
This is highly timely. The explosion of agent frameworks (LangChain, AutoGen, CrewAI, Claude Code, Codex, etc.) has created exactly the confusion this benchmark aims to address. The community routinely conflates model capability with system-level performance. The paper's framing aligns with growing recognition (Kapoor et al., 2025; Yang et al., 2024) that evaluation infrastructure for agents is insufficient.
The concept of measuring harness effects independently is an idea whose time has come. Whether Harness-Bench specifically becomes the standard tool for this is less certain, but the problem framing is valuable.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
Harness-Bench makes a timely and conceptually valuable contribution by formalizing the harness as an independent evaluation axis. The experimental design is thoughtful and the empirical findings are suggestive. However, the statistical analysis falls short of what's needed to firmly establish the paper's claims, the task scale is modest, and the lack of repeated runs is a significant methodological gap. The paper is better as a position piece with supporting evidence than as a rigorous empirical study. Its lasting impact will likely depend on community adoption and whether the benchmark evolves to address its current limitations.
Generated May 28, 2026
Comparison History (13)
Paper 1 likely has higher impact because it targets a broadly relevant, under-measured variable—agent “harness” effects—across many models and realistic tool-using workflows, providing a sizable, instrumented benchmark with traces to diagnose failure modes. This can influence evaluation standards, reporting practices, and system design across the fast-growing LLM agent ecosystem (software engineering, productivity, robotics-like tool use). Paper 2 is timely and important for clinical NLP, but its scope is narrower and more domain-specific, with real-world deployment constrained by regulation and data access.
Paper 1 is likely higher impact due to a concrete, broadly useful benchmark and dataset enabling systematic study of harness (execution-layer) effects across models—an under-measured, timely factor in real-world agent performance. It offers methodological rigor (sandboxed tasks, oracle-checkable validators, large trajectory set, rich traces) and clear applications for evaluating/debugging agent stacks, improving reliability, efficiency, and auditability. Paper 2 is important conceptually and timely for safety, but as a position piece its impact depends more on subsequent adoption and empirical depth; its benchmark scope appears narrower and tied to a specific agent setup.
Harness-Bench addresses a critical and timely gap in LLM agent evaluation—isolating the effect of the execution harness from the model itself—which has broad practical implications for the rapidly growing agent deployment ecosystem. It introduces a comprehensive benchmark with 5,194 trajectories and identifies a novel concept (execution-alignment failures) relevant across many agent applications. Paper 2 makes a solid methodological contribution (ADR metric, paired-formula protocol) for evaluating LLM reasoning on SAT, but targets a narrower, more theoretical question with less immediate breadth of impact across the field.
Paper 1 introduces a paradigm shift in evaluating LLM agents by highlighting the critical role of the system 'harness', rather than just the base model. This addresses a major gap in current methodologies. While Paper 2 offers valuable insights into memory systems, Paper 1's broader focus on the entire execution stack and its potential to redefine standard reporting practices for agent capabilities gives it a higher potential for widespread methodological impact across the field of AI agents.
Paper 2 addresses a timely and critical problem in AI safety—how to handle potentially adversarial behavior in deployed AI coding agents. It provides concrete, actionable findings (retrying leaks exploitable information, resampling strategies, contradictions with prior work) that directly inform the design of safer AI systems. Its findings challenge established assumptions in the AI control literature, which amplifies its impact. Paper 1 introduces a useful benchmark for studying harness effects on agent performance, which is valuable but more incremental. Paper 2's safety implications give it broader urgency and cross-field relevance.
Paper 2 presents a high-impact, interdisciplinary application of LLM agents to neuroimaging, addressing significant bottlenecks in medical research. By automating complex preprocessing and achieving high accuracy in Alzheimer's Disease classification, it offers immediate, tangible benefits to clinical and neuroscience research. While Paper 1 provides a useful AI benchmark, Paper 2's direct real-world healthcare application and ability to accelerate reproducible scientific analysis give it a broader and more profound potential scientific impact.
Paper 1 addresses a critical and foundational challenge in the rapidly expanding field of autonomous LLM agents: standardized evaluation. By demonstrating that the system 'harness' significantly affects agent performance, it challenges current evaluation paradigms and provides a necessary benchmark for the broader AI research community. Paper 2, while highly valuable for commercial applications, proposes a more domain-specific application of contrastive learning (e-commerce product images), giving it a narrower scope of scientific impact compared to the fundamental methodological shift proposed in Paper 1.
Paper 1 offers deeper mechanistic insights into how LLMs perform reasoning, identifying specific attention heads responsible for deductive reasoning steps through causal mediation analysis. This contributes fundamental understanding of LLM internals with broad implications for interpretability and model improvement. Paper 2 introduces a useful benchmark for evaluating harness effects on agent workflows, but benchmarks tend to have more incremental impact and shorter relevance windows. Paper 1's contributions to mechanistic interpretability address a more foundational question with broader applicability across the field.
Paper 2 introduces a novel conceptual framework (GEM) that redefines long-term agent memory as a fundamentally new data-management workload, with formal correctness conditions and proof that existing paradigms are insufficient. This has broader theoretical and practical impact across database systems, AI agents, and memory management. Paper 1, while useful, is primarily a benchmarking contribution for harness-level evaluation—important but more incremental. Paper 2 opens new research directions at the intersection of databases and AI agents, likely inspiring more follow-on work across multiple communities.
Paper 1 likely has higher impact: it introduces a new, reproducible benchmark targeting an under-measured but practically critical variable (agent harness configuration) with substantial empirical scale (106 tasks, 5,194 trajectories) and rich logged artifacts/traces enabling broad downstream research (reliability, tool use, eval methodology, systems). Its applications are immediate for deployed agent stacks and could standardize reporting at the model–harness level. Paper 2 is timely and conceptually important as a critique with stronger controls, but its scope is narrower (specific introspection paradigms) and is less directly enabling for engineering and cross-field benchmarking.
Paper 1 introduces a critical benchmark for evaluating the system-level 'harness' of LLM agents, addressing a major gap in real-world agent evaluation. Benchmarks that redefine how capabilities are measured (model-harness pairing vs. base model alone) typically have broader, longer-lasting impact across the field than specific training optimizations like the one presented in Paper 2.
Paper 1 likely has higher impact: it introduces a broadly applicable benchmark isolating “harness” effects in LLM agent execution, a timely and under-measured source of variance as agents move into real deployments. Its methodology (sandboxed offline tasks, oracle-checkable outcomes, traces, large trajectory count) enables rigorous, reproducible diagnosis across many systems and research groups. The resulting framing—reporting capability at the model–harness configuration level—could shift evaluation practice across agent tooling, alignment, reliability, and systems engineering. Paper 2 is useful and practical, but more incremental and narrower to claim–citation verification.
Paper 1 addresses a critical, highly practical gap in the rapidly expanding field of LLM agents by isolating the impact of the system 'harness' from the base model. This insight fundamentally shifts how agent capabilities should be evaluated and reported. While Paper 2 offers a valuable cognitive benchmark for Theory of Mind, Paper 1 has broader real-world applicability for building, diagnosing, and deploying reliable autonomous systems, giving it a higher potential for immediate and widespread scientific and engineering impact.