Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo

Jun 10, 2026arXiv:2606.12344v1

cs.LGcs.CL

#4344of 5669·cs.LG

#4344 of 5669 · cs.LG

Tournament Score

1329±42

10501750

32%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor5

Novelty4

Clarity6

Abstract

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Claw-SWE-Bench

1. Core Contribution

Claw-SWE-Bench addresses a real but relatively narrow problem in coding-agent evaluation: the conflation of model capability, harness/scaffold design, and task selection when reporting SWE-bench-style results. The paper's key insight is that existing SWE-bench evaluations bundle together the prompt template, agent loop, tool interface, timeout, patch extraction, and stopping logic, making it impossible to attribute performance differences to any single factor.

The paper contributes: (1) an adapter protocol that allows heterogeneous agent harnesses to be evaluated under standardized conditions; (2) a 350-instance multilingual benchmark drawn from existing SWE-bench sources; (3) an 80-instance "Lite" subset designed via cost-aware, rank-aware selection; and (4) experimental evidence quantifying the independent contributions of model choice (~29.4 pp) and harness choice (~27.4 pp) to Pass@1 variation.

The conceptual contribution—treating the harness as a first-class experimental variable alongside the model—is sound and addresses a genuine gap. However, the novelty is primarily in experimental design and engineering infrastructure rather than in algorithmic or theoretical innovation. The benchmark instances are entirely drawn from existing sources (SWE-bench-Multilingual and SWE-bench-Verified-Mini), and the adapter protocol, while useful, is essentially a software engineering interface specification.

2. Methodological Rigor

Strengths: The paper demonstrates careful experimental methodology. The standardization of prompt templates, runtime budgets (3600s per instance), worker concurrency, Docker environments, and patch extraction procedures is well-documented. The future-commit cleanup addressing data leakage in SWE-bench-Multilingual containers is a valuable fairness correction. The bare-vs-full adapter diagnostic (19.1% vs. 73.4% Pass@1) effectively demonstrates that adapter design is not merely an engineering convenience but fundamentally determines whether general-purpose agents can be meaningfully evaluated.

Weaknesses: The most significant methodological limitation is that all results are single-run aggregates with no variance estimates. The authors acknowledge this but do not address it. With stochastic LLM outputs and varying API latencies, the reported differences (especially smaller ones) lack statistical confidence intervals. The 5-claw × 2-model sweep is quite limited—only 10 cells—which restricts the generalizability of interaction claims. The Lite-80 subset selection, while methodologically sophisticated (17-column calibration, K-sweep sensitivity analysis), is calibrated on the same systems it's designed to evaluate, raising concerns about overfitting to the current generation of harnesses.

The cost accounting methodology deserves scrutiny. Costs depend on API provider pricing (which changes frequently), cache hit rates (which vary with provider policies), and external factors outside experimental control. While the paper acknowledges this, it somewhat undermines the "cost as first-class axis" claim.

3. Potential Impact

Practical utility: The benchmark fills a genuine need for standardized harness comparison. As the agent ecosystem grows, the ability to isolate harness contributions from model contributions becomes increasingly important for both researchers and practitioners. The Pareto frontier analysis (Figure 1) demonstrating that accuracy and cost are not monotonically related is a useful practical insight.

Scope limitations: The impact is constrained to the coding-agent evaluation community, which, while growing, remains a niche within the broader AI/ML field. The benchmark does not introduce new tasks, new evaluation metrics, or new algorithmic approaches. The adapter protocol, while useful for the five evaluated harnesses, requires per-harness implementation effort and may not generalize trivially to fundamentally different agent architectures.

Community adoption risk: With 350 instances (and 80 in Lite), the benchmark is relatively small. The multilingual coverage (8 languages, 43 repos) is inherited from upstream sources rather than independently curated. Whether the community adopts this as a standard depends heavily on whether the harness-comparison framing resonates beyond the initial set of evaluated systems.

4. Timeliness & Relevance

The paper addresses a timely problem. The proliferation of coding agents (SWE-agent, OpenHands, AutoCodeRover, and now general-purpose agents like OpenClaw) has created genuine confusion about performance attribution. The paper correctly identifies that SWE-bench leaderboard entries conflate multiple factors, and the agent community needs better experimental controls.

The paper references models and systems from mid-2026 (GPT 5.5, Claude Opus 4.7, GLM 5.1, DeepSeek-V4), indicating it targets the current frontier. However, the rapid pace of model and harness development means the specific numerical results may become stale quickly; the lasting value would be in the evaluation framework itself.

5. Strengths & Limitations

Key Strengths:

Clear identification of the harness conflation problem in SWE-bench evaluation

Thorough standardization of experimental conditions (same prompt, budget, workspace, evaluator)

The bare-vs-full adapter comparison compellingly demonstrates adapter necessity

Cost-aware reporting with Pareto analysis is a welcome addition to accuracy-only reporting

Extensive reproducibility documentation (Appendices A-F are unusually detailed)

The Lite-80 subset methodology is principled and well-validated

Notable Weaknesses:

No new task instances—entirely derivative of existing benchmarks

Single-run results without variance estimates undermine statistical claims

Limited claw sweep (5 harnesses × 2 models) restricts interaction analysis

The "adapter protocol" contribution is more software engineering than research methodology

Cache hit rate variations across providers introduce confounds that are acknowledged but not resolved

The paper is extremely long (35+ pages including appendices) relative to its conceptual density, with extensive configuration documentation that reads more as a software manual than a research paper

Some self-referential quality: the benchmark is partly designed to evaluate OpenClaw, which is affiliated with the authors' organization (TokenRhythm Technologies)

Additional Observations:

The paper's framing occasionally overstates novelty. Prior work like SWE-Effi and HAL partially addressed the harness dimension; the contribution is in making it a fully controlled variable rather than identifying an entirely new problem. The Lite subset, while methodologically interesting, addresses primarily a cost/convenience concern rather than a scientific one.

The paper would benefit from fewer harness-specific implementation details (which dominate the appendices) and more analysis of *why* certain harness designs work better—the current results show that harness matters but offer limited insight into which harness design decisions drive performance.

Rating:5/ 10

Significance 5.5Rigor 5Novelty 4Clarity 6

Generated Jun 11, 2026

Comparison History (19)

Lostvs. Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

Paper 1 addresses a fundamental challenge in diffusion models—improving coverage of low-density regions without additional training—which has broad implications for generative modeling research. Its training-free approach to improving recall while maintaining FID is methodologically elegant and applicable across diffusion model applications. Paper 2, while useful as a benchmark contribution for coding agents, is more incremental and narrowly focused on evaluation infrastructure for a specific agent framework (OpenClaw). Benchmarks have impact, but Paper 1's novel sampling technique addresses a more fundamental and broadly applicable problem in generative AI.

claude-opus-4-6·Jun 12, 2026

Lostvs. Uncertainty Estimation for Molecular Diffusion Models

Paper 1 introduces a principled methodological advancement for uncertainty estimation in molecular diffusion models, directly impacting critical applications like drug discovery and computational chemistry. In contrast, Paper 2 provides an engineering benchmark for LLM coding agents, which, while highly useful for software engineering AI, represents an incremental evaluation tool rather than a fundamental scientific breakthrough.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Navigating the Safety-Fidelity Trade-off: Massive-Variate Time Series Forecasting for Power Systems via Probabilistic Scenarios

Paper 2 likely has higher impact due to a substantial new benchmark at unprecedented scale (up to ~37k channels) grounded in AC power-flow physics, plus constraint-aware probabilistic metrics that formalize a broadly relevant safety–fidelity trade-off. Its real-world application domain (power-system operations) is high-stakes and timely, and the proposed model (PowerForge) is evaluated across multiple grids, baselines, and seeds. Paper 1 is valuable for agent evaluation standardization, but its contribution is narrower and more tooling/protocol-centric, with less cross-domain societal impact.

gpt-5.2·Jun 12, 2026

Wonvs. To GAN or Not To GAN: Segmentation Analysis on Mars DEM

Paper 2 introduces a novel, comprehensive benchmark for evaluating autonomous AI agents on coding tasks, addressing a critical bottleneck in a rapidly advancing and high-impact field (LLM agents). In contrast, Paper 1 is an application of standard deep learning techniques to a niche planetary science problem with somewhat negative results. Paper 2 offers broader applicability, greater methodological rigor, and higher relevance to current top-tier AI research.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Paper 2 is more likely to have higher scientific impact because it offers broadly applicable mechanistic insights into a widely used post-training paradigm (on-policy distillation), with findings on sparsity, layer/FFN concentration, optimizer dependence, and update geometry that can inform theory, optimization, efficiency (subnetwork training), and future algorithm design across LMs and VLMs. Paper 1 is valuable and timely as an evaluation benchmark/standardization effort for agent harnesses, but its impact is narrower (benchmarking infrastructure) and more contingent on community adoption compared to the generalizable training-dynamics insights of Paper 2.

gpt-5.2·Jun 12, 2026

Wonvs. Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

Paper 1 addresses a critical and highly timely challenge in the rapidly growing field of LLM agents: standardizing the evaluation of agent harnesses on coding tasks. Benchmarks like SWE-bench drive significant progress and attract high citations. Paper 2 presents a rigorous and useful method for ensemble compression, but operates in a more mature and narrower area of classical machine learning, making its broader impact and adoption likely lower compared to the immediate relevance of agentic coding benchmarks.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. PAWS: Preference Learning with Advantage-Weighted Segments

Paper 2 introduces a highly timely and relevant benchmark for evaluating autonomous LLM agents on complex coding tasks. In the current AI landscape, standardized benchmarks for agentic frameworks drive immense community adoption, resulting in high citation rates and broad impact across both academia and industry. While Paper 1 provides a valuable methodological algorithmic improvement for PbRL, Paper 2 offers foundational evaluation infrastructure that addresses a critical bottleneck in deploying real-world software engineering agents.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

Paper 2 likely has higher scientific impact: it contributes new theory (computable two-sided a posteriori error bounds) addressing a core trust/certification gap in PINNs, with clear methodological rigor and broadly relevant implications for scientific computing, control, and ML reliability. Its assumptions (localized monotonicity/one-sided Lipschitz) are weaker and more verifiable than prior global conditions, improving practicality and timeliness amid growing interest in trustworthy neural PDE/ODE solvers. Paper 1 is valuable infrastructure for agent evaluation, but its impact is more specialized to LLM tooling/benchmarks and may age faster as benchmarks shift.

gpt-5.2·Jun 11, 2026

Lostvs. RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

RCAP addresses a fundamental and broadly applicable problem in machine learning—efficient training via dynamic dataset pruning with robustness to class imbalance. It demonstrates consistent improvements across 6 datasets, 5 models, and 3 training paradigms, showing strong generalizability. The finding that 10% of data can outperform full training on imbalanced datasets is practically significant, offering ~8.7x speedups. In contrast, Claw-SWE-Bench is a narrower benchmark contribution for evaluating coding agents, primarily serving the SWE-bench community. RCAP's broader applicability across ML domains and its methodological contribution give it higher potential impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. Geodesics of Dynamic Graphs for Regime Change Detection

Paper 2 is more novel conceptually (regimes as geodesic trajectories in graph space) and advances methodology for a broad, longstanding problem (change point detection in evolving networks). Its approach is likely applicable across domains (social, mobility, biological, physical systems) and is timely given increased interest in temporal graphs. Paper 1 is valuable and practical for agent evaluation, but it is primarily an engineering/benchmarking contribution with impact concentrated in LLM/agent tooling; similar benchmarks proliferate, which may limit broader scientific novelty.

gpt-5.2·Jun 11, 2026

#4344of 5669·cs.LG

#4344 of 5669 · cs.LG

Tournament Score

1329±42

10501750

32%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor5

Novelty4

Clarity6