OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

May 28, 2026

arXiv:2605.29253v1 PDF

cs.AI(primary)

#1052of 2821·Artificial Intelligence

#1052 of 2821 · Artificial Intelligence

Tournament Score

1437±46

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5.5

Novelty6

Clarity5

Tournament Score

1437±46

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: OpenClawBench

1. Core Contribution

OpenClawBench addresses the "Outcome-Process Gap" — the observation that LLM agents can pass task-level oracles while exhibiting process-level anomalies such as unsafe writes, ignored errors, unresolved ambiguity, and capability overcommitment. The paper contributes: (a) a 31,264-trajectory dataset from 6 source models on BFCL tasks with multi-layered annotations (binary labels, 5-class taxonomy, onset localization, severity, recoverability); (b) FullTax, a multi-stage silver annotation protocol with 96% agreement against a 300-trajectory human audit; and (c) a LoRA-fine-tuned Gemma 3 12B detector achieving F1=0.729, outperforming GPT-5.4 zero-shot by +0.302.

The central insight — that 9.33% of oracle-passing executions contain process anomalies, rising to 92.7% among high-risk oracle-passing runs — is intuitive but important to quantify empirically.

2. Methodological Rigor

Strengths in methodology:

The pipeline is meticulously documented: trajectory normalization, ReAct-style structuring, oracle fusion, risk slicing, and four-stage FullTax annotation with quality tiers. This level of procedural transparency is commendable.

The separation of oracle outcome from anomaly labels is conceptually clean and well-motivated.

Cross-backbone hold-out evaluation (removing gpt-oss-20B from training) shows the detector generalizes, with only -0.026 F1 degradation.

The class-balancing ablation and confusion matrix analysis add useful diagnostic depth.

Weaknesses in methodology:

The silver labels are entirely LLM-generated (DeepSeek-family model). The 300-trajectory human audit (96% agreement) is conducted by the authors themselves, not independent annotators, which limits the reliability bound's independence.

The "both-high-confidence" filtering reduces the supervised pool from 30,398 to ~26,500 trajectories, essentially cherry-picking for where the LLM judge is most confident — this inflates apparent annotation quality.

The 5-class taxonomy is heavily imbalanced: capability_gap_overcommitment (52%) and write_under_unresolved_ambiguity (34%) dominate, while three tail classes have <35 test examples each, making the reported macro-F1 effectively a 3-class metric as the authors acknowledge.

The detector comparison is somewhat unfair: a fine-tuned specialized model vs. GPT-5.4 zero-shot. The +0.302 improvement largely reflects calibration (reducing over-prediction from 42% to 18%) rather than fundamental detection capability, as both achieve ~82% recall.

3. Potential Impact

The paper addresses a genuinely important gap in agent evaluation. Current benchmarks focus almost exclusively on task completion, while deployed agents increasingly need runtime monitoring for safety-critical applications. The process-anomaly framing could influence:

Agent safety research: Providing a concrete, trainable signal beyond pass/fail evaluation.

Runtime monitoring systems: The deployable 12B detector running on commodity hardware (8×A100) is practically relevant.

Evaluation methodology: The outcome-process gap concept could become a standard consideration in agent benchmark design.

However, the practical impact is bounded by several factors: the dataset covers only BFCL function-calling tasks (not web, code, or multi-modal agent settings); only 6 open-weight source models are represented; and the taxonomy is English-only and BFCL-specific. The generalizability to production agent deployments remains undemonstrated.

4. Timeliness & Relevance

This is highly timely. As LLM agents are deployed in production (coding assistants, customer service, autonomous workflows), the gap between task success and process reliability becomes a real operational concern. The paper cites contemporaneous work (Wink, Auditable Agents, TrajAD) showing this is an active research front. OpenClawBench fills a specific niche: naturally-arising (not synthetically injected) process anomalies with structured supervision.

5. Strengths & Limitations

Key Strengths:

The conceptual framing of the Outcome-Process Gap is clear, useful, and likely to be adopted by the community.

The dataset scale (31K trajectories, 122K steps) and annotation depth (multi-stage, quality-tiered) are substantial.

The paper is extraordinarily thorough in documentation — the appendices provide complete reproduction details for every pipeline stage.

The finding that a small fine-tuned model outperforms a frontier model via better calibration is practically useful.

Notable Limitations:

The definition of "process anomaly" is somewhat circular: it is whatever FullTax labels as anomalous, which is whatever the DeepSeek LLM judge identifies. The 300-trajectory author-conducted audit provides limited independent validation.

The 9.33% anomaly rate among oracle-passing trajectories is presented as the headline finding, but many of these may be benign (the silver judge over-flags in 10/12 disagreement cases with humans).

The paper is excessively long and repetitive — the same statistics (31,264 trajectories, 96% agreement, F1=0.729, +0.302 over GPT-5.4) are repeated dozens of times. The 37-page appendix, while thorough, makes the contribution harder to evaluate.

The taxonomy's empirical grounding is unclear — were the 5 subtypes discovered from data or imposed a priori? The 3 "absorbed" candidates suggest some post-hoc adjustment.

No analysis of what downstream consequences process anomalies have — do they correlate with user harm, cost, or downstream failures?

Additional Observations:

The paper claims the detector is "deployable" on commodity hardware but requires 8×A100 GPUs, which is far from commodity for most practitioners.

The cross-backbone generalization test uses only one held-out backbone; broader generalization claims are appropriately caveated but still limited.

Reproducibility appears strong given the detailed documentation, though the reliance on specific LLM judges (DeepSeek) and agent frameworks (OpenClaw) creates dependencies.

Overall, OpenClawBench makes a meaningful contribution by operationalizing process-level agent auditing with a concrete dataset and baseline detector. The conceptual contribution (Outcome-Process Gap) is likely more impactful than the specific dataset, which is narrowly scoped to BFCL tasks. The work would benefit from stronger independent human validation, broader task coverage, and analysis of the practical consequences of detected anomalies.

Rating:5.8/ 10

Significance 6.5Rigor 5.5Novelty 6Clarity 5

Generated May 29, 2026

Comparison History (20)

vs. KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

gemini-3.15/29/2026

Paper 2 addresses a critical and broad issue in AI agent evaluation by exposing the 'Outcome-Process Gap' where task success hides dangerous process anomalies. By providing a large-scale benchmark and taxonomy for agent safety and reliability, it offers foundational infrastructure that will broadly impact the rapidly growing field of autonomous agents. While Paper 1 presents an innovative multimodal approach, its impact is more narrowly confined to the time series forecasting domain.

vs. Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

gemini-3.15/29/2026

Paper 2 addresses a critical bottleneck in the real-world deployment of autonomous agents: process safety and reliability. By exposing the 'Outcome-Process Gap' and providing a large-scale benchmark, it challenges the standard success-only evaluation paradigm. This has profound implications for AI safety, auditing, and operational monitoring, giving it broader and more urgent real-world impact compared to Paper 1's narrower methodological improvement in reasoning distillation.

vs. Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

claude-opus-4.65/29/2026

Paper 2 introduces a novel, training-free diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation by making failure dynamics observable through logits alone. Its practical early-stop rule that cuts jailbreaks by >50% with no false alarms on benign queries offers immediate real-world applicability. The conceptual shift from binary outcome evaluation to temporal process observation is elegantly simple yet broadly applicable. Paper 1, while addressing an important gap (outcome vs. process anomalies), is more narrowly focused on agent trajectory benchmarking with moderate detector performance (F1=0.729) and relies on a specific pipeline setup, limiting its broader impact.

vs. Teaching Values to Machines: Simulating Human-Like Behavior in LLMs

gemini-3.15/29/2026

Paper 1 addresses a critical, timely bottleneck in autonomous AI: the 'Outcome-Process Gap' where task success masks dangerous or erroneous agent behaviors. By providing a large-scale dataset (OpenClawBench) and demonstrating that nearly 10% of 'successful' executions contain anomalies, it directly challenges current evaluation paradigms. While Paper 2 is an impressive interdisciplinary study on LLM personas, Paper 1 provides foundational infrastructure essential for the safe, reliable real-world deployment of agentic systems, giving it broader immediate utility and impact in AI safety and engineering.

vs. Anchorless Diversification for Parallel LLM Ideation

gpt-5.25/29/2026

Paper 1 likely has higher impact: it introduces a large, annotated benchmark (31k real agent trajectories) targeting a timely, safety-critical gap (Outcome–Process Gap) in agent evaluation, with structured taxonomy, localization, and severity labels enabling broad downstream research (reliability, safety, monitoring, training, auditing). The dataset + supervision pipeline and demonstrated detector performance suggest methodological rigor and immediate applicability. Paper 2 offers useful inference-time diversification techniques for ideation, but is more niche and incremental, with narrower cross-field impact and fewer durable artifacts than a benchmark that can standardize evaluation.

vs. MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

gemini-3.15/29/2026

Paper 2 addresses a critical and highly novel challenge in AI safety: the 'Outcome-Process Gap' in autonomous agents. As the deployment of real-world agents accelerates, evaluating execution safety rather than just task success is essential. Its large-scale dataset and comprehensive taxonomy provide foundational infrastructure for auditing agent reliability, offering broader applicability and timeliness compared to the more specific multimodal harm detection focus of Paper 1.

vs. Revealing Algorithmic Deductive Circuits for Logical Reasoning

claude-opus-4.65/29/2026

Paper 1 provides fundamental mechanistic insights into how LLMs perform reasoning, identifying specific attention heads responsible for deductive steps and revealing how higher layers integrate information for global reasoning strategies. This mechanistic interpretability work has broad implications for understanding, improving, and debugging LLM reasoning capabilities. Paper 2 introduces a useful benchmark for detecting process anomalies in agent executions, but is more narrowly scoped as an applied evaluation resource. Paper 1's contributions to understanding transformer internals during reasoning are more foundational and likely to influence a wider range of future research.

vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

gemini-3.15/29/2026

Paper 1 addresses a critical and highly timely challenge in the deployment of autonomous LLM agents: process safety and reliability despite apparent task success. As agentic AI rapidly expands into real-world applications, auditable evaluation frameworks like OpenClawBench will have broad, immediate impact across academia and industry. Paper 2 provides rigorous insights into reasoning failures, but its focus on masked diffusion models—a narrower niche in current language modeling—limits its comparative breadth and immediate practical impact.

vs. Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models

gpt-5.25/29/2026

Paper 2 is more novel and timely: it formalizes the Outcome-Process Gap and provides a large, richly annotated benchmark (31k+ trajectories) with taxonomy, evidence, localization, severity, and recoverability—enabling broad methodological and applied work on agent reliability, auditing, and safety across many domains. It offers concrete supervision and baseline detector results, supporting real-world deployment monitoring. Paper 1 is a useful, rigorous benchmark within EEG transformers, but is narrower in scope (positional encoding variants on limited tasks) and primarily confirms task-dependence rather than introducing a broadly enabling new resource or paradigm.

vs. Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

gemini-3.15/29/2026

Paper 1 addresses a critical and timely challenge in autonomous AI agents—detecting process anomalies masked by apparent task success (AI safety and reliability). Its contribution of a large-scale dataset (31k trajectories) and measurable evaluation metrics provides foundational infrastructure for agentic AI research. Paper 2 offers a valuable educational application, but its evaluation is limited to a single course with a small sample size, making Paper 1's methodological rigor and breadth of impact across the rapidly growing AI field significantly higher.

vs. The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

gemini-3.15/29/2026

Paper 2 discovers a fundamental and surprising phenomenon (an inverse scaling law in robustness to distractor instructions) and provides both mechanistic insights and an RL-based solution. Uncovering core weaknesses in current LLM scaling and alignment paradigms has broader foundational implications for the field than the creation of a new benchmarking dataset, making Paper 2 more likely to drive significant future research.

vs. SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

gemini-3.15/29/2026

Paper 1 pioneers a novel application of generative AI in hardware design by introducing the first LLM for PCB schematic generation. By proposing a semantic-grounded code representation to overcome tool-specific geometric syntax, it solves a major bottleneck in Electronic Design Automation (EDA). This breakthrough has immense real-world industrial applicability and bridges the gap between AI and hardware engineering. While Paper 2 offers a valuable benchmark for agent reliability, Paper 1 introduces a foundational capability in a highly specialized, economically critical domain, giving it a higher potential for disruptive scientific and industrial impact.

vs. Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

gemini-3.15/29/2026

Paper 1 proposes a fundamental paradigm shift in LLM reasoning and agent aggregation, moving from answer-level consensus (majority voting) to trace-level synthesis. By demonstrating that an aggregator can recover correct solutions even when agents unanimously fail (the aggregation paradox), it unlocks higher performance ceilings for test-time compute. This approach has broad, immediate applications across math, coding, and scientific reasoning. While Paper 2 provides a valuable benchmark for agent safety and reliability, Paper 1's algorithmic innovation addresses a core bottleneck in scaling reasoning capabilities, making its potential scientific and practical impact significantly higher.

vs. MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents

gpt-5.25/29/2026

Paper 2 has higher potential impact due to strong real-world applicability (therapeutic molecular design), broad cross-field relevance (LLMs + cheminformatics + docking/structural biology), and a concrete methodological innovation (fragment-based molecule-native representation plus tool-using multi-agent workflow) validated on multiple benchmarks with large reported gains. Paper 1 is timely and valuable for agent reliability auditing, but as a benchmark/dataset contribution its impact is narrower and more incremental relative to existing evaluation efforts, and its immediate downstream societal/industrial leverage is less direct than improved molecular design pipelines.

vs. BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

gemini-3.15/29/2026

Paper 1 presents a large-scale, rigorous benchmark (31k+ trajectories) addressing a critical issue in AI safety: the 'Outcome-Process Gap' in autonomous agents. By formalizing process-side anomalies that are masked by task success, it offers high value for deploying reliable real-world AI. In contrast, while Paper 2 explores a novel intersection of biosecurity and SAEs, it explicitly acknowledges its preliminary, hackathon-level methodology and very small sample size (n=75). Therefore, Paper 1 has significantly higher potential for immediate, broad scientific and practical impact.

vs. Mind Your Tone: Does Tone Alter LLM Performance?

gpt-5.25/29/2026

Paper 1 has higher impact: it introduces a large, annotated benchmark (31k real agent trajectories) targeting an under-evaluated but deployment-critical issue (process-side anomalies despite task success), with a structured taxonomy and supervision enabling auditing, monitoring, and model training. This offers immediate real-world applications in agent reliability/safety and a reusable resource likely to drive follow-on work across evaluation, alignment, and systems. Paper 2 is timely and useful but narrower (prompt tone effects on MCQ accuracy) and less methodologically and infrastructurally transformative.

vs. Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness

gemini-3.15/29/2026

Paper 2 addresses a fundamental flaw in agent evaluation by exposing the 'Outcome-Process Gap', where successful task completion masks unsafe or erroneous behaviors. Its large-scale dataset and novel taxonomy for process-side anomalies provide a critical foundation for auditing real-world autonomous agents, likely driving broader methodological shifts across the field compared to the narrower focus on XAI faithfulness in Paper 1.

vs. When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

gemini-3.15/29/2026

Paper 2 introduces a large-scale benchmark dataset (OpenClawBench) addressing a critical flaw in current agent evaluation: the Outcome-Process Gap. By providing structured anomaly supervision and a trained detector for real-world agent trajectories, it offers broad utility for the rapidly growing field of autonomous agents. While Paper 1 provides valuable insights for policy analysis, Paper 2's foundational dataset and taxonomy are likely to drive broader methodological advancements and higher citation impact across the general AI/ML community.

vs. Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence

gpt-5.25/29/2026

Paper 1 likely has higher scientific impact due to its technical novelty (formalizing the Outcome-Process Gap), creation of a large-scale annotated benchmark (31k+ trajectories) with rich supervision (taxonomy, evidence, localization, severity), and a demonstrated baseline detector, enabling broad, reusable evaluation and monitoring of agent reliability across AI systems. Its applications span safety, auditing, and deployment of autonomous agents, making it timely and widely relevant. Paper 2 is useful but limited by small sample size (n=72), cross-sectional design, and mainly exploratory measurement development, reducing generalizability and methodological rigor.

vs. On the Geometry of Games and their Solvers

gemini-3.15/29/2026

Paper 2 proposes a unifying geometric framework for equilibrium computation, bridging discrete taxonomies in game theory and optimization. Its foundational theoretical insights have broad applicability across machine learning (e.g., GANs, MARL) and economics. While Paper 1 provides a valuable and timely benchmark for LLM agents, Paper 2's novel paradigm for understanding solver dynamics offers deeper, more generalizable scientific impact across multiple disciplines.