OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han
Abstract
Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.
AI Impact Assessments
(1 models)Scientific Impact Assessment: OpenClawBench
1. Core Contribution
OpenClawBench addresses the "Outcome-Process Gap" — the observation that LLM agents can pass task-level oracles while exhibiting process-level anomalies such as unsafe writes, ignored errors, unresolved ambiguity, and capability overcommitment. The paper contributes: (a) a 31,264-trajectory dataset from 6 source models on BFCL tasks with multi-layered annotations (binary labels, 5-class taxonomy, onset localization, severity, recoverability); (b) FullTax, a multi-stage silver annotation protocol with 96% agreement against a 300-trajectory human audit; and (c) a LoRA-fine-tuned Gemma 3 12B detector achieving F1=0.729, outperforming GPT-5.4 zero-shot by +0.302.
The central insight — that 9.33% of oracle-passing executions contain process anomalies, rising to 92.7% among high-risk oracle-passing runs — is intuitive but important to quantify empirically.
2. Methodological Rigor
Strengths in methodology:
Weaknesses in methodology:
3. Potential Impact
The paper addresses a genuinely important gap in agent evaluation. Current benchmarks focus almost exclusively on task completion, while deployed agents increasingly need runtime monitoring for safety-critical applications. The process-anomaly framing could influence:
However, the practical impact is bounded by several factors: the dataset covers only BFCL function-calling tasks (not web, code, or multi-modal agent settings); only 6 open-weight source models are represented; and the taxonomy is English-only and BFCL-specific. The generalizability to production agent deployments remains undemonstrated.
4. Timeliness & Relevance
This is highly timely. As LLM agents are deployed in production (coding assistants, customer service, autonomous workflows), the gap between task success and process reliability becomes a real operational concern. The paper cites contemporaneous work (Wink, Auditable Agents, TrajAD) showing this is an active research front. OpenClawBench fills a specific niche: naturally-arising (not synthetically injected) process anomalies with structured supervision.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations:
Overall, OpenClawBench makes a meaningful contribution by operationalizing process-level agent auditing with a concrete dataset and baseline detector. The conceptual contribution (Outcome-Process Gap) is likely more impactful than the specific dataset, which is narrowly scoped to BFCL tasks. The work would benefit from stronger independent human validation, broader task coverage, and analysis of the practical consequences of detected anomalies.
Generated May 29, 2026
Comparison History (20)
Paper 2 addresses a critical and broad issue in AI agent evaluation by exposing the 'Outcome-Process Gap' where task success hides dangerous process anomalies. By providing a large-scale benchmark and taxonomy for agent safety and reliability, it offers foundational infrastructure that will broadly impact the rapidly growing field of autonomous agents. While Paper 1 presents an innovative multimodal approach, its impact is more narrowly confined to the time series forecasting domain.
Paper 2 addresses a critical bottleneck in the real-world deployment of autonomous agents: process safety and reliability. By exposing the 'Outcome-Process Gap' and providing a large-scale benchmark, it challenges the standard success-only evaluation paradigm. This has profound implications for AI safety, auditing, and operational monitoring, giving it broader and more urgent real-world impact compared to Paper 1's narrower methodological improvement in reasoning distillation.
Paper 2 introduces a novel, training-free diagnostic framework (TLO) that addresses a fundamental limitation in LLM safety evaluation by making failure dynamics observable through logits alone. Its practical early-stop rule that cuts jailbreaks by >50% with no false alarms on benign queries offers immediate real-world applicability. The conceptual shift from binary outcome evaluation to temporal process observation is elegantly simple yet broadly applicable. Paper 1, while addressing an important gap (outcome vs. process anomalies), is more narrowly focused on agent trajectory benchmarking with moderate detector performance (F1=0.729) and relies on a specific pipeline setup, limiting its broader impact.
Paper 1 addresses a critical, timely bottleneck in autonomous AI: the 'Outcome-Process Gap' where task success masks dangerous or erroneous agent behaviors. By providing a large-scale dataset (OpenClawBench) and demonstrating that nearly 10% of 'successful' executions contain anomalies, it directly challenges current evaluation paradigms. While Paper 2 is an impressive interdisciplinary study on LLM personas, Paper 1 provides foundational infrastructure essential for the safe, reliable real-world deployment of agentic systems, giving it broader immediate utility and impact in AI safety and engineering.
Paper 1 likely has higher impact: it introduces a large, annotated benchmark (31k real agent trajectories) targeting a timely, safety-critical gap (Outcome–Process Gap) in agent evaluation, with structured taxonomy, localization, and severity labels enabling broad downstream research (reliability, safety, monitoring, training, auditing). The dataset + supervision pipeline and demonstrated detector performance suggest methodological rigor and immediate applicability. Paper 2 offers useful inference-time diversification techniques for ideation, but is more niche and incremental, with narrower cross-field impact and fewer durable artifacts than a benchmark that can standardize evaluation.
Paper 2 addresses a critical and highly novel challenge in AI safety: the 'Outcome-Process Gap' in autonomous agents. As the deployment of real-world agents accelerates, evaluating execution safety rather than just task success is essential. Its large-scale dataset and comprehensive taxonomy provide foundational infrastructure for auditing agent reliability, offering broader applicability and timeliness compared to the more specific multimodal harm detection focus of Paper 1.
Paper 1 provides fundamental mechanistic insights into how LLMs perform reasoning, identifying specific attention heads responsible for deductive steps and revealing how higher layers integrate information for global reasoning strategies. This mechanistic interpretability work has broad implications for understanding, improving, and debugging LLM reasoning capabilities. Paper 2 introduces a useful benchmark for detecting process anomalies in agent executions, but is more narrowly scoped as an applied evaluation resource. Paper 1's contributions to understanding transformer internals during reasoning are more foundational and likely to influence a wider range of future research.
Paper 1 addresses a critical and highly timely challenge in the deployment of autonomous LLM agents: process safety and reliability despite apparent task success. As agentic AI rapidly expands into real-world applications, auditable evaluation frameworks like OpenClawBench will have broad, immediate impact across academia and industry. Paper 2 provides rigorous insights into reasoning failures, but its focus on masked diffusion models—a narrower niche in current language modeling—limits its comparative breadth and immediate practical impact.
Paper 2 is more novel and timely: it formalizes the Outcome-Process Gap and provides a large, richly annotated benchmark (31k+ trajectories) with taxonomy, evidence, localization, severity, and recoverability—enabling broad methodological and applied work on agent reliability, auditing, and safety across many domains. It offers concrete supervision and baseline detector results, supporting real-world deployment monitoring. Paper 1 is a useful, rigorous benchmark within EEG transformers, but is narrower in scope (positional encoding variants on limited tasks) and primarily confirms task-dependence rather than introducing a broadly enabling new resource or paradigm.
Paper 1 addresses a critical and timely challenge in autonomous AI agents—detecting process anomalies masked by apparent task success (AI safety and reliability). Its contribution of a large-scale dataset (31k trajectories) and measurable evaluation metrics provides foundational infrastructure for agentic AI research. Paper 2 offers a valuable educational application, but its evaluation is limited to a single course with a small sample size, making Paper 1's methodological rigor and breadth of impact across the rapidly growing AI field significantly higher.
Paper 2 discovers a fundamental and surprising phenomenon (an inverse scaling law in robustness to distractor instructions) and provides both mechanistic insights and an RL-based solution. Uncovering core weaknesses in current LLM scaling and alignment paradigms has broader foundational implications for the field than the creation of a new benchmarking dataset, making Paper 2 more likely to drive significant future research.
Paper 1 pioneers a novel application of generative AI in hardware design by introducing the first LLM for PCB schematic generation. By proposing a semantic-grounded code representation to overcome tool-specific geometric syntax, it solves a major bottleneck in Electronic Design Automation (EDA). This breakthrough has immense real-world industrial applicability and bridges the gap between AI and hardware engineering. While Paper 2 offers a valuable benchmark for agent reliability, Paper 1 introduces a foundational capability in a highly specialized, economically critical domain, giving it a higher potential for disruptive scientific and industrial impact.
Paper 1 proposes a fundamental paradigm shift in LLM reasoning and agent aggregation, moving from answer-level consensus (majority voting) to trace-level synthesis. By demonstrating that an aggregator can recover correct solutions even when agents unanimously fail (the aggregation paradox), it unlocks higher performance ceilings for test-time compute. This approach has broad, immediate applications across math, coding, and scientific reasoning. While Paper 2 provides a valuable benchmark for agent safety and reliability, Paper 1's algorithmic innovation addresses a core bottleneck in scaling reasoning capabilities, making its potential scientific and practical impact significantly higher.
Paper 2 has higher potential impact due to strong real-world applicability (therapeutic molecular design), broad cross-field relevance (LLMs + cheminformatics + docking/structural biology), and a concrete methodological innovation (fragment-based molecule-native representation plus tool-using multi-agent workflow) validated on multiple benchmarks with large reported gains. Paper 1 is timely and valuable for agent reliability auditing, but as a benchmark/dataset contribution its impact is narrower and more incremental relative to existing evaluation efforts, and its immediate downstream societal/industrial leverage is less direct than improved molecular design pipelines.
Paper 1 presents a large-scale, rigorous benchmark (31k+ trajectories) addressing a critical issue in AI safety: the 'Outcome-Process Gap' in autonomous agents. By formalizing process-side anomalies that are masked by task success, it offers high value for deploying reliable real-world AI. In contrast, while Paper 2 explores a novel intersection of biosecurity and SAEs, it explicitly acknowledges its preliminary, hackathon-level methodology and very small sample size (n=75). Therefore, Paper 1 has significantly higher potential for immediate, broad scientific and practical impact.
Paper 1 has higher impact: it introduces a large, annotated benchmark (31k real agent trajectories) targeting an under-evaluated but deployment-critical issue (process-side anomalies despite task success), with a structured taxonomy and supervision enabling auditing, monitoring, and model training. This offers immediate real-world applications in agent reliability/safety and a reusable resource likely to drive follow-on work across evaluation, alignment, and systems. Paper 2 is timely and useful but narrower (prompt tone effects on MCQ accuracy) and less methodologically and infrastructurally transformative.
Paper 2 addresses a fundamental flaw in agent evaluation by exposing the 'Outcome-Process Gap', where successful task completion masks unsafe or erroneous behaviors. Its large-scale dataset and novel taxonomy for process-side anomalies provide a critical foundation for auditing real-world autonomous agents, likely driving broader methodological shifts across the field compared to the narrower focus on XAI faithfulness in Paper 1.
Paper 2 introduces a large-scale benchmark dataset (OpenClawBench) addressing a critical flaw in current agent evaluation: the Outcome-Process Gap. By providing structured anomaly supervision and a trained detector for real-world agent trajectories, it offers broad utility for the rapidly growing field of autonomous agents. While Paper 1 provides valuable insights for policy analysis, Paper 2's foundational dataset and taxonomy are likely to drive broader methodological advancements and higher citation impact across the general AI/ML community.
Paper 1 likely has higher scientific impact due to its technical novelty (formalizing the Outcome-Process Gap), creation of a large-scale annotated benchmark (31k+ trajectories) with rich supervision (taxonomy, evidence, localization, severity), and a demonstrated baseline detector, enabling broad, reusable evaluation and monitoring of agent reliability across AI systems. Its applications span safety, auditing, and deployment of autonomous agents, making it timely and widely relevant. Paper 2 is useful but limited by small sample size (n=72), cross-sectional design, and mainly exploratory measurement development, reducing generalizability and methodological rigor.
Paper 2 proposes a unifying geometric framework for equilibrium computation, bridging discrete taxonomies in game theory and optimization. Its foundational theoretical insights have broad applicability across machine learning (e.g., GANs, MARL) and economics. While Paper 1 provides a valuable and timely benchmark for LLM agents, Paper 2's novel paradigm for understanding solver dynamics offers deeper, more generalizable scientific impact across multiple disciplines.