Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents
Bowen Ye, Rang Li, Qibin Yang, Yuanxin Liu, Linli Yao, Hanglong Lv, Zhihui Xie, Chenxin An
Abstract
Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.
AI Impact Assessments
(3 models)Scientific Impact Assessment: Claw-Eval
1. Core Contribution
Claw-Eval addresses three identified gaps in autonomous agent evaluation: trajectory-opaque grading, underspecified safety/robustness evaluation, and narrow modality coverage. The benchmark provides 300 human-verified tasks across 9 categories, evaluated through three independent evidence channels (execution traces, audit logs, environment snapshots), with 2,159 fine-grained rubric items. The key conceptual contribution is the *temporal firewall* separating execution from grading—no evaluation artifacts exist in the sandbox during agent execution, preventing evaluation-aware behavior.
The scoring protocol is multi-dimensional (Completion × Safety gate + Robustness), and the multi-trial evaluation (Pass@k for ceiling, Pass^k for floor) provides a more complete picture of deployable capability than single-run metrics. The controlled error injection mechanism for robustness testing is a genuinely useful addition that no prior benchmark systematically provides.
2. Methodological Rigor
Strengths in design:
Concerns:
3. Potential Impact
The benchmark fills a genuine gap. Most existing agent benchmarks indeed check only final outputs, and the systematic demonstration that trajectory-opaque evaluation misses nearly half of safety violations is an important finding for the community. The framework's modular, declarative task schema—where adding new domains requires only a task definition and grader script—supports extensibility.
The finding that Pass@3 remains stable while Pass^3 drops up to 24% under error injection is practically important for deployment decisions: it quantifies the gap between "can do" and "reliably does." The multi-turn dialogue analysis showing question quality (r=0.87) vastly outpredicts round count (r=0.07) provides actionable guidance for agent design.
However, 300 tasks may be insufficient for a benchmark aspiring to be comprehensive across 9 categories and 3 modalities. The benchmark's impact will depend heavily on community adoption and whether the infrastructure is released in a usable form.
4. Timeliness & Relevance
The paper is highly timely. As LLM agents move from research prototypes to production deployments (Claude Code, OpenClaw), the need for safety-aware, trajectory-auditable evaluation is acute. The reward hacking concern—where agents game output-only metrics—is a recognized and growing problem that this work directly addresses. The controlled perturbation testing fills a gap that no existing benchmark covers, which is critical for deployment readiness assessment.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Summary
Claw-Eval makes a solid architectural contribution to agent evaluation by unifying trajectory auditing, safety/robustness assessment, and cross-modal coverage. Its strongest contribution is the empirical demonstration that evaluation methodology matters—trajectory-opaque grading is systematically unreliable. The benchmark is timely and addresses real needs, though its relatively small scale, limited statistical analysis of judge reliability, and some questionable model references temper enthusiasm. It represents a meaningful step toward trustworthy agent evaluation but would benefit from larger task sets, human baselines, and more rigorous judge calibration.
Generated Apr 8, 2026
Comparison History (204)
Paper 2 has higher potential impact due to greater novelty (agentic, end-to-end architecture and mechanism discovery beyond standard Transformers), stronger real-world implications (new scalable foundation-model families and training improvements), and broader cross-field relevance (NAS, systems, scaling laws, long-context modeling, automated research). If results and comparisons hold, it could directly influence future model design and accelerate progress toward automated AI R&D. Paper 1 is timely and valuable for trustworthy evaluation, but its impact is primarily infrastructural/benchmarking and likely more incremental than a paradigm-shifting method for discovering new model architectures.
Claw-Eval addresses a critical and timely gap in evaluating autonomous LLM agents—a rapidly growing area with broad implications. Its contributions (trajectory-aware grading, safety/robustness evaluation, multi-modal coverage) are highly relevant as agent deployment scales. The benchmark serves the entire community and reveals actionable insights (e.g., 44% of safety violations missed by existing methods). Paper 1 offers an incremental improvement to GP-based symbolic regression using BERT for guided mutations/crossovers, which, while technically sound, addresses a narrower problem with more limited breadth of impact.
Paper 1 likely has higher scientific impact due to broader relevance and stronger methodological contribution: it delivers a general, trajectory-aware evaluation framework for autonomous agents with safety/robustness metrics, multi-modal coverage, and statistically meaningful protocols (multi-trial Pass@k vs Pass^k) tested across 14 frontier models. This benchmark infrastructure can influence many subfields (agent design, safety, multimodal systems, evaluation methodology). Paper 2 is innovative and highly applicable to manufacturing, but its scope is narrower and evidence appears limited (single model, 72 invocations).
Paper 1 introduces a comprehensive evaluation benchmark for LLM agents, a rapidly growing and highly influential field. Benchmarks typically achieve high scientific impact as they establish standard evaluation protocols adopted by many subsequent researchers. Paper 1 addresses critical gaps like trajectory opacity, safety, and robustness across multiple modalities. In contrast, Paper 2 focuses on optimizing the latency of a specific autonomous driving architecture (Alpamayo 1), which, while methodologically sound, has a much narrower scope and limited broader applicability.
Paper 2 introduces a comprehensive, trustworthy evaluation suite (Claw-Eval) for autonomous agents, addressing critical gaps in safety, robustness, and trajectory-aware grading. Evaluation frameworks and benchmarks typically have a broader and longer-lasting impact across the AI community by establishing standards for measuring progress. In contrast, Paper 1 presents a specific multi-agent methodology for algorithmic problem solving, which, while valuable, has a narrower scope and application.
Paper 1 likely has higher impact: it introduces a substantial, trajectory-aware evaluation suite for real-world autonomous agents with explicit safety/robustness scoring, rich multimodal coverage, and evidence-backed grading—immediately useful to many labs and deployments. Its methodological contribution (multi-channel logging + fine-grained rubrics + reliability metrics like Pass^k) addresses pressing evaluation gaps as agents move into production. Paper 2 is novel and timely for theory of transformer reasoning, but appears narrower in scope (Horn clauses, specific masking/depth settings) and may have less immediate cross-field and applied adoption than a widely usable benchmark.
Claw-Eval addresses fundamental limitations in how autonomous agents are evaluated—trajectory-opaque grading, safety/robustness gaps, and narrow coverage—which are critical bottlenecks for the entire field. Its findings (e.g., 44% of safety violations missed by traditional evaluation) have broad implications for trustworthy AI deployment. While Uno-Orchestra offers strong engineering contributions to multi-agent orchestration with impressive performance gains, Claw-Eval's impact is broader: it establishes evaluation infrastructure and methodology that will influence how all future agent systems are assessed, making it more foundational for the field.
Paper 1 likely has higher impact: it tackles a widely recognized bottleneck—trustworthy evaluation of autonomous agents—by introducing trajectory-aware, safety/robustness-aware benchmarking with multi-channel evidence and fine-grained rubrics. This is broadly applicable across agent research, safety, and deployment, and is timely as agents move into real software environments. Its methodological contribution (hybrid grading, multi-trial Pass metrics, error injection) can influence how the field measures progress. Paper 2 is novel and useful but more domain-specific (immersive role-play) and less foundational than evaluation infrastructure.
Paper 1 likely has higher impact due to broader relevance and immediate applicability: it proposes an end-to-end, trajectory-aware, safety/robustness-inclusive benchmark for autonomous agents across diverse task types and modalities, addressing widely recognized evaluation gaps. Its methodology (multi-channel evidence, fine-grained rubrics, multi-trial metrics) generalizes across agent settings and can shape how the field measures deployability. Paper 2 is rigorous and novel but targets a narrower niche (agent-repair leaderboards and evaluator-channel coupling), making its cross-field influence and adoption likely more limited.
Claw-Eval addresses a critical and timely gap in autonomous agent evaluation with a comprehensive benchmark (300 tasks, 2,159 rubric items, 14 models) that introduces trajectory-aware grading, safety/robustness evaluation, and multimodal coverage. Its practical impact is broader—it provides actionable infrastructure for the rapidly growing field of LLM agents. Paper 1, while methodologically sound, addresses a narrower question (thinking mode effects on moral judgments) with incremental findings (high agreement, marginal improvements). Claw-Eval's benchmark utility and relevance to deployment safety give it substantially wider and more lasting impact.
Claw-Eval addresses a more fundamental and broadly impactful problem: trustworthy evaluation of autonomous LLM agents. Its contributions—trajectory-aware grading, safety/robustness evaluation, and multimodal coverage—serve as critical infrastructure for the entire agent research community. The benchmark's findings (e.g., 44% missed safety violations) have immediate implications for deployment safety. While Strat-Reasoner offers solid improvements in strategic reasoning for multi-agent games, its scope is narrower. Claw-Eval's breadth across 14 models, 300 tasks, and actionable insights for reliable deployment gives it wider cross-field impact and greater timeliness.
Paper 1 (Claw-Eval) introduces a comprehensive evaluation framework for autonomous agents, addressing a critical bottleneck in the AI field: reliable, trajectory-aware benchmarking for safety and robustness. Because standardized benchmarks are foundational to AI progress, this paper is likely to see widespread adoption and citation across the broader AI, LLM, and software engineering communities. In contrast, Paper 2 presents a strong but more narrowly focused application in AR and human-computer interaction, which, while highly innovative, has a smaller target audience and consequently a narrower potential scientific impact.
Claw-Eval addresses a broader and more pressing problem—trustworthy evaluation of autonomous LLM agents—which is highly timely given the rapid deployment of AI agents. Its contributions (trajectory-aware grading, safety/robustness evaluation, multi-modality coverage) have wider applicability across the AI safety and agent development communities. The finding that trajectory-opaque evaluation misses 44% of safety violations is actionable and impactful. ReasonAudio, while valuable for audio retrieval, targets a narrower subfield with less immediate broad impact. Claw-Eval's methodological rigor (300 tasks, 2,159 rubric items, 14 models, three-trial protocol) and relevance to AI safety amplify its potential influence.
Claw-Eval addresses a timely and critical gap in evaluating autonomous LLM agents, a rapidly growing field. Its comprehensive benchmark covering safety, robustness, and trajectory-aware grading has broad applicability across the AI community. The finding that trajectory-opaque evaluation misses 44% of safety violations is highly impactful for deployment practices. While Paper 1 makes solid algorithmic contributions to online learning in Tree MDPs with elegant theoretical results, its scope is narrower (sequential games, bandit algorithms). Paper 2's relevance to the mainstream LLM agent ecosystem gives it broader potential impact and timeliness.
Paper 1 addresses a critical bottleneck in a highly active and broad field: the reliable evaluation of LLM-based autonomous agents. By introducing a comprehensive, trajectory-aware, and multimodal evaluation framework, it offers foundational tools that will benefit researchers across AI, software engineering, and robotics. Paper 2 is methodologically sound and valuable for surgical AI, but its impact is relatively niche, confined mostly to healthcare applications. Consequently, Paper 1 has a significantly wider breadth of impact and timeliness.
Paper 1 establishes fundamental information-theoretic limits for a widely-used class of explainability methods (LIME, KernelSHAP), providing novel theoretical foundations (converse and achievability results) that reframe explanation quality as a channel capacity problem. This offers deep, lasting insight into why and when explanations fail, with broad implications across XAI. Paper 2, while practically useful, is primarily an engineering benchmark contribution for LLM agents—a fast-moving area where benchmarks are quickly superseded. Paper 1's theoretical framework is more likely to have enduring scientific impact across multiple fields.
While Paper 1 offers a profound theoretical unification of fundamental probabilistic and generative algorithms, Paper 2 addresses an urgent and highly relevant bottleneck in current AI research: the trustworthy evaluation of autonomous LLM agents. By providing a comprehensive, trajectory-aware benchmark covering safety, robustness, and multimodal tasks, Claw-Eval is positioned for rapid, widespread adoption. In the fast-moving field of LLM agents, robust evaluation frameworks typically drive immediate empirical progress and garner higher short-to-medium-term scientific impact across a broad range of applied and foundational AI development.
Paper 2 introduces a novel theoretical framework connecting explainable AI to information theory, deriving fundamental limits (strong converse, achievability) for masking-based explanation methods like SHAP and LIME. This provides deep, generalizable insights with broad implications across ML interpretability, information theory, and trustworthy AI. Paper 1, while practically useful as a benchmark suite for LLM agents, is more incremental—improving evaluation methodology rather than establishing fundamental theoretical contributions. Paper 2's information-theoretic bounds will likely influence how the community designs and understands explanation methods for years to come.
Paper 1 likely has higher impact due to a substantial, reusable benchmark artifact (300 tasks, trajectory-aware evidence, fine-grained rubrics) that can become community infrastructure for evaluating agent reliability, safety, and robustness across modalities—high real-world relevance and broad applicability. Its methodological contribution (multi-channel auditing, multi-trial metrics) addresses pressing evaluation failures in deployed agents. Paper 2 is novel and timely with a clear theoretical framing and actionable prompting insight, but its scope is narrower (self-correction on a few NLP benchmarks) and may have less cross-field and tooling impact than a widely adoptable evaluation suite.
Paper 1 likely has higher impact: it delivers a substantial, reusable evaluation infrastructure (300 tasks, trajectory-aware evidence, 2,159 rubric items) directly addressing urgent gaps in agent benchmarking (safety/robustness, multimodality, interaction paradigms). This enables broad, cross-model and cross-lab comparability and can influence deployment standards and research directions across AI safety, HCI, and agent systems. Paper 2 is novel and rigorous with a useful control-theoretic framing, but its scope is narrower (self-correction on select datasets) and may translate more as a diagnostic/prompting guideline than a field-wide benchmark resource.