Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World

Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan

#1445 of 2682 · Artificial Intelligence
Share
Tournament Score
1400±40
10501800
53%
Win Rate
10
Wins
9
Losses
19
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Claw-Anything

1. Core Contribution

Claw-Anything introduces a benchmark for evaluating LLM-based personal assistant agents under significantly broader operational scope than existing benchmarks. The key innovation lies in expanding agent context along three dimensions simultaneously: (1) long-horizon event streams spanning months of simulated user activity, (2) interdependent backend services (averaging 10.1 per task vs. ~1-4 in prior work), and (3) cross-device interaction combining CLI and GUI interfaces. The benchmark also evaluates proactive assistance—agents anticipating user needs without explicit requests—which is largely absent from prior evaluations.

Alongside the benchmark, the authors release an automated data-generation pipeline that synthesizes coherent digital worlds through iterative event injection, producing 200 human-verified evaluation tasks and 2,000 training environments. The pipeline's ability to generate useful training data is demonstrated by a 23.7% improvement when fine-tuning Qwen3.5-27B.

2. Methodological Rigor

Strengths in methodology:

  • The iterative environment synthesis (Algorithm 1) is well-designed: starting from a minimal persona seed and progressively building complexity through multi-round event injection creates naturally entangled, realistic environments with noise, conflicting signals, and cross-service dependencies.
  • The four-stage pipeline (synthesis → task/verifier generation → automatic filtering → human verification) provides reasonable quality assurance, combining rule-based checks, LLM-based filtering, and execution-based validation.
  • The ablation studies are thorough and well-structured, systematically examining each contextual dimension (event streams, services, devices), pipeline parameters (noise ratio, simulation rounds, fixture conflicts), and evaluation settings (proactivity, skill loading).
  • Concerns:

  • The environments rely on mock services rather than real-world systems, which the authors acknowledge. This raises questions about ecological validity—real services have idiosyncratic behaviors, rate limits, and error modes not captured by mocks.
  • The evaluation uses Claude Sonnet 4.5 as a judge model, which itself achieves only 28% pass@1 on the benchmark. Using a model that struggles on the task as an evaluator introduces potential reliability concerns, though the outcome-dominated scoring scheme partially mitigates this.
  • With only 200 evaluation tasks (150 CLI + 50 CLI+GUI), the benchmark is relatively small, potentially limiting statistical power for fine-grained comparisons. The 50 GUI tasks are particularly thin for drawing robust conclusions about cross-device performance.
  • The paper references future models (GPT-5.5, Claude Opus 4.7, Qwen3.6-27B) with 2026 dates, indicating this is either forward-looking or uses speculative model names, which complicates reproducibility assessment.
  • 3. Potential Impact

    Direct impact on the agent benchmark ecosystem: Claw-Anything fills a genuine gap. The comparison table (Table 1) clearly shows that existing benchmarks operate with ~1-4 services per task, no event streams, CLI-only interfaces, and context lengths of 2-12k words, while Claw-Anything provides 10.1 services, multi-month event streams, CLI+GUI, and 191.7k words of context. This represents an order-of-magnitude increase in environmental complexity.

    Training data infrastructure: Perhaps equally important is the automated pipeline. The demonstration that 1,500 trajectories can improve a base model by 23.7% suggests this pipeline could serve as practical data infrastructure for the community, similar to how SWE-smith served the coding agent community.

    Broader implications: The benchmark pushes the field toward evaluating agents in settings that more closely resemble real deployment scenarios for personal assistants, where users expect seamless coordination across their entire digital footprint. The proactive assistance evaluation dimension is particularly forward-looking.

    4. Timeliness & Relevance

    This work is highly timely. The personal assistant agent space is rapidly evolving with systems like OpenClaw, Hermes Agent, and commercial offerings from Google and others. The gap between narrow benchmark performance and real-world utility is a recognized problem. By explicitly targeting the "always-on" assistant paradigm with broad digital access, the benchmark addresses a current bottleneck: existing evaluations don't capture the complexity that deployed systems will face.

    The focus on proactive assistance is also well-timed, as the field is transitioning from reactive tool-use to anticipatory agent behavior.

    5. Strengths & Limitations

    Key Strengths:

  • *Conceptual clarity*: The framing of "operational scope" as the key variable—expanding what agents can see and do—is clean and compelling.
  • *Comprehensive ablations*: The paper systematically demonstrates that each dimension (history length, service count, device heterogeneity, noise, conflicts) independently contributes to difficulty, validating the design choices.
  • *Dual-purpose infrastructure*: The pipeline serves both evaluation and training, maximizing its practical value.
  • *Failure mode analysis*: The identification of the "investigation-execution gap" as the dominant failure mode provides actionable insight for model developers.
  • *Scaling analysis*: Figure 6 shows clear trajectory scaling behavior, suggesting further improvements are achievable.
  • Notable Weaknesses:

  • *Mock environments*: The reliance on simulated services limits fidelity. Real-world APIs have authentication flows, rate limits, latency, and failure modes that mock services don't capture.
  • *Limited device coverage*: Only CLI (Linux Docker) and GUI (Android Docker) are included. Real users interact across laptops, tablets, wearables, and IoT devices.
  • *Evaluation scale*: 200 tasks is modest for a benchmark aspiring to comprehensive coverage of "anything" in the user's digital world.
  • *LLM-generated environments*: Despite filtering, the environments are ultimately LLM-synthesized, which may introduce systematic biases in the types of scenarios represented.
  • *Reproducibility concerns*: The references to specific model versions with 2026 dates and the dependence on proprietary models for judging make exact reproduction uncertain.
  • *Limited analysis of proactive tasks*: With proactive pass@1 at only 6.7%, the sample size for meaningful analysis of this dimension is very small.
  • Overall Assessment

    Claw-Anything makes a solid contribution by significantly raising the bar for personal assistant evaluation. Its primary value lies in demonstrating that current frontier models perform far worse (34.5% vs. higher on simpler benchmarks) when given broader but noisier access to user state, and in providing a scalable pipeline for generating such complex environments. The work is well-motivated, technically sound in its core pipeline design, and addresses a genuine gap. However, the reliance on mock services, modest evaluation set size, and incomplete device coverage temper the claim of benchmarking "anything" in the user's digital world.

    Rating:6.8/ 10
    Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

    Generated May 26, 2026

    Comparison History (19)

    vs. CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models
    gemini-3.15/28/2026

    Paper 2 addresses a critical and immediate bottleneck in modern AI: memory and latency in Vision-Language Models. By enabling genuine hardware efficiency and significantly reducing KV-cache without accuracy loss, its methodology has immediate, widespread applicability for deployment. Paper 1 introduces a highly novel and valuable benchmark for personal assistants, but fundamental efficiency improvements like CIVIC generally yield broader cross-field impact, allowing larger models to run efficiently on constrained hardware.

    vs. DREAM-R: Multimodal Speculative Reasoning with RL-Based Refined Drafting, Precise Verification, and Fully Parallel Execution
    claude-opus-4.65/28/2026

    DREAM-R addresses a fundamental efficiency bottleneck in large multimodal model inference through a principled combination of RL-based training, verification mechanisms, and parallel execution. Its contributions (SAPO, TBVM, FPSR) are technically deep and broadly applicable to any reasoning-intensive LMM deployment, with demonstrated speedups preserving accuracy. Paper 2, while addressing an important benchmark gap for always-on assistants, is primarily a benchmark contribution with more limited methodological novelty. DREAM-R's impact spans inference optimization, RL for alignment, and speculative decoding—areas with broader cross-field relevance and immediate practical deployment implications.

    vs. OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling
    gpt-5.25/27/2026

    Paper 2 (Claw-Anything) likely has higher scientific impact due to its strong real-world applicability and timeliness: always-on assistants with broad digital access are a central near-term deployment target, and the benchmark captures long-horizon, multi-service, multi-device, GUI+CLI interaction plus proactive behavior under noisy/conflicting context. It also contributes a scalable automated environment-generation pipeline and demonstrates measurable training gains, increasing methodological utility beyond evaluation. Paper 1 is novel and valuable for interpretability of ToM in LLMs, but its applications are more indirect and narrower than the systems-level agent setting in Paper 2.

    vs. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
    claude-opus-4.65/27/2026

    Claw-Anything addresses a more novel and forward-looking problem—always-on personal assistants with broad digital world access—which is a less explored but increasingly important research direction. It introduces a benchmark spanning multiple dimensions (long-horizon histories, backend services, multi-device interaction) that goes beyond existing narrow evaluations. The finding that GPT-5.5 achieves only 34.5% highlights significant open challenges, motivating future research. While Paper 1 makes solid contributions to mobile GUI navigation with scaling analysis and benchmarking tools, it operates in a more established domain with incremental advances. Paper 2's broader scope and novel evaluation paradigm give it higher potential impact.

    vs. Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments
    gemini-3.15/27/2026

    Paper 2 introduces a highly ambitious benchmark that defines a new frontier for LLM agents: always-on, cross-device personal assistance with long-horizon context. By expanding the evaluation scope to include proactive assistance and rich contextual noise, it is likely to drive future research directions and serve as a standard for next-generation agents, offering broader field impact than the robust training methodology proposed in Paper 1.

    vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills
    gemini-3.15/26/2026

    Paper 1 addresses a fundamental cognitive and architectural question in AI: the transition from episodic memory to procedural skills. Its surprising finding that raw-trajectory reuse outperforms distilled skills offers deep scientific insights that could redirect how agent learning and memory are fundamentally designed. While Paper 2 offers an ambitious and highly practical systems benchmark for personal assistants, Paper 1's focus on the underlying mechanisms of abstraction and generalization gives it higher potential for core scientific impact across AI and machine learning.

    vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning
    gpt-5.25/26/2026

    Paper 1 is likely to have higher impact because it targets the broadly relevant and timely problem of always-on personal assistants, introducing a benchmark that meaningfully expands context (long-horizon histories, multi-service dependencies, GUI/CLI across devices) and enables evaluation of proactive assistance under realistic noise. Its automated data-generation pipeline and evidence of scalable training gains increase practical adoption and downstream research utility. Paper 2 is strong and novel for programmatic spatiotemporal reasoning, but its application scope is narrower (code-based video generation) and may influence fewer adjacent fields than general agentic personal-assistant evaluation.

    vs. Generative Auto-Bidding with Unified Modeling and Exploration
    gpt-5.25/26/2026

    Paper 1 likely has higher scientific impact due to broader cross-field relevance and timeliness: it introduces a new benchmark and scalable data-generation pipeline for always-on LLM agents, a central, rapidly growing research direction affecting evaluation, robustness, long-horizon reasoning, and proactive assistance across AI/agent systems. Its contributions are broadly reusable beyond a single domain. Paper 2 is methodologically solid and has strong real-world impact via large-scale deployment, but the core ideas (DT + Q-guidance + safety fallback) are more incremental and primarily concentrated within ad bidding/industrial recommender economics.

    vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents
    claude-opus-4.65/26/2026

    Co-ReAct introduces a novel, principled framework for step-level rubric-guided reasoning in ReAct agents, with a dedicated rubric generator trained via a novel list-wise Spearman rank-correlation GRPO objective. It demonstrates consistent improvements across multiple models and benchmarks, and the rubric generator serves as a modular drop-in component. While Paper 2 introduces a valuable benchmark for always-on assistants, benchmarks typically have narrower methodological impact compared to new training/inference frameworks. Co-ReAct's approach is more broadly applicable across reasoning agent architectures and introduces reusable methodological innovations.

    vs. MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
    claude-opus-4.65/26/2026

    MobileGym provides a verifiable, scalable simulation platform enabling RL training for mobile GUI agents with strong sim-to-real transfer (95.1% retention). Its contributions—deterministic state-based judging, parallel rollouts, and demonstrated RL improvements—address fundamental infrastructure needs for the field. While Claw-Anything introduces an important benchmark for always-on assistants, MobileGym's combination of a reusable training platform, verifiable rewards enabling online RL, and validated sim-to-real transfer represents a more foundational contribution with broader methodological impact for the agent research community.

    vs. A governance horizon for ethical-use constraints in open-weight AI models
    gemini-3.15/26/2026

    Paper 2 addresses an urgent and universally relevant issue: AI governance and supply-chain accountability. By empirically auditing over 2 million models, it provides rigorous, large-scale evidence of the failure of current ethical-use constraints in open-weight models. This highly novel approach bridges technical ecosystems and policy design, offering immediate real-world implications for global AI regulation, whereas Paper 1 focuses on a technical benchmark for future AI agents.

    vs. Trace2Skill: Verifier-Guided Skill Evolution for Long-Context EDA Agents
    claude-opus-4.65/26/2026

    Trace2Skill presents a novel test-time scaling framework with a concrete skill evolution mechanism (oracle, mutator, selector loop) that addresses a well-defined and important problem in EDA/hardware design. It demonstrates breakthrough results on previously unsolved tasks without requiring model fine-tuning, offering a generalizable methodology. Paper 2 introduces a valuable benchmark but is primarily an evaluation resource. While benchmarks are important, Trace2Skill's methodological contribution—evolvable skill policies from rollout traces with dense verifier feedback—has broader applicability beyond EDA and represents a more fundamental algorithmic advance in LLM agent reasoning.

    vs. Reasoning as an Attack Surface: Adaptive Evolutionary CoT Jailbreaks for LLMs
    claude-opus-4.65/26/2026

    Paper 2 introduces a novel benchmark (Claw-Anything) addressing a significant gap in evaluating always-on personal assistants—a rapidly growing area with broad real-world applications. It defines new evaluation dimensions, provides an automated data-generation pipeline, and demonstrates substantial room for improvement even in frontier models (GPT-5.5 at 34.5%). This has broader impact across the agent/assistant community by establishing infrastructure others can build on. Paper 1, while technically sound, is incremental in the jailbreak attack space, improving on existing CoT-based attack methods with evolutionary search—a narrower contribution with less cross-field impact.

    vs. TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
    gpt-5.25/26/2026

    Paper 1 likely has higher impact: it targets the broader and more consequential setting of always-on personal assistants spanning multi-device GUI/CLI, long-horizon user histories, and interdependent services, enabling evaluation of proactive assistance—capabilities central to near-term agent deployment. Its simulated months-long, noisy, conflicting user-state setup is a novel benchmark axis beyond terminal-only tasks, and the released scalable data-generation pipeline with demonstrated training gains increases practical adoption and research leverage. Paper 2 is rigorous and valuable, but its scope is narrower (terminal workflows) and likely affects a smaller slice of agent applications.

    vs. Agent-Centric Social Trajectory Prediction: A Free Energy Principle Perspective
    gemini-3.15/26/2026

    Paper 2 introduces a comprehensive benchmark and scalable data infrastructure for evaluating LLM-based personal assistants, a highly active and rapidly growing field. Its focus on long-horizon, cross-device AI agents addresses a critical bottleneck in agentic AI evaluation, making it highly likely to see widespread adoption and catalyze future research. Paper 1 offers a theoretically interesting approach to trajectory prediction, but its impact is more narrowly confined to the robotics and autonomous navigation communities.

    vs. Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents
    gpt-5.25/26/2026

    Paper 2 (Hera) likely has higher scientific impact due to a more novel algorithmic contribution (step-level device–cloud coordination with a two-stage IL→cost-aware RL training scheme), clearer methodological rigor and generalizable evaluation across multiple established long-horizon benchmarks, and strong real-world applicability to practical deployment constraints (latency/cost/privacy). Paper 1 provides an important, timely benchmark and data pipeline for always-on assistants, but benchmark papers can have narrower methodological novelty and their impact depends heavily on adoption; Hera’s approach is more broadly transferable across agent systems and edge/cloud settings.

    vs. AgentFugue: Agent Scaling for Long-Horizon Tasks through Collective Reasoning
    gemini-3.15/26/2026

    Paper 2 introduces a fundamental algorithmic innovation by addressing how to effectively 'scale out' multi-agent systems without centralized orchestration. This collective reasoning framework offers a novel paradigm for utilizing test-time compute in long-horizon tasks, giving it broader methodological impact across various AI domains compared to Paper 1, which primarily introduces a domain-specific benchmark, albeit an ambitious one.

    vs. $D^2$-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it introduces a broad, timely benchmark and scalable data-generation pipeline for always-on personal assistants, an area with wide real-world relevance and cross-field reach (agents, HCI, systems, evaluation). By expanding context to long-horizon histories, multi-service dependencies, and GUI/CLI across devices, it enables new research directions and measurable gaps even for strong models. Paper 1 is novel and rigorous for diffusion-LLM safety monitoring, but its impact is narrower (specific to D-LLMs and moderation routing) and depends on diffusion LLM adoption.

    vs. Agentic Proving for Program Verification
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it introduces a broadly applicable benchmark and scalable data-generation pipeline for always-on assistants spanning long-horizon context, multiple services, and GUI/CLI across devices—closely aligned with real-world deployment. Its results expose a substantial capability gap and provide infrastructure (2,000 environments, measurable training gains) that many groups can reuse, driving follow-on work across agent evaluation, HCI, security/privacy, and multimodal interaction. Paper 1 is strong and timely but is narrower (Lean/CLEVER) and partly highlights benchmark shortcomings rather than delivering a widely reusable new standard.