Claw-Anything: Benchmarking Always-On Personal Assistants with Broader Access to User's Digital World
Yusong Lin, Xinyuan Liang, Haiyang Wang, Qipeng Gu, Siqi Cheng, Jiangui Chen, Shuzhe Wu, Feiyang Pan
Abstract
Large language model agents are increasingly envisioned as always-on personal assistants with access to anything relevant in the user's digital world. Yet current systems operate over only narrow slices of that world, limiting context-sensitive reasoning and effective assistance. Existing benchmarks similarly provide only partial user state and therefore fail to capture performance in such a broad, always-on setting. To address this gap, we introduce Claw-Anything, a benchmark that expands agent context along three dimensions: long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across multiple devices. To instantiate this setting, we simulate months of user activity through multi-round event injection, producing complex world states and realistic noise, including irrelevant events and conflicting signals. Agents must reason over rich contextual environments while remaining robust to such noise. This expanded scope also enables the evaluation of proactive assistance, requiring agents to anticipate user needs and deliver timely recommendations. Experiments show that GPT-5.5 achieves only 34.5% pass@1, substantially below prior benchmarks, underscoring a gap between current agent capabilities and the demands of always-on personal assistance. Alongside the benchmark, we release an automated data-generation pipeline that yields 2,000 training environments and improves the base model by 23.7%, demonstrating its utility of scalable data infrastructure.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Claw-Anything
1. Core Contribution
Claw-Anything introduces a benchmark for evaluating LLM-based personal assistant agents under significantly broader operational scope than existing benchmarks. The key innovation lies in expanding agent context along three dimensions simultaneously: (1) long-horizon event streams spanning months of simulated user activity, (2) interdependent backend services (averaging 10.1 per task vs. ~1-4 in prior work), and (3) cross-device interaction combining CLI and GUI interfaces. The benchmark also evaluates proactive assistance—agents anticipating user needs without explicit requests—which is largely absent from prior evaluations.
Alongside the benchmark, the authors release an automated data-generation pipeline that synthesizes coherent digital worlds through iterative event injection, producing 200 human-verified evaluation tasks and 2,000 training environments. The pipeline's ability to generate useful training data is demonstrated by a 23.7% improvement when fine-tuning Qwen3.5-27B.
2. Methodological Rigor
Strengths in methodology:
Concerns:
3. Potential Impact
Direct impact on the agent benchmark ecosystem: Claw-Anything fills a genuine gap. The comparison table (Table 1) clearly shows that existing benchmarks operate with ~1-4 services per task, no event streams, CLI-only interfaces, and context lengths of 2-12k words, while Claw-Anything provides 10.1 services, multi-month event streams, CLI+GUI, and 191.7k words of context. This represents an order-of-magnitude increase in environmental complexity.
Training data infrastructure: Perhaps equally important is the automated pipeline. The demonstration that 1,500 trajectories can improve a base model by 23.7% suggests this pipeline could serve as practical data infrastructure for the community, similar to how SWE-smith served the coding agent community.
Broader implications: The benchmark pushes the field toward evaluating agents in settings that more closely resemble real deployment scenarios for personal assistants, where users expect seamless coordination across their entire digital footprint. The proactive assistance evaluation dimension is particularly forward-looking.
4. Timeliness & Relevance
This work is highly timely. The personal assistant agent space is rapidly evolving with systems like OpenClaw, Hermes Agent, and commercial offerings from Google and others. The gap between narrow benchmark performance and real-world utility is a recognized problem. By explicitly targeting the "always-on" assistant paradigm with broad digital access, the benchmark addresses a current bottleneck: existing evaluations don't capture the complexity that deployed systems will face.
The focus on proactive assistance is also well-timed, as the field is transitioning from reactive tool-use to anticipatory agent behavior.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Overall Assessment
Claw-Anything makes a solid contribution by significantly raising the bar for personal assistant evaluation. Its primary value lies in demonstrating that current frontier models perform far worse (34.5% vs. higher on simpler benchmarks) when given broader but noisier access to user state, and in providing a scalable pipeline for generating such complex environments. The work is well-motivated, technically sound in its core pipeline design, and addresses a genuine gap. However, the reliance on mock services, modest evaluation set size, and incomplete device coverage temper the claim of benchmarking "anything" in the user's digital world.
Generated May 26, 2026
Comparison History (19)
Paper 2 addresses a critical and immediate bottleneck in modern AI: memory and latency in Vision-Language Models. By enabling genuine hardware efficiency and significantly reducing KV-cache without accuracy loss, its methodology has immediate, widespread applicability for deployment. Paper 1 introduces a highly novel and valuable benchmark for personal assistants, but fundamental efficiency improvements like CIVIC generally yield broader cross-field impact, allowing larger models to run efficiently on constrained hardware.
DREAM-R addresses a fundamental efficiency bottleneck in large multimodal model inference through a principled combination of RL-based training, verification mechanisms, and parallel execution. Its contributions (SAPO, TBVM, FPSR) are technically deep and broadly applicable to any reasoning-intensive LMM deployment, with demonstrated speedups preserving accuracy. Paper 2, while addressing an important benchmark gap for always-on assistants, is primarily a benchmark contribution with more limited methodological novelty. DREAM-R's impact spans inference optimization, RL for alignment, and speculative decoding—areas with broader cross-field relevance and immediate practical deployment implications.
Paper 2 (Claw-Anything) likely has higher scientific impact due to its strong real-world applicability and timeliness: always-on assistants with broad digital access are a central near-term deployment target, and the benchmark captures long-horizon, multi-service, multi-device, GUI+CLI interaction plus proactive behavior under noisy/conflicting context. It also contributes a scalable automated environment-generation pipeline and demonstrates measurable training gains, increasing methodological utility beyond evaluation. Paper 1 is novel and valuable for interpretability of ToM in LLMs, but its applications are more indirect and narrower than the systems-level agent setting in Paper 2.
Claw-Anything addresses a more novel and forward-looking problem—always-on personal assistants with broad digital world access—which is a less explored but increasingly important research direction. It introduces a benchmark spanning multiple dimensions (long-horizon histories, backend services, multi-device interaction) that goes beyond existing narrow evaluations. The finding that GPT-5.5 achieves only 34.5% highlights significant open challenges, motivating future research. While Paper 1 makes solid contributions to mobile GUI navigation with scaling analysis and benchmarking tools, it operates in a more established domain with incremental advances. Paper 2's broader scope and novel evaluation paradigm give it higher potential impact.
Paper 2 introduces a highly ambitious benchmark that defines a new frontier for LLM agents: always-on, cross-device personal assistance with long-horizon context. By expanding the evaluation scope to include proactive assistance and rich contextual noise, it is likely to drive future research directions and serve as a standard for next-generation agents, offering broader field impact than the robust training methodology proposed in Paper 1.
Paper 1 addresses a fundamental cognitive and architectural question in AI: the transition from episodic memory to procedural skills. Its surprising finding that raw-trajectory reuse outperforms distilled skills offers deep scientific insights that could redirect how agent learning and memory are fundamentally designed. While Paper 2 offers an ambitious and highly practical systems benchmark for personal assistants, Paper 1's focus on the underlying mechanisms of abstraction and generalization gives it higher potential for core scientific impact across AI and machine learning.
Paper 1 is likely to have higher impact because it targets the broadly relevant and timely problem of always-on personal assistants, introducing a benchmark that meaningfully expands context (long-horizon histories, multi-service dependencies, GUI/CLI across devices) and enables evaluation of proactive assistance under realistic noise. Its automated data-generation pipeline and evidence of scalable training gains increase practical adoption and downstream research utility. Paper 2 is strong and novel for programmatic spatiotemporal reasoning, but its application scope is narrower (code-based video generation) and may influence fewer adjacent fields than general agentic personal-assistant evaluation.
Paper 1 likely has higher scientific impact due to broader cross-field relevance and timeliness: it introduces a new benchmark and scalable data-generation pipeline for always-on LLM agents, a central, rapidly growing research direction affecting evaluation, robustness, long-horizon reasoning, and proactive assistance across AI/agent systems. Its contributions are broadly reusable beyond a single domain. Paper 2 is methodologically solid and has strong real-world impact via large-scale deployment, but the core ideas (DT + Q-guidance + safety fallback) are more incremental and primarily concentrated within ad bidding/industrial recommender economics.
Co-ReAct introduces a novel, principled framework for step-level rubric-guided reasoning in ReAct agents, with a dedicated rubric generator trained via a novel list-wise Spearman rank-correlation GRPO objective. It demonstrates consistent improvements across multiple models and benchmarks, and the rubric generator serves as a modular drop-in component. While Paper 2 introduces a valuable benchmark for always-on assistants, benchmarks typically have narrower methodological impact compared to new training/inference frameworks. Co-ReAct's approach is more broadly applicable across reasoning agent architectures and introduces reusable methodological innovations.
MobileGym provides a verifiable, scalable simulation platform enabling RL training for mobile GUI agents with strong sim-to-real transfer (95.1% retention). Its contributions—deterministic state-based judging, parallel rollouts, and demonstrated RL improvements—address fundamental infrastructure needs for the field. While Claw-Anything introduces an important benchmark for always-on assistants, MobileGym's combination of a reusable training platform, verifiable rewards enabling online RL, and validated sim-to-real transfer represents a more foundational contribution with broader methodological impact for the agent research community.
Paper 2 addresses an urgent and universally relevant issue: AI governance and supply-chain accountability. By empirically auditing over 2 million models, it provides rigorous, large-scale evidence of the failure of current ethical-use constraints in open-weight models. This highly novel approach bridges technical ecosystems and policy design, offering immediate real-world implications for global AI regulation, whereas Paper 1 focuses on a technical benchmark for future AI agents.
Trace2Skill presents a novel test-time scaling framework with a concrete skill evolution mechanism (oracle, mutator, selector loop) that addresses a well-defined and important problem in EDA/hardware design. It demonstrates breakthrough results on previously unsolved tasks without requiring model fine-tuning, offering a generalizable methodology. Paper 2 introduces a valuable benchmark but is primarily an evaluation resource. While benchmarks are important, Trace2Skill's methodological contribution—evolvable skill policies from rollout traces with dense verifier feedback—has broader applicability beyond EDA and represents a more fundamental algorithmic advance in LLM agent reasoning.
Paper 2 introduces a novel benchmark (Claw-Anything) addressing a significant gap in evaluating always-on personal assistants—a rapidly growing area with broad real-world applications. It defines new evaluation dimensions, provides an automated data-generation pipeline, and demonstrates substantial room for improvement even in frontier models (GPT-5.5 at 34.5%). This has broader impact across the agent/assistant community by establishing infrastructure others can build on. Paper 1, while technically sound, is incremental in the jailbreak attack space, improving on existing CoT-based attack methods with evolutionary search—a narrower contribution with less cross-field impact.
Paper 1 likely has higher impact: it targets the broader and more consequential setting of always-on personal assistants spanning multi-device GUI/CLI, long-horizon user histories, and interdependent services, enabling evaluation of proactive assistance—capabilities central to near-term agent deployment. Its simulated months-long, noisy, conflicting user-state setup is a novel benchmark axis beyond terminal-only tasks, and the released scalable data-generation pipeline with demonstrated training gains increases practical adoption and research leverage. Paper 2 is rigorous and valuable, but its scope is narrower (terminal workflows) and likely affects a smaller slice of agent applications.
Paper 2 introduces a comprehensive benchmark and scalable data infrastructure for evaluating LLM-based personal assistants, a highly active and rapidly growing field. Its focus on long-horizon, cross-device AI agents addresses a critical bottleneck in agentic AI evaluation, making it highly likely to see widespread adoption and catalyze future research. Paper 1 offers a theoretically interesting approach to trajectory prediction, but its impact is more narrowly confined to the robotics and autonomous navigation communities.
Paper 2 (Hera) likely has higher scientific impact due to a more novel algorithmic contribution (step-level device–cloud coordination with a two-stage IL→cost-aware RL training scheme), clearer methodological rigor and generalizable evaluation across multiple established long-horizon benchmarks, and strong real-world applicability to practical deployment constraints (latency/cost/privacy). Paper 1 provides an important, timely benchmark and data pipeline for always-on assistants, but benchmark papers can have narrower methodological novelty and their impact depends heavily on adoption; Hera’s approach is more broadly transferable across agent systems and edge/cloud settings.
Paper 2 introduces a fundamental algorithmic innovation by addressing how to effectively 'scale out' multi-agent systems without centralized orchestration. This collective reasoning framework offers a novel paradigm for utilizing test-time compute in long-horizon tasks, giving it broader methodological impact across various AI domains compared to Paper 1, which primarily introduces a domain-specific benchmark, albeit an ambitious one.
Paper 2 likely has higher impact: it introduces a broad, timely benchmark and scalable data-generation pipeline for always-on personal assistants, an area with wide real-world relevance and cross-field reach (agents, HCI, systems, evaluation). By expanding context to long-horizon histories, multi-service dependencies, and GUI/CLI across devices, it enables new research directions and measurable gaps even for strong models. Paper 1 is novel and rigorous for diffusion-LLM safety monitoring, but its impact is narrower (specific to D-LLMs and moderation routing) and depends on diffusion LLM adoption.
Paper 2 likely has higher impact: it introduces a broadly applicable benchmark and scalable data-generation pipeline for always-on assistants spanning long-horizon context, multiple services, and GUI/CLI across devices—closely aligned with real-world deployment. Its results expose a substantial capability gap and provide infrastructure (2,000 environments, measurable training gains) that many groups can reuse, driving follow-on work across agent evaluation, HCI, security/privacy, and multimodal interaction. Paper 1 is strong and timely but is narrower (Lean/CLEVER) and partly highlights benchmark shortcomings rather than delivering a widely reusable new standard.