SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu

#396 of 2320 · Artificial Intelligence
Share
Tournament Score
1486±44
10501800
73%
Win Rate
16
Wins
6
Losses
22
Matches
Rating
7/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: SaaS-Bench

1. Core Contribution

SaaS-Bench introduces a benchmark of 106 tasks across 23 real, deployable open-source SaaS systems organized into six professional domains. The key novelty lies at the intersection of several dimensions that prior benchmarks address only partially: (a) real SaaS deployments with full frontend-backend logic rather than toy or simplified websites, (b) cross-application coordination (93.4% of tasks span ≥2 applications), (c) genuinely long-horizon tasks averaging 100+ interaction steps, and (d) multimodal inputs including images and documents. The benchmark addresses a real evaluation gap: existing benchmarks like WebArena, OSWorld, and WorkArena either use simplified environments, single applications, or short-horizon tasks that fail to capture the complexity of real professional workflows. The weighted checkpoint scoring system is a practical design choice that enables measuring partial progress on tasks where end-to-end completion is extremely rare.

2. Methodological Rigor

Environment construction is well-executed. Using real open-source SaaS systems containerized via Docker ensures reproducibility while maintaining ecological validity. The semantic data population strategy (combining LLM-generated and real dataset imports) is sensible, though the paper provides limited detail on how faithfully this approximates production data distributions.

Task construction follows a multi-stage Builder-Challenger-Refiner pipeline with both LLM synthesis and human expert review. The 45% survival rate through quality control suggests meaningful filtering. The static and execution check rubrics (detailed in Appendix A) are thorough, with six scoring dimensions and anti-pattern flags. This is a strength—many benchmarks lack such transparent quality control documentation.

Evaluation methodology is sound. The three verification types (State-Check, Content-Check, LLM-Judge) cover different output modalities. Environment reset before each task prevents cross-contamination. However, several concerns arise: (1) Only 106 tasks across 23 systems means sparse coverage per application. (2) The use of browser-use as the sole execution framework means the benchmark partially measures the framework's capabilities rather than purely the LLM's. (3) Some models (DeepSeek V4 Pro, MiniMax M2.7) are evaluated only on text-only tasks, making cross-model comparisons incomplete.

The pass@k analysis (k=1,2,3) is a valuable addition that reveals execution variance, though k=3 is still limited for drawing strong conclusions about reliability.

3. Potential Impact

Immediate impact: SaaS-Bench fills a genuine gap in CUA evaluation. The finding that the best model achieves <4% resolved score while scoring ~43% on checkpoints is a powerful result that quantifies the "fragility of composition"—a concept the authors formalize well. This finding alone should redirect research attention toward long-horizon reliability rather than step-level accuracy.

Practical relevance: The benchmark directly targets the scenario enterprises care about—automating professional SaaS workflows. The six domains (software engineering, business operations, healthcare, teamwork, agriculture, media) represent genuine market verticals.

Failure mode taxonomy: The discussion section's detailed case studies (§5.1-5.4) contribute beyond the benchmark itself, identifying four failure modes: fragility of long-horizon completion, error cascading in multi-app workflows, unreliable self-evaluation, and high intra-model variance. The entity-type misclassification case (§5.2, bof_032) is particularly illuminating—showing how a subtle UI semantics error cascades through 30% of a task's score.

Architectural implications: The paper's observation about absent closed-loop outcome verification (§5.3) and the mathematical formalization of compositional fragility (p^N probability) provide actionable insights for agent design.

4. Timeliness & Relevance

This benchmark arrives at a critical moment. CUAs are being deployed commercially (Anthropic's computer use, OpenAI's CUA), and the gap between benchmark performance and real-world utility is increasingly problematic. The paper's reference to Online-Mind2Web's finding that "much of reported progress disappears outside controlled offline settings" contextualizes the need well. The benchmark tests models from 2026 (Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro), indicating it targets the current frontier.

5. Strengths & Limitations

Key Strengths:

  • Ecological validity: Real SaaS systems, not toy environments
  • Compositional complexity: Cross-application, long-horizon, multimodal
  • Transparent quality control with detailed rubrics and survival statistics
  • Excellent qualitative analysis of failure modes with specific trajectory excerpts
  • Checkpoint-based evaluation that captures partial progress
  • Strong differentiation from prior benchmarks (Table 1)
  • Reproducibility through Docker deployment
  • Notable Limitations:

  • Scale: 106 tasks is relatively small; statistical power for per-domain, per-model comparisons is limited
  • Single execution framework: All agents use browser-use, conflating framework limitations with model limitations
  • Evaluation coverage: Some models tested only on text-only tasks; most models have only 1 run (pass@k only for 4 models)
  • Open-source SaaS bias: Real enterprise SaaS (Salesforce, SAP, Workday) may present different challenges than open-source alternatives
  • Static benchmark: Tasks are fixed; no mechanism for generating new tasks at scale (unlike WebArena-Infinity)
  • Limited baseline diversity: Only LLM-based browser agents tested; no specialized agent architectures, no RL-trained agents, no agents with explicit planning modules
  • Verification reliability: LLM-Judge for open-ended outputs introduces non-deterministic evaluation, but the paper doesn't report inter-annotator agreement or judge reliability
  • Missing analysis: The paper lacks ablation studies on what makes tasks hard—is it primarily the number of steps, the number of applications, domain-specific knowledge requirements, or UI complexity? Figure 8 provides correlational evidence but not causal analysis.

    Overall Assessment

    SaaS-Bench makes a solid contribution to CUA evaluation by providing a benchmark that is more realistic than existing alternatives along multiple dimensions simultaneously. The qualitative failure analysis is the paper's strongest intellectual contribution, offering insights that transcend the specific benchmark. The main limitation is relatively modest scale (106 tasks) and narrow agent diversity in evaluation. The benchmark should catalyze progress on long-horizon agent reliability, though its longevity as a benchmark may be limited by the rapid evolution of both SaaS systems and CUA capabilities.

    Rating:7/ 10
    Significance 7.5Rigor 6.5Novelty 6.5Clarity 8

    Generated May 18, 2026

    Comparison History (22)

    vs. Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design
    gpt-5.25/19/2026

    Paper 2 likely has higher scientific impact: it introduces a broadly useful, timely benchmark for computer-use agents in realistic SaaS workflows, with deployable systems, long-horizon tasks, multimodal settings, and nuanced evaluation (partial progress checkpoints). Benchmarks often become shared infrastructure that catalyzes progress across models, agent architectures, evaluation, and HCI/security. Its findings (very low completion rates) create a clear research agenda. Paper 1 is novel and methodologically interesting for automated algorithm design, but its impact may be narrower (focused on heuristic synthesis for combinatorial optimization) and its empirical gains are competitive rather than clearly transformative.

    vs. Latent Action Reparameterization for Efficient Agent Inference
    claude-opus-4.65/19/2026

    SaaS-Bench addresses a critical gap in evaluating computer-use agents on realistic professional workflows, providing a concrete benchmark across 23 real SaaS systems with 106 tasks. The finding that even the best models complete fewer than 4% of tasks reveals a stark capability gap that will drive significant future research. While LAR's latent action reparameterization is a solid methodological contribution to inference efficiency, SaaS-Bench has broader impact potential: it defines a new evaluation paradigm for the rapidly growing CUA field, will likely be widely adopted as a standard benchmark, and its cross-domain coverage invites contributions from multiple research communities.

    vs. From LLM-Generated Conjectures to Lean Formalizations: Automated Polynomial Inequality Proving via Sum-of-Squares Certificates
    gpt-5.25/18/2026

    Paper 2 likely has higher impact due to broad applicability and timeliness: a realistic, deployable benchmark for computer-use agents across 23 SaaS systems and multiple domains can standardize evaluation, drive rapid iteration, and influence both academia and industry. Its methodological contribution (task design, long-horizon workflows, weighted checkpoints, reproducibility) enables wide adoption and comparability across agent architectures. Paper 1 is novel and rigorous, but its impact is narrower (polynomial inequality proving/formal methods) and depends on uptake within a smaller community.

    vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
    gemini-3.15/18/2026

    Paper 2 addresses the critical, universally relevant challenge of AI safety, compliance, and governance by integrating formal methods with LLMs. This approach offers a robust methodological framework for auditing and monitoring AI behavior, which has broad implications across multiple fields, including AI safety, software engineering, and regulation. While Paper 1 provides a valuable benchmark for evaluating agents, Paper 2's methodological innovation and alignment with urgent real-world AI governance needs give it a higher potential for broad scientific and societal impact.

    vs. Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning
    claude-opus-4.65/18/2026

    SaaS-Bench addresses a critical gap in evaluating computer-use agents on realistic professional workflows using real SaaS systems. With 23 deployable systems across 6 domains and 106 tasks, it provides a much-needed benchmark revealing that even state-of-the-art agents complete fewer than 4% of tasks. This has broad impact across the AI agent community, informing future research directions. Paper 2 (BISON) presents a solid bilevel planning approach but is more incremental, combining known ideas (symbolic planning + imitation learning) in a relatively narrow robotics setting. SaaS-Bench's timeliness amid the CUA boom gives it higher impact potential.

    vs. PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI
    gpt-5.25/18/2026

    Paper 2 likely has higher scientific impact because it introduces a broadly reusable, open benchmark (23 SaaS systems, 106 realistic long-horizon tasks, multimodal, checkpointed evaluation) that can become a community standard for measuring and driving progress in computer-use agents. Its applicability spans agent planning, HCI, multimodal learning, RL, and software engineering, and it is timely given rapid CUA development. Paper 1 is highly practical and well-evaluated, but appears more platform/enterprise-specific and relies on LLM-as-judge evaluation, which may limit generalizability and rigor compared to a public benchmark.

    vs. Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
    gpt-5.25/18/2026

    Paper 1 likely has higher scientific impact: it introduces the first fully open, end-to-end auditable pipeline for clinical LLM-CDSS, addressing a major barrier (reproducibility, provenance, decontamination, clinician validation) in a high-stakes domain with immediate real-world and regulatory relevance. It contributes reusable assets (audited corpus, training/eval protocol) and demonstrates competitive/SoTA performance across multiple base models, broadening adoption. Paper 2 is timely and useful as a benchmark, but its impact may be narrower (evaluation-only) and more incremental relative to the crowded agent-benchmark space.

    vs. TMAS: Scaling Test-Time Compute via Multi-Agent Synergy
    claude-opus-4.65/18/2026

    TMAS addresses the fundamental challenge of improving LLM reasoning through test-time compute scaling, which is a rapidly growing and broadly applicable research direction. Its multi-agent synergy framework with hierarchical memories and hybrid reward RL introduces novel, generalizable techniques that could impact many reasoning tasks. While SaaS-Bench provides a valuable benchmark for evaluating computer-use agents in realistic professional workflows, benchmarks typically have narrower impact than new methodological frameworks. TMAS's contributions to structured test-time scaling are more likely to influence subsequent research across multiple domains.

    vs. Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems
    gpt-5.25/18/2026

    Paper 1 likely has higher scientific impact because it introduces a deployable, reproducible benchmark spanning 23 real SaaS systems and 106 long-horizon tasks with partial-credit evaluation—an enabling artifact that can standardize evaluation and drive measurable progress in computer-use agents. Its strong empirical finding (<4% end-to-end success for top models) is timely and directly actionable for the community, with clear real-world relevance across professional workflows. Paper 2 is a useful synthesis and roadmap, but as a survey it is less methodologically transformative and typically has less direct, benchmark-driven impact on future empirical work.

    vs. RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision
    gpt-5.25/18/2026

    Paper 2 (SaaS-Bench) likely has higher scientific impact due to broader applicability and timeliness: it benchmarks computer-use agents in realistic, long-horizon, cross-application professional workflows across multiple domains, a key bottleneck for deploying agents. Its deployable SaaS-based tasks, multimodal setting, and graded checkpoint evaluation can influence agent research, evaluation standards, and real-world automation. Paper 1 is valuable but narrower (EDA/RTL generation) and focuses on benchmark maintenance; its impact is more field-specific and depends on adoption within a smaller community.

    vs. Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design
    gemini-3.15/18/2026

    Paper 1 demonstrates a foundational breakthrough by using LLM agents to autonomously design neural architectures that outperform current state-of-the-art models like Llama 3.2. It tackles the highly ambitious goal of recursive AI self-improvement, offering massive implications for the future of foundational model design and scaling efficiency. While Paper 2 provides a valuable benchmarking tool for evaluating agent workflows, Paper 1 introduces a paradigm-shifting generative methodology with broader, more fundamental scientific consequences for the entire field of deep learning.

    vs. An Algebraic Exposition of the Theory of Dyadic Morality
    claude-opus-4.65/18/2026

    SaaS-Bench addresses a timely, high-demand problem in AI agent evaluation with a concrete, reproducible benchmark spanning 23 real-world SaaS systems. It reveals significant limitations of current LLM-based agents (<4% task completion), providing a clear research target for the rapidly growing computer-use agent community. Its practical relevance to enterprise AI adoption and broad applicability across multiple professional domains gives it wider immediate impact. Paper 2, while intellectually interesting in formalizing moral psychology for AI, addresses a narrower niche with more theoretical contributions and less immediate empirical validation or adoption potential.

    vs. Holistic Evaluation and Failure Diagnosis of AI Agents
    gpt-5.25/18/2026

    Paper 2 is likely higher impact because it introduces a new, deployable, realistic benchmark spanning 23 real SaaS systems and six domains, addressing a timely need for evaluating computer-use agents in professional, long-horizon workflows. Its breadth (multi-domain, multimodal, dynamic state, cross-app coordination) makes it broadly useful across agent research, HCI, benchmarking, and enterprise automation, and its open release supports adoption. Paper 1 is methodologically innovative for failure localization, but it is narrower (evaluation framework atop an existing benchmark) and may have more limited cross-field uptake than a widely reusable real-world benchmark.

    vs. FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast
    gpt-5.25/18/2026

    Paper 1 likely has higher scientific impact due to its broad, timely benchmark contribution: a deployable, realistic SaaS-based suite spanning multiple professional domains with long-horizon, multimodal tasks and graded evaluation. Such benchmarks often become shared infrastructure, enabling standardized measurement and driving progress across agent planning, tool use, robustness, and multimodal interaction. Paper 2 is innovative and shows strong gains, but evidence is confined to a single environment/attacker setting, limiting generality and cross-field adoption. Overall, SaaS-Bench’s wider applicability and community utility suggest greater impact.

    vs. Prompting Policies for Multi-step Reasoning and Tool-Use in Black-box LLMs with Iterative Distillation of Experience
    gemini-3.15/18/2026

    Paper 2 introduces a highly relevant and challenging benchmark for Computer-Using Agents in real-world SaaS environments. By demonstrating that current state-of-the-art models fail dramatically (sub-4% success rate) on realistic professional workflows, it exposes critical limitations and establishes a clear roadmap for future research. While Paper 1 offers a strong methodological improvement for prompt optimization, foundational benchmarks like SaaS-Bench typically drive broader community efforts and have a higher long-term scientific impact.

    vs. Coding Agent Is Good As World Simulator
    gpt-5.25/18/2026

    Paper 1 likely has higher impact: it introduces a deployable, realistic benchmark for computer-use agents across 23 real SaaS systems and diverse professional workflows, addressing a timely evaluation gap with clear real-world relevance and broad applicability to agent research, robustness, planning, and HCI. The strong empirical finding (state-of-the-art <4% success) can reshape research priorities and provide a standardized target for the community. Paper 2 is innovative but seems more complex to validate rigorously, may be domain-specific to physics simulation, and its gains depend on implementation details and evaluation breadth.

    vs. See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation
    gpt-5.25/18/2026

    Paper 1 likely has higher impact: it introduces a broadly useful, realistic benchmark for computer-use agents across 23 deployable SaaS systems and long-horizon professional workflows—an urgent bottleneck for reliable agent evaluation and progress tracking. Its design (dynamic states, cross-app coordination, weighted checkpoints) can influence many subfields (agent planning, UI grounding, RL, evaluation methodology) and is highly timely given rapid CUA adoption. Paper 2 is innovative and rigorous for a narrower domain (educational animation rendering), with clearer immediate application but less cross-domain reach.

    vs. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
    gpt-5.25/18/2026

    Paper 1 introduces a novel, scalable personalization mechanism (learning discrete buyer types from raw clickstreams via behavior-aware VQ-VAE and grounding LLM agents with persona tokens) and validates it on very large real-world data (8.37M buyers, 42 storefronts) with strong alignment and task gains, suggesting direct deployment impact in e-commerce simulation/personalized agents. Paper 2 is a valuable benchmark with broad relevance, but is primarily an evaluation dataset with fewer methodological innovations and less direct demonstrated real-world performance gains beyond exposing current limitations.

    vs. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents
    gpt-5.25/18/2026

    Paper 2 (ShopGym) likely has higher impact due to a more novel, generalizable methodology: converting live storefronts into reproducible, inspectable, resettable sandboxes and synthesizing grounded tasks, explicitly addressing a core benchmarking bottleneck (realism vs. controllability). It offers scalable environment construction, validation via structural analysis and correlation to live-store performance, and broad utility for agent evaluation, safety, and reproducibility research. Paper 1 is valuable but primarily contributes a fixed benchmark suite; its approach is less extensible and may face maintenance/non-stationarity challenges across many SaaS systems.

    vs. Deterministic Event-Graph Substrates as World Models for Counterfactual Reasoning
    gemini-3.15/18/2026

    Paper 1 addresses a highly active and critical area in AI research: evaluating LLM agents on complex, real-world computer tasks. Its introduction of a realistic benchmark for SaaS workflows has immediate and broad applicability for both academia and industry. While Paper 2 offers strong methodological rigor in formalizing counterfactual reasoning, Paper 1's timeliness, high relevance to current agentic AI trends, and potential to drive forward practical enterprise AI capabilities give it a significantly higher potential for broad scientific and real-world impact.