JobBench: Aligning Agent Work With Human Will

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao

#897 of 2682 · Artificial Intelligence
Share
Tournament Score
1445±41
10501800
68%
Win Rate
13
Wins
6
Losses
19
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: JobBench: Aligning Agent Work With Human Will

1. Core Contribution

JobBench introduces a benchmark of 130 agentic tasks across 35 occupations, where task selection is grounded in worker-reported delegation preferences rather than economic value. The key conceptual contribution is reframing AI workplace evaluation from "replacement" to "enhancement" — building on Workbank survey data where 1,500+ workers rated which duties they'd prefer AI to handle. Each task is packaged as a heterogeneous workspace of reference files (averaging 3.9 files per task across 17 formats), graded by chained binary rubrics averaging 35.6 criteria per task.

The benchmark addresses a genuine gap: existing occupational benchmarks like GDPVal, Remote Labor Index, and $OneMillion-Bench all scope tasks by economic value, asking whether agents can produce deliverables that substitute for human labor. JobBench instead asks whether agents can perform the specific subtasks that professionals themselves want offloaded.

2. Methodological Rigor

Strengths in design:

  • The occupation selection methodology is well-motivated: intersecting Workbank automation-desire scores (>3 on 1-5 scale) with OEWS 2024 economic exposure data, then applying a feasibility filter requiring tasks to be digitalizable, evaluable, and supportable.
  • The chained rubric design is rigorous — a rubric receives credit only when all criteria in the chain pass, preventing partial credit for arriving at correct facts through incorrect reasoning.
  • The 95.4% union pass rate across sampled runs provides evidence that rubrics are achievable, addressing a common benchmark concern about impossible criteria.
  • The three-stage quality gate (automated audit, annotator review, solve trial) with 71% task retention shows disciplined curation.
  • Concerns:

  • The paper uses Grok-4.1-Fast as judge for cost reasons (2vs2 vs40), validated against Opus-4.5 with only "0.7% variance." However, the agreement analysis covers only three configurations (Figure 5b), which is thin for establishing judge reliability across 36 configurations and 4,631 criteria.
  • The 51.7% real-world / 48.3% synthesized split for main-set reference files raises questions about ecological validity for the synthesized portion.
  • The expert pool description is somewhat vague — "26.5 distinct experts per occupation" from Prolific, plus Upwork freelancers with >90% job success rate. The actual annotation workflow and inter-annotator agreement metrics are not reported.
  • The paper does not report variance across multiple runs for the same model-scaffold configuration, making it difficult to assess score stability.
  • 3. Potential Impact

    Direct impact on benchmark design: JobBench offers a compelling alternative framing for occupational AI evaluation. The human-will-centered selection criterion could influence how future benchmarks scope their task sets, moving beyond pure economic exposure.

    Occupational analysis (Section 3.3): The analysis of research papers and YC startups mapped to occupations is an interesting secondary contribution, revealing that attention correlates *negatively* with model capability — communities focus more on areas where agents still struggle. The research-vs-startup attention divergence (Figure 7c) provides actionable intelligence for both communities.

    Practical deployment guidance: The scaffold comparison (Table 3) and cost analysis (Figure 5) provide practical value — showing that scaffold choice can shift scores by 6+ points with the same base model, and that GPT-5.5 at 44dominatesthecostperformancefrontierwhileOpus4.7at44 dominates the cost-performance frontier while Opus-4.7 at210 provides only marginal gains.

    Limitations of impact scope: The benchmark is U.S.-centric, English-only, document-heavy, and covers only 35 of hundreds of occupations. The 130 tasks, while carefully designed, provide thin per-occupation coverage (averaging 3.7 tasks per occupation), limiting fine-grained occupational conclusions.

    4. Timeliness & Relevance

    The paper arrives at a moment of intense debate about AI's labor-market effects, with multiple competing occupational benchmarks (GDPVal, Remote Labor Index, $OneMillion-Bench) all appearing in 2025-2026. JobBench's human-centered framing is timely and fills a genuine conceptual gap. The comparison showing GDPVal approaching saturation (70-83%) while JobBench-Main remains below 46% suggests the benchmark will remain discriminative for frontier models in the near term.

    The paper's use of future model versions (Claude Opus 4.7, GPT-5.5, Gemini 3) suggests this is from a near-future or speculative timeline, which is unusual but doesn't diminish the methodological contribution if the benchmark design itself is evaluated independently.

    5. Strengths & Limitations

    Key strengths:

  • Novel and well-argued conceptual framing that distinguishes enhancement from replacement
  • Rich task design with heterogeneous file formats, cross-source conflicts, and chained reasoning requirements
  • Comprehensive evaluation across 36 model-scaffold configurations with four different agentic scaffolds
  • Detailed task examples (Appendix E) demonstrate genuine professional complexity
  • The Table 1 case comparison effectively illustrates how JobBench differs from GDPVal at the task level
  • Notable weaknesses:

  • The causal link between "what workers want delegated" and "what enhances rather than replaces" is assumed rather than demonstrated. Workers might want to delegate core competencies, and automating those could still lead to displacement.
  • The benchmark's reliance on Workbank survey data means it inherits any biases in that survey's methodology, population, or question design.
  • No human baseline performance is reported, making it difficult to contextualize the 45.9% ceiling.
  • The easy/main split rationale is somewhat ad hoc — fewer reasoning challenges, no web search, fewer conflicts — without systematic control over difficulty factors.
  • Reproducibility concerns: the paper does not release rubric-level results or provide sufficient detail about the synthesized reference files to enable independent verification.
  • Overall Assessment

    JobBench makes a meaningful conceptual contribution by reframing occupational AI benchmarking around worker preferences rather than economic value, backed by a technically sound benchmark design with rigorous chained rubrics. The evaluation is comprehensive across models and scaffolds. However, the paper would benefit from stronger validation of its judge reliability, human baselines, and a more critical examination of whether delegation preference truly maps to enhancement rather than replacement. The benchmark's discriminative power and professional reasoning demands are genuine strengths that should sustain its relevance as frontier models improve.

    Rating:6.8/ 10
    Significance 7Rigor 6.5Novelty 7.5Clarity 7.5

    Generated May 27, 2026

    Comparison History (19)

    vs. On the Origin of Synthetic Information by Means of Steganographic Inheritance
    claude-opus-4.65/28/2026

    JobBench addresses a timely, practical problem in AI agent evaluation with a concrete benchmark covering 130 tasks across 35 occupations, evaluated on 36 models. Its human-centered framing (augmentation vs. replacement) is novel and directly actionable for the AI community. Paper 2 proposes an interesting steganographic provenance tracking framework but is more conceptual and niche, with narrower immediate applicability. JobBench's breadth of impact—spanning AI evaluation, labor economics, and agent development—combined with its methodological rigor (fact-anchored rubrics, extensive model evaluation) gives it higher potential for widespread adoption and citation.

    vs. Got a Secret? LLM Agents Can't Keep It: Evaluating Privacy in Multi-Agent Systems
    gpt-5.25/28/2026

    Paper 1 likely has higher scientific impact because it introduces a novel, multi-agent, month-long social simulation to reveal privacy failures that standard single-agent benchmarks miss, directly affecting safety evaluation paradigms for agentic systems. Its findings (social contagion of leakage, robustness against explicit instructions) are broadly relevant across AI safety, privacy, multi-agent systems, and deployment policy, and are timely as persistent agent communities become common. Paper 2 is valuable and rigorous as a benchmark for human-aligned occupational delegation, but its impact is more domain-scoped to evaluation of work agents rather than uncovering a new systemic safety failure mode.

    vs. Verifiable Benchmarking of Long-Horizon Spatial Biology
    claude-opus-4.65/28/2026

    JobBench has broader impact across fields: it covers 35 occupations with 130 tasks, evaluates 36 models, and introduces a paradigm shift from replacement to enhancement framing in AI labor economics. Its methodological contribution (fact-anchored rubrics, heterogeneous workspaces) is generalizable. SpatialBench-Long is rigorous and valuable but narrower in scope (24 evaluations in spatial biology). JobBench's reframing of how we evaluate occupational AI agents has potential to influence AI policy, workforce development, and benchmark design across many domains.

    vs. Measuring Progress Toward AGI: A Cognitive Framework
    claude-opus-4.65/28/2026

    Paper 1 addresses a more fundamental and broadly impactful problem—how to measure progress toward AGI—by proposing a cognitive taxonomy grounded in decades of cognitive science research. This framework has potential to shape how the entire AI field tracks progress, informs governance, and structures evaluation. While Paper 2 (JobBench) makes a valuable contribution with its human-centered benchmark for occupational AI agents, it is more narrowly scoped as one benchmark among many. Paper 1's interdisciplinary foundation and its potential to become a standard framework for AGI evaluation give it greater breadth of impact.

    vs. Multi-Adapter Representation Interventions via Energy Calibration
    gemini-3.15/28/2026

    Paper 2 introduces a timely, comprehensive benchmark that shifts the evaluation paradigm of AI agents from job replacement to human empowerment. This conceptual shift and robust evaluation framework will likely guide future agent development and have a broader interdisciplinary impact across AI, HCI, and economics than Paper 1's algorithmic improvement for LLM alignment, which addresses a narrower technical niche.

    vs. Plan Before Search: Search Agents Need Plan
    gemini-3.15/28/2026

    JobBench introduces a paradigm-shifting benchmark that redefines occupational AI evaluation from economic replacement to human enhancement. By aligning AI capabilities with human delegation preferences, it offers profound societal relevance and broad multidisciplinary impact. While Paper 1 provides a valuable methodological improvement for agent reasoning, Paper 2 addresses urgent ethical and economic questions, likely driving extensive future research in AI alignment, HCI, and agent evaluation.

    vs. MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing
    gemini-3.15/28/2026

    Paper 2 presents a novel benchmark that fundamentally challenges the current paradigm of AI automation, shifting focus from job replacement to human enhancement. This conceptual innovation, combined with its broad applicability across 35 occupations and relevance to pressing issues in AI alignment and labor economics, gives it a much wider potential impact across multiple fields compared to Paper 1's domain-specific, though rigorous, methodological advance in chemical diagram parsing.

    vs. Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows
    gemini-3.15/27/2026

    Paper 1 offers a profound paradigm shift by evaluating AI agents based on human empowerment rather than economic replacement. Its extensive scope (130 tasks across 35 occupations) and detailed evaluation framework position it to become a foundational benchmark for human-AI collaboration. While Paper 2 tackles the crucial technical issue of trajectory hallucinations, Paper 1's broader socio-technical relevance, interdisciplinary appeal, and potential to steer the future development of agentic AI towards human-centric workflows give it a higher overall scientific and societal impact.

    vs. NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding
    gpt-5.25/27/2026

    Paper 1 likely has higher overall scientific impact due to broader cross-field relevance and immediate usability: a large, carefully rubric-graded benchmark for agentic work can shape evaluation practices across AI, HCI, and economics-of-AI, influencing what systems are optimized for. Its novelty is in reframing occupational agent benchmarking around human delegation priorities and providing high-fidelity, workspace-style tasks with rigorous criteria. Paper 2 is methodologically strong and innovative within fMRI decoding, but its impact is narrower (neuroimaging/brain decoding) despite clear technical advances and efficiency gains.

    vs. Towards end-to-end LLM-based censoring-aware survival analysis
    gpt-5.25/27/2026

    Paper 1 likely has higher scientific impact due to broader, timely relevance and cross-field reach: it introduces a large, rigorous benchmark (130 tasks, 35 occupations, rubric-based grading) that can standardize evaluation of agentic systems across the AI community and influence research agendas (human-alignment framing vs. GDP/replacement). Paper 2 is a useful methodological contribution with clear clinical applicability, but it targets a narrower domain (survival analysis) and shows modest gains; impact may be more limited to medical ML and may face adoption/regulatory barriers.

    vs. EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions
    gpt-5.25/27/2026

    Paper 1 (JobBench) likely has higher scientific impact due to broader cross-field relevance and timeliness: it reframes agent evaluation around human-prioritized delegation rather than economic replacement, influencing how occupational agents are designed, assessed, and governed. Its scale (130 tasks, 35 occupations) and rubric-based, fact-anchored grading support methodological rigor and wide adoption across NLP, HCI, AI ethics, and labor economics. Paper 2 is innovative and rigorous for coding agents, but its domain is narrower (software) and impact is more specialized.

    vs. Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
    gemini-3.15/27/2026

    Paper 1 investigates the mechanistic interpretability of Large Reasoning Models (LRMs), a highly timely frontier in AI. Its discovery that Chain-of-Thought traces dynamically interact with residual streams to encode refusal presents a fundamental shift in AI safety and steering. While Paper 2 introduces a valuable, human-centric benchmark for AI agents, Paper 1 offers deeper methodological insights into the internal mechanics and vulnerabilities of state-of-the-art models. This mechanistic discovery will have a broader and more immediate scientific impact on fundamental AI architecture, alignment, and security research.

    vs. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation
    claude-opus-4.65/27/2026

    Paper 1 offers a more comprehensive and rigorous scientific contribution with a large-scale dataset (16K+ tasks, 650+ apps), an open-source evaluation toolkit, and systematic analysis of data scaling and reinforcement learning vs supervised finetuning. These methodological insights into training paradigms and the reusable infrastructure (HyperTrack, GUIEvalKit) provide foundational tools likely to be widely adopted. Paper 2 introduces an interesting human-centered benchmarking philosophy, but its 130-task benchmark is smaller in scale and its primary contribution is more conceptual (reframing AI evaluation from replacement to enhancement) than methodological.

    vs. MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
    gpt-5.25/27/2026

    Paper 1 (MobileGym) likely has higher scientific impact due to a more technically novel and enabling infrastructure: verifiable, deterministic state-based judging, structured full-state capture/fork/compare, and highly parallel RL rollouts for mobile GUI agents—capabilities that can accelerate algorithmic research and reproducibility across the field. It also provides a sizable, parameterized benchmark with deterministic evaluation and shows sim-to-real transfer, increasing real-world applicability. Paper 2 (JobBench) is timely and socially important, but is primarily a benchmark/reframing with less methodological/technical innovation and narrower direct impact on agent training pipelines.

    vs. Test-Time Deep Thinking to Explore Implicit Rules
    gemini-3.15/27/2026

    Paper 2 introduces a comprehensive and timely benchmark that shifts the paradigm of AI agent evaluation from economic replacement to human-centric delegation. Benchmarks of this scale and philosophical importance typically have foundational, widespread impact by guiding future research directions and setting evaluation standards across the field. While Paper 1 offers a solid algorithmic improvement for agent reasoning, Paper 2's broad applicability and potential to redefine human-AI collaboration give it a higher potential for broad scientific impact.

    vs. Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models
    gemini-3.15/27/2026

    Paper 1 has higher potential impact due to its broader applicability and paradigm-shifting approach. While Paper 2 tackles a critical methodological flaw in financial ML (data contamination/look-ahead bias), its impact is largely confined to quantitative finance. JobBench addresses a central, universal challenge in AI: agent evaluation and labor alignment. By shifting the focus from economic replacement to human-centric delegation across 35 occupations and evaluating 36 models, Paper 1 establishes a comprehensive benchmark likely to influence broad AI development, HCI, and policy discussions on the future of work.

    vs. Boosting Knowledge Graph Foundation Models via Enhanced Negative Sampling
    claude-opus-4.65/27/2026

    JobBench introduces a novel benchmark paradigm that reframes AI agent evaluation from economic replacement to human-centered delegation, covering 130 tasks across 35 occupations with rigorous evaluation. It addresses a timely and broadly impactful question about how AI should augment human work, evaluates 36 frontier models, and has potential to redirect an entire research community's focus. Paper 2, while solid, proposes an incremental improvement (better negative sampling) for knowledge graph foundation models—a narrower contribution with less potential to reshape research directions or influence policy and societal outcomes.

    vs. BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization
    gemini-3.15/27/2026

    Paper 2 introduces a comprehensive benchmark for evaluating AI agents across diverse occupational workflows, addressing the critical and timely challenge of human-AI alignment and labor enhancement. Benchmarks in agentic AI currently drive broad industry and academic focus. Paper 1, while methodologically innovative, applies to a more niche domain (brick structure generation) and is likely to have a narrower impact compared to a widely applicable AI agent benchmark.

    vs. Advancing Graph Few-Shot Learning via In-Context Learning
    claude-opus-4.65/27/2026

    Paper 1 presents a technically novel method (VISION) that combines in-context learning with graph few-shot learning, introducing unsupervised meta-learning and a dual-context fusion module. It addresses well-defined limitations in an established research area with rigorous methodology and experimental validation. Paper 2 (JobBench) introduces a valuable benchmark with a compelling human-centric framing, but benchmarks generally have narrower methodological contributions. Paper 1's innovations in combining ICL with graph learning and unsupervised task generation are more likely to spawn follow-up research and broader methodological impact across graph learning applications.