JobBench: Aligning Agent Work With Human Will
Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao
Abstract
Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.
AI Impact Assessments
(1 models)Scientific Impact Assessment: JobBench: Aligning Agent Work With Human Will
1. Core Contribution
JobBench introduces a benchmark of 130 agentic tasks across 35 occupations, where task selection is grounded in worker-reported delegation preferences rather than economic value. The key conceptual contribution is reframing AI workplace evaluation from "replacement" to "enhancement" — building on Workbank survey data where 1,500+ workers rated which duties they'd prefer AI to handle. Each task is packaged as a heterogeneous workspace of reference files (averaging 3.9 files per task across 17 formats), graded by chained binary rubrics averaging 35.6 criteria per task.
The benchmark addresses a genuine gap: existing occupational benchmarks like GDPVal, Remote Labor Index, and $OneMillion-Bench all scope tasks by economic value, asking whether agents can produce deliverables that substitute for human labor. JobBench instead asks whether agents can perform the specific subtasks that professionals themselves want offloaded.
2. Methodological Rigor
Strengths in design:
Concerns:
3. Potential Impact
Direct impact on benchmark design: JobBench offers a compelling alternative framing for occupational AI evaluation. The human-will-centered selection criterion could influence how future benchmarks scope their task sets, moving beyond pure economic exposure.
Occupational analysis (Section 3.3): The analysis of research papers and YC startups mapped to occupations is an interesting secondary contribution, revealing that attention correlates *negatively* with model capability — communities focus more on areas where agents still struggle. The research-vs-startup attention divergence (Figure 7c) provides actionable intelligence for both communities.
Practical deployment guidance: The scaffold comparison (Table 3) and cost analysis (Figure 5) provide practical value — showing that scaffold choice can shift scores by 6+ points with the same base model, and that GPT-5.5 at 210 provides only marginal gains.
Limitations of impact scope: The benchmark is U.S.-centric, English-only, document-heavy, and covers only 35 of hundreds of occupations. The 130 tasks, while carefully designed, provide thin per-occupation coverage (averaging 3.7 tasks per occupation), limiting fine-grained occupational conclusions.
4. Timeliness & Relevance
The paper arrives at a moment of intense debate about AI's labor-market effects, with multiple competing occupational benchmarks (GDPVal, Remote Labor Index, $OneMillion-Bench) all appearing in 2025-2026. JobBench's human-centered framing is timely and fills a genuine conceptual gap. The comparison showing GDPVal approaching saturation (70-83%) while JobBench-Main remains below 46% suggests the benchmark will remain discriminative for frontier models in the near term.
The paper's use of future model versions (Claude Opus 4.7, GPT-5.5, Gemini 3) suggests this is from a near-future or speculative timeline, which is unusual but doesn't diminish the methodological contribution if the benchmark design itself is evaluated independently.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Overall Assessment
JobBench makes a meaningful conceptual contribution by reframing occupational AI benchmarking around worker preferences rather than economic value, backed by a technically sound benchmark design with rigorous chained rubrics. The evaluation is comprehensive across models and scaffolds. However, the paper would benefit from stronger validation of its judge reliability, human baselines, and a more critical examination of whether delegation preference truly maps to enhancement rather than replacement. The benchmark's discriminative power and professional reasoning demands are genuine strengths that should sustain its relevance as frontier models improve.
Generated May 27, 2026
Comparison History (19)
JobBench addresses a timely, practical problem in AI agent evaluation with a concrete benchmark covering 130 tasks across 35 occupations, evaluated on 36 models. Its human-centered framing (augmentation vs. replacement) is novel and directly actionable for the AI community. Paper 2 proposes an interesting steganographic provenance tracking framework but is more conceptual and niche, with narrower immediate applicability. JobBench's breadth of impact—spanning AI evaluation, labor economics, and agent development—combined with its methodological rigor (fact-anchored rubrics, extensive model evaluation) gives it higher potential for widespread adoption and citation.
Paper 1 likely has higher scientific impact because it introduces a novel, multi-agent, month-long social simulation to reveal privacy failures that standard single-agent benchmarks miss, directly affecting safety evaluation paradigms for agentic systems. Its findings (social contagion of leakage, robustness against explicit instructions) are broadly relevant across AI safety, privacy, multi-agent systems, and deployment policy, and are timely as persistent agent communities become common. Paper 2 is valuable and rigorous as a benchmark for human-aligned occupational delegation, but its impact is more domain-scoped to evaluation of work agents rather than uncovering a new systemic safety failure mode.
JobBench has broader impact across fields: it covers 35 occupations with 130 tasks, evaluates 36 models, and introduces a paradigm shift from replacement to enhancement framing in AI labor economics. Its methodological contribution (fact-anchored rubrics, heterogeneous workspaces) is generalizable. SpatialBench-Long is rigorous and valuable but narrower in scope (24 evaluations in spatial biology). JobBench's reframing of how we evaluate occupational AI agents has potential to influence AI policy, workforce development, and benchmark design across many domains.
Paper 1 addresses a more fundamental and broadly impactful problem—how to measure progress toward AGI—by proposing a cognitive taxonomy grounded in decades of cognitive science research. This framework has potential to shape how the entire AI field tracks progress, informs governance, and structures evaluation. While Paper 2 (JobBench) makes a valuable contribution with its human-centered benchmark for occupational AI agents, it is more narrowly scoped as one benchmark among many. Paper 1's interdisciplinary foundation and its potential to become a standard framework for AGI evaluation give it greater breadth of impact.
Paper 2 introduces a timely, comprehensive benchmark that shifts the evaluation paradigm of AI agents from job replacement to human empowerment. This conceptual shift and robust evaluation framework will likely guide future agent development and have a broader interdisciplinary impact across AI, HCI, and economics than Paper 1's algorithmic improvement for LLM alignment, which addresses a narrower technical niche.
JobBench introduces a paradigm-shifting benchmark that redefines occupational AI evaluation from economic replacement to human enhancement. By aligning AI capabilities with human delegation preferences, it offers profound societal relevance and broad multidisciplinary impact. While Paper 1 provides a valuable methodological improvement for agent reasoning, Paper 2 addresses urgent ethical and economic questions, likely driving extensive future research in AI alignment, HCI, and agent evaluation.
Paper 2 presents a novel benchmark that fundamentally challenges the current paradigm of AI automation, shifting focus from job replacement to human enhancement. This conceptual innovation, combined with its broad applicability across 35 occupations and relevance to pressing issues in AI alignment and labor economics, gives it a much wider potential impact across multiple fields compared to Paper 1's domain-specific, though rigorous, methodological advance in chemical diagram parsing.
Paper 1 offers a profound paradigm shift by evaluating AI agents based on human empowerment rather than economic replacement. Its extensive scope (130 tasks across 35 occupations) and detailed evaluation framework position it to become a foundational benchmark for human-AI collaboration. While Paper 2 tackles the crucial technical issue of trajectory hallucinations, Paper 1's broader socio-technical relevance, interdisciplinary appeal, and potential to steer the future development of agentic AI towards human-centric workflows give it a higher overall scientific and societal impact.
Paper 1 likely has higher overall scientific impact due to broader cross-field relevance and immediate usability: a large, carefully rubric-graded benchmark for agentic work can shape evaluation practices across AI, HCI, and economics-of-AI, influencing what systems are optimized for. Its novelty is in reframing occupational agent benchmarking around human delegation priorities and providing high-fidelity, workspace-style tasks with rigorous criteria. Paper 2 is methodologically strong and innovative within fMRI decoding, but its impact is narrower (neuroimaging/brain decoding) despite clear technical advances and efficiency gains.
Paper 1 likely has higher scientific impact due to broader, timely relevance and cross-field reach: it introduces a large, rigorous benchmark (130 tasks, 35 occupations, rubric-based grading) that can standardize evaluation of agentic systems across the AI community and influence research agendas (human-alignment framing vs. GDP/replacement). Paper 2 is a useful methodological contribution with clear clinical applicability, but it targets a narrower domain (survival analysis) and shows modest gains; impact may be more limited to medical ML and may face adoption/regulatory barriers.
Paper 1 (JobBench) likely has higher scientific impact due to broader cross-field relevance and timeliness: it reframes agent evaluation around human-prioritized delegation rather than economic replacement, influencing how occupational agents are designed, assessed, and governed. Its scale (130 tasks, 35 occupations) and rubric-based, fact-anchored grading support methodological rigor and wide adoption across NLP, HCI, AI ethics, and labor economics. Paper 2 is innovative and rigorous for coding agents, but its domain is narrower (software) and impact is more specialized.
Paper 1 investigates the mechanistic interpretability of Large Reasoning Models (LRMs), a highly timely frontier in AI. Its discovery that Chain-of-Thought traces dynamically interact with residual streams to encode refusal presents a fundamental shift in AI safety and steering. While Paper 2 introduces a valuable, human-centric benchmark for AI agents, Paper 1 offers deeper methodological insights into the internal mechanics and vulnerabilities of state-of-the-art models. This mechanistic discovery will have a broader and more immediate scientific impact on fundamental AI architecture, alignment, and security research.
Paper 1 offers a more comprehensive and rigorous scientific contribution with a large-scale dataset (16K+ tasks, 650+ apps), an open-source evaluation toolkit, and systematic analysis of data scaling and reinforcement learning vs supervised finetuning. These methodological insights into training paradigms and the reusable infrastructure (HyperTrack, GUIEvalKit) provide foundational tools likely to be widely adopted. Paper 2 introduces an interesting human-centered benchmarking philosophy, but its 130-task benchmark is smaller in scale and its primary contribution is more conceptual (reframing AI evaluation from replacement to enhancement) than methodological.
Paper 1 (MobileGym) likely has higher scientific impact due to a more technically novel and enabling infrastructure: verifiable, deterministic state-based judging, structured full-state capture/fork/compare, and highly parallel RL rollouts for mobile GUI agents—capabilities that can accelerate algorithmic research and reproducibility across the field. It also provides a sizable, parameterized benchmark with deterministic evaluation and shows sim-to-real transfer, increasing real-world applicability. Paper 2 (JobBench) is timely and socially important, but is primarily a benchmark/reframing with less methodological/technical innovation and narrower direct impact on agent training pipelines.
Paper 2 introduces a comprehensive and timely benchmark that shifts the paradigm of AI agent evaluation from economic replacement to human-centric delegation. Benchmarks of this scale and philosophical importance typically have foundational, widespread impact by guiding future research directions and setting evaluation standards across the field. While Paper 1 offers a solid algorithmic improvement for agent reasoning, Paper 2's broad applicability and potential to redefine human-AI collaboration give it a higher potential for broad scientific impact.
Paper 1 has higher potential impact due to its broader applicability and paradigm-shifting approach. While Paper 2 tackles a critical methodological flaw in financial ML (data contamination/look-ahead bias), its impact is largely confined to quantitative finance. JobBench addresses a central, universal challenge in AI: agent evaluation and labor alignment. By shifting the focus from economic replacement to human-centric delegation across 35 occupations and evaluating 36 models, Paper 1 establishes a comprehensive benchmark likely to influence broad AI development, HCI, and policy discussions on the future of work.
JobBench introduces a novel benchmark paradigm that reframes AI agent evaluation from economic replacement to human-centered delegation, covering 130 tasks across 35 occupations with rigorous evaluation. It addresses a timely and broadly impactful question about how AI should augment human work, evaluates 36 frontier models, and has potential to redirect an entire research community's focus. Paper 2, while solid, proposes an incremental improvement (better negative sampling) for knowledge graph foundation models—a narrower contribution with less potential to reshape research directions or influence policy and societal outcomes.
Paper 2 introduces a comprehensive benchmark for evaluating AI agents across diverse occupational workflows, addressing the critical and timely challenge of human-AI alignment and labor enhancement. Benchmarks in agentic AI currently drive broad industry and academic focus. Paper 1, while methodologically innovative, applies to a more niche domain (brick structure generation) and is likely to have a narrower impact compared to a widely applicable AI agent benchmark.
Paper 1 presents a technically novel method (VISION) that combines in-context learning with graph few-shot learning, introducing unsupervised meta-learning and a dual-context fusion module. It addresses well-defined limitations in an established research area with rigorous methodology and experimental validation. Paper 2 (JobBench) introduces a valuable benchmark with a compelling human-centric framing, but benchmarks generally have narrower methodological contributions. Paper 1's innovations in combining ICL with graph learning and unsupervised task generation are more likely to spawn follow-up research and broader methodological impact across graph learning applications.