Design and Report Benchmarks for Knowledge Work

Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

May 22, 2026

arXiv:2605.23262v1 PDF

cs.AI(primary)

#1373of 2682·Artificial Intelligence

#1373 of 2682 · Artificial Intelligence

Tournament Score

1406±42

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1406±42

10501800

55%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance

Rigor

Novelty

Clarity

Abstract

The development of LLM agents has led to a growing body of work on knowledge-work AI, including coding, research, and healthcare. However, current knowledge-work evaluation and benchmark design still largely follow the logic of traditional NLP tasks. As a result, higher benchmark performance does not reliably show that a system can carry out knowledge work in real-world deployment settings. This paper contributes a three-step approach for making explicit how benchmarked tasks represent the work claims attached to their scores: defining the work activity under evaluation, specifying the tested setting, and scoring the appropriate work product. We review work studies showing that knowledge work is organized through roles and responsibilities, local materials and tools, and artifacts that must remain usable in downstream workflows. We then translate these concerns into benchmark design and reporting guidance, covering how tasks should be mapped to work activities, how tested settings should specify materials, tools, roles, and constraints, and how scoring should focus on the work product left by the system. To name the work activity being evaluated and distinguish it from common benchmark tasks, we derive an inventory of 18 work activities from the O{*}NET occupational task database. We demonstrate the approach through three benchmark case analyses: GDPval, a non-code occupational deliverable benchmark; OfficeQA Pro, a grounded document-analysis benchmark scored by final answers; and APEX-SWE, a software-engineering benchmark with executable scored products. These cases show how benchmark design choices shape the strongest work claim a score can support, and where gaps arise between the benchmarked task, tested setting, scored product, and broader work claim.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper presents a three-step framework for designing and reporting benchmarks that evaluate LLM agents on knowledge work: (1) defining the work activity under evaluation, (2) specifying the tested setting, and (3) scoring the appropriate work product. The central argument is that current AI benchmarks inherit NLP-style evaluation logic—bounded input-output tasks with automated metrics—which creates a gap between what benchmark scores measure and the broader work-capability claims they are used to support. The paper also derives an inventory of 18 cross-occupation work activities from the O*NET occupational task database, intended as a shared vocabulary for benchmark designers.

The paper addresses a real and growing problem: as LLM agents are deployed in professional settings (coding, healthcare, legal, administrative work), benchmark scores are routinely over-interpreted as evidence of occupational competence. The framework attempts to discipline this inference chain by making explicit what work a benchmark actually tests.

Methodological Rigor

The paper is primarily a conceptual and methodological contribution rather than an empirical one, so rigor should be assessed on the coherence of the framework and the quality of its grounding.

Strengths in grounding: The theoretical foundations are well-chosen. Drawing on Abbott's sociology of professions, Suchman's situated action theory, Carlile's boundary objects, and Malone/Crowston's coordination theory provides genuine intellectual depth. The connection between these sociological/organizational insights and benchmark design decisions is clearly articulated—roles and responsibilities justify specifying work activities, situated action justifies specifying tested settings, and coordination/boundary objects justify scoring work products rather than just outputs.

The O*NET inventory construction follows a reasonable pipeline (Job Zones 3–5 filtering → knowledge-work screening → profession-neutral rewriting → embedding/clustering → expert consolidation), though several decisions rely heavily on GPT-5.5-assisted classification with limited transparency about inter-rater reliability or sensitivity to prompting choices. The ESCO cross-check in Appendix B is a useful sanity check showing all 18 activities are populated across both ontologies, though it stops short of formal validation.

The case analyses (GDPval, OfficeQA Pro, APEX-SWE) are illustrative rather than evaluative. They demonstrate how the framework can be applied but do not produce new empirical findings about system capabilities or benchmark validity. The analyses are careful in distinguishing what each benchmark's score can and cannot support, but they examine only one or a few instances from each benchmark, which limits generalizability.

A notable weakness is the absence of any empirical demonstration that applying this framework actually improves benchmark design or changes evaluation outcomes. The paper does not show, for instance, that scoring work products (rather than outputs) leads to different system rankings, or that work-activity labels change how practitioners interpret scores.

Potential Impact

The paper could influence benchmark design practices across multiple domains—software engineering, healthcare, legal AI, enterprise automation—by providing a structured vocabulary and reporting template. The three-step framework is simple enough to be actionable: benchmark papers could adopt it as a reporting checklist without fundamental redesign.

The 18-work-activity inventory, if adopted, could serve as a coordination device for the field, enabling cross-benchmark coverage audits (e.g., "which work activities does healthcare AI evaluation actually cover?"). This addresses a genuine gap: current benchmarks are organized by domain or component task, making it difficult to assess whether evaluation covers the activities that matter for deployment.

The practical impact depends on uptake. Unlike a new model architecture or training technique, a reporting framework requires community adoption to matter. The paper's influence may be more in shaping discourse and norms than in producing direct technical advances.

Timeliness & Relevance

This paper is highly timely. The rapid deployment of coding agents (Devin, Cursor, etc.), research assistants, and enterprise AI tools has created acute demand for evaluation methods that go beyond traditional NLP benchmarks. The disconnect between SWE-bench scores and real software engineering capability is already a recognized issue (as the paper notes, citing Wang et al. 2025 on "solved" issues that don't meet developer expectations). The framework directly addresses this credibility gap.

The paper also arrives at a moment when major AI labs are releasing work-oriented benchmarks (GDPval from OpenAI, APEX-SWE, OfficeQA Pro), making the design guidance immediately relevant to ongoing benchmark development.

Strengths

1. Intellectually grounded: The framework is not ad hoc but derived from established traditions in organizational sociology and situated cognition, giving it theoretical legitimacy.

2. Actionable taxonomy: The 18 work activities with explicit definitions, contrasts, and design notes (Table C.1) provide concrete guidance rather than abstract principles.

3. Clear distinction between output and product: The paper's key conceptual move—distinguishing "work output" (visible content) from "work product" (artifact usable in downstream workflows)—is genuinely useful and underappreciated.

4. Appropriate epistemic humility: The paper is careful about what it claims. It explicitly states the inventory is preliminary, deployment evidence is still needed, and the framework doesn't address all benchmark quality concerns.

Limitations

1. No empirical validation: The framework is not tested against actual benchmark redesign outcomes. Does applying it change system rankings? Does it predict deployment success better?

2. Inventory construction opacity: Heavy reliance on LLM-assisted screening and classification introduces unquantified noise. The expert panel process is described but not detailed enough for replication.

3. Limited case analysis depth: Analyzing single instances from each benchmark is illustrative but insufficient to characterize entire benchmark suites.

4. Boundary conditions unclear: The paper doesn't address when the framework is unnecessary (simple component benchmarks may not need work-activity framing) or when it's insufficient (highly novel AI-mediated workflows that don't map to existing O*NET categories).

5. Risk of over-formalization: The framework could become a checklist that benchmark papers nominally satisfy without genuine improvement in evaluation quality.

Overall Assessment

This is a well-crafted conceptual contribution that identifies a real and important problem—the gap between benchmark scores and work-capability claims—and provides a structured, theoretically grounded approach to address it. Its impact will depend heavily on community adoption. The paper would be significantly strengthened by empirical demonstrations showing that applying the framework changes evaluation outcomes in meaningful ways. As it stands, it is a useful design guide and a thoughtful critique of current evaluation practices, but its ultimate contribution remains prospective.

Rating:6/ 10

Significance 6.5Rigor 5.5Novelty 6.5Clarity 7.5

Generated May 25, 2026

Comparison History (22)

vs. Representation Without Control: Testing the Realization Effect in Language Models

claude-opus-4.65/26/2026

Paper 1 makes a methodologically rigorous and novel contribution by demonstrating that representational readout and causal control are dissociable in LLMs—a finding with broad implications for mechanistic interpretability research. It introduces a clear three-level evaluation framework (behavioral sensitivity, linear readout, causal steering) and provides a concrete cautionary result against over-interpreting probe-based evidence. Paper 2 offers useful benchmark design guidance but is more of a framework/taxonomy contribution with less empirical novelty. Paper 1's findings are more likely to influence ongoing interpretability and AI behavioral simulation research directions.

vs. JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental methodological gap in how LLM benchmarks relate to real-world knowledge work, providing a systematic framework (work activity definition, tested setting specification, work product scoring) with practical guidance applicable across the entire field of AI evaluation. Its contribution—a reusable benchmark design methodology grounded in occupational science—has broader and more lasting impact than Paper 2, which presents an incremental model improvement (JT-Safe-V2) in the crowded space of safety-aligned LLMs. Paper 1's framework could reshape how the community designs and interprets benchmarks, affecting many downstream research efforts.

vs. NeurIPS: Neuro-anatomical Inductive Priors for Sphere-based Brain Decoding

gpt-5.25/26/2026

Paper 1 targets a fast-moving, cross-domain bottleneck: how to validly evaluate LLM “knowledge work” performance. Its framework (work-activity definition, tested setting specification, work-product scoring) plus an O*NET-derived activity inventory and multi-benchmark case analyses can reshape benchmark design, reporting standards, and downstream claims across many AI subfields and applications. Paper 2 is technically strong and timely for neuroimaging, but its impact is likely narrower to fMRI/surface-decoding communities. Overall, Paper 1 has broader and more field-spanning potential impact.

vs. Insuring Every Action: An Authority Frontier Framework for Runtime Actuarial Control of Autonomous AI Agents

gpt-5.25/26/2026

Paper 2 offers a more technically novel, operational framework (AAI + Authority Frontier) for runtime control and measurement of autonomous-agent side effects, with clear real-world applicability to safety, finance, and governance of tool-using agents. It defines concrete protocols, metrics, and cross-domain normalization, and demonstrates instantiations across multiple environments, suggesting stronger methodological and translational impact. Paper 1 is timely and valuable for benchmark design in knowledge work, but is primarily conceptual/guidance-oriented with narrower immediate deployment leverage compared to a runtime control interface and evaluation primitive.

vs. AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

claude-opus-4.65/26/2026

Paper 2 applies rigorous psychometric methods (CFA, Generalizability Theory) to quantitatively decompose benchmark reliability, yielding actionable diagnostics for the entire AI evaluation ecosystem. Its findings—such as unreliable scaling law slopes, local dependence among benchmarks, and the outsized role of contributor metadata—are immediately applicable across all LLM benchmarking efforts. Paper 1 offers valuable conceptual guidance for knowledge-work benchmarks but is more prescriptive and narrower in scope. Paper 2's quantitative framework and surprising empirical findings have broader methodological impact and are more likely to reshape evaluation practices.

vs. Second Guess: Detecting Uncertainty Through Abstention and Answer Stability in Small Language Models

gemini-3.15/26/2026

Paper 2 addresses a systemic and fundamental issue in AI research: the misalignment between current NLP benchmarks and real-world knowledge work. By proposing a comprehensive framework for designing and reporting agentic evaluations, it has the potential to influence how future AI systems are validated across multiple domains. In contrast, Paper 1 offers a valuable but narrower methodological improvement (a prompting technique) specifically for uncertainty detection in small language models.

vs. CODESKILL: Learning Self-Evolving Skills for Coding Agents

gpt-5.25/26/2026

Paper 2 has higher likely scientific impact because it targets the evaluation foundations of knowledge-work agents across domains, offering a general framework (activity/setting/product) and an 18-activity taxonomy grounded in O*NET, plus concrete benchmark case analyses. This can reshape how the field designs, reports, and interprets benchmarks—affecting many subareas (coding, research, healthcare, office work) and improving external validity. Paper 1 is technically novel and practically useful for coding agents, but its impact is narrower (primarily software-engineering agents and skill-memory methods) and depends on adoption within that slice.

vs. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

gemini-3.15/25/2026

Paper 1 addresses a critical bottleneck in AI: the misalignment between traditional NLP benchmarks and real-world knowledge work. By proposing a comprehensive evaluation framework based on occupational studies, it has the potential to fundamentally shift how AI agents are tested across diverse industries. While Paper 2 offers a timely security contribution, Paper 1's foundational approach to evaluation provides broader methodological impact and wider real-world applicability across all sectors integrating AI workflows.

vs. Parallel Context Compaction for Long-Horizon LLM Agent Serving

gpt-5.25/25/2026

Paper 2 has higher potential scientific impact because it proposes a general, field-shaping framework for designing and reporting benchmarks for real-world knowledge work, grounded in work-studies concepts and operationalized via an 18-activity inventory and concrete case analyses. This can influence evaluation practices across many domains (coding, research, healthcare, enterprise), improving rigor and relevance of future benchmarks and deployment claims. Paper 1 is a useful, timely systems contribution for long-horizon LLM serving, but its impact is narrower (agent infrastructure) and more likely to be superseded by evolving context-window and memory architectures.

vs. DART: Semantic Recoverability for Structured Tool Agents

claude-opus-4.65/25/2026

Paper 2 addresses a fundamental and broadly applicable problem in LLM evaluation methodology—how benchmarks for knowledge-work AI should be designed and reported to ensure scores reflect real-world capability. Its framework (18 work activities from O*NET, three-step approach) provides reusable infrastructure for the entire AI evaluation community across multiple domains. Paper 1, while technically rigorous in formalizing semantic recoverability for tool agents, addresses a narrower problem (mid-execution failure recovery) with a more specialized audience. Paper 2's potential to reshape evaluation practices across coding, research, healthcare, and other knowledge-work domains gives it broader and more lasting impact.

vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

gpt-5.25/25/2026

Paper 2 likely has higher impact due to a concrete, technically novel system (neurosymbolic scaffold + expert-iteration) demonstrated on Lean 4 with measurable gains: a 7B model outperforming much larger baselines and improving proof refactoring across new structural metrics. This is methodologically testable and directly advances automated theorem proving, formal verification, and LLM training-data quality, with clear downstream applications in math libraries and software/hardware verification. Paper 1 offers valuable conceptual guidance for benchmark design, but is less directly actionable and may diffuse into slower, qualitative adoption.

vs. Foundation Protocol: A Coordination Layer for Agentic Society

gemini-3.15/25/2026

Paper 1 addresses a critical and immediate bottleneck in AI research: the misalignment between traditional NLP benchmarks and real-world knowledge work. By providing a rigorous, actionable framework for benchmark design and reporting, it offers high methodological value that can immediately influence how AI agents are evaluated globally. While Paper 2 presents a visionary protocol for multi-agent coordination, Paper 1's concrete methodological rigor and immediate applicability to current LLM evaluation give it a more assured and foundational scientific impact.

vs. From Raw Experience to Skill Consumption: A Systematic Study of Model-Generated Agent Skills

claude-opus-4.65/25/2026

Paper 1 provides a comprehensive empirical framework for understanding model-generated agent skills across the full lifecycle, with actionable findings (e.g., meta-skills reducing negative transfer) backed by experiments across five domains. It addresses a core challenge in the rapidly growing field of language agents with concrete, reproducible contributions. Paper 2 offers valuable conceptual guidance for benchmark design in knowledge work but is primarily a methodological/position paper with case analyses rather than novel empirical results, limiting its immediate technical impact despite its relevance.

vs. AutoResearch AI: Towards AI-Powered Research Automation for Scientific Discovery

gemini-3.15/25/2026

Paper 1 offers a comprehensive framework for the automation of scientific discovery, an emerging paradigm with profound, cross-disciplinary implications. By conceptualizing 'AutoResearch' and defining the transition from task-level AI to workflow-level automation, it provides a foundational taxonomy for future AI scientists. While Paper 2 addresses a critical need in LLM evaluation methodology, Paper 1's focus on reshaping the scientific method itself promises broader impact, higher timeliness, and relevance to researchers across all scientific domains, making it highly likely to become a widely cited foundational text.

vs. CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

gemini-3.15/25/2026

Paper 1 addresses a highly timely and critical challenge in AI: bridging the gap between LLM benchmarks and real-world knowledge work. Its systematic framework has broad applicability across multiple domains like software engineering and healthcare. In contrast, Paper 2 explores a hybrid algorithmic approach for a niche scheduling problem but explicitly admits it is not competitive with state-of-the-art solvers, significantly limiting its practical utility and breadth of impact.

vs. SkillOpt: Executive Strategy for Self-Evolving Agent Skills

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact: it proposes a concrete, optimizer-like framework (SkillOpt) for reproducible, controllable skill improvement with validation-gated edits, and reports extensive empirical gains across many models/benchmarks/harnesses plus transfer. This combination of methodological specificity, measurable performance improvements, and near-term deployability makes it broadly useful for agent training and tooling. Paper 1 offers important conceptual guidance for benchmark design and interpretation, but is less directly actionable as a new method and may yield slower, more diffuse downstream adoption compared to a strong, general-purpose optimization technique.

vs. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact: it introduces a concrete, broadly applicable inference-time technique (IdleSpec) that improves LLM-agent performance under realistic tool/interaction latency, with measurable gains on multiple established benchmarks (GAIA, FRAMES, MLE-Bench). This is timely for deployed agent systems and can transfer across domains that involve asynchronous tools and long-horizon tasks. Paper 1 provides important conceptual guidance for benchmarking knowledge work, but it is primarily methodological/position-oriented and may diffuse impact more slowly without an immediately adoptable, quantitative standard or artifact.

vs. When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

claude-opus-4.65/25/2026

Paper 1 introduces a novel, well-defined concept (epistemic miscalibration in planning) with a concrete technical solution (EPC-AW) and quantitative experimental validation showing 9.75% improvement. It addresses a fundamental failure mode in LLM multi-agent systems that is both timely and practically important. Paper 2 provides valuable benchmark design guidance, but is more of a methodological framework/position paper without introducing new technical methods. While Paper 2 addresses an important meta-problem, Paper 1's concrete technical contribution with empirical results is more likely to drive follow-on research and real-world system improvements.

vs. FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

gemini-3.15/25/2026

While Paper 1 demonstrates impressive industrial-scale application in recommender systems, Paper 2 addresses a critical and timely methodological bottleneck in AI evaluation: benchmarking LLM agents for real-world knowledge work. By providing a structured evaluation framework for AI agents, Paper 2 has the potential for much broader cross-disciplinary impact, influencing how future AI systems are designed, tested, and validated across multiple domains.

vs. Solving the Aircraft Disassembly Scheduling Problem

gpt-5.25/25/2026

Paper 1 is likely higher impact due to broader relevance and timeliness: it targets a central, cross-domain bottleneck in evaluating LLM-based “knowledge work” systems and proposes a generalizable framework plus an activity inventory and case analyses that could influence how many benchmarks are designed and interpreted. Its potential applications span AI research, industry deployment validation, and policy/standardization. Paper 2 is methodologically solid and practically valuable, but is more domain-specific (aircraft disassembly scheduling) and incremental within established CP/MIP scheduling literature, limiting breadth of impact.