Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

Tengchao Lv, Dongdong Zhang, Jiayu Ding, Yilin Jia, Yuzhong Zhao, Yupan Huang, Wenshan Wu, Xiangyang Zhou

Jun 9, 2026arXiv:2606.10956v1

cs.AIcs.CL

#2702of 3539·Artificial Intelligence

#2702 of 3539 · Artificial Intelligence

Tournament Score

1326±44

10501800

38%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity8

Abstract

The deployment of Large Language Model (LLM) agents for computer automation is accelerating, yet their ability to navigate complex, professional-grade productivity software is largely untested. We argue that Office automation is an ideal environment for benchmarking document-automation capability, as it requires long-horizon planning and reasoning, precise parameter configuration, and multi-application integration. To quantify this capability, we introduce an evaluation based on China's National Computer Rank Examination (NCRE), featuring 200 comprehensive practical-operation tasks across Word, Excel, and PowerPoint. Each task is scored on a 100-point rubric scale using 7,118 machine-gradable criteria, and Score Rate (SR) denotes the mean percentage of rubric points earned across these tasks. We benchmark 7 frontier LLMs and observe stark limitations: single-turn models score a maximum of 36.6%. A stronger agentic system with execution feedback, iterative repair, and broader Office automation access reaches 68.8%, but remains below the 95.5% community-reference score used as a scoring sanity check. Ultimately, our experiments demonstrate that despite recent advancements in code generation, achieving reliable fine-grained Office document automation remains a significant challenge for current code-generating LLM and agent systems.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?"

1. Core Contribution

The paper introduces OFFICEEVAL, a benchmark of 200 practical Office automation tasks (Word, Excel, PowerPoint) derived from China's National Computer Rank Examination (NCRE), scored against 7,118 machine-gradable criteria. The central novelty lies in repurposing a real, nationally standardized human certification exam—designed by domain experts and administered to over 110 million candidates—as an AI evaluation framework. This provides an externally anchored scoring rubric (100-point scale per task with partial credit) rather than synthetic or crowdsourced tasks with binary pass/fail metrics. The benchmark evaluates 7 frontier LLMs in single-turn code generation and autonomous coding-agent settings, revealing that even the best systems (68.8% for Codex agent) fall well short of community-reference solutions (95.5%).

2. Methodological Rigor

Strengths in evaluation design: The deterministic, criterion-level scoring via Office Open XML parsing and COM automation is a significant methodological strength. With 7,118 independent evaluation signals, the benchmark offers much higher granularity than binary task-completion metrics. The partial-credit scoring provides informative signal about capability gradients rather than cliff-edge pass/fail outcomes.

Statistical practices: Results are averaged over 3 independent runs with reported standard deviations (0.4–2.1pp). Statistical tests (paired t-tests, bootstrap CIs) are used to assess significance of model differences. The cross-language analysis (Chinese vs. English) with 5 models adds an important control.

Concerns: The coding-agent comparison conflates multiple variables (execution feedback, repair budget, tool access, scaffolding) without ablation—acknowledged by the authors but still limiting causal inference. The community-reference solutions (95.5%) serve as a sanity check rather than a formal human baseline, which somewhat weakens the human-AI comparison narrative. The copyright restrictions on NCRE materials prevent full data release, which limits exact reproducibility, though the authors compensate with detailed pipeline documentation, prompts, and criterion-level statistics.

Evaluation scope: Only code-generation agents are tested; GUI-based agents (which might better leverage visual feedback for layout-heavy tasks) are excluded. This is a meaningful gap given the paper's own finding that visual categories like Animation and Graphics & Media are the weakest.

3. Potential Impact

Benchmarking community: OFFICEEVAL fills a genuine gap. Prior Office automation benchmarks (OfficeBench, OdysseyBench, SheetCopilot, PPTCBench) either focus on single applications, use synthetic tasks, or rely on workflow-level evaluation. The combination of multi-application coverage, real exam provenance, and deterministic fine-grained grading is novel and practically useful.

Agent development: The error taxonomy (Table 5-6) provides actionable diagnostic information. The finding that 97.4% of non-crash weighted loss in the best coding agent stems from implementation-knowledge errors (wrong OOXML property paths, enumeration constants, color encodings) directly suggests engineering interventions: skill libraries, retrieval-augmented generation over Office documentation, or fine-tuning on Office API specifications.

Industry relevance: Office automation is a massive market. Demonstrating that frontier LLMs fail at fine-grained document manipulation despite strong code generation capabilities is commercially relevant and sets realistic expectations for AI-powered productivity tools.

Cross-field influence: The methodology of repurposing standardized human professional exams as AI benchmarks could generalize to other certification domains (CAD, database administration, graphic design).

4. Timeliness & Relevance

The paper addresses a timely need. LLM-based agents are being aggressively marketed for computer automation, yet rigorous evaluation of their Office capabilities has lagged. The paper arrives at a moment when coding agents (Claude Code, Codex) are being deployed commercially, making the performance gap quantification particularly relevant. The 2026 publication date and evaluation of very recent models (Claude Opus 4.7, GPT-5.5, Gemini 3.1 Pro) ensure currency.

5. Strengths & Limitations

Key Strengths:

External validity anchor: Using a real, nationally administered exam with established difficulty calibration is far stronger than synthetic benchmarks. The 60-point passing threshold and community-reference score provide meaningful interpretive context.

Diagnostic depth: The criterion-level taxonomy that distinguishes execution crashes from implementation-knowledge errors, further broken into actionable subtypes (OOXML paths, enumerations, color encodings), goes beyond aggregate scoring to provide genuine diagnostic value.

Comprehensive coverage: 200 tasks across 3 applications, 2 difficulty levels, 8 skill categories, with 7,118 criteria—this is dense evaluation coverage.

Revealing finding about the "implementation knowledge gap": The insight that agents can write code that runs (98-99% execution success) but still fails on fine-grained Office-specific semantics is a valuable contribution that redirects attention from code generation quality to domain-specific knowledge.

Notable Limitations:

Reproducibility constraints: Copyright restrictions prevent data release; researchers must independently obtain NCRE materials.

Limited agent paradigms tested: No GUI-based agents, no retrieval-augmented approaches, no fine-tuned models.

Chinese-centric: While the cross-language analysis partially addresses this, a natively English exam (e.g., Microsoft Office Specialist) would strengthen generalizability claims.

Potential contamination: NCRE materials may appear in training data; the direction and magnitude of contamination effects are unknown and could inflate scores.

Single evaluation paradigm: Only code-based automation is tested, yet many real Office workflows involve mixed modalities.

Additional Observations

The paper's finding that PowerPoint is hardest in single-turn but shows the largest agent improvement is insightful—it reveals that the bottleneck shifts from API coverage (python-pptx limitations) to precise parameter knowledge (COM constants) as tool access broadens. The skill-level analysis showing complementary strengths between Claude Opus 4.7 (document structure) and GPT-5.5 (data-centric operations) despite near-identical aggregate scores demonstrates the value of fine-grained evaluation.

The paper is well-structured and clearly written, with appropriate caveats about its limitations. It represents solid benchmark engineering and evaluation work from Microsoft Research, though it is primarily empirical rather than theoretically novel.

Rating:7/ 10

Significance 7.5Rigor 7.5Novelty 6.5Clarity 8

Generated Jun 10, 2026

Comparison History (21)

Lostvs. Mind the Perspective: Let's Reason Recursively for Theory of Mind

Paper 2 addresses a fundamental cognitive capability in AI (Theory of Mind) with a novel methodological framework (recursive perspective construction) backed by theoretical analysis. While Paper 1 provides a valuable empirical benchmark for a specific practical application (Office automation), Paper 2's advancement of reasoning capabilities has broader implications for agentic AI and human-AI interaction, leading to higher potential scientific impact across multiple subfields of artificial intelligence.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

Paper 2 proposes a foundational architectural framework for AI agent security, addressing critical vulnerabilities in enterprise deployment. While Paper 1 provides a useful benchmark for LLM capabilities in office tasks, Paper 2's focus on runtime governance, composite principals, and composable enforcement primitives offers a broader systemic impact, paving the way for the safe, widespread adoption of autonomous agents across diverse industries.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Paper 2 likely has higher impact: it introduces a scalable, fine-grained, machine-gradable benchmark (200 tasks, 7,118 criteria) grounded in a real standardized exam, enabling broad, repeatable evaluation of LLM agents’ long-horizon software automation. This addresses a timely, widely relevant capability gap with applicability across AI, HCI, software engineering, and agent evaluation. Paper 1 is a valuable domain-specific engineering automation framework with open code, but its impact is narrower (concrete barrier design) and depends more on niche regulatory context, limiting cross-field breadth despite strong applied relevance.

gpt-5.2·Jun 11, 2026

Lostvs. Embodied-BenchClaw: An Autonomous Multi-Agent System for Embodied Spatial Intelligence Benchmark Construction

Paper 2 has higher impact potential due to greater novelty (autonomous, multi-agent benchmark construction with reusable skill library and QC), broader applicability across embodied AI domains (robotics, navigation, UAV, indoor/outdoor spatial reasoning), and stronger timeliness as benchmark saturation and maintenance are acute issues. Its pipeline approach could scale evaluation infrastructure and influence multiple subfields. Paper 1 is a valuable, rigorous benchmark for office automation, but is narrower in scope and closer to an incremental extension of existing LLM evaluation paradigms.

gpt-5.2·Jun 11, 2026

Wonvs. ComBench: A Benchmark for Rigorous Proof Reasoning and Constructive Realization in Olympiad-Level Combinatorics

Paper 1 addresses the practically important and timely problem of LLM-based office automation with a large-scale, well-structured benchmark (200 tasks, 7,118 criteria) that directly evaluates real-world productivity software use. This has broader impact across industry and research, as office automation affects millions of users. Paper 2, while rigorous, targets a narrower niche (Olympiad combinatorics reasoning) with only 100 problems. Paper 1's findings on the significant gap between LLM capabilities and reliable document automation have more immediate implications for the rapidly growing LLM agent deployment ecosystem.

claude-opus-4-6·Jun 10, 2026

Lostvs. ActiveMem: Distributed Active Memory for Long-Horizon LLM Reasoning

Paper 1 likely has higher impact: it proposes a novel, broadly applicable architecture (distributed active memory decoupled from reasoning) that addresses a central scaling bottleneck in long-horizon LLM agents, with demonstrated SOTA gains and reduced overhead—suggesting real-world utility and methodological contribution. Its ideas can transfer across many agentic settings (tool use, planning, retrieval, multi-step reasoning). Paper 2 is timely and rigorous as a benchmark, but is primarily evaluative and domain-specific (Office automation), with narrower cross-field methodological innovation.

gpt-5.2·Jun 10, 2026

Wonvs. Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

Paper 2 has higher likely scientific impact: it introduces a concrete, standardized, machine-gradable benchmark with clear metrics, broad utility for evaluating agentic LLMs, and directly actionable findings for automation research and industry. Its methodology (200 tasks, 7,118 criteria, multiple model baselines, sanity-check reference) supports reproducibility and rigorous comparison over time. Paper 1 is largely conceptual and definitional, with ambitious claims but unclear testable hypotheses, evaluation protocols, or implementation evidence, which limits near-term rigor and adoption despite potential long-term philosophical relevance.

gpt-5.2·Jun 10, 2026

Lostvs. Parthenon Law: A Self-Evolving Legal-Agent Framework

Paper 1 introduces a novel, self-evolving architecture for agents in a complex, high-stakes domain (legal), demonstrating how agents can improve without weight updates. This architectural contribution and its large-scale empirical validation offer broader methodological insights for expert-domain AI. In contrast, Paper 2 is primarily a benchmarking study that highlights current LLM limitations in Office software; while useful, its impact is more observational than fundamentally innovative.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. AutoPDE: Reliable Agentic PDE Solving via Explicitly Represented Solver Strategies

AutoPDE addresses a fundamental challenge in scientific computing—automating PDE solving—with a novel architecture that explicitly separates solver strategy from code generation. This has broad applications across science and engineering, introduces a methodologically rigorous framework with reusable skills and adaptive tuning, and demonstrates significant improvement over baselines. Paper 2, while introducing a useful benchmark for office automation, addresses a narrower application domain with less scientific novelty, primarily documenting LLM limitations rather than proposing a transformative solution. AutoPDE's impact spans computational science, AI-for-science, and numerical methods research.

claude-opus-4-6·Jun 10, 2026

Wonvs. From 0-to-1 to 1-to-N: Reproducible Engineering Evidence for MetaAI Recursive Self-Design

Paper 1 introduces a concrete, well-defined benchmark (200 tasks, 7,118 criteria) for evaluating LLMs on professional Office automation tasks, addressing a practical gap in AI evaluation. It provides rigorous empirical results across 7 frontier models with clear metrics. Paper 2, while addressing an interesting topic (recursive self-design), is primarily a conceptual framework and literature mapping exercise. Its proposed protocol (MetaAI-Mini) lacks experimental results, significantly limiting its immediate scientific impact. Paper 1's benchmark and findings have broader, more immediate utility for the AI research community.

claude-opus-4-6·Jun 10, 2026

#2702of 3539·Artificial Intelligence

#2702 of 3539 · Artificial Intelligence

Tournament Score

1326±44

10501800

38%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7.5

Novelty6.5

Clarity8