KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

Ankit Maloo

#406 of 3022 · Artificial Intelligence
Share
Tournament Score
1496±32
10501800
70%
Win Rate
26
Wins
11
Losses
37
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

We introduce the first version of KWBench (Knowledge Work Bench), a benchmark for unprompted problem recognition in large language models: can an LLM identify a professional scenario before attempting to solve it. Existing frontier benchmarks have saturated, and most knowledge-work evaluations to date reduce to extraction or task completion against a specification. KWBench targets the step before that: recognizing the governing structure of the situation from raw inputs alone. The benchmark contains 223 tasks sourced from practitioners across acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design. Each task encodes a formal game-theoretic pattern (principal-agent conflict, signaling, mechanism design failure, strategic omission, coalitional dynamics, strategic interdependence) and carries structured ground truth recording the expert reading of the situation and the anticipated failure modes. Models receive raw data and a task prompt with no indication of problem type. Scoring is a three-tier rubric gated by a mandatory conjunctive check. Mandatory criteria encode the predicted wrong paths. We evaluate 16 models. The best model passes on 27.9% of tasks. The top two models agree on only 31.7% of their passes. Among the top 8, 44 tasks are solved by exactly one model; routing across the top 8 covers 50.7% of the benchmark, nearly double the best single model. Conditional on passing, quality scores converge (approx 83% across models); unconditional scores do not. Same models articulate the relevant game-theoretic concept correctly when asked, then fail to apply it unprompted. We release KWBench to shift how frontier models are evaluated on knowledge work, scoring them on whether they recognize the right problem from the situation alone, not only on how well they execute once the problem has been framed for them.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: KWBench

Core Contribution

KWBench introduces a genuinely novel evaluation axis: unprompted problem recognition — whether LLMs can identify the correct framing of a professional scenario before executing on it. The paper argues convincingly that existing benchmarks test execution given a correctly specified problem, while real knowledge work requires first recognizing *what* the problem is. The benchmark contains 223 tasks spanning acquisitions, contract negotiations, clinical pharmacy, organizational politics, fraud analysis, and incentive design, each encoding a game-theoretic pattern (signaling, principal-agent, mechanism design failure, etc.) that must be recognized from raw inputs alone.

The conceptual contribution is sharp and well-articulated: the distinction between perfect-information problems (where benchmarks saturate) and imperfect-information games (where knowledge workers actually operate) provides a clean theoretical motivation. The "chess vs. poker" framing is effective and the mapping of six game-theoretic patterns to professional scenarios is intellectually coherent.

Methodological Rigor

Strengths in design: The "don't instruct, measure" principle is methodologically sound. The mandatory conjunctive gate — score zero if any core criterion fails — is well-justified by the argument that domain expertise is conjunctive (one missed liability clause compromises the entire contract review). The three-stage rubric construction (metadata specification → multi-model generation → human synthesis) provides some quality control.

Significant weaknesses: The paper acknowledges but does not resolve several critical methodological gaps:

1. No human baseline. This is the most damaging omission. Without knowing how human experts perform, we cannot interpret the 27.9% pass rate. Is this benchmark impossibly hard, poorly calibrated, or genuinely measuring a meaningful gap? The authors claim practitioner validation of "realism and difficulty calibration," but structured consultations are not the same as having experts take the test.

2. Single-judge evaluation. All scoring relies on Gemini 3 Flash as judge. While the binary, verifiable nature of criteria helps, the paper provides no inter-rater reliability data, no multi-judge comparison, and no analysis of judge failure modes. Given the centrality of the mandatory gate, even a small systematic bias in judging could dramatically affect results.

3. No recognition ablation. The paper's central claim — that models *possess* game-theoretic knowledge but fail to *apply* it unprompted — is stated but not formally tested. Running the same tasks with explicit game-theoretic hints would be a straightforward and critical experiment.

4. Rubric construction concerns. The rubrics are generated by LLMs and synthesized by a single author. The "practitioner validation" is described vaguely ("structured consultations") without details on how many practitioners reviewed how many tasks, what their credentials were, or what the disagreement rate was.

5. Best-of-3 evaluation protocol introduces selection bias. Taking the best run inflates reported performance, though the authors note modest variance (1-3 percentage points).

Potential Impact

The paper targets a real and important gap. As LLMs are increasingly deployed in advisory roles — drafting memos, reviewing contracts, triaging decisions — the failure mode of "polished analysis of the wrong problem" is genuinely dangerous. The examples are vivid and compelling: a PIP that would fail in court, an acquisition offer misread as a valuation exercise, a deal celebrated when the buying process hasn't started.

Practical implications are significant:

  • The finding that no single model dominates (Jaccard overlap of 31.7% between top two models) has direct implications for system architecture, suggesting ensemble/routing approaches.
  • The "cooperative default" analysis — that RLHF training may systematically suppress adversarial reasoning — is a valuable hypothesis for the alignment community.
  • The 107 unsolved tasks provide a concrete capability target.
  • Limitations on impact: The benchmark's professional focus means it primarily benefits enterprise AI deployment rather than broader ML research. The task count (223) is modest, and the domain skew toward Western corporate norms (acknowledged by the authors) limits generalizability.

    Timeliness & Relevance

    This is highly timely. Frontier benchmarks are saturating (MMLU, HumanEval), and the field is actively searching for meaningful evaluation axes. The deployment of LLMs in knowledge work is accelerating faster than our ability to measure their fitness for it. The specific failure mode KWBench targets — confident, well-structured output that answers the wrong question — is arguably the most dangerous failure mode in professional AI deployment, precisely because it evades casual quality checks.

    Strengths

    1. Novel and well-defined evaluation axis. The recognition-execution distinction is crisp, empirically supported (decoupled scores, knowledge-application gap), and practically important.

    2. Rich task design. The detailed walkthrough examples (Appendix B, C) demonstrate genuine depth. The PIP example alone is a masterclass in what "testing for pitfalls, not correct answers" means.

    3. Surprising empirical findings. The disjoint recognition profiles (low Jaccard overlap), the coverage analysis showing every top-8 model contributes unique passes, and the convergence of conditional scores are genuinely informative results.

    4. Structured expert annotations (5,800 items across 223 tasks) are a separable, reusable contribution.

    5. Transparent about limitations. The paper clearly states what it does not do.

    Limitations

    1. Absence of human baselines undermines interpretability of all absolute numbers.

    2. Single author, single judge creates concentration risk in both construction and evaluation.

    3. Small benchmark size (223 tasks) with only 85 in the core game-theoretic category.

    4. Potential subjectivity in "correct" framing. While the paper argues criteria are objective (verifiable traps), reasonable experts might disagree on whether a given scenario truly requires adversarial framing.

    5. Reproducibility concerns. The reliance on practitioner knowledge that is "rarely articulated" makes independent validation difficult.

    6. 38 tasks adapted from existing benchmarks without clear analysis of how they compare to the 185 original tasks.

    Overall Assessment

    KWBench makes a compelling conceptual contribution by identifying and operationalizing a previously unmeasured capability axis. The empirical findings are interesting and the task design is sophisticated. However, the methodological gaps — particularly the absence of human baselines, recognition ablation, and multi-judge validation — prevent the paper from fully substantiating its claims. The paper is stronger as a provocation and framework than as a definitive measurement instrument. It opens an important research direction but needs significant validation work to become a trusted evaluation standard.

    Rating:6.5/ 10
    Significance 7.5Rigor 5Novelty 8Clarity 8.5

    Generated Apr 20, 2026

    Comparison History (37)

    vs. LLMs can persuade only psychologically susceptible humans on societal issues, via trust in AI and emotional appeals, amid logical fallacies
    gemini-35/5/2026

    Paper 2 introduces a novel benchmark targeting a critical, underexplored gap in LLM capabilities—unprompted problem recognition. By moving beyond saturated task-completion evaluations, it sets a new target for AI reasoning and agentic systems in real-world knowledge work. Benchmarks that expose fundamental model limitations typically drive significant, field-wide progress and garner high citations, giving it a higher potential for broad scientific impact than the behavioral findings in Paper 1.

    vs. Rectification Difficulty and Optimal Sample Allocation in LLM-Augmented Surveys
    claude-opus-4.65/5/2026

    KWBench introduces a novel evaluation paradigm—unprompted problem recognition—that addresses a fundamental gap in LLM benchmarking. Its finding that models can articulate game-theoretic concepts but fail to apply them unprompted reveals a critical limitation with broad implications for AI deployment in knowledge work. The benchmark spans multiple professional domains, offers a public resource for the community, and addresses benchmark saturation, a timely concern. Paper 1, while methodologically rigorous and practically useful, represents an incremental optimization within an established framework (Prediction-Powered Inference) with narrower scope.

    vs. When AI reviews science: Can we trust the referee?
    gemini-35/5/2026

    Paper 1 introduces a fundamentally novel evaluation paradigm for LLMs (unprompted problem recognition), addressing a critical gap in current AI benchmarks. While Paper 2 tackles an important and timely issue (AI in peer review), Paper 1's benchmark has broader implications across multiple disciplines of knowledge work and cognitive AI evaluation. By shifting the focus from task execution to problem identification, KWBench has the potential to guide the next generation of frontier model development, yielding a wider and more foundational scientific impact.

    vs. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
    claude-opus-4.64/21/2026

    Agent-World addresses a fundamental bottleneck in agent training—lack of realistic environments and continuous learning mechanisms—with a scalable, self-evolving framework validated across 23 benchmarks. Its contributions to environment synthesis, multi-environment RL, and self-evolving training have broad applicability across the rapidly growing AI agent ecosystem. While KWBench introduces a valuable and novel evaluation paradigm (unprompted problem recognition), it is primarily a benchmark contribution with a narrower scope. Agent-World's methodological contributions and demonstrated scaling laws offer more transformative potential for advancing general agent intelligence.

    vs. From Fallback to Frontline: When Can LLMs be Superior Annotators of Human Perspectives?
    gpt-5.24/21/2026

    Paper 2 offers a more general, theory-driven reframing of LLMs-as-annotators as latent opinion estimators, deriving conditions/regimes where LLMs can statistically outperform humans and where they cannot—insights broadly applicable across HCI, NLP evaluation, computational social science, and survey/measurement. This has clear real-world implications for scaling subjective annotation and estimating subgroup perspectives, and is timely given widespread LLM annotation use. Paper 1 is novel as a benchmark targeting unprompted problem recognition in knowledge work, but its impact is narrower (benchmark-centric, 223 tasks) and more domain-specific, with less immediate cross-field theoretical leverage.

    vs. ASMR-Bench: Auditing for Sabotage in ML Research
    claude-opus-4.64/20/2026

    KWBench introduces a fundamentally novel evaluation paradigm—unprompted problem recognition—that addresses a significant gap in LLM benchmarking. Its findings (best model at 27.9%, low inter-model agreement, the recognition-application gap) reveal deep limitations in current frontier models with broad implications across knowledge work domains. The benchmark spans multiple professional fields and introduces rigorous game-theoretic grounding. ASMR-Bench addresses an important but narrower AI safety concern (sabotage detection in ML codebases) with a smaller benchmark (9 codebases). KWBench's broader applicability, novel conceptual contribution, and richer empirical findings suggest higher impact.

    vs. Grounding Clinical AI Competency in Human Cognition Through the Clinical World Model and Skill-Mix Framework
    gemini-34/20/2026

    Paper 1 introduces an empirical benchmark addressing a critical, unsolved capability in LLMs (unprompted problem recognition), applying broadly across multiple domains of knowledge work. Its quantitative evaluation of frontier models reveals a significant performance gap, which is highly likely to drive immediate follow-up research and model optimization. While Paper 2 offers a valuable theoretical framework for clinical AI, conceptual models typically have a slower, more domain-restricted impact compared to actionable, broadly applicable AI benchmarks.

    vs. Do Agent Rules Shape or Distort? Guardrails Beat Guidance in Coding Agents
    gemini-34/20/2026

    Paper 2 introduces a novel benchmarking paradigm (unprompted problem recognition) that addresses a critical gap in LLM evaluation across diverse knowledge-work domains. While Paper 1 offers valuable, immediate insights for coding agents, Paper 2 has a broader scientific impact by fundamentally shifting how we evaluate and develop frontier models for complex, real-world reasoning and autonomous problem framing.

    vs. Experience Compression Spectrum: Unifying Memory, Skills, and Rules in LLM Agents
    gemini-34/20/2026

    Paper 2 introduces a novel benchmark (KWBench) addressing a critical gap in LLM evaluation (unprompted problem recognition) where frontier models currently struggle. Challenging new benchmarks that reveal fundamental model limitations typically drive immediate follow-up research, model development, and high citation rates, leading to higher measurable scientific impact compared to theoretical or survey frameworks like the one proposed in Paper 1.

    vs. Agent-Aided Design for Dynamic CAD Models
    gemini-34/20/2026

    Paper 1 introduces a paradigm shift in LLM evaluation, focusing on unprompted problem recognition rather than mere execution. Because existing AI benchmarks are rapidly saturating, a rigorous benchmark evaluating situational awareness addresses a critical bottleneck in AI development. While Paper 2 offers a valuable advance in AI-aided CAD design, its impact is largely confined to mechanical engineering and manufacturing. Paper 1's findings regarding LLM failure modes in structural reasoning have broad implications across AI alignment, cognitive science, and diverse knowledge-work domains like law, medicine, and finance, giving it a much wider scientific footprint.

    vs. How Much LLM Does a Self-Revising Agent Actually Need?
    claude-opus-4.64/20/2026

    KWBench introduces a novel benchmark addressing a fundamental gap in LLM evaluation—unprompted problem recognition rather than task execution. It covers a broad range of professional domains with 223 tasks, evaluates 16 models, and reveals striking findings (best model at 27.9%, model complementarity, recognition-application gaps). This has broad impact potential across AI evaluation, knowledge work automation, and model development. Paper 2 makes a useful methodological contribution about decomposing agent competence but is narrower in scope (single game domain, 54 games) and yields modest, inconclusive empirical results.

    vs. Stein Variational Black-Box Combinatorial Optimization
    gemini-34/20/2026

    Paper 1 addresses a critical and highly timely challenge in AI: the saturation of LLM benchmarks and the need to evaluate unprompted reasoning in real-world professional tasks. Its novel approach to problem recognition and broad applicability across diverse knowledge domains gives it significantly higher potential for widespread scientific and industry impact compared to the more specialized optimization improvements proposed in Paper 2.

    vs. SPPO: Sequence-Level PPO for Long-Horizon Reasoning Tasks
    claude-opus-4.64/20/2026

    SPPO addresses a core challenge in LLM alignment—efficient reinforcement learning for reasoning—with a concrete algorithmic contribution (sequence-level PPO) that demonstrates strong empirical results while reducing computational costs. This directly impacts the rapidly growing field of reasoning LLM training, with broad applicability. KWBench introduces an interesting benchmark for unprompted problem recognition, but benchmarks typically have narrower impact unless widely adopted, and its niche focus on game-theoretic pattern recognition in knowledge work limits its breadth. SPPO's methodological contribution is more likely to influence mainstream LLM training pipelines.

    vs. Integrating Graphs, Large Language Models, and Agents: Reasoning and Retrieval
    gemini-34/20/2026

    While Paper 1 is a timely survey that will likely attract citations, Paper 2 introduces a novel benchmark that addresses a critical, under-explored gap in AI evaluation: unprompted problem recognition. By demonstrating that current frontier models fail significantly at identifying problem structures without explicit prompts (27.9% max pass rate), Paper 2 establishes a new paradigm for evaluating AI in knowledge work. This will actively drive future research in model training and reasoning architectures, offering a higher fundamental scientific impact than a literature review.

    vs. Learning to Reason with Insight for Informal Theorem Proving
    claude-opus-4.64/20/2026

    KWBench introduces a fundamentally new evaluation paradigm—unprompted problem recognition—that addresses a critical gap in how LLMs are assessed for real-world knowledge work. Its cross-domain scope (acquisitions, clinical pharmacy, fraud analysis, etc.), game-theoretic grounding, and striking empirical findings (best model at 27.9%, low inter-model agreement, recognition-application gap) open new research directions in LLM evaluation, routing/ensemble strategies, and professional AI deployment. Paper 2 makes a solid contribution to mathematical reasoning via insight-aware training, but operates within the well-established LLM math reasoning space with incremental methodology (progressive SFT). KWBench's broader applicability and novel evaluation framework give it higher potential impact.

    vs. FVD: Inference-Time Alignment of Diffusion Models via Fleming-Viot Resampling
    claude-opus-4.64/20/2026

    Paper 1 (FVD) introduces a principled methodological advance with broad applicability across diffusion model alignment tasks, showing strong empirical gains (7% ImageReward improvement, 14-20% FID improvement, 66x speedup). It addresses a fundamental problem (diversity collapse in SMC samplers) with a theoretically grounded solution from population dynamics. Paper 2 (KWBench) is a valuable benchmark contribution highlighting an important gap in LLM evaluation, but benchmarks typically have narrower long-term impact unless widely adopted. FVD's technical contribution is more likely to influence subsequent methods across generative modeling.

    vs. Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
    claude-opus-4.64/20/2026

    Paper 2 addresses a fundamental efficiency bottleneck in language model inference with a novel MoE flow matching framework achieving dramatic speedups (40x over AR, 1000x over diffusion) while maintaining quality. This has immediate, broad practical impact across all LLM applications. Paper 1 introduces an interesting but niche benchmark for unprompted problem recognition with a relatively small task set (223 tasks). While Paper 1 highlights an underexplored evaluation dimension, Paper 2's methodological innovation in non-autoregressive generation has greater potential to influence the field broadly, combining architectural novelty (MoE-FM) with compelling empirical results across multiple architectures.

    vs. Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
    gpt-5.24/20/2026

    Paper 2 likely has higher scientific impact: it introduces a novel modeling framework (MoE flow matching) that targets a major bottleneck—fast, high-quality language generation—showing dramatic, quantifiable speedups with competitive quality and compatibility with mainstream architectures (Transformer/Mamba). This has clear, broad real-world applicability (serving, on-device, low-latency generation) and cross-field relevance (diffusion/flow methods, MoE, NAR generation). Paper 1 is timely and valuable as an evaluation benchmark, but its impact is narrower (223-task benchmark) and more incremental relative to core model capability advances.

    vs. EVGeoQA: Benchmarking LLMs on Dynamic, Multi-Objective Geo-Spatial Exploration
    claude-opus-4.64/20/2026

    KWBench introduces a fundamentally new evaluation paradigm—unprompted problem recognition—that addresses a critical gap in how LLMs are assessed for real-world knowledge work. Its findings (best model at 27.9%, low inter-model agreement, recognition-application gap) reveal deep limitations with broad implications across AI evaluation, decision support, and professional applications. The benchmark spans multiple high-stakes domains with game-theoretic grounding. EVGeoQA, while well-constructed, addresses a narrower niche (EV charging geo-spatial QA) with more incremental contributions to an existing benchmark tradition.

    vs. MEDLEY-BENCH: Scale Buys Evaluation but Not Control in AI Metacognition
    gemini-34/20/2026

    Paper 1 introduces a highly novel paradigm for LLM evaluation—unprompted problem recognition—addressing a critical bottleneck in autonomous agent deployment. By shifting focus from prompted execution to structural recognition in real-world knowledge work, it offers immense practical value and a fresh methodological approach, likely driving significant future research in agentic AI.