Generative AI and the Productivity Divide: Human-AI Complementarities in Education

Lihi Idan, Bharat Anand

May 18, 2026

arXiv:2605.18143v1 PDF

cs.AI(primary)

#569of 2292·Artificial Intelligence

#569 of 2292 · Artificial Intelligence

Tournament Score

1463±45

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance6

Rigor3.5

Novelty5.5

Clarity5.5

Tournament Score

1463±45

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textit{AI Interaction Competence (AIC)} -- the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates the heterogeneous productivity effects of generative AI (GenAI) access in a knowledge-work learning context. The central claim is that while LLM access raises average task performance, the gains are unevenly distributed—not along traditional academic dimensions (GPA, prior knowledge) but along a newly proposed construct called AI Interaction Competence (AIC), defined as the ability to effectively prompt, filter, and verify LLM outputs. The paper additionally tests whether lightweight managerial interventions (scaffolding via conceptual maps, additional study time, peer collaboration) can reduce outcome variance.

The introduction of AIC as a moderating construct is the paper's most distinctive conceptual contribution. The finding that traditional academic markers (GPA, prior domain knowledge) do not moderate treatment effects while AIC does is provocative and, if robust, has significant implications for workforce development and organizational design around AI tools.

Methodological Rigor

The study has notable methodological strengths but also significant limitations that temper confidence in its conclusions.

Strengths:

Randomized controlled design with pre- and post-intervention assessments

Stratification into novice/advanced learners based on objective baseline scores

Multiple outcome measures (performance, attrition, preferences)

Attention to validity checks (randomization balance, self-assessment calibration)

Weaknesses:

1. Sample size and power concerns. With 179 participants total and 29 dropouts, the effective sample is ~150, distributed across at least 6 treatment arms. Several sub-conditions (e.g., scaffolding, peer, time) likely contain very small cell sizes (perhaps 15-25 per arm). The paper reports several results at p < 0.10 rather than p < 0.05, suggesting underpowered tests. The three-way interaction analysis (Treatment × Pre-intervention score × AIC) with this sample size is particularly concerning for reliability.

2. AIC measurement is poorly specified. The paper never clearly operationalizes how AIC was measured at baseline. It is described as "inferred from behavioral performance" rather than directly assessed, yet it is used as a baseline moderator. If AIC is derived from post-treatment behavior or correlated with treatment outcomes by construction, this introduces serious endogeneity. The paper acknowledges the weak correlation between self-assessed AIC and performance (ρ = 0.46) but does not resolve the measurement question.

3. Ecological validity. Using students studying LLMs via LLMs is a recursive design the authors acknowledge, but it introduces confounds: familiarity with the tool and familiarity with the content domain are entangled. The generalizability to "knowledge work" more broadly is asserted but not demonstrated.

4. Selective reporting and effect magnitude. The paper emphasizes a "17% productivity lift" and "47% increase in variance" in the Discussion, but these specific numbers do not appear in the Results section with accompanying standard errors or confidence intervals. Key regression tables are absent—the paper reports p-values but rarely coefficients, standard errors, or effect sizes in a systematic way. No regression tables are provided at all.

5. The domain choice (studying LLMs) likely inflates engagement and motivation in the LLM condition relative to what would occur in a typical workplace learning context, potentially biasing the average treatment effect upward.

6. Attrition. Differential attrition (20 vs. 9 dropouts) between conditions is acknowledged but not formally addressed through bounds analysis or inverse probability weighting. This could bias treatment effect estimates, particularly if dropouts from the baseline condition were lower-performing.

Potential Impact

The paper addresses a genuinely important question: who benefits from GenAI and why? The conceptual framing—that AI adoption is a "capability design" problem rather than a procurement problem—resonates with practical organizational challenges. The finding that scaffolding reduces variance without lowering means is actionable and, if replicated, could influence how firms structure AI-assisted workflows.

However, the impact is limited by the preliminary nature of the evidence. The construct of AIC, while intuitively appealing, lacks psychometric validation, a clear measurement protocol, and discriminant validity analysis against related constructs (e.g., digital literacy, metacognition, critical thinking). Without this, AIC risks being a label for "people who perform well with AI perform well with AI."

The managerial recommendations (micro-trainings, SOPs, prompt templates) are sensible but not strongly grounded in the experimental evidence. The scaffolding intervention showed only weakly significant effects, and the micro-training recommendation appears to be extrapolated rather than tested.

Timeliness & Relevance

The paper is highly timely. The question of how GenAI affects productivity inequality is one of the most pressing in both management research and public policy. There is genuine demand for causal evidence on this topic, and the number of rigorous experiments remains limited. The paper positions itself well relative to recent high-profile studies (Brynjolfsson et al., 2023; Dell'Acqua et al., 2025).

Strengths & Limitations Summary

Key Strengths:

Addresses a high-salience, policy-relevant question with experimental evidence

Provocative finding that traditional academic markers do not moderate AI gains while AIC does

Tests actionable interventions (scaffolding) that organizations could implement

Clean conceptual framing linking complementarity theory to AI adoption

Key Limitations:

Critically underpowered for the number of treatment arms and interaction analyses conducted

AIC construct is poorly operationalized and potentially circular

No regression tables, effect sizes, or confidence intervals presented systematically

Recursive design (studying LLMs with LLMs) limits generalizability

Differential attrition unaddressed

Reference list is minimal (13 references), missing key recent work (e.g., Noy & Zhang, 2023; Peng et al., 2023)

Working paper status with limited peer review

Overall Assessment

This paper tackles an important and timely question with an appropriate experimental design, but the execution falls short of what would be needed for confident causal claims. The underpowered analyses, absent statistical tables, unclear AIC operationalization, and differential attrition are substantial concerns. The conceptual contribution (AIC as a new axis of inequality) is compelling as a hypothesis but requires much stronger empirical grounding to be influential. As a working paper, it offers a useful framework and suggestive evidence, but the findings should be treated as preliminary.

Rating:4.5/ 10

Significance 6Rigor 3.5Novelty 5.5Clarity 5.5

Generated May 19, 2026

Comparison History (19)

vs. Implicit Safety Alignment from Crowd Preferences

gemini-3.15/22/2026

Paper 2 addresses a highly timely and universally relevant problem across multiple disciplines (economics, management, HCI, education) by empirically measuring GenAI's productivity impacts. Its introduction of 'AI Interaction Competence' and actionable insights for reducing inequality offer broader real-world applications and societal relevance compared to Paper 1, which, while technically sound, focuses on a narrower methodological advancement within reinforcement learning and AI safety.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gpt-5.25/20/2026

Paper 2 is likely higher impact: it introduces a broadly usable diagnostic benchmark for a timely, high-stakes problem (privacy/intent-following in LLM agents) with clear real-world applicability for deployed systems and regulation. The methodology (adversarial two-model setup, policy dimensions, deterministic scoring, large multi-domain dataset) is readily extensible and can become a standard evaluation tool across academia and industry, influencing model training, red-teaming, and procurement. Paper 1 is rigorous and valuable, but its scope is narrower (education/early-career analogs) and its core contribution (heterogeneous gains + interaction competence) is less likely to become a cross-field infrastructure artifact.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how GenAI affects productivity inequality—a topic with enormous breadth of impact across economics, education, management, and policy. Its RCT methodology is rigorous, and the novel construct of AI Interaction Competence (AIC) as a predictor of differential gains is highly citable and actionable. The finding that scaffolding reduces variance has immediate practical implications for firms and educators. Paper 1, while technically sound, addresses a narrower optimization problem within LLM agent engineering with more limited cross-disciplinary relevance.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

claude-opus-4.65/20/2026

Paper 1 addresses a fundamental question about how GenAI affects productivity inequality—a topic with broad implications across economics, education, management, and policy. Its RCT methodology, novel construct of AI Interaction Competence (AIC), and actionable finding that scaffolding reduces variance make it highly relevant and timely. Paper 2 is a solid engineering contribution (benchmarking framework for computer-use agents) but has narrower impact, primarily within the AI agents community. Paper 1's findings about human-AI complementarities will likely influence organizational adoption strategies, workforce training, and educational policy at scale.

vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

gemini-3.15/19/2026

Paper 1 offers broader multi-disciplinary impact by addressing the critical socio-economic implications of GenAI adoption. Through a rigorous RCT, it introduces 'AI Interaction Competence' (AIC), providing actionable insights for education, economics, and management. While Paper 2 makes valuable technical contributions to LLM memory architectures, Paper 1's findings on workplace productivity inequality and actionable mitigation strategies possess wider real-world applicability and timeliness across diverse fields.

vs. Stateful Reasoning via Insight Replay

claude-opus-4.65/19/2026

Paper 2 addresses a fundamental limitation of Chain-of-Thought reasoning in LLMs—attention degradation over long reasoning traces—and proposes InsightReplay, a novel, generalizable method with rigorous evaluation across 24 settings (multiple model scales, families, and benchmarks). It offers a mechanistic insight into why longer CoT can hurt performance and provides a practical solution applicable broadly to LLM reasoning. Paper 1, while methodologically sound (RCT design) and policy-relevant, addresses a narrower educational/organizational question with findings (AI skill matters, scaffolding helps) that are somewhat expected. Paper 2's technical contribution has broader impact potential across the rapidly growing LLM reasoning research community.

vs. TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact due to its methodological and technical contributions: it formalizes “full-scene household reasoning,” proposes a training-free, model-agnostic framework (TaskGround), and introduces a new human-validated benchmark (FullHome) that can standardize evaluation and accelerate follow-on research. It is timely for embodied/household agents and practical constraints (privacy, local compute), and its approach can generalize to other grounding-and-planning settings. Paper 2 is rigorous and societally relevant, but its impact may be narrower (education/workforce context) and less likely to seed reusable technical artifacts than Paper 1’s framework + benchmark.

vs. SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

gemini-3.15/19/2026

Paper 1 addresses a critical, broad socioeconomic issue—the productivity divide and human-AI complementarity—with implications extending across economics, management, education, and HCI. Its introduction of 'AI Interaction Competence' offers a foundational concept for understanding heterogeneous AI adoption. In contrast, Paper 2 is a highly technical, domain-specific benchmark for LLM agents, which, while valuable to the AI community, has a narrower scope and potentially shorter lifespan of relevance compared to the lasting theoretical and practical implications of Paper 1.

vs. Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

claude-opus-4.65/19/2026

Paper 1 addresses the widely relevant topic of GenAI's heterogeneous productivity effects through a rigorous RCT, introducing the novel concept of AI Interaction Competence (AIC) as a key moderator. Its findings have broad implications across education, management, and policy, given the massive adoption of LLMs across industries. The actionable insight that scaffolding interventions can reduce inequality in AI-mediated performance has immediate real-world applications. Paper 2, while technically strong and achieving SOTA in generalized planning, addresses a narrower AI planning community. Paper 1's timeliness and cross-disciplinary relevance give it greater potential impact.

vs. A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

gemini-3.15/19/2026

Paper 1 explores the socioeconomic and productivity impacts of Generative AI, a highly timely and universally relevant topic. By introducing the concept of 'AI Interaction Competence' (AIC) and demonstrating how AI can create a new productivity divide, its findings have broad, cross-disciplinary implications for economics, education, HCI, and management. Paper 2, while methodologically rigorous and useful for healthcare, presents a more incremental and domain-specific advancement in multi-modal sleep stage classification, yielding a narrower overall scientific and societal impact compared to Paper 1.

vs. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

gpt-5.25/19/2026

Paper 1 is likely to have higher scientific impact: it proposes a novel, generalizable RLVR exploration framework (strategy-conditioned rollouts + unified objective with distillation) that directly addresses a core bottleneck in training reasoning-capable LLMs, with strong benchmark gains and clear computational efficiency implications. Its methods can influence multiple adjacent areas (LLM alignment/training, RL exploration, verifiable reasoning, scalable optimization). Paper 2 is timely and practically relevant, but its impact may be narrower (education/management) and more context-dependent despite solid RCT rigor.

vs. Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

gemini-3.15/19/2026

Paper 2 addresses a broadly applicable and highly timely issue—the productivity impacts of Generative AI across knowledge workers. Its use of a randomized controlled experiment provides strong methodological rigor. By identifying 'AI Interaction Competence' as a key driver of inequality and offering a scaffolding solution, its findings will likely influence multiple massive fields including economics, education, HCI, and management, leading to higher overall scientific impact and citation volume than Paper 1's niche ecological application.

vs. New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

claude-opus-4.65/19/2026

Paper 2 addresses a timely, broadly relevant question about GenAI's heterogeneous productivity effects with a rigorous RCT design. It introduces the novel concept of AI Interaction Competence (AIC) as a key moderator, which has immediate implications across education, management, and policy. Its breadth of impact spans multiple fields (economics, education, organizational behavior, AI policy), and its practical recommendations are directly actionable. Paper 1, while technically sound, addresses a narrower optimization problem with incremental improvements over existing methods, limiting its audience and real-world impact.

vs. Do Self-Evolving Agents Forget? Capability Degradation and Preservation in Lifelong LLM Agent Adaptation

claude-opus-4.65/19/2026

Paper 1 addresses a fundamental and underexplored problem in the rapidly growing field of self-evolving LLM agents—capability erosion during continual adaptation. It identifies a systematic phenomenon across four evolution dimensions and proposes a general mitigation framework (CPE), offering broad technical impact across AI/ML research. Paper 2 provides valuable empirical insights on AI productivity heterogeneity via an RCT, but its findings (skill-dependent gains, scaffolding helps) are more incremental and domain-specific. Paper 1's novelty, breadth of technical contribution, and timeliness in the fast-moving agents space give it higher potential impact.

vs. LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

gpt-5.25/19/2026

Paper 1 likely has higher impact due to broad, timely relevance: it provides randomized controlled evidence on how GenAI affects productivity and inequality via a measurable skill (AI Interaction Competence), with an actionable mitigation (scaffolding). This generalizes across education, workforce training, management, and AI policy, and directly informs deployment practices. Paper 2 is technically novel for degradation modeling and RUL with knowledge-conditioned RAG, but its application scope is narrower (reliability/prognostics) and impact may remain within that subfield unless widely adopted.

vs. ADR: An Agentic Detection System for Enterprise Agentic AI Security

claude-opus-4.65/19/2026

Paper 2 presents a novel, production-deployed security framework (ADR) addressing a critical emerging problem—securing enterprise AI agents—with strong empirical validation at scale (Uber, 10+ months, 7,200 hosts). It introduces a new benchmark (ADR-Bench), demonstrates significant performance improvements over baselines, and addresses a timely, high-stakes problem with broad cross-field implications (cybersecurity, AI safety, enterprise systems). Paper 1, while methodologically sound, addresses a narrower question about AI-mediated learning with more incremental findings about user heterogeneity and scaffolding interventions.

vs. Imperfect World Models are Exploitable

gemini-3.15/19/2026

Paper 1 addresses a highly timely and universally relevant topic: the productivity impact of GenAI on knowledge workers. By introducing and empirically validating the concept of 'AI Interaction Competence', it provides a highly citable framework applicable across economics, management, education, and HCI. While Paper 2 offers rigorous theoretical contributions to AI safety, Paper 1's immediate real-world applicability, randomized controlled experimental design, and broad cross-disciplinary appeal give it a higher potential for widespread scientific and societal impact.

vs. MMSkills: Towards Multimodal Skills for General Visual Agents

gemini-3.15/19/2026

Paper 2 addresses the immediate, widespread socioeconomic impact of Generative AI, offering highly relevant empirical findings on productivity and inequality through a rigorous RCT. Its introduction of AI Interaction Competence provides broad, cross-disciplinary impact across economics, management, education, and HCI. In contrast, Paper 1, while methodologically innovative, is more narrowly focused on visual agent architectures within specific AI research subfields.

vs. ScreenSearch: Uncertainty-Aware OS Exploration

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to strong methodological rigor (randomized controlled experiment), high timeliness, and broad cross-field relevance (education, labor economics, HCI, AI policy/management). Its core construct (AI Interaction Competence) and evidence on heterogeneous treatment effects plus a mitigating intervention are directly actionable and generalizable to real-world GenAI adoption, affecting many domains. Paper 1 is novel and technically valuable for GUI agent exploration, but its impact is narrower to OS/agent systems research and depends more on downstream uptake and benchmarking standards.