The Impact of AI Usage and Informativeness on Skill Development in Logical Reasoning

Shang Wu, Hongyu Yao, Catarina Belem, Shuyuan Fu, Mark Steyvers, Padhraic Smyth

May 20, 2026

arXiv:2605.21695v1 PDF

cs.AI(primary)cs.HC

#1239of 2292·Artificial Intelligence

#1239 of 2292 · Artificial Intelligence

Tournament Score

1403±47

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor4.5

Novelty4.5

Clarity7

Tournament Score

1403±47

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Artificial intelligence (AI) is being increasingly integrated into human problem-solving, yet its effects on individual skill development remain unclear. We examine how both AI usage and informativeness can shape learning in the context of a controlled logical reasoning task with on-demand access to AI assistance. We find that greater AI usage is associated with weaker skill development: heavy AI users underperform relative to comparable peers, whereas light AI users perform similarly to matched users who do not use AI. We also find in our study that these patterns are mediated by AI informativeness. Low-information AI neither improves immediate performance nor preserves performance after AI assistance is removed, and is linked to weaker learning overall. On the other hand, high-information AI was found to improve short-run performance without reducing post-AI outcomes on average in our experiments, but with heterogeneous effects. Our findings in general suggest that AI can, depending on context, either complement human skill development by amplifying independent reasoning or can act as a substitute that undermines such reasoning, with the implication that regulating AI access and usage will be important for promoting skill development in the presence of AI assistance.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates how AI usage intensity and informativeness affect human skill development in logical reasoning tasks, using a controlled pre-/post-assessment experimental design. The main novelty lies in the intersection of three factors: (1) measuring skill development *after* AI removal rather than during AI use, (2) experimentally manipulating AI informativeness while holding accuracy constant, and (3) examining individual heterogeneity in how users engage with AI assistance. The key finding is that heavy AI usage is associated with weaker skill development, and this relationship is moderated by AI informativeness—low-information AI uniformly harms learning while high-information AI produces heterogeneous effects that widen ability gaps.

Methodological Rigor

The experimental design has notable strengths but also significant limitations:

Strengths:

The three-phase design (pre-AI → AI exposure → post-AI) is well-suited for isolating learning effects from contemporaneous AI support. This addresses a genuine gap, as most prior work only measures performance *during* AI use.

Propensity score matching (PSM) is used to account for baseline ability differences between usage groups, which is appropriate given the observational nature of usage behavior within treatment conditions.

Holding AI accuracy at 100% eliminates trust dynamics as a confound when studying informativeness.

Weaknesses:

The sample size is modest (N=132 after exclusions, with 42-47 per condition), limiting statistical power for the subgroup analyses that form the paper's most interesting findings. Several reported effects are marginal (p=0.08, p<0.10).

AI usage intensity is an endogenous behavioral choice, not a randomly assigned treatment. Even with PSM on Phase 1 performance, unobserved confounders (motivation, cognitive style, fatigue tolerance) could drive both heavy AI use and weaker learning. The paper sometimes conflates correlation with causation in its language despite this limitation.

The task is highly artificial—ordering six objects based on logical constraints—which limits ecological validity. The authors acknowledge this but don't adequately discuss how task specificity constrains generalizability.

The "simulated AI" with perfect accuracy is unrealistic and participants were unaware of accuracy levels, creating a somewhat artificial trust environment.

Phase 2 lasts only 20 minutes, making claims about "skill development" somewhat stretched. This is more accurately described as short-term learning transfer.

The exclusion of 28/160 participants (17.5%) is substantial, and the paper provides limited detail on whether exclusion patterns differed across conditions.

Potential Impact

The paper addresses a timely concern about AI's effects on human cognition and learning. The practical implications are relevant to:

Education policy: Informing decisions about AI tool access in educational settings

AI system design: Suggesting that informativeness levels matter for downstream learning

Workplace training: Highlighting risks of AI dependency for skill acquisition

However, the impact is tempered by the narrow experimental context. The logic puzzle paradigm, while controlled, is far removed from the educational and professional domains where these findings would matter most (e.g., medical diagnosis, programming, essay writing). The findings largely confirm intuitions rather than reveal surprising mechanisms—that heavy reliance on external tools reduces independent learning is well-established in educational psychology literature on scaffolding and desirable difficulties.

Timeliness & Relevance

The paper is highly timely. As LLMs and AI assistants become ubiquitous, understanding their impact on human skill development is urgent. The paper arrives amid growing concern about "cognitive offloading" and "deskilling" in AI-assisted environments. It speaks directly to debates in education about ChatGPT policies and in medicine about diagnostic deskilling with AI support. The HHAI 2026 venue is appropriate.

Strengths & Limitations

Key Strengths:

1. The pre-/post-AI assessment design is the paper's strongest methodological contribution, directly measuring what matters—performance after AI is removed.

2. The distinction between AI informativeness levels is a clean experimental manipulation that yields interpretable results.

3. The heterogeneity analysis revealing that high-information AI widens ability gaps is the most novel and policy-relevant finding.

4. The "solo share" metric captures a meaningful behavioral dimension of cognitive engagement.

Notable Weaknesses:

1. Causal claims exceed design: Usage intensity is not randomly assigned, yet much of the discussion implies causal relationships. The PSM approach mitigates but does not resolve this.

2. Limited statistical power: Key subgroup findings rely on small cells (splitting 43 high-info participants by ability level yields ~21 per group), and several results hover around conventional significance thresholds.

3. Task artificiality: Logic puzzles with deterministic solutions and perfect AI accuracy are far from real-world AI-assisted learning scenarios.

4. Short time horizon: 20 minutes of AI exposure is insufficient to study "skill development" in any meaningful sense—this is more accurately characterized as short-term performance transfer.

5. Missing controls and analyses: No analysis of learning curves within phases, no examination of which specific problems benefited from AI, and limited analysis of what strategies participants actually developed.

6. Self-selection bias: The finding that heavy AI users perform worse is confounded by the possibility that individuals who rely heavily on AI are systematically different in ways not captured by Phase 1 performance alone (e.g., motivation, self-regulation).

7. Inflated confidence finding: The observation about miscalibrated self-assessment among lower-ability participants is suggestive but based on single survey items without validated scales.

Overall Assessment

This paper makes a relevant and timely contribution to an important question, with a reasonably well-designed experiment. However, its impact is limited by the artificial task domain, modest sample sizes, short time horizons, and causal inference challenges. The findings are directionally interesting and policy-relevant but not definitive. The paper is best viewed as a well-motivated pilot study that points toward important dynamics requiring larger-scale, longer-term, and more ecologically valid investigation. It is appropriate for a workshop or short conference paper at HHAI but would need substantially more evidence to influence policy or practice.

Rating:5/ 10

Significance 5.5Rigor 4.5Novelty 4.5Clarity 7

Generated May 22, 2026

Comparison History (19)

vs. The Log is the Agent: Event-Sourced Reactive Graphs for Auditable, Forkable Agentic Systems

claude-opus-4.65/22/2026

Paper 1 addresses a timely, broadly relevant question about AI's impact on human skill development with empirical evidence from controlled experiments. Its findings on how AI usage intensity and informativeness affect learning have immediate implications for education, workforce training, and AI policy—topics of enormous current societal interest. The nuanced finding that AI can complement or substitute for human reasoning depending on context is novel and actionable. Paper 2 presents an interesting software architecture (ActiveGraph) for agentic systems, but it is more niche, lacks empirical validation of its claimed benefits, and primarily contributes to AI engineering rather than generating broadly impactful scientific insights.

vs. A Causal Argumentation Method for Explainability of Machine Learning Models

gemini-3.15/22/2026

Paper 1 addresses a critical technical bottleneck in machine learning (explainability) by introducing a novel integration of causal discovery and argumentation frameworks. This methodological advancement has broad, scalable applicability across numerous high-stakes domains requiring interpretable AI, offering foundational algorithmic tools for future AI development that typically yield higher cross-disciplinary citations than the behavioral insights presented in Paper 2.

vs. HarnessAPI: A Skill-First Framework for Unified Streaming APIs and MCP Tools

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental question about how AI usage affects human skill development and learning, with broad implications for education, workforce development, and AI policy. Its controlled experimental design examining causal mechanisms (AI informativeness as mediator) provides rigorous evidence on a timely topic relevant across multiple fields. Paper 1, while practically useful, is an engineering contribution (a Python framework reducing boilerplate) with narrow impact limited to a specific developer community and lacking scientific novelty beyond software engineering convenience.

vs. Teaching AI Through Benchmark Construction: QuestBench as a Course-Based Practice for Accountable Knowledge Work

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental question about AI's impact on human skill development with rigorous experimental methodology, producing generalizable insights about AI-as-complement vs AI-as-substitute that have broad implications across education, workforce development, and AI policy. Its findings on AI usage intensity and informativeness mediating learning outcomes are novel and timely, relevant to nearly every domain where AI assistance is deployed. Paper 2, while creative in its pedagogical approach, is more niche—focused on a specific classroom practice and benchmark artifact with narrower applicability and less generalizable scientific contributions.

vs. LACO: Adaptive Latent Communication for Collaborative Driving

gpt-5.25/22/2026

Paper 1 is more methodologically innovative and timely for autonomous systems: it proposes a concrete, training-free latent-communication framework (ILD, CHSA, SSKD) addressing clear bottlenecks (latency, information loss, identity confusion) and validates in closed-loop CARLA, implying near-term deployment relevance for connected AVs and multi-agent robotics. Its ideas may generalize to other multi-agent settings (robot swarms, decentralized inference), broadening impact. Paper 2 is important and applicable to education/policy, but likely less novel methodologically and more context-dependent; rigor is hard to judge from abstract alone.

vs. Beyond the Org Chart: AI and the Transformation of Invisible Work

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to a more novel, causally oriented contribution: controlled experiments on AI usage/informativeness and downstream skill development, with mediating mechanisms and heterogeneous effects. This is timely and broadly relevant to education, human-AI interaction, labor economics, and policy (e.g., regulating AI access). Its methodological rigor appears stronger than Paper 2’s small-N, single-firm qualitative interviews, which are valuable for insight and hypothesis generation but have limited generalizability and weaker causal claims, thus narrower scientific reach.

vs. ExComm: Exploration-Stage Communication for Error-Resilient Agentic Test-Time Scaling

gemini-3.15/22/2026

While Paper 1 presents a strong technical advancement in AI agent architecture, Paper 2 addresses a critical and highly timely societal issue: the impact of AI on human skill development. Its findings have far-reaching implications across multiple disciplines, including education, cognitive psychology, human-computer interaction, and AI policy, giving it a broader potential scientific and real-world impact compared to the specialized algorithmic improvements in Paper 1.

vs. EXG: Self-Evolving Agents with Experience Graphs

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to a more novel technical contribution (structured experience graphs for self-evolving LLM agents), clear methodology with benchmarked performance/efficiency gains, and broad applicability across agentic systems, continual learning, memory/knowledge representation, and software automation. Its timeliness is high given rapid adoption of deployable agents and the need for scalable improvement mechanisms. Paper 2 is important and relevant for human-AI learning and policy, but its impact may be narrower and more context-dependent (specific task/experimental setting) and less likely to generalize into widely reusable methods or systems.

vs. Scaling Observation-aware Planning in Uncertain Domains

gemini-3.15/22/2026

Paper 1 addresses a highly timely and broadly relevant issue—how AI affects human skill development—with significant real-world implications for education, policy, and cognitive science. While Paper 2 offers impressive algorithmic improvements (orders of magnitude) in POMDP planning, its impact is largely confined to the specific subfield of robotics and decision-making. Paper 1's cross-disciplinary appeal and societal relevance give it higher potential scientific impact.

vs. The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to broader cross-field relevance and timeliness: it addresses how AI assistance affects human skill acquisition, a central question for education, workplace training, human–AI interaction, and AI governance. The experimental, controlled-task design and mediation via “informativeness” suggests stronger causal/methodological rigor than a primarily observational/computational humanities comparison. Its findings have direct real-world implications for regulating AI access and designing assistive systems. Paper 2 is innovative and valuable within digital humanities, but its impact is likely narrower and more domain-specific.

vs. The Shape of Testimony: A Scalable Framework for Oral History Archive Comparison

gemini-3.15/22/2026

Paper 2 addresses an extremely timely and broad societal issue: the impact of AI on human learning and skill development. Its findings have wide-ranging implications across education, psychology, human-computer interaction, and AI policy. While Paper 1 provides an innovative computational framework for digital humanities, its primary impact is largely confined to history and archival studies, giving Paper 2 a significantly higher potential for cross-disciplinary and real-world scientific impact.

vs. Parametric Modular Answer Set Programs Made Declarative

gemini-3.15/22/2026

Paper 1 addresses a highly timely and widely relevant issue: the impact of AI on human learning and skill development. Its findings have broad implications across education, cognitive science, human-computer interaction, and AI policy. In contrast, Paper 2 focuses on a very niche theoretical advancement in answer set programming. While methodologically rigorous, Paper 2's impact is largely confined to a specific subfield of logic programming, whereas Paper 1 has significant real-world applicability and interdisciplinary appeal.

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact: it introduces a novel, timely benchmark for end-to-end LLM agent spreadsheet generation in high-stakes finance, with a multidimensional evaluation taxonomy that can become a standard tool for model assessment and drive measurable progress across agentic AI, HCI, and enterprise automation. Its applications are concrete and immediate (auditing, reliability, workflow automation), and benchmarks typically catalyze broad follow-on research. Paper 2 is important and relevant, but its contribution (AI assistance effects on learning) is less methodologically distinctive from prior human-AI/education studies and may have narrower generalizability beyond the specific task setting.

vs. Conflict-Aware Additive Guidance for Flow Models under Compositional Rewards

claude-opus-4.65/22/2026

Paper 1 addresses a timely and broadly impactful question about how AI usage affects human skill development, with implications spanning education, workforce training, and AI policy. Its findings—that heavy AI use can substitute for rather than complement learning—have immediate real-world relevance as AI tools become ubiquitous. Paper 2 makes a solid technical contribution to guided sampling in flow/diffusion models, but its impact is more narrowly scoped to the generative modeling community. Paper 1's breadth of societal impact and timeliness give it higher potential scientific impact.

vs. Knowledge Graph Re-engineering Along the Ontological Continuum (extended version)

claude-opus-4.65/22/2026

Paper 2 addresses the timely and broadly relevant question of how AI usage affects human skill development, with empirical findings from controlled experiments. Its implications span education, cognitive science, AI policy, and workforce development, giving it wide interdisciplinary appeal. The finding that AI can either complement or substitute for human reasoning, depending on informativeness and usage intensity, has immediate practical applications for AI tool design and educational policy. Paper 1, while intellectually interesting, is more niche—proposing a theoretical framework for knowledge graph re-engineering with limited empirical validation beyond a case study, and targeting a narrower community.

vs. AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

gpt-5.25/22/2026

Paper 2 likely has higher impact: it introduces a reusable, conversation-grounded EI benchmark with participant-provided turn-by-turn annotations, enabling standardized evaluation and model comparison across many systems—high novelty, timeliness, and broad applicability to NLP, HCI, safety, and alignment. The benchmark can directly influence model development and deployment practices. Paper 1 addresses an important question (AI assistance and skill development) with real-world relevance, but appears narrower in scope (a controlled logical reasoning task) and its impact may depend on generalizability beyond the specific experimental setting.

vs. ArborKV: Structure-Aware KV Cache Management for Scaling Tree-based LLM Reasoning

claude-opus-4.65/22/2026

Paper 1 addresses a fundamentally important and timely question about how AI usage affects human skill development, with broad implications across education, workforce policy, and AI governance. Its findings that AI can either complement or substitute for human reasoning, depending on informativeness and usage patterns, are highly relevant to ongoing societal debates about AI integration. Paper 2, while technically solid and useful for LLM inference optimization, addresses a narrower systems-level problem (KV cache management) that is more incremental and has a smaller audience. Paper 1's cross-disciplinary relevance and policy implications give it greater potential impact.

vs. KAPPS: A knowledge-based CPPS Architecture for the Circular Factory

gpt-5.25/22/2026

Paper 1 has higher likely scientific impact due to broader cross-domain relevance (education, human-AI interaction, cognition, policy), high timeliness given rapid AI adoption, and clearer causal/experimental framing around how AI assistance affects skill development. Its findings can inform AI tool design and regulation across many settings. Paper 2 is innovative and applied, but is more domain-specific (circular manufacturing IT/CPPS architectures) and design-science evaluations often have narrower citation and adoption outside industrial engineering, despite solid real-world applicability.

vs. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental and timely question about how AI usage affects human skill development, with broad implications across education, workforce training, and AI policy. Its findings—that AI can either complement or substitute for human learning depending on informativeness—have wide applicability and relevance to ongoing societal debates about AI integration. Paper 2, while technically solid, is a more incremental engineering contribution focused on a specific application (spreadsheet automation via RL fine-tuning), with narrower impact scope and less conceptual novelty. Paper 1's insights are more likely to influence multiple fields and policy discussions.