Back to Rankings

How AI Agents Reshape Knowledge Work: Autonomy, Efficiency, and Scope

Jeremy Yang, Kate Zyskowski, Noah Yonack, Jerry Ma

cs.AIecon.GN
Share
#723 of 3489 · Artificial Intelligence
Tournament Score
1465±45
10501800
75%
Win Rate
18
Wins
6
Losses
24
Matches
Rating
7.5/ 10
Significance8
Rigor6.5
Novelty7.5
Clarity8.5

Abstract

Frontier AI systems are bridging the gap between intelligence and utility by shifting from conversational assistants to autonomous agents that execute tasks end to end. Using production data from Perplexity's Search and Computer products, we study this transition by examining how AI agents accelerate and reshape knowledge work. Three key empirical findings emerge. First, using sessions with near-identical initial query pairs as natural experiments for the same underlying task attempted with both products, Computer performs 26 minutes of autonomous work per user session, versus 33 seconds for Search. Computer automates task decomposition and execution that Search users might otherwise manually orchestrate and implement. As a result, Computer shifts follow-up query distribution toward higher-order work such as verification and extension. Autonomy also increases execution quality, with per-query dissatisfaction rates 55% lower on Computer than on Search. Second, due to its autonomy advantage, Computer reduces completion time from 269 to 36 minutes on matched tasks, lowering estimated time and cost by 87% and 94%, respectively, compared to humans equipped with Search alone. Third, Computer changes the scope of work that users attempt: Computer queries more often cross occupational boundaries, require higher-order cognition, draw on broader expertise, take the form of composite tasks that bundle interdependent subtasks into a single query, and unlock work activities that are essentially absent from Search usage among the same users. Together, the evidence indicates that AI agents accelerate workflows, enhance output quality, reduce costs, and expand the breadth and depth of automated work.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper provides the first large-scale field evidence on how the transition from conversational AI assistants to autonomous AI agents affects knowledge work along three dimensions: autonomy, efficiency, and scope. Using production data from Perplexity's Search (conversational assistant) and Computer (autonomous agent) products, the authors document that agents perform 48× more machine work per session, reduce task completion time by 87% and cost by 94%, and expand the complexity and cross-occupational breadth of work users attempt. The paper also contributes a simple task-based conceptual framework (knapsack model with fixed delegation costs and lower marginal execution costs for agents) that generates testable predictions about task selection under different AI regimes.

Methodological Rigor

Strengths in identification strategy. The matched-pair design—pairing near-identical initial queries (cosine similarity >0.99) from the same users across both products—is the paper's methodological centerpiece. This within-user, within-task comparison controls for both user heterogeneity and task content, providing arguably cleaner identification than most field studies in this space. The authors appropriately acknowledge that this covers only a subset of Computer usage (tasks with Search analogues) and note the likely underestimation given Computer queries tend to be more complex.

Triangulation. The efficiency estimates are triangulated via three independent methods: tool-based time mapping, LLM-based estimation, and user interviews (n=25). The convergence across methods (87% vs. 84% time savings) strengthens the findings. The sensitivity analysis showing robustness to 16× overstatement in per-tool time estimates and 26× inflation in oversight assumptions is particularly convincing.

Weaknesses. Several methodological concerns warrant attention. First, the paper relies heavily on LLM-based classification for scope analysis (Bloom's taxonomy, O*NET mapping, task composability), introducing measurement error of unknown magnitude. While the authors note that "the magnitude of the gaps suggests that the patterns are unlikely to be driven by classification noise alone," no formal validation of classifier accuracy is provided. Second, the "do"-tool gating for Computer queries creates a selection issue: Computer sessions that don't invoke execution tools are excluded, potentially inflating the autonomy and efficiency gaps for the measured subset. Third, the 10-minute fixed oversight assumption for Computer + Human is somewhat arbitrary, though sensitivity analysis partially addresses this. Fourth, the complementarity analysis (Appendix D) uses exact matching on observable characteristics, but unobservable differences between Computer adopters and non-adopters may still confound the DiD estimates.

The conceptual framework, while clean, is stylized to the point of being almost tautological—if agents have lower marginal costs and higher fixed costs, it follows mechanically that they dominate for complex tasks. The framework's value lies more in organizing the empirical analysis than in generating non-obvious predictions.

Potential Impact

This paper is likely to be highly influential across several dimensions:

1. Economics of AI literature: It provides the first comprehensive empirical evidence on agent-level (not just assistant-level) productivity impacts across diverse knowledge work domains, filling an important gap between coding-specific studies (Demirer et al., Cui et al.) and theoretical exposure analyses (Eloundou et al., Felten et al.).

2. Organizational design: The scope expansion findings—users crossing occupational boundaries and undertaking more complex, multi-domain tasks—have direct implications for how firms structure teams, define roles, and coordinate work. The evidence that 23% of Computer queries involve task statements absent from the same users' Search behavior suggests genuine capability expansion, not just acceleration.

3. Labor economics: The finding that agents don't just speed up existing tasks but change what tasks are attempted provides empirical grounding for theoretical models of task creation/displacement (Acemoglu & Restrepo). The horizontal expansion across occupations and vertical expansion into higher-order cognition suggest agents may compress occupational boundaries.

4. AI product design: The paper provides a useful two-dimensional framework (autonomy × context integration) for characterizing AI products and documents how product architecture affects user behavior and economic outcomes.

Timeliness & Relevance

This paper is exceptionally timely. The shift from chatbots to agents is the central narrative in AI product development in 2025-2026, yet rigorous empirical evidence has been scarce. The 90-day observation window (Feb-May 2026) captures the very early deployment of general-purpose agent orchestration systems, making this among the first systematic studies of this product category. The breadth of knowledge work domains covered (18 domains, from legal to healthcare to finance) distinguishes it from the coding-dominated agent literature.

Key Strengths

  • Scale and ecological validity: Production data from hundreds of thousands of users performing real tasks, not lab experiments with artificial constraints
  • Within-user design: Same users observed across both products, controlling for individual heterogeneity
  • Multi-method triangulation for efficiency estimates
  • Comprehensive scope analysis using established taxonomies (Bloom's, O*NET) at multiple granularity levels
  • Breadth across knowledge work domains rather than narrow focus on coding
  • Notable Limitations

  • Conflict of interest: Three of four authors are Perplexity employees, and the paper effectively evaluates Perplexity's own product. While the methodology is generally sound, this creates incentive alignment concerns regarding framing and selective reporting.
  • Generalizability: The 90-day early-adoption window captures power users and paying subscribers, not representative workers.
  • No quality validation beyond satisfaction signals: User satisfaction (next-turn dissatisfaction) is a noisy proxy for output quality; no independent quality assessment of Computer outputs is conducted.
  • Missing counterfactuals: The paper doesn't observe what users actually do outside the Perplexity ecosystem, potentially missing important workflow context.
  • LLM-as-classifier validation gap: Heavy reliance on LLM classification without ground-truth validation for the scope analysis.
  • Overall Assessment

    This is a well-executed empirical study that addresses a timely and important question with large-scale production data and a thoughtful identification strategy. The findings on autonomy and efficiency are convincing; the scope expansion results are more novel but rest on less validated measurement. The conflict of interest is a legitimate concern but does not invalidate the methodology. The paper will likely serve as a key reference point for understanding how autonomous agents reshape knowledge work.

    Rating:7.5/ 10
    Significance 8Rigor 6.5Novelty 7.5Clarity 8.5

    Generated Jun 8, 2026

    Comparison History (24)

    Lostvs. WorldKernel: A World Model is the Coupling Kernel of Admissible Possible Worlds

    Paper 2 addresses foundational theoretical limitations in causal modeling and counterfactual reasoning, introducing a novel mathematical framework (WorldKernel). While Paper 1 offers valuable empirical insights into current AI utility, Paper 2 tackles core bottlenecks in AI reasoning and world modeling. This deep theoretical contribution has the potential to significantly reshape future AI architectures, learning paradigms, and the broader field of causal inference, giving it higher long-term scientific impact.

    gemini-3.1-pro-preview·Jun 10, 2026
    Lostvs. From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

    Paper 1 has higher estimated scientific impact due to a clearly novel, training-free, entropy-guided adaptive inference mechanism for long-context LLMs with concrete algorithmic contributions (per-head/segment allocation and decoding-time latent KV compression), strong methodological rigor via multi-model benchmarking and measurable speedups at 100k+ tokens, and broad applicability to widely used LLM inference stacks. Paper 2 is timely and valuable empirically, but is more product-specific/observational, with narrower generalizable methodological contribution and higher risk of confounds and limited reproducibility.

    gpt-5.2·Jun 9, 2026
    Wonvs. Personalization Meets Safety:Mechanisms,Risks,and Mitigations in Personalized LLMs

    Paper 2 presents novel empirical findings from production data analyzing how AI agents reshape knowledge work, with concrete quantitative results (87% time reduction, 94% cost reduction, 55% lower dissatisfaction). It addresses the timely transition from conversational AI to autonomous agents with real-world production evidence, offering broad implications across economics, HCI, and AI research. Paper 1, while comprehensive, is a survey/review paper synthesizing existing work on personalized LLM safety—valuable but inherently less novel. Paper 2's empirical contributions and immediate practical relevance give it higher impact potential.

    claude-opus-4-6·Jun 9, 2026
    Wonvs. Front-to-Attractors: Modifying the Front-to-Front Heuristic in Bidirectional Search

    Paper 2 presents novel empirical findings about how AI agents reshape knowledge work using large-scale production data, addressing a timely and broadly relevant topic. Its findings on autonomy, efficiency, cost reduction, and scope expansion have wide-ranging implications across economics, HCI, organizational science, and AI policy. Paper 1, while technically solid, makes an incremental contribution to bidirectional search heuristics—a narrower subfield of AI. Paper 2's relevance to the rapidly evolving AI agent landscape and its potential to influence policy, labor economics, and product design gives it substantially broader impact potential.

    claude-opus-4-6·Jun 8, 2026
    Lostvs. StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

    Paper 1 offers a novel, technically specific method (entity-stain tracking + evidence linking) that advances process reward modeling and credit assignment for GUI-agent RL, with clear methodological contributions, benchmarks (AndroidWorld/OGRBench), and measurable gains. Its ideas may generalize to other long-horizon, partially observable agent settings (web, robotics, tool-use), giving broader cross-field impact. Paper 2 is timely and potentially influential for policy/industry, but is primarily an observational study on proprietary production data with limited methodological innovation and harder-to-reproduce claims, which may constrain long-term scientific impact.

    gpt-5.2·Jun 8, 2026
    Lostvs. Breaking the Reversal Curse in Autoregressive Language Models via Identity Bridge

    Paper 1 addresses a fundamental, widely-recognized limitation of autoregressive LLMs (the reversal curse) with both rigorous theoretical proofs and empirical validation. By providing a novel, low-cost solution that challenges existing assumptions about LLM memorization vs. rule-learning, it promises widespread methodological adoption in foundational model training. Paper 2, while valuable for HCI and economics, is an observational study of a specific commercial product's workflow impact, making Paper 1's core algorithmic and theoretical contributions far more foundational and transformative for the AI field.

    gemini-3.1-pro-preview·Jun 8, 2026
    Wonvs. Accelerated Fourier SAT (AFSAT): Fully Realising a GPU-based Symmetric Pseudo-Boolean SAT Solver

    Paper 1 presents novel empirical findings about how AI agents reshape knowledge work using production-scale data, addressing a timely and broadly impactful topic. Its findings on autonomy, efficiency gains (87% time reduction, 94% cost reduction), and scope expansion have immediate implications for economics, labor markets, HCI, and AI policy. The natural experiment methodology and large-scale production data provide strong evidence. Paper 2, while technically solid, is a narrower engineering contribution improving a GPU-based SAT solver, with more limited audience and applicability. Paper 1's relevance to the rapidly evolving AI agent landscape gives it significantly broader impact potential.

    claude-opus-4-6·Jun 8, 2026
    Wonvs. Workflow-to-Skill: Skill Creation via Routing-Workflow-Semantics-Attachments Decomposition

    Paper 1 provides large-scale empirical evidence from real-world production data detailing how autonomous AI agents impact knowledge work. Its findings on efficiency gains and shifting work scopes have broad, cross-disciplinary implications spanning economics, HCI, and AI policy. While Paper 2 offers a strong technical contribution for LLM agent skill creation, its impact is narrower and largely confined to agent architecture. Paper 1's broader societal and scientific relevance gives it higher potential impact.

    gemini-3.1-pro-preview·Jun 8, 2026
    Lostvs. Towards World Models in Biomedical Research

    While Paper 1 provides valuable empirical data on AI's impact on productivity, Paper 2 proposes a transformative paradigm for biomedical research. Developing 'biomedical world models' to simulate biological futures could revolutionize drug discovery, personalized medicine, and clinical trials. The potential to proactively model complex biological systems and interventions offers profound, life-saving scientific consequences that outweigh the efficiency gains of current AI agents in standard knowledge work.

    gemini-3.1-pro-preview·Jun 8, 2026
    Wonvs. The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective

    Paper 1 presents novel empirical findings with large-scale production data quantifying how AI agents reshape knowledge work, offering concrete metrics on autonomy, efficiency gains (87% time reduction, 94% cost reduction), and scope expansion. Its real-world evidence on a timely topic (autonomous AI agents vs. assistants) has broad implications for economics, labor, and AI deployment. Paper 2 proposes a useful conceptual framework mapping the sim-to-real gap to foundation model agents via MDP formalization, but is primarily a position/agenda paper without substantial empirical validation, limiting its immediate impact.

    claude-opus-4-6·Jun 8, 2026