Back to Rankings

Nonslop: A Gamified Experiment in Human-AI Collaborative Writing

Maria Edwards, Julian Togelius

cs.AI
Share
#3270 of 3489 · Artificial Intelligence
Tournament Score
1233±47
10501800
19%
Win Rate
3
Wins
13
Losses
16
Matches
Rating
4.2/ 10
Significance4.5
Rigor3
Novelty6
Clarity6.5

Abstract

The rapid proliferation of large language models (LLMs) raises critical questions about human creativity and individual expression in an era of AI-assisted creation. When do humans adopt AI suggestions, and what are the implications for individual voice? This study examines these questions through a gamified writing exercise where 74 participants (214 responses) replied to prompts while AI-generated word suggestions were available as they wrote. The game simulates a dystopian future in which an AI is attempting to learn from what remains of human individuality, and disincentivizes AI-like writing. In doing so, it attempts to create conditions that reveal authentic user preferences rather than default behaviors, such as accepting a readily available AI-generated suggestion. Note that this is a deliberate inversion of the "helpful assistant" design pattern; the system is explicitly forbidding you from accepting AI suggestions. We analyze user behavior patterns across different task types, user behaviors, and response characteristics to understand the factors influencing human-AI interaction in creative tasks. The study focuses on when users choose to maintain creative autonomy versus violating the rules of the game and accepting AI assistance. It also explores how these choices relate to response patterns, task characteristics, and user behavior. This gamified approach offers both a framework for studying authentic human-AI interaction and a provocative lens for understanding the tension between efficiency and authenticity in AI-augmented creativity.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: *Nonslop: A Gamified Experiment in Human-AI Collaborative Writing*

1. Core Contribution

Nonslop introduces a gamified web-based experiment that inverts the standard "helpful assistant" paradigm of AI writing tools. Rather than encouraging users to accept AI-generated suggestions, the system frames suggestion adoption as a rule violation within a dystopian narrative. The paper's core claim is that this inversion can reveal authentic user preferences about AI assistance that are otherwise obscured by frictionless interface designs. The study collects behavioral data from 74 participants (214 valid submissions) and analyzes when and why users adopt AI-generated word suggestions under disincentivized conditions.

The paper makes three stated contributions: (1) a lightweight framework for studying human-AI writing interaction, (2) empirical analysis of AI suggestion use under discouragement, and (3) observations linking task type and user behavior to AI adoption patterns. The most interesting finding is the differential adoption rate across prompt categories — explanatory prompts showed ~6x higher AI adoption transgression rates than creative prompts — suggesting that the perceived "correctness" or objectivity of a task influences willingness to accept AI assistance.

2. Methodological Rigor

The methodology has significant limitations that the authors partially acknowledge. The sample size is small (74 users, 214 submissions), and substantial attrition occurred: 53% of attempted users could not participate due to browser/OS/network restrictions from the Web-LLM requirement. This introduces non-trivial selection bias — participants skew toward users with compatible hardware, likely more technically sophisticated individuals. The authors acknowledge this but do not adequately discuss its implications for generalizability.

The k-means clustering (k=3) on only 74 users with three features is statistically fragile. No validation metrics (e.g., silhouette scores, gap statistics) are reported for the clustering, nor is there justification for k=3 beyond apparent interpretability. The "minimalist" cluster (72% of users) may simply reflect disengagement or confusion rather than a deliberate strategy, especially given reported latency issues with in-browser inference.

The prompt categorization into five types was performed post-hoc and qualitatively, introducing potential confirmation bias. The correlation between prompt popularity and AI adoption (r = −0.261) is weak and based on aggregated prompt-level data with few data points, making it difficult to draw meaningful conclusions.

The post-playtest survey (n=7) is too small to contribute meaningfully and is appropriately flagged as anecdotal. The use of GPT-4o-mini for scoring introduces another layer of AI judgment that could interact with the phenomena being studied, though this is acknowledged.

The 73.8% zero-transgression rate is presented as the central finding, but interpreting this is difficult without a control condition. Without knowing baseline rates of suggestion acceptance in a neutral or encouraging framing, it is impossible to determine how much the game's disincentive structure actually altered behavior versus how much the low rate reflects natural behavior with unfamiliar, small-model suggestions that may not have been particularly compelling.

3. Potential Impact

The conceptual framing — studying AI adoption by inverting incentives — is genuinely interesting and could inspire follow-up work. The idea that most AI adoption in commercial tools may be driven by interface design rather than genuine preference is a provocative and timely hypothesis. If validated at scale, this could influence how writing tool designers think about defaults and friction.

However, the current execution is too preliminary to have substantial direct impact. The sample is small, the technical constraints are severe, and the findings are largely descriptive without strong causal claims. The framework contribution is more conceptual than methodological — the system itself is relatively simple, and replication would likely require addressing the browser compatibility issues that plagued this deployment.

The connection to critical play and design research (DeepTingle, computational poetics) is well-articulated and positions the work within an interesting interdisciplinary space, though this framing is more of a contribution to discourse than to empirical science.

4. Timeliness & Relevance

The paper addresses an undeniably timely topic. As AI writing assistance becomes ubiquitous, understanding when and why users accept suggestions is increasingly important. The question of whether widespread AI adoption reflects genuine preference or interface-driven defaults is highly relevant to ongoing debates about AI's impact on creativity, homogenization of language, and human agency.

The "inversion" approach — studying resistance rather than adoption — fills a genuine gap, as most existing work studies AI-assisted writing in contexts designed to maximize adoption. This perspective is valuable even if the current study's execution is limited.

5. Strengths & Limitations

Strengths:

  • Creative and well-motivated experimental design that inverts standard paradigms
  • The finding that prompt type (explanatory vs. creative) influences AI adoption is intuitive but empirically supported and potentially useful
  • Honest and thorough discussion of limitations
  • Interesting positioning within critical play and computational poetics traditions
  • The dystopian game narrative is a clever mechanism for making AI suggestion use a conscious, observable choice
  • Limitations:

  • Small sample with severe selection bias (53% attrition before participation)
  • No control condition to benchmark against standard AI-assisted writing
  • Latency issues with in-browser inference may have suppressed AI suggestion use for technical rather than behavioral reasons
  • Short response lengths (mean 9-41 words depending on cluster) limit the granularity of analysis
  • Statistical methods are basic and sometimes unjustified (k-means without validation)
  • The 0.5B parameter model may generate low-quality suggestions that users would reject regardless of incentive structure, confounding the interpretation
  • No comparison between easy and hard modes despite their fundamentally different mechanics
  • Post-hoc prompt categorization without inter-rater reliability
  • Overall Assessment

    Nonslop presents a creative experimental concept that addresses a genuine research gap, but the execution is too preliminary and technically constrained to support strong conclusions. The most impactful contribution is conceptual — the idea of inverting incentive structures to study authentic AI adoption preferences. The empirical findings, while suggestive, require larger-scale replication with better controls before they can meaningfully advance the field. This reads as an early-stage workshop or position paper rather than a complete empirical study.

    Rating:4.2/ 10
    Significance 4.5Rigor 3Novelty 6Clarity 6.5

    Generated Jun 11, 2026

    Comparison History (16)

    Lostvs. A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

    Paper 2 has higher potential impact due to a clear, safety-critical real-world application (automated compliant highway barrier design), a concrete closed-loop multi-agent methodology with reported quantitative gains (>98% accuracy) and open-source code enabling reproducibility and adoption. Its finding that small models can outperform very large ones under constrained agentic optimization is timely and broadly relevant to LLM deployment, engineering automation, and cost-efficient AI. Paper 1 is novel in HCI framing, but its scope (74 participants) and mainly behavioral insights suggest narrower, less directly deployable impact.

    gpt-5.2·Jun 11, 2026
    Wonvs. Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

    Paper 1 explores a highly timely and broadly debated topic—the impact of LLMs on human creativity and authenticity. Its novel experimental design, which inverts the typical 'helpful assistant' paradigm using a gamified dystopian scenario, offers a fresh methodological framework for HCI and AI research. While Paper 2 presents a solid, practical application for infrastructure inspection, Paper 1's interdisciplinary relevance and innovative approach give it a higher potential for broad scientific impact across AI, psychology, and HCI.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

    Paper 1 introduces a technically rigorous framework using dialogue policy optimization and decomposed process rewards to assess creativity. Its approach to mitigating reward hacking in educational AI has broad applications in AI-mediated learning and alignment. Paper 2, while offering an interesting behavioral experiment, lacks the algorithmic innovation and technical depth of Paper 1, making Paper 1 more likely to drive future research in both AI assessment and human-AI interaction.

    gemini-3.1-pro-preview·Jun 11, 2026
    Wonvs. Evidence-Based Intelligent Diagnostic and Therapeutic Visualization System with Large Language Models: Multi-Turn Interaction and Multimodal Treatment Plan Generation

    Paper 2 addresses a broadly relevant and timely question about human-AI interaction in creative writing, with a novel gamified methodology that inverts typical AI-assistant paradigms. Its findings about when humans accept vs. resist AI suggestions have broad implications across HCI, cognitive science, and AI design. Paper 1, while technically sound, addresses a narrower domain (Traditional Chinese Medicine diagnostics) with incremental integration of existing techniques (knowledge graphs, LLMs), limiting its cross-disciplinary impact and audience reach.

    claude-opus-4-6·Jun 11, 2026
    Lostvs. PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

    Paper 1 likely has higher scientific impact due to a more novel and actionable systems contribution: an event-sourced, local-first memory layer with deterministic projections and a pre-action governance gate, plus open-source tooling and reproducibility/provenance benefits. Its real-world applicability to AI-assisted software engineering is immediate and broad (agents, IDE tooling, auditing, safety, MLOps-style provenance). While Paper 2 is timely and interesting for HCI/creativity research, its methodological scale (74 participants) and domain specificity suggest narrower downstream impact compared to a deployable infrastructure component for coding agents.

    gpt-5.2·Jun 11, 2026
    Lostvs. Lung-R1: A Knowledge Graph-Guided LLM for Pulmonary Diagnostic Reasoning

    Lung-R1 presents a novel knowledge graph (LungKG) with 59K nodes and 164K edges for pulmonary diagnosis, combined with KG-guided reinforcement learning—a concrete, reusable resource with clear clinical applications. It demonstrates state-of-the-art results on multiple benchmarks with rigorous evaluation across 20 systems. The direct medical application (pulmonary diagnosis from EMRs) has significant real-world impact potential. Paper 1, while interesting as an HCI study, has a smaller sample size (74 participants), narrower scope, and more exploratory findings about human-AI creative interaction without comparable methodological depth or breadth of impact.

    claude-opus-4-6·Jun 11, 2026
    Lostvs. Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

    Paper 1 addresses a critical bottleneck in large language models—the quadratic computational cost of long-context self-attention. By demonstrating a novel RL-based method to make efficient sliding-window attention competitive in complex mathematical reasoning, it offers immediate, high-impact applications in foundational AI development. Paper 2 presents an interesting HCI study on creativity, but its small sample size and niche gamified setup limit its broader scientific impact compared to the core architectural advancements in Paper 1.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. Towards Responsibly Non-Compliant Machines

    Paper 2 addresses a fundamental and highly timely issue in AI safety and alignment: responsible non-compliance. Its framework for task refusal, security, and liability has broad implications across AI engineering, ethics, and policy, offering wider real-world impact and broader relevance across fields than Paper 1's specific HCI experiment on creative writing.

    gemini-3.1-pro-preview·Jun 11, 2026
    Lostvs. MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

    Paper 2 likely has higher scientific impact due to a more novel and generalizable technical contribution (claim-level “market” verification plus code synthesis and verification), strong methodological rigor (evaluation across 10 benchmarks with a fixed backbone, explicit error-checking and repair), and clear real-world applicability in high-stakes financial/tabular reasoning. Its approach can transfer to other domains needing grounded numerical reasoning (science, medicine, auditing). Paper 1 is timely and interesting for HCI/creativity studies, but the scale and generalizability are more limited, with narrower cross-field impact and less demonstrated performance/replicability evidence.

    gpt-5.2·Jun 11, 2026
    Lostvs. Human-Enhanced Loop Modeling (HELM): Agent-Based Finite Element Modeling of Concrete Bridge Barriers

    Paper 1 presents a highly rigorous, quantifiable framework combining AI agents and human-in-the-loop for automating complex finite element modeling in structural engineering. It addresses a significant real-world challenge in safety-critical infrastructure, demonstrating a massive improvement in success rates. Paper 2, while offering interesting HCI insights into AI-assisted writing, relies on a relatively small behavioral study and lacks the concrete, domain-shifting technological application and rigorous engineering impact demonstrated by Paper 1.

    gemini-3.1-pro-preview·Jun 11, 2026