Dynamics of collective creativity in AI art competitions

Mason Youngblood, Jeff Nusz, Joel Simon

May 16, 2026

arXiv:2605.17141v1 PDF

cs.AI(primary)

#1114of 2292·Artificial Intelligence

#1114 of 2292 · Artificial Intelligence

Tournament Score

1415±45

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance7

Rigor5.5

Novelty7

Clarity7.5

Tournament Score

1415±45

10501800

56%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Creativity is a fundamental aspect of how culture evolves, yet the mechanisms by which groups produce novelty are notoriously difficult to infer from the historical record. Iterated learning experiments have shown that cultural transmission reliably distorts artifacts toward the inductive biases of learners, but most of this work uses linear chains between human participants, leaving open how these dynamics play out in the networked, human-AI systems that increasingly shape cultural production. In this study, we leverage one such system, Artbreeder, which hosts daily "remix parties" where users iteratively build on each other's work from a single seed image, producing branching lineages of human-AI co-created images. We analyze a dataset of 130,882 images from 368 remix parties over 13 months and find that images become simpler and converge toward common thematic "attractors" (e.g., steampunk scenes, alien architecture). We also find that while more novel "parent" images produce more novel and complex "children" that attract more likes, users paradoxically prefer to remix images that are less novel and complex. Finally, larger remix parties produce more novelty at the cost of lower complexity.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper leverages Artbreeder's "remix parties"—daily events where users iteratively build on each other's AI-generated images from a single seed—as a naturally occurring iterated learning experiment. The core contribution is extending the iterated learning paradigm from linear, human-only transmission chains to networked, human-AI hybrid systems. The study analyzes 130,882 images from 368 remix parties over 13 months and identifies several key dynamics: (1) images simplify and converge toward thematic "attractors" over successive remixes, consistent with classical iterated learning predictions; (2) a paradox where novel images receive more likes but are less likely to be remixed; and (3) larger populations produce more novelty but lower complexity.

The most interesting finding is the decomposition of appreciation versus transmission—consumers value novelty, but producers select simpler, less novel inputs for remixing. This asymmetry is invisible in traditional iterated learning chains where the learner and selector are the same person, making it a genuinely novel theoretical insight.

Methodological Rigor

The methodology is generally sound but has notable limitations the authors partially acknowledge. The use of OpenCLIP embeddings to project images and text into a shared representational space is well-motivated, and the operationalization of novelty via neural density estimation (masked autoregressive flow) is more sophisticated than simple distance metrics. The odd/even party split for training and evaluation of the density estimator avoids data leakage. Image complexity via SAM segment counts correlates with perceptual complexity, though this is a coarse proxy.

The Bayesian structural equation model is an appropriate choice for the complex causal structure, and the use of 15 chains with 2,000 iterations suggests adequate convergence checking. However, several methodological choices weaken the analysis:

Mean imputation for missing data (~32-37% for text-related variables) is acknowledged as potentially attenuating effects, but given the substantial missingness rates, this is a significant concern. The authors frame their estimates as "conservative lower bounds," which is reasonable but still limiting.

The R² for grandchildren is only 0.090, meaning the model explains very little variance in remixing behavior—the very mechanism that drives the transmission dynamics central to the paper's argument.

The novelty measure captures statistical atypicality, not creative value. While the authors note that more novel images get more likes (suggesting some alignment with perceived quality), incoherent AI outputs could also score as "novel."

The causal DAG assumes specific directional relationships, but the observational nature of the data limits genuine causal inference. Unmeasured confounders (user skill, community norms, time-of-day effects) could explain some patterns.

Potential Impact

This paper sits at an important intersection of cultural evolution, computational creativity, and human-AI interaction. Its potential impact spans several areas:

1. Cultural evolution theory: The finding that classical iterated learning signatures persist in networked, AI-mediated systems is significant. It suggests these dynamics are robust to substantial changes in transmission structure, or alternatively, that we need a broader conception of "learner biases" that includes algorithmic biases.

2. Human-AI collaboration research: The paper provides empirical evidence for how generative AI tools shape collective creative processes, relevant to the rapidly growing field of human-AI co-creation.

3. Platform design: The consumer-producer paradox (novel work is appreciated but not remixed) has direct implications for designing creative platforms—how do you encourage exploration when producers gravitate toward simpler inputs?

4. Computational social science: The methodological pipeline (OpenCLIP embeddings → density estimation → Bayesian SEM) is transferable to other platform-scale studies of cultural production.

However, the impact is somewhat limited by the platform specificity. Artbreeder remix parties have particular affordances and norms that may not generalize to other creative domains. The instruction to "keep some aspect of the original" explicitly constrains the creative space.

Timeliness & Relevance

The paper is highly timely. Generative AI is rapidly transforming cultural production, and understanding how human-AI hybrid systems shape collective creativity is an urgent question. The study directly addresses the gap between controlled iterated learning experiments and real-world, at-scale cultural dynamics. The connection to ongoing debates about AI's role in creative industries gives this work broader relevance beyond academic cultural evolution.

Strengths

Scale and ecological validity: 130,882 images across 368 parties over 13 months provides substantial statistical power and ecological validity that lab experiments cannot match.

Novel theoretical decomposition: Separating consumer appreciation (likes) from producer selection (remixing) reveals dynamics invisible in linear transmission chains.

Attractor convergence finding: The demonstration that images converge toward thematic attractors (steampunk, alien architecture) in a branching, AI-mediated system is a compelling extension of iterated learning theory.

Honest limitation discussion: The authors are forthright about what the data can and cannot tell us.

Code availability: Open-source code supports reproducibility.

Limitations

Low explanatory power for transmission: The R² = 0.090 for grandchildren undermines claims about what drives remixing behavior.

Missing data handling: Mean imputation at 32-37% missingness is a substantial weakness.

Cannot disentangle bias sources: Human cognitive biases, AI model biases, and platform norms all contribute to the observed patterns, and the observational design cannot separate them.

Single platform: Generalizability beyond Artbreeder is uncertain.

No individual-level analysis: User heterogeneity is unmodeled—some users may drive disproportionate amounts of novelty or convergence.

Preprint status: The paper has not yet undergone peer review.

Overall Assessment

This is a well-conceived study that applies cultural evolution theory to a timely and understudied phenomenon—collective creativity in human-AI systems at scale. The consumer-producer paradox and the persistence of iterated learning dynamics in networked AI-mediated settings are genuinely interesting findings. However, methodological limitations (mean imputation, low R² for key outcomes, inability to disentangle bias sources) temper the strength of the conclusions. The paper makes a solid contribution to cultural evolution and human-AI interaction research, though its impact would be strengthened by complementary experimental work that can establish causal mechanisms.

Rating:6.5/ 10

Significance 7Rigor 5.5Novelty 7Clarity 7.5

Generated May 19, 2026

Comparison History (18)

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

gpt-5.25/20/2026

Paper 1 has higher likely impact: it introduces a reusable, large-scale benchmark substrate for emergent delegation/orchestration in long-horizon agent workflows with standardized interfaces, metrics, deterministic annotations, and extensive reference sweeps plus released artifacts—directly enabling rigorous, comparable progress across many LLM-agent methods and vendors. Its applications (agent routing, tool/model selection, cost/latency-quality tradeoffs) are immediate and broadly relevant to ML systems and deployment. Paper 2 is timely and methodologically solid observational science with cross-disciplinary interest, but its contributions are more domain-specific and less likely to catalyze widespread methodological advances.

vs. PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

claude-opus-4.65/20/2026

Paper 2 addresses a fundamentally interdisciplinary question about collective creativity dynamics in human-AI systems, with broad implications for cultural evolution, computational creativity, and social science. Its empirical analysis of 130,882 images reveals paradoxical behavioral patterns (users prefer remixing less novel works despite novel parents producing more liked outputs), offering genuinely novel theoretical insights. Paper 1, while methodologically solid with a useful benchmark, addresses a narrower technical problem (programmatic video generation evaluation) with more limited cross-disciplinary appeal. Paper 2's findings about human-AI co-creation dynamics are timely and relevant to a much wider audience.

vs. Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

claude-opus-4.65/19/2026

Paper 2 has higher estimated scientific impact due to its broader interdisciplinary relevance spanning cultural evolution, creativity research, human-AI interaction, and computational social science. It analyzes a large empirical dataset (130K+ images) revealing fundamental dynamics of collective creativity in human-AI systems—a timely topic with growing real-world significance. Its findings about cultural attractors, the paradox of novelty preference vs. remixing behavior, and group-size effects have implications across multiple fields. Paper 1, while technically innovative, addresses a narrower problem in executable world models within a specific game environment, limiting its breadth of impact.

vs. RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

gemini-3.15/19/2026

Paper 1 offers a large-scale, rigorous empirical analysis of a novel phenomenon (human-AI cultural evolution), providing foundational insights into collective creativity. In contrast, Paper 2 presents a specialized engineering framework for RAG and KG construction with only preliminary experimental validation. Paper 1's robust methodology, large dataset, and broad implications for computational social science and human-computer interaction give it a higher potential for lasting scientific impact.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

gpt-5.25/19/2026

Paper 1 has higher potential impact due to its methodological and translational reach: it proposes a unified, intervention-aware framework linking forecasting, counterfactual trajectory estimation, and policy evaluation while explicitly handling time-varying confounding and informative observation—core barriers to clinically actionable AI. Its applications (treatment-sensitive predictions, policy stress-testing, safer closed-loop learning health systems) are high-stakes and broadly relevant across biostatistics, causal inference, ML, and healthcare delivery. Paper 2 is novel and well-powered empirically, but its impact is more domain-specific (computational social science/creativity) and less likely to reshape high-consequence decision pipelines.

vs. Skim: Speculative Execution for Fast and Efficient Web Agents

gemini-3.15/19/2026

Paper 2 offers a highly practical, systems-level solution to a major bottleneck in modern AI: the cost and latency of autonomous web agents. By applying speculative execution to web navigation, it achieves quantifiable, significant improvements (1.9x cost reduction, 33.4% latency reduction) without sacrificing accuracy. While Paper 1 provides fascinating theoretical insights into human-AI cultural evolution, Paper 2 has immediate, broad, and highly scalable real-world applications across the booming field of AI agent research and industry deployment.

vs. Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

claude-opus-4.65/19/2026

Paper 1 introduces a novel neuro-symbolic framework with a Probabilistic Inconsistency Signal that reframes temporal QA as a structural alignment problem rather than a reasoning deficit. Its methodological rigor (perfect accuracy on controlled benchmarks, deterministic failure localization) and direct implications for reliable AI systems give it high impact potential in the active neuro-symbolic AI field. Paper 2 offers interesting empirical findings on human-AI co-creativity but is more observational and narrower in its technical contributions, with impact largely confined to computational social science and cultural evolution.

vs. Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to its large-scale, real-world dataset (130,882 images) and broad relevance across cultural evolution, computational social science, HCI, network science, and human–AI co-creation. It studies an emergent, timely phenomenon (AI-mediated cultural production) in a naturalistic setting, yielding generalizable insights (attractors, novelty/complexity trade-offs, preference paradox) that can inform theory and platform design. Paper 1 is methodologically innovative and highly applicable to LLM evaluation/routing, but its impact is more domain-specific to NLP/LLM assessment practices.

vs. TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

gemini-3.15/19/2026

Paper 1 offers profound, cross-disciplinary insights into cultural evolution and human-AI co-creation. While Paper 2 provides a valuable, industry-specific LLM benchmark, Paper 1 explores fundamental scientific questions about collective creativity, uncovering paradoxical dynamics in networked human-AI systems. Its findings on cultural attractors and transmission biases have broad theoretical implications across sociology, cognitive science, and AI, giving it a higher potential for foundational scientific impact compared to the applied, domain-specific utility of Paper 2.

vs. Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact: it proposes a concrete algorithmic advance for RL in open-ended generation (pairwise preference rewards + explicit group-level diversity in a unified objective), directly targeting major, timely issues in LLM alignment (reward modeling cost, RLVR diversity collapse). This is broadly applicable across many generative NLP tasks and can be integrated into existing RLHF/RLAIF pipelines, increasing practical adoption potential. Paper 2 is a strong large-scale empirical study of human–AI cultural dynamics, but its contributions are primarily descriptive and may have narrower methodological transfer to core ML systems development.

vs. POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

gemini-3.15/19/2026

Paper 1 explores the intersection of human-AI interaction, cultural evolution, and collective creativity, offering broad, interdisciplinary insights into how AI tools shape cultural production. This timeliness and relevance to emerging societal trends give it broader scientific and cultural impact compared to Paper 2, which, while methodologically rigorous, focuses on a highly specialized technical problem in time series anomaly detection.

vs. ShopGym: An Integrated Framework for Realistic Simulation and Scalable Benchmarking of E-Commerce Web Agents

claude-opus-4.65/19/2026

Paper 2 addresses fundamental questions about collective creativity, cultural evolution, and human-AI co-creation using a large-scale empirical dataset. Its findings about attractor dynamics, the paradox between preference and novelty, and how group size affects creative output have broad implications across cognitive science, cultural evolution, AI, and social science. Paper 1, while technically solid, addresses a narrower engineering problem (benchmarking e-commerce web agents) with impact largely limited to the AI agents community. Paper 2's interdisciplinary relevance and novel empirical insights give it higher potential impact.

vs. Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in modern AI—LLM hallucinations in RAG systems. By providing a comprehensive benchmark and analyzing realistic label noise, it offers immediate, highly relevant practical applications for AI safety and reliability. This will likely lead to widespread adoption and high citation counts in the rapidly moving NLP field. While Paper 2 offers fascinating insights into cultural evolution and HCI, its practical applications are less immediate and its target audience is narrower compared to the massive, ongoing efforts in LLM evaluation.

vs. Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

gemini-3.15/19/2026

Paper 2 addresses fundamental scientific questions regarding cultural evolution and collective creativity in modern human-AI systems, supported by a large empirical dataset. Its findings on the paradoxical preferences and evolutionary dynamics of human-AI co-creation offer broad, interdisciplinary impact across cognitive science, sociology, and HCI. In contrast, Paper 1 presents a practical software engineering framework; while useful for developers, its contribution is primarily technical and tooling-focused rather than advancing foundational scientific knowledge.

vs. NeuroAgent: LLM Agents for Multimodal Neuroimaging Analysis and Research

gemini-3.15/19/2026

Paper 2 addresses a critical bottleneck in neuroimaging research by automating complex, multi-modal preprocessing and analysis workflows using LLM agents. Its demonstration on Alzheimer's Disease classification with high accuracy highlights direct, high-impact clinical and scientific applications. While Paper 1 offers interesting insights into cultural evolution and human-AI co-creation, Paper 2 provides a highly practical, methodologically rigorous tool that can significantly accelerate research across neuroscience and medical imaging.

vs. Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

claude-opus-4.65/19/2026

Paper 1 addresses a novel intersection of cultural evolution, collective creativity, and human-AI co-creation at scale, analyzing a unique large-scale dataset (130K+ images). It contributes fundamental insights about how creativity emerges in networked human-AI systems, with broad implications across cultural evolution, computational creativity, and social computing. Paper 2, while rigorous, addresses a more incremental question (contamination in LLM legal reasoning) within a narrower domain. Paper 1's findings about attractor dynamics and the paradox of novelty preferences are more likely to inspire cross-disciplinary research.

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

claude-opus-4.65/19/2026

Paper 2 addresses a fundamentally novel interdisciplinary question about collective creativity dynamics in human-AI systems, with broad implications across cultural evolution, computational creativity, cognitive science, and sociology. Its large-scale empirical analysis (130K+ images) of real-world creative behavior reveals paradoxical findings about novelty preferences that challenge existing theories. Paper 1, while technically solid, is an incremental improvement to RL training for diffusion MLLMs—a rapidly evolving area where specific methods are quickly superseded. Paper 2's findings about cultural attractors in human-AI co-creation have longer-lasting scientific significance.

vs. Learning to Learn from Multimodal Experience

gemini-3.15/19/2026

Paper 1 proposes a fundamental advancement in AI agent architecture by shifting from static to adaptive multimodal memory. This addresses a critical bottleneck in the highly active field of multimodal agents, offering broad applicability across various real-world AI applications. While Paper 2 provides fascinating empirical insights into cultural evolution and human-AI interaction, Paper 1's algorithmic innovations have a higher potential for widespread technological integration and foundational impact in artificial intelligence.