Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution

Can Gurkan, Forrest Stonedahl, Uri Wilensky

#1584 of 3355 · Artificial Intelligence
Share
Tournament Score
1411±46
10501800
59%
Win Rate
10
Wins
7
Losses
17
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates a fundamental but largely overlooked question in the rapidly growing field of LLM-driven evolutionary computation: what happens when LLMs are used as mutation operators without selection pressure? The authors demonstrate that LLM-based mutation consistently converges toward restricted "attractor regions" in program space, with 87% of chains showing over 93% of mutations revisiting previously seen structural forms. The key insight is the separation between surface-level variation (terminal substitutions) and structural diversity (skeleton-level convergence), revealing that apparent diversity often masks deep structural homogeneity.

The paper frames LLM-driven mutation as a dynamical system, borrowing concepts from attractor theory to characterize the behavior. This framing is genuinely novel in the GP/LLM intersection—while convergence has been noted anecdotally in LLM-driven evolutionary systems, no prior work has systematically isolated and characterized the mutation operator's intrinsic bias by removing selection pressure entirely.

Methodological Rigor

The experimental design is thoughtful and well-controlled. Three experiments systematically vary prompt design (50 variants), stochastic replication (30 independent runs per condition), and model family (7 models across 3 providers). The use of a constrained, strongly typed DSL is a deliberate methodological choice that enables precise structural analysis through the program/skeleton dual representation, though it also limits generalizability.

The inclusion of a classical GP subtree mutation baseline is critical and well-executed—it demonstrates that the DSL itself supports sustained exploration (~270 unique programs, ~143 unique skeletons per 300-step chain), confirming that convergence is a property of the LLM operator, not the search space.

The analysis toolkit is comprehensive: cumulative unique counts, transition graphs, cycle analysis, degree entropy, and Levenshtein distance heatmaps. The cycle analysis revealing dominance of 2-cycles and self-loops is particularly compelling and connects nicely to Wang et al.'s findings in paraphrasing dynamics.

However, several methodological limitations merit attention. The DSL is small and Lisp-like—quite distant from the Python/general-purpose code that dominates LLM training data. This raises legitimate questions about whether convergence would be more or less severe in languages better represented in pretraining corpora. The analysis is purely genotypic, meaning structurally convergent programs might still exhibit meaningful behavioral diversity. The model sensitivity experiment lacks replication (single runs per model-prompt pair), weakening model-level claims. The chain length of 300 steps, while reasonable, leaves open whether longer chains might eventually escape attractors.

Potential Impact

The practical implications are significant for a growing community. Multiple high-profile systems (FunSearch, AlphaEvolve, ELM, LLaMEA, ReEvo) use LLMs as mutation operators, and this work provides a mechanistic explanation for why diversity-preservation mechanisms have been empirically necessary. The finding that prompt design dramatically modulates convergence—with semantically similar prompts producing vastly different exploration profiles—is directly actionable for practitioners.

The paper could influence system design in several ways: (1) motivating systematic prompt-level mutation-chain analysis before deploying LLM-GP systems; (2) informing the design of diversity maintenance mechanisms by characterizing the specific nature of convergence (structural vs. terminal); (3) guiding model selection, given that reasoning-enabled variants consistently show less convergence.

The connection to broader iterative LLM phenomena (model collapse, paraphrasing attractors, image generation loops) positions this work as contributing to a larger theoretical understanding of LLM behavior under recursive self-application. This cross-domain relevance amplifies its potential influence.

Timeliness & Relevance

This paper is exceptionally timely. LLM-driven evolutionary computation is experiencing explosive growth, with systems like AlphaEvolve and FunSearch achieving prominent results. Yet the fundamental properties of LLMs as variation operators remain poorly understood. This work addresses a critical gap—most prior work evaluates LLM mutation operators only within optimization loops, making it impossible to separate operator bias from selection effects. The isolation of the mutation operator's intrinsic dynamics fills an important conceptual void.

The paper also arrives at a moment when the community is grappling with reproducibility and reliability of LLM-driven systems, making mechanistic understanding of their components particularly valuable.

Strengths

1. Clean experimental isolation: Removing selection pressure to study the operator in isolation is methodologically elegant and enables causal attribution of convergence to the LLM itself.

2. Multi-level analysis: The program/skeleton decomposition reveals that convergence is primarily structural, with surface variation creating an illusion of diversity—a nuanced insight that aggregate metrics would miss.

3. Comprehensive sensitivity analysis: Varying prompts, models, and replications systematically demonstrates robustness while identifying modulating factors.

4. Strong baseline comparison: The GP subtree mutation baseline conclusively rules out DSL size as a confound.

5. Excellent visualization: Transition graphs and Levenshtein heatmaps make complex dynamics intuitive and reveal temporal fine structure (attractor hopping, checkered cycling patterns).

6. Reproducibility: Code is publicly available; the experimental setup is clearly documented with all prompt variants listed.

Limitations

1. Limited DSL scope: The constrained Lisp-like DSL may not represent the dynamics in richer programming environments where LLMs have stronger priors.

2. No behavioral analysis: Structural convergence may or may not imply behavioral convergence; this remains uninvestigated.

3. No interaction with selection: The paper explicitly isolates mutation from selection, but the practical question of how selection interacts with these attractor dynamics remains open.

4. Temperature fixed at 1.0: Temperature is a natural control variable for exploration that is not varied.

5. Workshop paper scope: Some experiments (particularly model sensitivity) lack sufficient replication for strong statistical claims.

Overall Assessment

This is a well-conceived empirical study that identifies and characterizes a fundamental phenomenon—structural convergence under iterated LLM mutation—with clear implications for the design of LLM-driven evolutionary systems. While the scope is appropriately bounded for a workshop paper and several extensions are needed (richer languages, behavioral analysis, selection interaction), the core finding is robust, clearly communicated, and immediately relevant to a rapidly growing research area. The dynamical systems framing provides a productive conceptual lens that could seed a productive line of follow-up research.

Rating:6.8/ 10
Significance 7.5Rigor 7Novelty 7.5Clarity 8.5

Generated Jun 5, 2026

Comparison History (17)

vs. Off-Policy Evaluation with Strategic Agents via Local Disclosure
claude-opus-4.66/8/2026

Paper 2 addresses a fundamental and timely question about LLM-driven program evolution, revealing an intrinsic convergence bias in LLM mutations. This finding has broad implications across multiple rapidly growing fields—LLM-based code generation, automated program synthesis, evolutionary computation, and AI-driven scientific discovery. The robustness of the finding across models and prompts makes it highly generalizable. Paper 1, while methodologically sound, addresses a more niche intersection of OPE and strategic behavior with restrictive assumptions (conditional log-normal distribution). Paper 2's insights are more likely to influence a wider research community given the current centrality of LLMs.

vs. Online Pandora's Box for Contextual LLM Cascading
gpt-5.26/8/2026

Paper 2 has higher potential impact: it introduces a timely, practically motivated online learning framework for LLM API cascading with a novel output-mediated feedback structure, and provides methodological rigor via a principled policy (GMM + UCB) with provable regret bounds. Its applications (adaptive model selection, cost-quality tradeoffs) are immediate and broad across ML systems, bandits/online decision-making, and operations research. Paper 1 is insightful and novel diagnostically for LLM-driven program evolution, but is more specialized and primarily descriptive, with narrower direct real-world deployment implications and less formal theoretical grounding.

vs. Formal Semantics for Agentic Tool Protocols: A Process Calculus Approach
gpt-5.26/6/2026

Paper 2 has higher estimated impact: it introduces a formal semantics and verification framework for widely used agent-tool protocols (SGD and MCP), proves bisimilarity under a mapping, identifies expressivity gaps, and proposes principled, type-theoretic extensions (MCP+) with isomorphism proofs. This is novel, methodologically rigorous, timely for LLM agent safety, and broadly applicable across PL, formal methods, security, and AI systems engineering. Paper 1 offers valuable empirical insight into LLM mutation dynamics, but its impact is narrower and less foundational than a general verification calculus for agent protocols.

vs. Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety
gemini-3.16/5/2026

Paper 2 uncovers a fundamental behavioral limitation of LLMs (convergence to attractor regions) in program evolution, which broadly impacts the highly active fields of LLM-based code generation, evolutionary algorithms, and open-ended exploration. Paper 1, while highly valuable for autonomous driving safety, is more applied and regulatory-focused, mapping existing XAI methods to safety standards rather than revealing a novel underlying scientific phenomenon.

vs. From Reward-Hack Activations to Agentic Risk States: Context-Calibrated Mechanistic Monitoring in LLM Agents
claude-opus-4.66/5/2026

Paper 1 identifies a fundamental and previously uncharacterized limitation of LLM-driven program evolution—systematic convergence toward structural attractors—which has broad implications for evolutionary computation, program synthesis, and open-ended search. The finding is robust across models and prompts, methodologically rigorous with comparisons to classical GP, and challenges core assumptions in the rapidly growing field of LLM-guided evolutionary algorithms. Paper 2 addresses important AI safety monitoring but is more incremental, combining known techniques (activation probing, entropy, steering) in a narrower agentic setting with less generalizable insights.

vs. FIDES: Faithful Inference via Deep Evidence Signals for Retrieval-Memory Conflict in RAG
gpt-5.26/5/2026

Paper 1 introduces a broadly applicable, training-free decoding method (FIDES) that targets a central, timely failure mode in RAG—retrieval vs. parametric-memory conflict—using token-level adaptive intervention from multiple internal signals, and demonstrates consistent gains across many benchmarks and model scales up to 70B. This combination of methodological novelty, immediate deployability, and wide relevance to LLM reliability makes its likely impact higher. Paper 2 offers an important diagnostic finding about convergence in LLM-driven program evolution, but it is more domain-specific and primarily descriptive, with less direct, general-purpose intervention.

vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
gemini-3.16/5/2026

Paper 1 identifies a fundamental limitation (bias toward structural homogeneity and convergence) in LLM-driven program evolution. This insight has broad, cross-disciplinary implications for AI, evolutionary algorithms, and open-ended exploration, impacting how researchers design LLM-based optimization systems. Paper 2 presents a strong, practical improvement for autonomous driving world models, but its impact is relatively confined to the specialized domain of end-to-end autonomous driving systems compared to the broader theoretical and methodological relevance of Paper 1.

vs. LLM-WikiRace Benchmark: How Far Can LLMs Plan over Real-World Knowledge Graphs?
gemini-3.16/5/2026

Paper 1 uncovers a fundamental limitation (structural convergence) in LLM-driven program evolution, offering deep theoretical insights critical for the future of AI self-improvement and open-ended exploration. In contrast, Paper 2 introduces a useful but narrower benchmark based on a known game (Wikirace), which, while highlighting planning deficits, provides less novelty and long-term fundamental scientific impact.

vs. The Self-Correction Illusion: LLMs Correct Others but Not Themselves
gpt-5.26/5/2026

Paper 2 offers a more novel and broadly relevant causal finding: self-correction failures largely arise from chat-template role labeling, isolated via byte-identical claims and verified controls. It is timely for agentic LLM design, has immediate real-world application (a training-free prompt-structure intervention), and generalizes across multiple model families and domains with strong statistical evidence. Paper 1 is insightful and methodologically solid but is narrower (DSL program evolution setting) and its impact is more specialized to LLM-guided genetic programming, whereas Paper 2 affects evaluation protocols, safety, reliability, and interface design across many LLM applications.

vs. Unsupervised Skill Discovery for Agentic Data Analysis
claude-opus-4.66/5/2026

Paper 2 addresses a fundamental question about LLM-driven program evolution that has broad implications across multiple fields (evolutionary computation, program synthesis, AI-driven search). Its finding that LLMs exhibit systematic convergence bias is a novel, rigorous empirical contribution that challenges assumptions underlying many LLM-based optimization systems (e.g., FunSearch, EvoPrompting). This insight is widely applicable and timely. Paper 1, while showing solid improvements on data analysis benchmarks, is more incremental and narrowly focused on a specific application domain with a complex multi-component framework.

vs. Position Paper: Post-Solve Robustness in Decision Engines: Feasible Regions and Smoothness Under Perturbations
gemini-3.16/5/2026

Paper 1 presents empirical findings on a highly timely and broadly relevant topic: the limitations of LLMs in program evolution. Identifying 'structural homogeneity' and 'attractor regions' directly impacts the rapidly growing fields of AI, program synthesis, and open-ended exploration. Paper 2, while addressing a crucial industrial problem in optimization, is a position paper with a narrower focus on MILP robustness, making Paper 1's empirical contributions likely to garner broader and more immediate scientific attention.

vs. PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation
claude-opus-4.66/5/2026

Paper 2 reveals a fundamental limitation of LLM-driven program evolution—systematic convergence toward structural attractors—which has broad implications for the rapidly growing fields of LLM-based code generation, automated program synthesis, and evolutionary computation. This finding is highly novel, methodologically rigorous (controlled experiments across models, prompts, with GP baselines), and challenges core assumptions underlying many LLM-augmented search/optimization systems. Paper 1, while practically useful for UI/UX evaluation, addresses a narrower application domain with incremental improvements over existing LLM-based evaluation approaches.

vs. Entropy-Based Evaluation of AI Agents: A Lightweight Framework for Measuring Behavioral Patterns
gemini-3.16/5/2026

Paper 2 reveals a fundamental limitation in LLM-driven program evolution—structural convergence and attractor regions—which challenges current assumptions about LLM creativity in open-ended search. Uncovering these intrinsic biases provides deeper theoretical insights and will significantly impact how future evolutionary algorithms and automated programming systems are designed. While Paper 1 offers a useful and practical evaluation framework, Paper 2 provides a more profound scientific discovery regarding model behavior and generalization limits.

vs. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents
gpt-5.26/5/2026

Paper 1 likely has higher scientific impact due to stronger novelty and broader conceptual relevance: it uncovers and quantifies an intrinsic convergence/attractor bias in LLM-driven program mutation, contrasted against a classical GP operator, yielding a generally applicable insight for LLM-based search, program synthesis, and open-ended evolution. Its methodological framing (mutation chains, cycle analysis, robustness across prompts/models) targets a foundational limitation that could influence many downstream systems. Paper 2 is timely and application-relevant for agent safety, but is closer to an engineering framework/finetuning recipe whose impact may be narrower and more benchmark-dependent.

vs. Fix the Mind, Not the Move: Interpretable AI Assistance via Knowledge-Gap Localization
gemini-3.16/5/2026

Paper 2 offers a broader paradigm shift in human-AI collaboration by moving from behavioral correction to cognitive intervention ('fixing the mind'). Its applications span AI tutoring, assistive technologies, and cognitive modeling. The inclusion of zero-shot compositional generalization and a successful user study demonstrates strong methodological rigor and high real-world applicability. While Paper 1 provides valuable insights into LLM mutation dynamics, its impact is largely confined to the niche fields of genetic programming and automated code generation.

vs. SoCRATES: Towards Reliable Automated Evaluation of Proactive LLM Mediation across Domains and Socio-cognitive Variations
gpt-5.26/5/2026

Paper 1 is likely to have higher impact: it introduces a broadly useful benchmark and evaluation methodology (multi-domain, socio-cognitive variation, topic-localized scoring) with validated human alignment, directly addressing a major, timely gap in LLM evaluation for social/interactive settings. Its applications span AI evaluation, HCI, computational social science, and responsible AI, and it produces actionable diagnostics across adaptation axes. Paper 2 is novel and rigorous, but its scope is narrower (LLM program mutation dynamics in a DSL) and its immediate cross-field and real-world applicability is more limited.

vs. Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills
claude-opus-4.66/5/2026

Trace2Skill presents a practical framework with demonstrated effectiveness across multiple domains, showing strong quantitative improvements (up to 57.65 percentage points) and transferability across model scales and families. It addresses a timely, high-demand problem in LLM agent skill acquisition with broad applicability. Paper 2 provides valuable analytical insights into LLM mutation convergence, but its impact is more niche, primarily relevant to the LLM-driven program evolution/genetic programming community. While Paper 2 identifies an important limitation, Paper 1 offers a constructive solution to a widely-felt need, giving it broader and more immediate scientific impact.