Mutation Without Variation: Convergence Dynamics in LLM-Driven Program Evolution
Can Gurkan, Forrest Stonedahl, Uri Wilensky
Abstract
When an LLM repeatedly mutates a program, does it explore new forms or circle back to the same ones? We study this question by analyzing LLM-driven mutation chains in the absence of selection pressure within a domain-specific language, varying prompt design, model family, and stochastic replication. We find that LLM-based mutation consistently converges toward restricted attractor regions in program space. Convergence is especially severe at the structural level: in 87% of chains, over 93% of mutations revisit a previously seen structural form, with most variation confined to terminal substitutions within recurring templates. Cycle analysis reveals short cycles and self-loops dominating the transition structure. The rate of convergence varies with prompt wording and model choice, but the phenomenon is robust across conditions. A classical GP subtree mutation operator does not exhibit comparable convergence, suggesting that the effect is intrinsic to the LLM mutation pipeline. These findings reveal a tension at the heart of LLM-driven program evolution: the same capabilities that enable semantics-aware program transformation also carry a systematic bias toward structural homogeneity that must be accounted for if such systems are to sustain open-ended exploration. Source code is available at https://github.com/can-gurkan/lmca.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper investigates a fundamental but largely overlooked question in the rapidly growing field of LLM-driven evolutionary computation: what happens when LLMs are used as mutation operators without selection pressure? The authors demonstrate that LLM-based mutation consistently converges toward restricted "attractor regions" in program space, with 87% of chains showing over 93% of mutations revisiting previously seen structural forms. The key insight is the separation between surface-level variation (terminal substitutions) and structural diversity (skeleton-level convergence), revealing that apparent diversity often masks deep structural homogeneity.
The paper frames LLM-driven mutation as a dynamical system, borrowing concepts from attractor theory to characterize the behavior. This framing is genuinely novel in the GP/LLM intersection—while convergence has been noted anecdotally in LLM-driven evolutionary systems, no prior work has systematically isolated and characterized the mutation operator's intrinsic bias by removing selection pressure entirely.
Methodological Rigor
The experimental design is thoughtful and well-controlled. Three experiments systematically vary prompt design (50 variants), stochastic replication (30 independent runs per condition), and model family (7 models across 3 providers). The use of a constrained, strongly typed DSL is a deliberate methodological choice that enables precise structural analysis through the program/skeleton dual representation, though it also limits generalizability.
The inclusion of a classical GP subtree mutation baseline is critical and well-executed—it demonstrates that the DSL itself supports sustained exploration (~270 unique programs, ~143 unique skeletons per 300-step chain), confirming that convergence is a property of the LLM operator, not the search space.
The analysis toolkit is comprehensive: cumulative unique counts, transition graphs, cycle analysis, degree entropy, and Levenshtein distance heatmaps. The cycle analysis revealing dominance of 2-cycles and self-loops is particularly compelling and connects nicely to Wang et al.'s findings in paraphrasing dynamics.
However, several methodological limitations merit attention. The DSL is small and Lisp-like—quite distant from the Python/general-purpose code that dominates LLM training data. This raises legitimate questions about whether convergence would be more or less severe in languages better represented in pretraining corpora. The analysis is purely genotypic, meaning structurally convergent programs might still exhibit meaningful behavioral diversity. The model sensitivity experiment lacks replication (single runs per model-prompt pair), weakening model-level claims. The chain length of 300 steps, while reasonable, leaves open whether longer chains might eventually escape attractors.
Potential Impact
The practical implications are significant for a growing community. Multiple high-profile systems (FunSearch, AlphaEvolve, ELM, LLaMEA, ReEvo) use LLMs as mutation operators, and this work provides a mechanistic explanation for why diversity-preservation mechanisms have been empirically necessary. The finding that prompt design dramatically modulates convergence—with semantically similar prompts producing vastly different exploration profiles—is directly actionable for practitioners.
The paper could influence system design in several ways: (1) motivating systematic prompt-level mutation-chain analysis before deploying LLM-GP systems; (2) informing the design of diversity maintenance mechanisms by characterizing the specific nature of convergence (structural vs. terminal); (3) guiding model selection, given that reasoning-enabled variants consistently show less convergence.
The connection to broader iterative LLM phenomena (model collapse, paraphrasing attractors, image generation loops) positions this work as contributing to a larger theoretical understanding of LLM behavior under recursive self-application. This cross-domain relevance amplifies its potential influence.
Timeliness & Relevance
This paper is exceptionally timely. LLM-driven evolutionary computation is experiencing explosive growth, with systems like AlphaEvolve and FunSearch achieving prominent results. Yet the fundamental properties of LLMs as variation operators remain poorly understood. This work addresses a critical gap—most prior work evaluates LLM mutation operators only within optimization loops, making it impossible to separate operator bias from selection effects. The isolation of the mutation operator's intrinsic dynamics fills an important conceptual void.
The paper also arrives at a moment when the community is grappling with reproducibility and reliability of LLM-driven systems, making mechanistic understanding of their components particularly valuable.
Strengths
1. Clean experimental isolation: Removing selection pressure to study the operator in isolation is methodologically elegant and enables causal attribution of convergence to the LLM itself.
2. Multi-level analysis: The program/skeleton decomposition reveals that convergence is primarily structural, with surface variation creating an illusion of diversity—a nuanced insight that aggregate metrics would miss.
3. Comprehensive sensitivity analysis: Varying prompts, models, and replications systematically demonstrates robustness while identifying modulating factors.
4. Strong baseline comparison: The GP subtree mutation baseline conclusively rules out DSL size as a confound.
5. Excellent visualization: Transition graphs and Levenshtein heatmaps make complex dynamics intuitive and reveal temporal fine structure (attractor hopping, checkered cycling patterns).
6. Reproducibility: Code is publicly available; the experimental setup is clearly documented with all prompt variants listed.
Limitations
1. Limited DSL scope: The constrained Lisp-like DSL may not represent the dynamics in richer programming environments where LLMs have stronger priors.
2. No behavioral analysis: Structural convergence may or may not imply behavioral convergence; this remains uninvestigated.
3. No interaction with selection: The paper explicitly isolates mutation from selection, but the practical question of how selection interacts with these attractor dynamics remains open.
4. Temperature fixed at 1.0: Temperature is a natural control variable for exploration that is not varied.
5. Workshop paper scope: Some experiments (particularly model sensitivity) lack sufficient replication for strong statistical claims.
Overall Assessment
This is a well-conceived empirical study that identifies and characterizes a fundamental phenomenon—structural convergence under iterated LLM mutation—with clear implications for the design of LLM-driven evolutionary systems. While the scope is appropriately bounded for a workshop paper and several extensions are needed (richer languages, behavioral analysis, selection interaction), the core finding is robust, clearly communicated, and immediately relevant to a rapidly growing research area. The dynamical systems framing provides a productive conceptual lens that could seed a productive line of follow-up research.
Generated Jun 5, 2026
Comparison History (17)
Paper 2 addresses a fundamental and timely question about LLM-driven program evolution, revealing an intrinsic convergence bias in LLM mutations. This finding has broad implications across multiple rapidly growing fields—LLM-based code generation, automated program synthesis, evolutionary computation, and AI-driven scientific discovery. The robustness of the finding across models and prompts makes it highly generalizable. Paper 1, while methodologically sound, addresses a more niche intersection of OPE and strategic behavior with restrictive assumptions (conditional log-normal distribution). Paper 2's insights are more likely to influence a wider research community given the current centrality of LLMs.
Paper 2 has higher potential impact: it introduces a timely, practically motivated online learning framework for LLM API cascading with a novel output-mediated feedback structure, and provides methodological rigor via a principled policy (GMM + UCB) with provable regret bounds. Its applications (adaptive model selection, cost-quality tradeoffs) are immediate and broad across ML systems, bandits/online decision-making, and operations research. Paper 1 is insightful and novel diagnostically for LLM-driven program evolution, but is more specialized and primarily descriptive, with narrower direct real-world deployment implications and less formal theoretical grounding.
Paper 2 has higher estimated impact: it introduces a formal semantics and verification framework for widely used agent-tool protocols (SGD and MCP), proves bisimilarity under a mapping, identifies expressivity gaps, and proposes principled, type-theoretic extensions (MCP+) with isomorphism proofs. This is novel, methodologically rigorous, timely for LLM agent safety, and broadly applicable across PL, formal methods, security, and AI systems engineering. Paper 1 offers valuable empirical insight into LLM mutation dynamics, but its impact is narrower and less foundational than a general verification calculus for agent protocols.
Paper 2 uncovers a fundamental behavioral limitation of LLMs (convergence to attractor regions) in program evolution, which broadly impacts the highly active fields of LLM-based code generation, evolutionary algorithms, and open-ended exploration. Paper 1, while highly valuable for autonomous driving safety, is more applied and regulatory-focused, mapping existing XAI methods to safety standards rather than revealing a novel underlying scientific phenomenon.
Paper 1 identifies a fundamental and previously uncharacterized limitation of LLM-driven program evolution—systematic convergence toward structural attractors—which has broad implications for evolutionary computation, program synthesis, and open-ended search. The finding is robust across models and prompts, methodologically rigorous with comparisons to classical GP, and challenges core assumptions in the rapidly growing field of LLM-guided evolutionary algorithms. Paper 2 addresses important AI safety monitoring but is more incremental, combining known techniques (activation probing, entropy, steering) in a narrower agentic setting with less generalizable insights.
Paper 1 introduces a broadly applicable, training-free decoding method (FIDES) that targets a central, timely failure mode in RAG—retrieval vs. parametric-memory conflict—using token-level adaptive intervention from multiple internal signals, and demonstrates consistent gains across many benchmarks and model scales up to 70B. This combination of methodological novelty, immediate deployability, and wide relevance to LLM reliability makes its likely impact higher. Paper 2 offers an important diagnostic finding about convergence in LLM-driven program evolution, but it is more domain-specific and primarily descriptive, with less direct, general-purpose intervention.
Paper 1 identifies a fundamental limitation (bias toward structural homogeneity and convergence) in LLM-driven program evolution. This insight has broad, cross-disciplinary implications for AI, evolutionary algorithms, and open-ended exploration, impacting how researchers design LLM-based optimization systems. Paper 2 presents a strong, practical improvement for autonomous driving world models, but its impact is relatively confined to the specialized domain of end-to-end autonomous driving systems compared to the broader theoretical and methodological relevance of Paper 1.
Paper 1 uncovers a fundamental limitation (structural convergence) in LLM-driven program evolution, offering deep theoretical insights critical for the future of AI self-improvement and open-ended exploration. In contrast, Paper 2 introduces a useful but narrower benchmark based on a known game (Wikirace), which, while highlighting planning deficits, provides less novelty and long-term fundamental scientific impact.
Paper 2 offers a more novel and broadly relevant causal finding: self-correction failures largely arise from chat-template role labeling, isolated via byte-identical claims and verified controls. It is timely for agentic LLM design, has immediate real-world application (a training-free prompt-structure intervention), and generalizes across multiple model families and domains with strong statistical evidence. Paper 1 is insightful and methodologically solid but is narrower (DSL program evolution setting) and its impact is more specialized to LLM-guided genetic programming, whereas Paper 2 affects evaluation protocols, safety, reliability, and interface design across many LLM applications.
Paper 2 addresses a fundamental question about LLM-driven program evolution that has broad implications across multiple fields (evolutionary computation, program synthesis, AI-driven search). Its finding that LLMs exhibit systematic convergence bias is a novel, rigorous empirical contribution that challenges assumptions underlying many LLM-based optimization systems (e.g., FunSearch, EvoPrompting). This insight is widely applicable and timely. Paper 1, while showing solid improvements on data analysis benchmarks, is more incremental and narrowly focused on a specific application domain with a complex multi-component framework.
Paper 1 presents empirical findings on a highly timely and broadly relevant topic: the limitations of LLMs in program evolution. Identifying 'structural homogeneity' and 'attractor regions' directly impacts the rapidly growing fields of AI, program synthesis, and open-ended exploration. Paper 2, while addressing a crucial industrial problem in optimization, is a position paper with a narrower focus on MILP robustness, making Paper 1's empirical contributions likely to garner broader and more immediate scientific attention.
Paper 2 reveals a fundamental limitation of LLM-driven program evolution—systematic convergence toward structural attractors—which has broad implications for the rapidly growing fields of LLM-based code generation, automated program synthesis, and evolutionary computation. This finding is highly novel, methodologically rigorous (controlled experiments across models, prompts, with GP baselines), and challenges core assumptions underlying many LLM-augmented search/optimization systems. Paper 1, while practically useful for UI/UX evaluation, addresses a narrower application domain with incremental improvements over existing LLM-based evaluation approaches.
Paper 2 reveals a fundamental limitation in LLM-driven program evolution—structural convergence and attractor regions—which challenges current assumptions about LLM creativity in open-ended search. Uncovering these intrinsic biases provides deeper theoretical insights and will significantly impact how future evolutionary algorithms and automated programming systems are designed. While Paper 1 offers a useful and practical evaluation framework, Paper 2 provides a more profound scientific discovery regarding model behavior and generalization limits.
Paper 1 likely has higher scientific impact due to stronger novelty and broader conceptual relevance: it uncovers and quantifies an intrinsic convergence/attractor bias in LLM-driven program mutation, contrasted against a classical GP operator, yielding a generally applicable insight for LLM-based search, program synthesis, and open-ended evolution. Its methodological framing (mutation chains, cycle analysis, robustness across prompts/models) targets a foundational limitation that could influence many downstream systems. Paper 2 is timely and application-relevant for agent safety, but is closer to an engineering framework/finetuning recipe whose impact may be narrower and more benchmark-dependent.
Paper 2 offers a broader paradigm shift in human-AI collaboration by moving from behavioral correction to cognitive intervention ('fixing the mind'). Its applications span AI tutoring, assistive technologies, and cognitive modeling. The inclusion of zero-shot compositional generalization and a successful user study demonstrates strong methodological rigor and high real-world applicability. While Paper 1 provides valuable insights into LLM mutation dynamics, its impact is largely confined to the niche fields of genetic programming and automated code generation.
Paper 1 is likely to have higher impact: it introduces a broadly useful benchmark and evaluation methodology (multi-domain, socio-cognitive variation, topic-localized scoring) with validated human alignment, directly addressing a major, timely gap in LLM evaluation for social/interactive settings. Its applications span AI evaluation, HCI, computational social science, and responsible AI, and it produces actionable diagnostics across adaptation axes. Paper 2 is novel and rigorous, but its scope is narrower (LLM program mutation dynamics in a DSL) and its immediate cross-field and real-world applicability is more limited.
Trace2Skill presents a practical framework with demonstrated effectiveness across multiple domains, showing strong quantitative improvements (up to 57.65 percentage points) and transferability across model scales and families. It addresses a timely, high-demand problem in LLM agent skill acquisition with broad applicability. Paper 2 provides valuable analytical insights into LLM mutation convergence, but its impact is more niche, primarily relevant to the LLM-driven program evolution/genetic programming community. While Paper 2 identifies an important limitation, Paper 1 offers a constructive solution to a widely-felt need, giving it broader and more immediate scientific impact.