Beyond Static Evaluation: Co-Evolutionary Mechanisms for LLM-Driven Strategy Evolution in Adversarial Games

Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong

Jun 9, 2026arXiv:2606.10389v1

cs.AI

#2161of 3489·Artificial Intelligence

#2161 of 3489 · Artificial Intelligence

Tournament Score

1370±45

10501800

48%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5

Novelty5.5

Clarity7

Abstract

Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

This paper addresses a genuine problem in applying LLM-driven code evolution to adversarial multi-agent games: the non-stationarity of the evaluation landscape. The authors propose three mechanisms—evaluator co-evolution (adding discovered champions to the opponent pool), hierarchical deep evaluation (replacing noisy 3-game scores with statistically reliable 20-game assessments), and weakness pressure (dynamically up-weighting difficult opponents)—integrated into FAMOU, a framework for evolving complete strategy code (500–1700 lines of Python) in a 3v3 maritime capture-the-flag domain.

The core insight—that evaluation design matters as much as or more than search design in adversarial LLM-driven evolution—is well-articulated and practically important. However, the individual mechanisms are fairly straightforward applications of well-established ideas: co-evolution traces back to Hillis (1990), opponent reweighting is a form of competitive fitness sharing (Rosin & Belew 1997), and hierarchical evaluation is essentially noise reduction through increased sampling. The novelty lies primarily in their combination and application to LLM code evolution, not in the mechanisms themselves.

2. Methodological Rigor

The experimental design has both strengths and notable weaknesses:

Strengths:

Controlled comparison across two backbone LLMs (DeepSeek-V4-Flash and Gemini-2.5-Flash) and three frameworks (FAMOU, OpenEvolve, ShinkaEvolve)

Statistical testing (Wilcoxon signed-rank, paired t-tests, bootstrap CIs) for main comparisons

Cross-evaluation matrix (6×6, 1,800 total games) that reveals pairwise dynamics beyond aggregate metrics

Inclusion of both seen and unseen opponents for generalization assessment

Weaknesses:

Ablation experiments are single-run only, which the authors acknowledge but which substantially weakens the ablation conclusions. The p-values reported for ablations (e.g., p=0.084 for co-evolution removal, p=0.160 for weakness pressure) are from per-opponent comparisons within that single run, not from replicated experiments.

The non-additive interaction finding (individual drops sum to 0.304 vs. full-vanilla gap of 0.089) is intriguing but could simply reflect noise in single-run estimates rather than genuine synergistic effects.

The comparison is equal-iteration (T=400) rather than equal-compute. Since FAMOU incurs deep-evaluation overhead (each checkpoint ~10 hours), the baselines are disadvantaged in terms of wall-clock time per iteration of actual evolution. The authors note this but don't provide equal-compute comparisons.

The domain is restricted to a single game (MCTF 2026), limiting generalizability claims.

Spearman correlation of ρ=0.11 for 3-game scores motivating deep evaluation is based on preliminary experiments with unclear methodology.

3. Potential Impact

The paper's impact operates at several levels:

Immediate practical impact: The framework achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, demonstrating real-world transferability. This validates that LLM-driven code evolution can produce competitive strategies in adversarial settings.

Methodological template: The co-evolutionary evaluation approach could become a standard component in LLM-driven code evolution systems applied to adversarial or multi-agent domains. The principle that evaluation must co-evolve with solutions is general.

Evidence for LLM as mutation operator: The documentation of emergent tactical structures (H-DWA lookahead, K-Filter interception, A-Lock role locking) that were absent from seeds provides concrete evidence that LLMs can generate non-trivial algorithmic innovations. This is perhaps the most compelling contribution for the broader community.

However, the impact is somewhat limited by the single-domain evaluation and the relatively incremental nature of the individual mechanisms.

4. Timeliness & Relevance

The paper is highly timely. LLM-driven code evolution (FunSearch, ELM, ReEvo, OpenEvolve) is a rapidly growing area, but most work focuses on single-objective optimization problems with static evaluation. The extension to adversarial multi-agent settings with non-stationary evaluation landscapes addresses a current bottleneck. The connection to established co-evolutionary computation principles is appropriate and could help bridge the gap between the evolutionary computation and LLM communities.

5. Strengths & Limitations

Key Strengths:

Clear problem formulation: the "static evaluation becomes unreliable" problem is well-motivated with the Spearman correlation analysis

Practical validation through competition results

Documentation of emergent algorithmic innovations provides qualitative evidence beyond benchmark numbers

Punctuated equilibrium analysis (Figure 3) offers interesting insights into the dynamics of LLM-driven evolution

Code and evaluation available (reproducibility)

Notable Limitations:

Single game domain limits generalizability; the authors list RoboCup and StarCraft as future work

Single-run ablations weaken mechanism-level conclusions significantly

The 400-iteration budget may be insufficient (as the authors note), making it unclear whether baselines would catch up given more iterations

No comparison with RL-based approaches beyond the anecdotal mention that RL "failed to outperform handwritten heuristics"

The combined score metric (Eq. 5) with fixed 0.7/0.3 weighting is somewhat arbitrary

Computational cost (>30,000 LLM calls, ~10 hours per checkpoint) is substantial and not thoroughly analyzed relative to alternatives

The "vanilla" ablation (all mechanisms removed) still uses FAMOU's infrastructure, making it unclear how comparable it truly is to OpenEvolve/ShinkaEvolve

Overall Assessment

This paper makes a solid applied contribution by demonstrating that co-evolutionary evaluation mechanisms improve LLM-driven code evolution in adversarial settings, validated through competition success. The individual mechanisms are well-motivated but not deeply novel, and the experimental evidence—while generally positive—is weakened by single-run ablations and single-domain evaluation. The emergent tactical innovations documented are compelling qualitative evidence for the potential of LLM-driven code evolution. The work is timely and addresses a real gap, but falls short of the rigor needed for strong mechanistic claims.

Rating:5.8/ 10

Significance 6Rigor 5Novelty 5.5Clarity 7

Generated Jun 10, 2026

Comparison History (25)

Wonvs. Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

Paper 1 demonstrates higher potential scientific impact through rigorous methodology and proven real-world applicability. It proposes actionable co-evolutionary mechanisms for LLM-driven discovery, validating them through comprehensive ablations and a 1st place win in a hardware competition. Its empirical grounding and open-source availability guarantee immediate utility in multi-agent systems. While Paper 2 presents a highly novel theoretical framework for AI alignment, its empirical approach (relying on linguistic mimicry) lacks the concrete architectural validation and immediate, demonstrable real-world transferability seen in Paper 1.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. TreeSeeker: Tree-Structured Trial, Error, and Return in Deep Search

Paper 1 presents a highly innovative approach by combining LLM-driven code evolution with co-evolutionary mechanisms, demonstrating the discovery of nontrivial algorithmic innovations. Its scientific impact is significantly bolstered by its proven real-world transferability, evidenced by winning performances in a major hardware and simulation competition. While Paper 2 offers valuable improvements to LLM web search via tree-structured exploration, Paper 1 tackles a more fundamental challenge in AI self-improvement and multi-agent systems with exceptional methodological rigor and empirical success.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

Paper 1 introduces novel co-evolutionary mechanisms for LLM-driven strategy evolution in adversarial games, addressing a fundamental challenge (non-stationary evaluation landscapes) with broader applicability. It demonstrates real-world validation through competition success (1st place hardware, 3rd simulation at AAMAS 2026), shows emergent algorithmic innovations (lookahead search, adaptive interception), and contributes to the rapidly growing field of LLM-based code evolution. Paper 2, while solving a practical problem in BIM compliance checking, targets a narrower domain (AEC industry) with more incremental improvements (8.6% accuracy gain) and limited cross-field impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. Skill-Augmented AI Agents for Medical Research Analysis: An Exploratory Multi-Model Human Evaluation in an NSCLC Transcriptomic Biomarker Task

Paper 2 demonstrates significant methodological innovation in LLM-driven code evolution and validates its approach through strong, statistically reliable empirical results and real-world competition success (1st place in hardware round-robin). In contrast, Paper 1 presents an exploratory study with statistically non-significant results, high expert disagreement, and a primary conclusion that more research is needed. Paper 2's definitive algorithmic contributions, proven generalizability, and immediate applicability in multi-agent environments give it a substantially higher potential for scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Paper 2 introduces a comprehensive evaluation benchmark for autonomous agents, addressing critical flaws in current trajectory-opaque grading. Robust benchmarks typically yield broader scientific impact and higher citations than specific algorithmic applications, as they set the standard for evaluating future models across the community. While Paper 1 presents an innovative co-evolutionary method for multi-agent games with impressive competition results, Paper 2's focus on safety, robustness, and multi-modal workflows tackles a universally relevant and timely bottleneck in LLM development, giving it a much wider potential impact across the AI field.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. What Fits (Into Few Tokens) Doesn't Overfit: Compression and Generalization in ML Research Agents

Paper 1 has higher potential scientific impact because it offers a broadly applicable, falsifiable explanation for a central ML-science issue (why adaptive benchmark reuse often doesn’t overfit) using information-theoretic compression tests across diverse modalities and tasks. This targets methodology and interpretation across the whole ML research pipeline (including agentic research), potentially influencing benchmark design, evaluation norms, and theory. Paper 2 is a strong, timely systems contribution with clear competitive validation, but its impact is more domain-specific (adversarial game strategy evolution) and tied to a particular framework/task, with less general conceptual reach.

gpt-5.2·Jun 10, 2026

Lostvs. When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models

Paper 2 addresses a fundamental and broadly impactful problem in AI safety—hidden failure modes in multi-turn reasoning models—that affects the entire LLM alignment community. Its novel CoT-Output 2x2 safety matrix framework introduces new conceptual vocabulary (context-injection failure, oversight paradox) with broad applicability across safety research. Paper 1, while strong and competition-validated, is more narrowly focused on adversarial game strategy evolution with domain-specific contributions. Paper 2's findings about reasoning unfaithfulness and the paradoxical effects of oversight have deeper implications for AI deployment safety, affecting a larger research community.

claude-opus-4-6·Jun 10, 2026

Lostvs. Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

Paper 2 offers a more broadly impactful contribution by identifying a precise, generalizable principle: LLMs serve as structural priors rather than optimizers for coupled MIMO controller tuning. It rigorously delineates when LLMs help versus when they don't, provides reproducible benchmarks, and addresses a fundamental industrial problem (control tuning) with practical constraints (on-premise, no plant model). The clarity of its negative results (LLMs add nothing for simple loops) strengthens its scientific value. Paper 1, while impressive in competition results, is more application-specific to a niche adversarial game domain with narrower transferability of insights.

claude-opus-4-6·Jun 10, 2026

Wonvs. A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

Paper 2 proposes highly novel co-evolutionary mechanisms for LLM-driven strategy evolution, advancing the frontier of multi-agent adversarial games. Its framework demonstrates significant algorithmic innovation, robust methodological rigor (including ablation studies), and proven real-world applicability by winning a recognized competition. In contrast, Paper 1 is primarily a reproduction and evaluation study of an existing model, offering valuable critical analysis but lacking the broad methodological innovations and transformative potential of Paper 2.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Paper 2 introduces novel co-evolutionary mechanisms for LLM-driven strategy evolution in adversarial games, addressing a fundamental challenge (shifting evaluation landscapes) with three well-defined mechanisms. It demonstrates real-world validation (competition placements), shows emergent algorithmic innovations, and has broader impact across AI, multi-agent systems, and evolutionary computation. Paper 1 applies existing techniques (LoRA, NEFTune) to a narrow domain (financial NER) with incremental improvements on a small dataset, representing solid engineering but limited novelty and narrower impact.

claude-opus-4-6·Jun 10, 2026

#2161of 3489·Artificial Intelligence

#2161 of 3489 · Artificial Intelligence

Tournament Score

1370±45

10501800

48%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5

Novelty5.5

Clarity7