Haoran Li, Zengle Ge, Ziyang Zhang, Xiaomin Yuan, Yui Lo, Qianhui Liu, Bocheng An, Dongke Rong
Recent advances in LLM-driven code evolution have enabled automated discovery by iteratively generating and improving programs. However, applying these methods to adversarial multi-agent games introduces a fundamental challenge: the evaluation landscape shifts as strategies improve, causing fixed evaluators to become unreliable and evolution to stagnate. We propose three mechanisms to address this challenge: evaluator co-evolution, which incorporates discovered champions into the opponent pool; hierarchical deep evaluation, which replaces noisy few-game scores with statistically reliable assessments; and weakness pressure, which dynamically up-weights the most difficult opponents to break through plateaus. We implement these mechanisms within FAMOU, a framework built upon the same foundation-model code-evolution paradigm as OpenEvolve and ShinkaEvolve. On the MCTF 2026 3v3 maritime capture-the-flag task, FAMOU consistently outperforms both baselines under two backbone LLMs, achieving the highest combined score (0.526) and the best generalization to unseen opponents (61.7% win rate), while ablations confirm that each mechanism contributes to performance. Notably, the LLM mutation process generates tactical structures entirely absent from the seed strategies -- including lookahead search and adaptive interception -- demonstrating that code-level evolution can produce nontrivial algorithmic innovations in adversarial settings. The FAMOU-evolved strategy further achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, validating its real-world transferability. The optimized implementation and corresponding evaluation codes developed through our evolutionary process are available at: https://github.com/1xiangliu1/FAMOU-CoEvo
This paper addresses a genuine problem in applying LLM-driven code evolution to adversarial multi-agent games: the non-stationarity of the evaluation landscape. The authors propose three mechanisms—evaluator co-evolution (adding discovered champions to the opponent pool), hierarchical deep evaluation (replacing noisy 3-game scores with statistically reliable 20-game assessments), and weakness pressure (dynamically up-weighting difficult opponents)—integrated into FAMOU, a framework for evolving complete strategy code (500–1700 lines of Python) in a 3v3 maritime capture-the-flag domain.
The core insight—that evaluation design matters as much as or more than search design in adversarial LLM-driven evolution—is well-articulated and practically important. However, the individual mechanisms are fairly straightforward applications of well-established ideas: co-evolution traces back to Hillis (1990), opponent reweighting is a form of competitive fitness sharing (Rosin & Belew 1997), and hierarchical evaluation is essentially noise reduction through increased sampling. The novelty lies primarily in their combination and application to LLM code evolution, not in the mechanisms themselves.
The experimental design has both strengths and notable weaknesses:
The paper's impact operates at several levels:
Immediate practical impact: The framework achieved 1st place in the hardware round-robin and 3rd in simulation at the AAMAS 2026 MCTF Competition, demonstrating real-world transferability. This validates that LLM-driven code evolution can produce competitive strategies in adversarial settings.
Methodological template: The co-evolutionary evaluation approach could become a standard component in LLM-driven code evolution systems applied to adversarial or multi-agent domains. The principle that evaluation must co-evolve with solutions is general.
Evidence for LLM as mutation operator: The documentation of emergent tactical structures (H-DWA lookahead, K-Filter interception, A-Lock role locking) that were absent from seeds provides concrete evidence that LLMs can generate non-trivial algorithmic innovations. This is perhaps the most compelling contribution for the broader community.
However, the impact is somewhat limited by the single-domain evaluation and the relatively incremental nature of the individual mechanisms.
The paper is highly timely. LLM-driven code evolution (FunSearch, ELM, ReEvo, OpenEvolve) is a rapidly growing area, but most work focuses on single-objective optimization problems with static evaluation. The extension to adversarial multi-agent settings with non-stationary evaluation landscapes addresses a current bottleneck. The connection to established co-evolutionary computation principles is appropriate and could help bridge the gap between the evolutionary computation and LLM communities.
This paper makes a solid applied contribution by demonstrating that co-evolutionary evaluation mechanisms improve LLM-driven code evolution in adversarial settings, validated through competition success. The individual mechanisms are well-motivated but not deeply novel, and the experimental evidence—while generally positive—is weakened by single-run ablations and single-domain evaluation. The emergent tactical innovations documented are compelling qualitative evidence for the potential of LLM-driven code evolution. The work is timely and addresses a real gap, but falls short of the rigor needed for strong mechanistic claims.
Generated Jun 10, 2026
Paper 1 demonstrates higher potential scientific impact through rigorous methodology and proven real-world applicability. It proposes actionable co-evolutionary mechanisms for LLM-driven discovery, validating them through comprehensive ablations and a 1st place win in a hardware competition. Its empirical grounding and open-source availability guarantee immediate utility in multi-agent systems. While Paper 2 presents a highly novel theoretical framework for AI alignment, its empirical approach (relying on linguistic mimicry) lacks the concrete architectural validation and immediate, demonstrable real-world transferability seen in Paper 1.
Paper 1 presents a highly innovative approach by combining LLM-driven code evolution with co-evolutionary mechanisms, demonstrating the discovery of nontrivial algorithmic innovations. Its scientific impact is significantly bolstered by its proven real-world transferability, evidenced by winning performances in a major hardware and simulation competition. While Paper 2 offers valuable improvements to LLM web search via tree-structured exploration, Paper 1 tackles a more fundamental challenge in AI self-improvement and multi-agent systems with exceptional methodological rigor and empirical success.
Paper 1 introduces novel co-evolutionary mechanisms for LLM-driven strategy evolution in adversarial games, addressing a fundamental challenge (non-stationary evaluation landscapes) with broader applicability. It demonstrates real-world validation through competition success (1st place hardware, 3rd simulation at AAMAS 2026), shows emergent algorithmic innovations (lookahead search, adaptive interception), and contributes to the rapidly growing field of LLM-based code evolution. Paper 2, while solving a practical problem in BIM compliance checking, targets a narrower domain (AEC industry) with more incremental improvements (8.6% accuracy gain) and limited cross-field impact.
Paper 2 demonstrates significant methodological innovation in LLM-driven code evolution and validates its approach through strong, statistically reliable empirical results and real-world competition success (1st place in hardware round-robin). In contrast, Paper 1 presents an exploratory study with statistically non-significant results, high expert disagreement, and a primary conclusion that more research is needed. Paper 2's definitive algorithmic contributions, proven generalizability, and immediate applicability in multi-agent environments give it a substantially higher potential for scientific impact.
Paper 2 introduces a comprehensive evaluation benchmark for autonomous agents, addressing critical flaws in current trajectory-opaque grading. Robust benchmarks typically yield broader scientific impact and higher citations than specific algorithmic applications, as they set the standard for evaluating future models across the community. While Paper 1 presents an innovative co-evolutionary method for multi-agent games with impressive competition results, Paper 2's focus on safety, robustness, and multi-modal workflows tackles a universally relevant and timely bottleneck in LLM development, giving it a much wider potential impact across the AI field.
Paper 1 has higher potential scientific impact because it offers a broadly applicable, falsifiable explanation for a central ML-science issue (why adaptive benchmark reuse often doesn’t overfit) using information-theoretic compression tests across diverse modalities and tasks. This targets methodology and interpretation across the whole ML research pipeline (including agentic research), potentially influencing benchmark design, evaluation norms, and theory. Paper 2 is a strong, timely systems contribution with clear competitive validation, but its impact is more domain-specific (adversarial game strategy evolution) and tied to a particular framework/task, with less general conceptual reach.
Paper 2 addresses a fundamental and broadly impactful problem in AI safety—hidden failure modes in multi-turn reasoning models—that affects the entire LLM alignment community. Its novel CoT-Output 2x2 safety matrix framework introduces new conceptual vocabulary (context-injection failure, oversight paradox) with broad applicability across safety research. Paper 1, while strong and competition-validated, is more narrowly focused on adversarial game strategy evolution with domain-specific contributions. Paper 2's findings about reasoning unfaithfulness and the paradoxical effects of oversight have deeper implications for AI deployment safety, affecting a larger research community.
Paper 2 offers a more broadly impactful contribution by identifying a precise, generalizable principle: LLMs serve as structural priors rather than optimizers for coupled MIMO controller tuning. It rigorously delineates when LLMs help versus when they don't, provides reproducible benchmarks, and addresses a fundamental industrial problem (control tuning) with practical constraints (on-premise, no plant model). The clarity of its negative results (LLMs add nothing for simple loops) strengthens its scientific value. Paper 1, while impressive in competition results, is more application-specific to a niche adversarial game domain with narrower transferability of insights.
Paper 2 proposes highly novel co-evolutionary mechanisms for LLM-driven strategy evolution, advancing the frontier of multi-agent adversarial games. Its framework demonstrates significant algorithmic innovation, robust methodological rigor (including ablation studies), and proven real-world applicability by winning a recognized competition. In contrast, Paper 1 is primarily a reproduction and evaluation study of an existing model, offering valuable critical analysis but lacking the broad methodological innovations and transformative potential of Paper 2.
Paper 2 introduces novel co-evolutionary mechanisms for LLM-driven strategy evolution in adversarial games, addressing a fundamental challenge (shifting evaluation landscapes) with three well-defined mechanisms. It demonstrates real-world validation (competition placements), shows emergent algorithmic innovations, and has broader impact across AI, multi-agent systems, and evolutionary computation. Paper 1 applies existing techniques (LoRA, NEFTune) to a narrow domain (financial NER) with incremental improvements on a small dataset, representing solid engineering but limited novelty and narrower impact.