Discovering Novel LLM Experts via Task-Capability Coevolution

Andrew Dai, Boris Meinardus, Ciaran Regan, Yingtao Tian, Yujin Tang

#39 of 2292 · Artificial Intelligence
Share
Tournament Score
1576±28
10501800
72%
Win Rate
39
Wins
15
Losses
54
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Frontier model developers aim to train models continually to possess emergent, diverse capabilities. To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time. Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run. We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textit{Assessment Coevolving with Diverse Capabilities} (AC/DC). AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation. AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory. In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textit{any} explicit benchmark optimization. Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-N selection. Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs. Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: "Discovering Novel LLM Experts via Task-Capability Coevolution"

1. Core Contribution

AC/DC introduces a coevolutionary framework that jointly evolves populations of LLMs (via model merging) and synthetic evaluation tasks (via LLM-generated data). The key insight is that open-ended coevolution—where models and tasks co-adapt—can discover diverse collectives of specialist LLMs whose combined Coverage (fraction of problems solved by at least one model) exceeds that of larger monolithic models or manually curated expert ensembles. The framework combines evolutionary model merging (crossover + SVD-based mutation), Dominated Novelty Search (DNS) for quality-diversity selection, synthetic task generation with difficulty-adaptive evolution, and minimal criteria filtering (gibberish filter, impossible task filter).

The fundamental reframing is notable: instead of producing one better model, AC/DC produces a *population* of complementary models, shifting the paradigm from monolithic scaling to distributed specialization. The framework operates without any explicit benchmark optimization, yet discovered models generalize to out-of-distribution benchmarks.

2. Methodological Rigor

Strengths in experimental design:

  • Evaluation across 4 model families (Qwen2, Qwen2.5, Qwen3, DeepSeek V1), providing evidence of generality.
  • 8 diverse benchmarks spanning knowledge, math, STEM, and code.
  • Novel open-ended benchmark variants (judge-evaluated MCQ without answer choices) that prevent Coverage "cheating."
  • Comprehensive ablation study systematically removing components (QD selection, gibberish filter, mutation, novelty filter).
  • Bootstrap hypothesis testing with 50,000 resamples and detailed statistical significance analysis across all comparisons.
  • Human study (3 reviewers, 94 assessments) validating synthetic task quality (97.8% correctness, 68.9% OOD).
  • Comparison against prior QD methods (CycleQD, DNS) that directly optimize benchmarks.
  • Methodological concerns:

  • The Coverage metric, while appropriate for the collective intelligence framing, inherently favors diverse populations. The Best-of-N results show substantially smaller gains, and the gap between Coverage and practical single-answer extraction remains significant—acknowledged by the authors but limiting real-world applicability.
  • The reliance on a fixed "scientist LLM" (Qwen2.5-72B-Instruct) for task generation constrains the exploration space and introduces a dependency that isn't co-evolved.
  • Reproducibility analysis shows only 2 runs for AC/DC vs. 3 for control, with moderately higher variance at N=8 (mean std dev 1.80 vs 0.59 points).
  • The Llama3 experiment fails to improve over baselines, with the explanation attributed to seed model incompatibility. This limitation is significant—the method's applicability depends on empirically testing seed combinations, reducing its practical accessibility.
  • 3. Potential Impact

    Direct applications:

  • Parameter-efficient deployment: collectives of 7B models achieving Coverage competitive with 72B models or GPT-4o, relevant for resource-constrained settings.
  • Automated model development pipelines reducing human curation overhead.
  • The synthetic task generation pipeline itself is valuable for automated evaluation.
  • Broader influence:

  • Opens a research direction at the intersection of open-endedness, model merging, and collective intelligence for LLMs. This is a genuinely novel intersection.
  • The skill vector representation and DNS-based selection provide a practical framework for measuring and optimizing model diversity.
  • Could influence how model hubs (e.g., Hugging Face) think about model collections—from independent uploads to coordinated, complementary populations.
  • Limitations on impact:

  • The Best-of-N extraction problem remains the critical bottleneck. Without reliable methods to select the correct answer from diverse candidates, Coverage improvements don't fully translate to practical gains (average BoN improvements of +0.99% to +1.34% vs. control, compared to Coverage gains of +2.04% to +10.19%).
  • The requirement for seed models fine-tuned from the same base architecture limits the diversity of starting conditions.
  • 324 GPU hours per run is modest but not negligible, and the method doesn't incorporate fine-tuning, meaning discovered capabilities are bounded by what crossover and mutation can extract from seed models.
  • 4. Timeliness & Relevance

    The paper addresses a timely bottleneck: the diminishing returns of scaling monolithic models and the growing interest in model merging, mixture-of-experts, and collective inference strategies. The open-endedness angle is particularly relevant as the field moves toward self-improving and autonomous AI systems. The work also connects to the emerging "small model collective" paradigm as an alternative to frontier model scaling—a topic of increasing practical and economic interest.

    5. Strengths & Limitations

    Key strengths:

  • Novel and well-motivated combination of coevolution, model merging, and QD for LLM discovery.
  • No benchmark optimization during training, yet competitive OOD generalization.
  • Comprehensive evaluation: multiple model families, benchmarks, baselines, ablations, statistical tests, human study.
  • The qualitative analysis (emergent specialization, lineage trees, response diversity) is compelling and well-presented.
  • Strong Coverage gains against Big Models (+10.19% average at N=8) with fraction of parameters.
  • Notable weaknesses:

  • The Coverage-to-deployment gap (Best-of-N) limits practical utility; gains against Big Models become negative in BoN setting (-0.25% at N=8).
  • Method fails on Llama3 family, and the merging compatibility analysis (Appendix J) provides only post-hoc correlations with 80% accuracy on 5 data points.
  • Novelty is primarily recombinative—AC/DC cannot discover genuinely new knowledge, only recombine existing capabilities from seed models.
  • The coevolution dynamics analysis (Section D.3, D.6) is suggestive but the 50-generation horizon is insufficient to demonstrate truly open-ended dynamics.
  • Summary

    AC/DC represents a creative and well-executed contribution that bridges open-endedness research with practical LLM development. The Coverage results are strong and the experimental methodology is thorough. However, the practical impact is tempered by the Best-of-N extraction gap and seed model dependency. The paper's greatest value may be in establishing a research direction—coevolutionary LLM population discovery—rather than providing an immediately deployable system.

    Rating:6.8/ 10
    Significance 7Rigor 7Novelty 7.5Clarity 7

    Generated Apr 17, 2026

    Comparison History (54)

    vs. Generative structure search for efficient and diverse discovery of molecular and crystal structures
    claude-opus-4.65/6/2026

    Paper 2 addresses a fundamental challenge in molecular and materials discovery—efficient exploration of energy landscapes—with a principled framework unifying generative models and physics-based search. Its methodological contribution (GSS) is broadly applicable across chemistry and materials science, offers >10x efficiency gains, and generalizes beyond training data. Paper 1 presents an interesting open-ended evolution framework for LLMs, but its impact is more incremental within the rapidly shifting LLM landscape, where model merging and synthetic data generation are already explored. Paper 2's grounding in physical principles gives it more durable cross-disciplinary impact.

    vs. Long-Horizon Plan Execution in Large Tool Spaces through Entropy-Guided Branching
    gemini-35/5/2026

    Paper 2 proposes a paradigm-shifting framework for open-ended LLM development via coevolution, addressing the fundamental limitation of static training pipelines. This approach to automated, continuous capability discovery has broader implications for general AI advancement and foundation model training compared to Paper 1, which focuses on the narrower, albeit practical, domain of tool-use optimization and search algorithms.

    vs. When to Forget: A Memory Governance Primitive
    gpt-5.25/5/2026

    Paper 2 likely has higher scientific impact: it proposes an open-ended coevolution framework for jointly evolving tasks and LLM “experts,” a paradigm-level shift with broad relevance to continual learning, automated model discovery, evaluation, and multi-agent selection. If empirically validated, it has strong real-world applicability for building diverse capability portfolios without hand-designed training runs, and its breadth spans ML, LLM training, AutoML, and AI safety/evaluation. Paper 1 is novel and rigorous but narrower in scope (memory governance primitive) and primarily incremental to agent memory system design.

    vs. Equifinality in Mixture of Experts: Routing Topology Does Not Determine Language Modeling Quality
    gemini-35/5/2026

    Paper 2 introduces a novel open-ended framework for coevolving LLMs and tasks, addressing a critical bottleneck in continual model training. Its approach to autonomously discovering diverse capabilities has broad, transformative implications for frontier model development and multi-agent systems. While Paper 1 provides rigorous and valuable mechanistic insights into MoE routing, its impact is narrower, focusing on architectural simplification, and is currently demonstrated at a relatively small scale (80M parameters).

    vs. From Admission to Invariants: Measuring Deviation in Delegated Agent Systems
    gemini-35/5/2026

    Paper 2 proposes a paradigm shift in LLM development by introducing a coevolutionary framework for tasks and models. This open-ended approach to discovering emergent capabilities could fundamentally alter how foundation models are trained, extending beyond static datasets. While Paper 1 provides excellent theoretical insights into agent safety and monitoring, Paper 2's methodology addresses core scalability and capability discovery bottlenecks in AI, promising broader impact across the entire machine learning community and industry.

    vs. Poly-EPO: Training Exploratory Reasoning Models
    gpt-5.25/5/2026

    Paper 1 proposes a more novel, paradigm-shifting framework: open-ended coevolution of both tasks and LLM populations, enabling continual capability discovery in a single run and yielding archives of specialized experts with strong coverage without direct benchmark optimization. This has broad potential applications (automated model discovery, efficient expert ensembles, continual learning) and could influence multiple fields (AutoML, multi-agent systems, evolutionary computation, synthetic data/task generation). Paper 2 is timely and methodologically solid but is more incremental within post-training/RL, with impact primarily on reasoning and test-time compute scaling.

    vs. UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
    claude-opus-4.65/5/2026

    Paper 2 introduces a fundamentally new paradigm (AC/DC) for LLM development through open-ended coevolution of models and tasks, which has broader implications across the entire field of AI/ML. It addresses the core limitation of static training paradigms and offers a scalable, continual improvement framework. While Paper 1 makes a solid engineering contribution by unifying audio front-end tasks for full-duplex speech, it is more narrowly scoped to speech interaction systems. Paper 2's novelty in applying open-endedness and coevolution to LLM discovery has wider potential to reshape how models are developed across domains.

    vs. Model Spec Midtraining: Improving How Alignment Training Generalizes
    gpt-5.25/5/2026

    Paper 2 likely has higher impact: it proposes a simple, broadly applicable intervention (midtraining on synthetic spec documents) directly targeting a central, timely problem—alignment generalization. It shows large, safety-relevant gains (e.g., major reduction in agentic misalignment) and provides an experimental tool for studying what spec content improves generalization, increasing scientific and practical value. Paper 1 is novel and ambitious (open-ended coevolution for LLM capability discovery), but may face heavier engineering complexity and more uncertain reproducibility/standardization, potentially limiting near-term adoption compared to MSM.

    vs. Towards Understanding Specification Gaming in Reasoning Models
    claude-opus-4.65/5/2026

    Paper 2 addresses a fundamental safety challenge (specification gaming) in RL-trained reasoning models, providing systematic empirical evidence linking RL training to increased exploitation of specifications. This has immediate, broad implications for AI safety, alignment, and deployment practices. Its open-sourced evaluation suite enables reproducible research. Paper 1 presents a creative framework for LLM development via coevolution, but its impact is more incremental within the model development space. Paper 2's findings about a core failure mode of increasingly deployed reasoning models are more timely and consequential for the field.

    vs. ORPilot: A Production-Oriented Agentic LLM-for-OR Tool for Optimization Modeling
    gemini-35/5/2026

    Paper 1 introduces a foundational paradigm shift in LLM development through open-ended coevolution of models and tasks, impacting the broader AI community's approach to continual learning and model discovery. In contrast, Paper 2, while highly practical and valuable for Operations Research, focuses on a domain-specific application of existing LLM capabilities. The methodological innovation and broader applicability across the entire field of AI give Paper 1 a significantly higher potential for widespread scientific impact.

    vs. When Agents Evolve, Institutions Follow
    gpt-5.25/1/2026

    Paper 2 is likely higher impact: it proposes a new, scalable training/development paradigm (open-ended task–model coevolution) that could change how frontier LLMs are iteratively improved, with clear real-world relevance for continual capability discovery and efficiency (smaller models exceeding larger ones). The approach potentially generalizes across domains and connects to broader open-ended evolution research. Paper 1 is novel and useful for multi-agent system design, but its contributions are more evaluative/architectural within a fixed-model setting and may have narrower downstream leverage than a framework that alters the model improvement pipeline itself.

    vs. When Agents Evolve, Institutions Follow
    claude-opus-4.65/1/2026

    Paper 2 introduces a concrete, novel algorithmic framework (AC/DC) that addresses a fundamental limitation of current LLM training paradigms by applying open-ended coevolution to discover diverse LLM experts via model merging. It demonstrates practical results (smaller models surpassing larger ones, growing capability archives) with broad implications for LLM development. Paper 1 offers an interesting interdisciplinary framing mapping historical institutions to multi-agent architectures, but is more of an empirical evaluation of known organizational patterns rather than a new technical contribution. Paper 2's paradigm shift in model development methodology has greater potential for widespread adoption and follow-on research.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    claude-opus-4.64/22/2026

    The AAAI-26 AI Review Pilot represents the first large-scale real-world deployment of AI-assisted peer review across nearly 23,000 papers at a major conference. Its immediate practical impact on the scientific review process—a universal bottleneck across all fields—gives it extraordinary breadth. The finding that AI reviews were preferred over human reviews on key dimensions is paradigm-shifting for scientific publishing. While Paper 1 (AC/DC) presents a novel and interesting approach to open-ended LLM capability discovery, Paper 2 addresses a more universal problem with demonstrated real-world validation at unprecedented scale, likely influencing how all major conferences operate.

    vs. AI scientists produce results without reasoning scientifically
    claude-opus-4.64/22/2026

    Paper 1 addresses a fundamental and timely question about the epistemic validity of AI-driven scientific research, with rigorous methodology (25,000+ agent runs, 8 domains, two complementary evaluation lenses). Its finding that LLM agents fail to exhibit genuine scientific reasoning despite producing correct outputs has profound implications for the trustworthiness of AI-generated scientific knowledge—a concern spanning all scientific disciplines. Paper 2 presents an interesting engineering contribution (AC/DC framework for evolving LLM experts), but its impact is narrower, focused on model development methodology. Paper 1's insights are more likely to reshape policies, evaluation standards, and training paradigms across the AI-for-science ecosystem.

    vs. AI scientists produce results without reasoning scientifically
    gemini-34/22/2026

    Paper 1 provides a crucial, timely critique of 'AI scientists,' demonstrating that while LLMs can execute workflows, they fail at true scientific reasoning. This has profound implications across all scientific fields adopting AI, potentially shifting the focus from outcome-based evaluation to process-based reasoning training. Paper 2 offers an innovative LLM development method, but Paper 1's foundational questioning of AI validity in science gives it broader and more disruptive impact.

    vs. Using large language models for embodied planning introduces systematic safety risks
    gpt-5.24/22/2026

    Paper 2 likely has higher scientific impact: it introduces a large, deterministic benchmark (DESPITE) and provides a clear, scalable empirical finding that planning competence and safety awareness diverge, with strong implications for embodied AI and robotics deployment. The work is timely given rapid adoption of LLM planners and directly informs safety evaluation, model development, and policy. Paper 1 is innovative and potentially impactful for LLM training paradigms, but its coevolution/merging approach may face harder-to-validate generality and adoption barriers compared to a benchmarked safety result with immediate real-world relevance.

    vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
    gpt-5.24/22/2026

    Paper 2 likely has higher scientific impact due to its unprecedented real-world deployment at massive scale (all AAAI-26 submissions), strong timeliness, and immediate applicability to a critical bottleneck in science. It combines methodological engineering with large-scale user evaluation and introduces a benchmark, suggesting rigor and reproducibility. Its impact spans multiple fields (ML, HCI, scientometrics, research policy, academic publishing). Paper 1 is novel and important for LLM training paradigms, but its demonstrated impact is more contained within model-development research and depends on broader adoption and validation.

    vs. Using large language models for embodied planning introduces systematic safety risks
    gemini-34/22/2026

    Paper 1 introduces a highly novel paradigm for LLM development through task-capability coevolution, moving beyond static pre-training/post-training to an open-ended, continual discovery process. This fundamental methodological innovation has the potential to reshape how frontier models are trained and optimized. While Paper 2 provides a valuable benchmark for embodied AI safety, Paper 1's broader implications for AI self-improvement and capability discovery offer a higher ceiling for transformative scientific impact across the entire field of machine learning.

    vs. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
    gpt-5.24/22/2026

    Paper 2 likely has higher impact due to greater novelty and breadth: an open-ended coevolution framework that jointly evolves tasks and model variants (via merging) could change how LLMs are developed and continually improved, with applications across many domains and model families. If rigorously validated, it offers a scalable paradigm beyond static pre/post-training and could influence evaluation, capability discovery, and multi-agent/model selection. Paper 1 is timely and useful for embodied/robotics VLM reasoning, but its scope is narrower (a specific multimodal arithmetic benchmark + RL fine-tuning) and may primarily impact VLM/robotics subcommunities.

    vs. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic
    gpt-5.24/22/2026

    Paper 2 likely has higher impact due to greater novelty and breadth: an open-ended coevolution framework that jointly evolves tasks and model populations, potentially changing how LLMs are developed (continual capability discovery without explicit benchmark tuning). Its applications span model training pipelines, evaluation, agent selection, and efficient deployment via smaller specialist models. Paper 1 is timely and useful (datasets + RL post-training for visual relational arithmetic), but is narrower (vision-language relational reasoning) and closer to established post-training/dataset-construction patterns, with impact mainly in embodied/robotics and VLM benchmarking.