Discovering Novel LLM Experts via Task-Capability Coevolution
Andrew Dai, Boris Meinardus, Ciaran Regan, Yingtao Tian, Yujin Tang
Abstract
Frontier model developers aim to train models continually to possess emergent, diverse capabilities. To extend capabilities, the current pre-training and post-training paradigm requires manually starting training runs with static datasets or reward functions every time. Addressing this limitation, our work pursues the insight that open-endedness (via the coevolution of models and tasks) can discover models with increasingly novel skills in a single run. We introduce a new model development framework that extends coevolution to large language model (LLM) discovery, open-ended \textit{Assessment Coevolving with Diverse Capabilities} (AC/DC). AC/DC evolves both LLMs via model merging and natural language tasks via synthetic data generation. AC/DC discovers growing archives of LLMs that surpass the capabilities of larger LLMs while taking up less GPU memory. In particular, our LLM populations achieve a broader Coverage of expertise than other curated models or baselines on downstream benchmarks, without \textit{any} explicit benchmark optimization. Furthermore, AC/DC improves Coverage over time, continually innovates on tasks and models, and improves performance in multi-agent best-of-N selection. Our findings highlight the potential of coevolution as a means of discovering broader sets of capabilities from base LLMs. Overall, AC/DC brings us one step closer to a profoundly new paradigm of LLM development, where continual improvements to the diversity of model capabilities can be accelerated by leveraging existing models as stepping stones to increasingly powerful models.
AI Impact Assessments
(3 models)Scientific Impact Assessment: "Discovering Novel LLM Experts via Task-Capability Coevolution"
1. Core Contribution
AC/DC introduces a coevolutionary framework that jointly evolves populations of LLMs (via model merging) and synthetic evaluation tasks (via LLM-generated data). The key insight is that open-ended coevolution—where models and tasks co-adapt—can discover diverse collectives of specialist LLMs whose combined Coverage (fraction of problems solved by at least one model) exceeds that of larger monolithic models or manually curated expert ensembles. The framework combines evolutionary model merging (crossover + SVD-based mutation), Dominated Novelty Search (DNS) for quality-diversity selection, synthetic task generation with difficulty-adaptive evolution, and minimal criteria filtering (gibberish filter, impossible task filter).
The fundamental reframing is notable: instead of producing one better model, AC/DC produces a *population* of complementary models, shifting the paradigm from monolithic scaling to distributed specialization. The framework operates without any explicit benchmark optimization, yet discovered models generalize to out-of-distribution benchmarks.
2. Methodological Rigor
Strengths in experimental design:
Methodological concerns:
3. Potential Impact
Direct applications:
Broader influence:
Limitations on impact:
4. Timeliness & Relevance
The paper addresses a timely bottleneck: the diminishing returns of scaling monolithic models and the growing interest in model merging, mixture-of-experts, and collective inference strategies. The open-endedness angle is particularly relevant as the field moves toward self-improving and autonomous AI systems. The work also connects to the emerging "small model collective" paradigm as an alternative to frontier model scaling—a topic of increasing practical and economic interest.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Summary
AC/DC represents a creative and well-executed contribution that bridges open-endedness research with practical LLM development. The Coverage results are strong and the experimental methodology is thorough. However, the practical impact is tempered by the Best-of-N extraction gap and seed model dependency. The paper's greatest value may be in establishing a research direction—coevolutionary LLM population discovery—rather than providing an immediately deployable system.
Generated Apr 17, 2026
Comparison History (54)
Paper 2 addresses a fundamental challenge in molecular and materials discovery—efficient exploration of energy landscapes—with a principled framework unifying generative models and physics-based search. Its methodological contribution (GSS) is broadly applicable across chemistry and materials science, offers >10x efficiency gains, and generalizes beyond training data. Paper 1 presents an interesting open-ended evolution framework for LLMs, but its impact is more incremental within the rapidly shifting LLM landscape, where model merging and synthetic data generation are already explored. Paper 2's grounding in physical principles gives it more durable cross-disciplinary impact.
Paper 2 proposes a paradigm-shifting framework for open-ended LLM development via coevolution, addressing the fundamental limitation of static training pipelines. This approach to automated, continuous capability discovery has broader implications for general AI advancement and foundation model training compared to Paper 1, which focuses on the narrower, albeit practical, domain of tool-use optimization and search algorithms.
Paper 2 likely has higher scientific impact: it proposes an open-ended coevolution framework for jointly evolving tasks and LLM “experts,” a paradigm-level shift with broad relevance to continual learning, automated model discovery, evaluation, and multi-agent selection. If empirically validated, it has strong real-world applicability for building diverse capability portfolios without hand-designed training runs, and its breadth spans ML, LLM training, AutoML, and AI safety/evaluation. Paper 1 is novel and rigorous but narrower in scope (memory governance primitive) and primarily incremental to agent memory system design.
Paper 2 introduces a novel open-ended framework for coevolving LLMs and tasks, addressing a critical bottleneck in continual model training. Its approach to autonomously discovering diverse capabilities has broad, transformative implications for frontier model development and multi-agent systems. While Paper 1 provides rigorous and valuable mechanistic insights into MoE routing, its impact is narrower, focusing on architectural simplification, and is currently demonstrated at a relatively small scale (80M parameters).
Paper 2 proposes a paradigm shift in LLM development by introducing a coevolutionary framework for tasks and models. This open-ended approach to discovering emergent capabilities could fundamentally alter how foundation models are trained, extending beyond static datasets. While Paper 1 provides excellent theoretical insights into agent safety and monitoring, Paper 2's methodology addresses core scalability and capability discovery bottlenecks in AI, promising broader impact across the entire machine learning community and industry.
Paper 1 proposes a more novel, paradigm-shifting framework: open-ended coevolution of both tasks and LLM populations, enabling continual capability discovery in a single run and yielding archives of specialized experts with strong coverage without direct benchmark optimization. This has broad potential applications (automated model discovery, efficient expert ensembles, continual learning) and could influence multiple fields (AutoML, multi-agent systems, evolutionary computation, synthetic data/task generation). Paper 2 is timely and methodologically solid but is more incremental within post-training/RL, with impact primarily on reasoning and test-time compute scaling.
Paper 2 introduces a fundamentally new paradigm (AC/DC) for LLM development through open-ended coevolution of models and tasks, which has broader implications across the entire field of AI/ML. It addresses the core limitation of static training paradigms and offers a scalable, continual improvement framework. While Paper 1 makes a solid engineering contribution by unifying audio front-end tasks for full-duplex speech, it is more narrowly scoped to speech interaction systems. Paper 2's novelty in applying open-endedness and coevolution to LLM discovery has wider potential to reshape how models are developed across domains.
Paper 2 likely has higher impact: it proposes a simple, broadly applicable intervention (midtraining on synthetic spec documents) directly targeting a central, timely problem—alignment generalization. It shows large, safety-relevant gains (e.g., major reduction in agentic misalignment) and provides an experimental tool for studying what spec content improves generalization, increasing scientific and practical value. Paper 1 is novel and ambitious (open-ended coevolution for LLM capability discovery), but may face heavier engineering complexity and more uncertain reproducibility/standardization, potentially limiting near-term adoption compared to MSM.
Paper 2 addresses a fundamental safety challenge (specification gaming) in RL-trained reasoning models, providing systematic empirical evidence linking RL training to increased exploitation of specifications. This has immediate, broad implications for AI safety, alignment, and deployment practices. Its open-sourced evaluation suite enables reproducible research. Paper 1 presents a creative framework for LLM development via coevolution, but its impact is more incremental within the model development space. Paper 2's findings about a core failure mode of increasingly deployed reasoning models are more timely and consequential for the field.
Paper 1 introduces a foundational paradigm shift in LLM development through open-ended coevolution of models and tasks, impacting the broader AI community's approach to continual learning and model discovery. In contrast, Paper 2, while highly practical and valuable for Operations Research, focuses on a domain-specific application of existing LLM capabilities. The methodological innovation and broader applicability across the entire field of AI give Paper 1 a significantly higher potential for widespread scientific impact.
Paper 2 is likely higher impact: it proposes a new, scalable training/development paradigm (open-ended task–model coevolution) that could change how frontier LLMs are iteratively improved, with clear real-world relevance for continual capability discovery and efficiency (smaller models exceeding larger ones). The approach potentially generalizes across domains and connects to broader open-ended evolution research. Paper 1 is novel and useful for multi-agent system design, but its contributions are more evaluative/architectural within a fixed-model setting and may have narrower downstream leverage than a framework that alters the model improvement pipeline itself.
Paper 2 introduces a concrete, novel algorithmic framework (AC/DC) that addresses a fundamental limitation of current LLM training paradigms by applying open-ended coevolution to discover diverse LLM experts via model merging. It demonstrates practical results (smaller models surpassing larger ones, growing capability archives) with broad implications for LLM development. Paper 1 offers an interesting interdisciplinary framing mapping historical institutions to multi-agent architectures, but is more of an empirical evaluation of known organizational patterns rather than a new technical contribution. Paper 2's paradigm shift in model development methodology has greater potential for widespread adoption and follow-on research.
The AAAI-26 AI Review Pilot represents the first large-scale real-world deployment of AI-assisted peer review across nearly 23,000 papers at a major conference. Its immediate practical impact on the scientific review process—a universal bottleneck across all fields—gives it extraordinary breadth. The finding that AI reviews were preferred over human reviews on key dimensions is paradigm-shifting for scientific publishing. While Paper 1 (AC/DC) presents a novel and interesting approach to open-ended LLM capability discovery, Paper 2 addresses a more universal problem with demonstrated real-world validation at unprecedented scale, likely influencing how all major conferences operate.
Paper 1 addresses a fundamental and timely question about the epistemic validity of AI-driven scientific research, with rigorous methodology (25,000+ agent runs, 8 domains, two complementary evaluation lenses). Its finding that LLM agents fail to exhibit genuine scientific reasoning despite producing correct outputs has profound implications for the trustworthiness of AI-generated scientific knowledge—a concern spanning all scientific disciplines. Paper 2 presents an interesting engineering contribution (AC/DC framework for evolving LLM experts), but its impact is narrower, focused on model development methodology. Paper 1's insights are more likely to reshape policies, evaluation standards, and training paradigms across the AI-for-science ecosystem.
Paper 1 provides a crucial, timely critique of 'AI scientists,' demonstrating that while LLMs can execute workflows, they fail at true scientific reasoning. This has profound implications across all scientific fields adopting AI, potentially shifting the focus from outcome-based evaluation to process-based reasoning training. Paper 2 offers an innovative LLM development method, but Paper 1's foundational questioning of AI validity in science gives it broader and more disruptive impact.
Paper 2 likely has higher scientific impact: it introduces a large, deterministic benchmark (DESPITE) and provides a clear, scalable empirical finding that planning competence and safety awareness diverge, with strong implications for embodied AI and robotics deployment. The work is timely given rapid adoption of LLM planners and directly informs safety evaluation, model development, and policy. Paper 1 is innovative and potentially impactful for LLM training paradigms, but its coevolution/merging approach may face harder-to-validate generality and adoption barriers compared to a benchmarked safety result with immediate real-world relevance.
Paper 2 likely has higher scientific impact due to its unprecedented real-world deployment at massive scale (all AAAI-26 submissions), strong timeliness, and immediate applicability to a critical bottleneck in science. It combines methodological engineering with large-scale user evaluation and introduces a benchmark, suggesting rigor and reproducibility. Its impact spans multiple fields (ML, HCI, scientometrics, research policy, academic publishing). Paper 1 is novel and important for LLM training paradigms, but its demonstrated impact is more contained within model-development research and depends on broader adoption and validation.
Paper 1 introduces a highly novel paradigm for LLM development through task-capability coevolution, moving beyond static pre-training/post-training to an open-ended, continual discovery process. This fundamental methodological innovation has the potential to reshape how frontier models are trained and optimized. While Paper 2 provides a valuable benchmark for embodied AI safety, Paper 1's broader implications for AI self-improvement and capability discovery offer a higher ceiling for transformative scientific impact across the entire field of machine learning.
Paper 2 likely has higher impact due to greater novelty and breadth: an open-ended coevolution framework that jointly evolves tasks and model variants (via merging) could change how LLMs are developed and continually improved, with applications across many domains and model families. If rigorously validated, it offers a scalable paradigm beyond static pre/post-training and could influence evaluation, capability discovery, and multi-agent/model selection. Paper 1 is timely and useful for embodied/robotics VLM reasoning, but its scope is narrower (a specific multimodal arithmetic benchmark + RL fine-tuning) and may primarily impact VLM/robotics subcommunities.
Paper 2 likely has higher impact due to greater novelty and breadth: an open-ended coevolution framework that jointly evolves tasks and model populations, potentially changing how LLMs are developed (continual capability discovery without explicit benchmark tuning). Its applications span model training pipelines, evaluation, agent selection, and efficient deployment via smaller specialist models. Paper 1 is timely and useful (datasets + RL post-training for visual relational arithmetic), but is narrower (vision-language relational reasoning) and closer to established post-training/dataset-construction patterns, with impact mainly in embodied/robotics and VLM benchmarking.