MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery
Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng
Abstract
Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MLEvolve
1. Core Contribution
MLEvolve introduces a self-evolving multi-agent framework for automated machine learning algorithm discovery, addressing three identified limitations of existing MLE agents: inter-branch information isolation, memoryless search, and lack of hierarchical control. The framework integrates three novel components:
The problem addressed—automating end-to-end ML pipeline design—is significant and practically important. The solution is architecturally comprehensive, integrating search, memory, and generation into a unified framework.
2. Methodological Rigor
The methodology is generally well-formulated. The graph-based search space is formally defined with primary and reference edges, and the four expansion types (primary, intra-branch evolution, cross-branch reference, multi-branch aggregation) are clearly specified. The progressive exploration schedule with entropy-inspired soft switching between UCT and elite-guided exploitation is mathematically grounded.
However, several concerns exist:
3. Potential Impact
The practical impact could be significant. Automated ML engineering is a high-value target area, and achieving 65.3% medal rate on MLE-Bench under half the standard time budget represents genuine advancement. The framework's generalization to mathematical optimization tasks (outperforming AlphaEvolve on 11/15 tasks) broadens its applicability.
Key impact areas include:
However, the framework's complexity (9+ specialized agents, intricate search mechanisms) may limit adoption outside well-resourced groups. The reliance on frontier LLMs also constrains accessibility.
4. Timeliness & Relevance
This work is highly timely. The MLE-Bench leaderboard is actively competitive, and the paper addresses a current bottleneck in LLM-based agent systems: how to sustain improvement over long horizons rather than plateauing early. The integration of search, memory, and hierarchical generation reflects emerging best practices in agentic AI. The comparison with AlphaEvolve (Google DeepMind's recent system) positions this work at the frontier.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment
MLEvolve is a well-engineered system paper that achieves competitive results on a challenging benchmark. Its main contributions are architectural—combining graph search, memory, and hierarchical code generation—rather than introducing fundamentally new algorithms. The empirical improvements over baselines are real but modest when accounting for the concurrent rapid progress in this area. The cross-domain generalization to mathematical optimization is a notable strength. The work will be most impactful as a reference architecture for future long-horizon LLM agent systems.
Generated Jun 5, 2026
Comparison History (22)
MLEvolve addresses the fundamental challenge of automated ML algorithm discovery with a comprehensive multi-agent framework featuring novel contributions (Progressive MCGS, Retrospective Memory, adaptive coding modes). It demonstrates SOTA results on established benchmarks (MLE-Bench) and outperforms notable systems like AlphaEvolve, showing cross-domain generalization. The breadth of impact spans automated ML, scientific discovery, and LLM-based agents. Paper 2, while addressing an interesting niche problem in meme understanding, has narrower scope and more incremental contributions (retrieval-augmented zero-shot framework) with limited cross-domain applicability.
Paper 2 likely has higher impact due to broader applicability and stronger real-world utility: an automated framework for ML algorithm discovery can influence many domains and workflows. It proposes multiple concrete system innovations (progressive graph-based tree search, retrospective memory, hierarchical planning/coding modes) and reports state-of-the-art results on established benchmarks under constrained budgets, plus cross-domain performance against specialized methods. Paper 1 is novel and timely for AI safety auditing, but is evaluated in a narrower, controlled GSM8K setup on a single model, with more limited immediate applicability and generality evidence.
Paper 2 presents a self-evolving framework for automated ML algorithm discovery, pushing the boundaries of AI-driven scientific research and demonstrating strong cross-domain generalization. This represents a significant step toward automated science, offering broad applications. Paper 1, while important for refining evaluation methodology, focuses on a narrower vulnerability (post-decision manipulability of LLM judges), making Paper 2's potential impact on the broader field of AI and scientific discovery substantially higher.
Paper 2 demonstrates higher potential scientific impact due to its focus on automated machine learning algorithm discovery. While Paper 1 offers a highly practical, cost-effective application for educational video generation, Paper 2 tackles a fundamental bottleneck in AI and scientific discovery. By introducing a self-evolving multi-agent framework capable of cross-domain generalization and long-horizon optimization, MLEvolve can accelerate research across numerous scientific fields. Advancing AI's ability to discover and optimize new algorithms provides a compounding technological multiplier that offers significantly broader theoretical and methodological impact than an applied video generation pipeline.
MLEvolve addresses a broader and more impactful problem—automated ML algorithm discovery—with a comprehensive multi-agent framework featuring novel components (Progressive MCGS, Retrospective Memory, adaptive coding modes). It demonstrates state-of-the-art results on MLE-Bench and outperforms AlphaEvolve on mathematical optimization, showing strong cross-domain generalization. Paper 2 (ReTreVal) presents valuable contributions to inference-time reasoning with cross-problem memory, but its scope is narrower (math/reasoning benchmarks) and its improvements, while solid, represent incremental gains on established benchmarks rather than enabling a fundamentally new capability like automated algorithm discovery.
Paper 2 offers a more broadly applicable and conceptually novel contribution: a reconstructed (active) memory access paradigm with an explicit Cue-Tag-Content graph that tightly couples retrieval and reasoning. This targets a central bottleneck for LLM agents across many domains (assistants, tools, robotics, scientific workflows), with demonstrated accuracy and efficiency gains on established long-context benchmarks. Paper 1 is impactful within AutoML/algorithm discovery, but its innovations are more system-integration and benchmark-specific, with narrower cross-field reach and potentially faster obsolescence as agent frameworks evolve.
Paper 2 is likely higher impact due to clearer methodological novelty and generality: it formalizes a concrete RL failure mode (credit misassignment in tool-augmented agents), quantifies it, and introduces a lightweight, theory-motivated correction (credit transfer via parameter-determinism) that is plug-and-play across multiple RL algorithms and multimodal search benchmarks. This directly targets a timely bottleneck in training tool-using agents and could broadly influence RLHF/tool-use training practices. Paper 1 is strong engineering with solid results, but impact may be narrower and more benchmark/framework-specific.
MLEvolve presents a novel technical framework addressing fundamental limitations in LLM-based ML engineering agents, with state-of-the-art results on established benchmarks and cross-domain generalization. It advances the rapidly growing field of automated ML/scientific discovery with concrete methodological innovations (Progressive MCGS, Retrospective Memory). Paper 2 proposes an educational competency model for AI reasoning skills—valuable but narrower in scope, validated only with simulated learners (not yet humans), and targets a less technically impactful domain. MLEvolve's breadth of applications and rigorous benchmarking suggest substantially higher scientific impact.
MLEvolve presents a concrete, implemented system with state-of-the-art empirical results on established benchmarks (MLE-Bench), outperforming notable baselines including AlphaEvolve. It addresses the timely and broadly applicable problem of automated ML algorithm discovery with novel technical contributions (Progressive MCGS, Retrospective Memory). Paper 1 is a perspective/review paper proposing hybrid modeling strategies for neurological disorders without novel experimental validation. While Paper 1 covers an important topic, Paper 2's demonstrated results, open-source code, cross-domain generalization, and positioning in the rapidly growing LLM-agent field give it higher near-term scientific impact.
Paper 1 has higher likely scientific impact due to greater novelty (self-evolving multi-agent LLM framework with progressive graph-based search and retrospective memory), broader applicability across ML engineering and algorithm discovery, and strong timeliness given rapid growth of LLM-agent research. It reports state-of-the-art results on a community benchmark (MLE-Bench) under constrained budgets and claims cross-domain generalization beyond ML (mathematical optimization), plus open-source release, all of which can accelerate adoption and follow-on work. Paper 2 targets an important application, but the algorithmic contribution appears more incremental within established DRL for inventory management.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: an LLM-based self-evolving framework for automated ML algorithm discovery can influence many domains beyond a single application area. Its methodological contributions (progressive tree/graph search with cross-branch information flow, retrospective memory, planning–coding decoupling) are general-purpose and directly target current challenges in long-horizon agentic systems. The evaluations report strong results on widely relevant benchmarks (MLE-Bench) and cross-domain tasks, suggesting wider adoption potential. Paper 1 is impactful for molecular design but is narrower in scope.
MLEvolve presents a significantly more novel and impactful contribution: a self-evolving multi-agent framework achieving state-of-the-art on MLE-Bench while outperforming AlphaEvolve on algorithm discovery tasks. It introduces multiple technical innovations (Progressive MCGS, Retrospective Memory, adaptive coding modes) with broad applicability across ML and scientific discovery. Paper 2 applies LLMs to epidemiological simulation in a relatively incremental way, combining existing ideas (LLM-based agents, ABM, census data) without fundamental methodological advances. MLEvolve's cross-domain generalization and strong benchmarks suggest wider and deeper scientific influence.
MLEvolve presents a significantly more novel and impactful contribution: a self-evolving multi-agent framework for automated ML algorithm discovery with several innovative components (Progressive MCGS, Retrospective Memory, adaptive coding modes). It achieves SOTA on MLE-Bench and outperforms AlphaEvolve, demonstrating broad cross-domain generalization. Paper 2 (GuardNet) addresses an important but narrower problem using relatively conventional techniques (BiLSTM ensembles) and achieves results that are acknowledged to be inferior to existing larger models, limiting its impact. MLEvolve's contributions to autonomous scientific discovery have far broader implications.
Paper 2 is likely to have higher scientific impact because it introduces a first formal measurement framework for appropriate reliance on set-valued AI advice—an increasingly common uncertainty-aware interface. Its contributions (new dimensions/metrics for both classification and regression in a sequential judge–advisor setting) are broadly applicable across HCI, AI evaluation, decision science, and policy, and can standardize empirical studies. Paper 1 is innovative and practically useful for AutoML/agentic search, but it is more engineering- and benchmark-driven with impact concentrated in ML systems, whereas Paper 2 offers a generalizable conceptual/measurement foundation with wider cross-field uptake.
Paper 2 has higher potential scientific impact because it introduces a novel, timely verification primitive for frontier AI training that could underpin enforceable governance and international agreements—an application with broad cross-field consequences (cryptography, systems, ML, policy). If realized, it changes how training claims are audited and could become infrastructure-level. Paper 1 is a strong engineering contribution to LLM-based AutoML/search with clear applicability, but it is more incremental within a fast-moving area and its impact is likely narrower and less durable than a widely adopted training-verification standard.
Paper 2 has higher impact potential due to greater novelty (self-evolving multi-agent framework with progressive graph-based search and retrospective memory), broader real-world applicability (automated ML algorithm discovery, long-horizon engineering), and wider cross-field relevance (AutoML, agent systems, optimization, software engineering). It reports strong empirical results on established benchmarks under tighter budgets and claims cross-domain generalization beyond ML. Paper 1 is valuable but primarily a comparative evaluation/benchmarking study in a narrower domain (Lean formalization) with more incremental methodological contribution.
MLEvolve presents a technically novel framework with strong empirical results on established benchmarks (MLE-Bench), introduces multiple methodological innovations (Progressive MCGS, Retrospective Memory), and demonstrates cross-domain generalization including outperforming AlphaEvolve. Paper 2 introduces a practical enterprise knowledge management framework with a real deployment study, but its contributions are more incremental and applied—focused on developer productivity rather than advancing fundamental capabilities. Paper 1's broader algorithmic contributions, rigorous benchmarking, and potential to advance automated ML discovery give it significantly higher scientific impact potential.
Paper 2 presents a foundational framework for automated machine learning algorithm discovery using LLMs, offering broad applicability across various scientific and mathematical domains. Its methodological innovations (Progressive MCGS, Retrospective Memory) have the potential to accelerate the pace of ML research itself. In contrast, Paper 1 offers a highly specific, incremental application of existing memory-augmented neural networks to maritime trajectory prediction, limiting its impact primarily to a single niche field.
Paper 2 proposes a fundamental paradigm shift in LLM agent architecture by unifying execution and adaptation into a single policy's action space. While Paper 1 offers a strong, specialized framework for ML algorithm discovery, Paper 2's 'ToolSelf' addresses a core bottleneck (static configurations) affecting all long-horizon agentic systems. This task-agnostic approach to emergent adaptivity provides greater breadth of impact across diverse domains, advancing the theoretical foundation of autonomous AI agents beyond domain-specific optimization.
Paper 2 (MLEvolve) likely has higher impact: it introduces a broadly applicable framework for automated ML algorithm discovery with innovations in search (Progressive MCGS), cross-branch knowledge sharing, and persistent retrospective memory. Its applications span many ML and scientific domains, and it shows strong empirical results on established benchmarks (MLE-Bench) plus cross-domain gains over specialized methods. Paper 1 is rigorous and valuable for autonomous driving safety/controllability, but its impact is more domain-specific and incremental relative to broader AutoML/agentic discovery trends. Paper 2 is also highly timely given rapid growth in LLM agents.