MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Shangheng Du, Xiangchao Yan, Jinxin Shi, Zongsheng Cao, Shiyang Feng, Zichen Liang, Boyuan Sun, Tianshuo Peng

Jun 4, 2026

arXiv:2606.06473v1 PDF

cs.AI(primary)cs.CL

#1168of 3404·Artificial Intelligence

#1168 of 3404 · Artificial Intelligence

Tournament Score

1437±47

10501800

77%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6

Clarity7.5

Tournament Score

1437±47

10501800

77%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLEvolve, an LLM-based self-evolving multi-agent framework for end-to-end machine learning algorithm discovery. By extending tree search to Progressive MCGS, MLEvolve enables cross-branch information flow through graph-based reference edges and gradually shifts the search from broad exploration to focused exploitation with an entropy-inspired progressive schedule. To allow the agent to evolve with accumulated experience, we introduce Retrospective Memory, which combines a cold-start domain knowledge base with a dynamic global memory for task-specific experience retrieval and reuse. For stable long-horizon iteration, we further decouple strategic planning from code generation with adaptive coding modes. Evaluation on MLE-Bench shows that MLEvolve achieves state-of-the-art performance across multiple dimensions including average medal rate and valid submission rate under a 12-hour budget (half the standard runtime). Moreover, MLEvolve also outperforms specialized algorithm discovery methods including AlphaEvolve on mathematical algorithm optimization tasks, demonstrating strong cross-domain generalization. Our code is available at https://github.com/InternScience/MLEvolve.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MLEvolve

1. Core Contribution

MLEvolve introduces a self-evolving multi-agent framework for automated machine learning algorithm discovery, addressing three identified limitations of existing MLE agents: inter-branch information isolation, memoryless search, and lack of hierarchical control. The framework integrates three novel components:

Progressive Monte Carlo Graph Search (MCGS): Extends tree search to a graph structure with reference edges enabling cross-branch information flow, combined with an entropy-inspired schedule that transitions from exploration to exploitation over time.

Retrospective Memory: A dual-memory system combining a static domain knowledge base for cold-start initialization with a dynamic global memory for runtime experience accumulation, using hybrid BM25+FAISS retrieval with reciprocal rank fusion.

Hierarchical Planning with Adaptive Code Generation: Decouples strategic planning from code implementation, with three coding modes (base, stepwise, diff) selected according to search state.

The problem addressed—automating end-to-end ML pipeline design—is significant and practically important. The solution is architecturally comprehensive, integrating search, memory, and generation into a unified framework.

2. Methodological Rigor

The methodology is generally well-formulated. The graph-based search space is formally defined with primary and reference edges, and the four expansion types (primary, intra-branch evolution, cross-branch reference, multi-branch aggregation) are clearly specified. The progressive exploration schedule with entropy-inspired soft switching between UCT and elite-guided exploitation is mathematically grounded.

However, several concerns exist:

Reproducibility challenges: Despite open-sourced code, the system involves numerous hyperparameters (Table 4 lists ~20), specialized agent prompts, and relies on expensive frontier LLMs (Gemini-3.1-Pro-preview). The computational cost per run (12 hours × 75 tasks × 3 seeds on H200 GPUs) is substantial.

Statistical reporting: Results report mean ± SEM over only 3 seeds, which provides limited statistical confidence, especially for percentage-based metrics where several results show 0.0 SEM (suspicious for stochastic methods).

Ablation limitations: The component ablation is conducted on only 22 tasks (MLE-Bench Lite), and the detailed component analysis on only 9 tasks, raising questions about generalizability of these findings.

Fair comparison concerns: MLEvolve uses Gemini-3.1-Pro-preview while baselines use various LLMs (o1-preview, gpt-5, DeepSeek-R1, etc.), making direct comparison difficult. The paper partially addresses this with multi-LLM experiments but only on 8 tasks.

3. Potential Impact

The practical impact could be significant. Automated ML engineering is a high-value target area, and achieving 65.3% medal rate on MLE-Bench under half the standard time budget represents genuine advancement. The framework's generalization to mathematical optimization tasks (outperforming AlphaEvolve on 11/15 tasks) broadens its applicability.

Key impact areas include:

AutoML: Moving beyond component-level optimization to full pipeline automation

LLM-based agents: The MCGS framework and retrospective memory are generalizable to other long-horizon agentic tasks

AI for Science: The authors explicitly target this as a future direction

However, the framework's complexity (9+ specialized agents, intricate search mechanisms) may limit adoption outside well-resourced groups. The reliance on frontier LLMs also constrains accessibility.

4. Timeliness & Relevance

This work is highly timely. The MLE-Bench leaderboard is actively competitive, and the paper addresses a current bottleneck in LLM-based agent systems: how to sustain improvement over long horizons rather than plateauing early. The integration of search, memory, and hierarchical generation reflects emerging best practices in agentic AI. The comparison with AlphaEvolve (Google DeepMind's recent system) positions this work at the frontier.

5. Strengths & Limitations

Strengths:

Comprehensive architecture: The three-component design addresses genuine, distinct limitations of prior work

Strong empirical results: State-of-the-art on MLE-Bench at half the time budget, plus cross-domain generalization

100% valid submission rate: Practical reliability is notable

Progressive entropy dynamics (Figure 3): Empirically validates the exploration-exploitation transition

Well-structured case studies (Appendix G): Concretely demonstrates how graph-based operators function in practice

Open-source commitment: Code availability enhances impact

Limitations:

Engineering vs. scientific novelty: Many components are careful combinations of existing ideas (MCTS, hybrid retrieval, diff-based editing) rather than fundamentally new algorithms. The progressive schedule is inspired by but not rigorously derived from entropy principles.

Computational cost: Not reported explicitly, but running 75 tasks × 3 seeds × 12 hours with frontier LLMs implies significant API and compute costs, limiting reproducibility.

Limited analysis of failure modes: The paper doesn't deeply analyze where MLEvolve fails or what types of tasks remain challenging.

Benchmark limitations: MLE-Bench, while comprehensive, evaluates against historical Kaggle competitions. The mathematical optimization comparison uses only 15 tasks with task-dependent precision, making some comparisons marginal.

Incremental over concurrent work: The field is extremely active (MARS, ML-Master 2.0, AIBuildAI all from 2025-2026), and the relative improvement over MARS+ (65.3% vs 62.7%) and AIBuildAI (65.3% vs 63.1%) is modest given the SEM overlap.

No cost-normalized comparison: Comparing methods using different LLMs without normalizing for API costs or total compute makes fair evaluation difficult.

Overall Assessment

MLEvolve is a well-engineered system paper that achieves competitive results on a challenging benchmark. Its main contributions are architectural—combining graph search, memory, and hierarchical code generation—rather than introducing fundamentally new algorithms. The empirical improvements over baselines are real but modest when accounting for the concurrent rapid progress in this area. The cross-domain generalization to mathematical optimization is a notable strength. The work will be most impactful as a reference architecture for future long-horizon LLM agent systems.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6Clarity 7.5

Generated Jun 5, 2026

Comparison History (22)

vs. I Know What You Meme, Even If it Emerged Today: Understanding Evolving Memes through Open-World Knowledge Acquisition

claude-opus-4.66/6/2026

MLEvolve addresses the fundamental challenge of automated ML algorithm discovery with a comprehensive multi-agent framework featuring novel contributions (Progressive MCGS, Retrospective Memory, adaptive coding modes). It demonstrates SOTA results on established benchmarks (MLE-Bench) and outperforms notable systems like AlphaEvolve, showing cross-domain generalization. The breadth of impact spans automated ML, scientific discovery, and LLM-based agents. Paper 2, while addressing an interesting niche problem in meme understanding, has narrower scope and more incremental contributions (retrieval-augmented zero-shot framework) with limited cross-domain applicability.

vs. Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

gpt-5.26/6/2026

Paper 2 likely has higher impact due to broader applicability and stronger real-world utility: an automated framework for ML algorithm discovery can influence many domains and workflows. It proposes multiple concrete system innovations (progressive graph-based tree search, retrospective memory, hierarchical planning/coding modes) and reports state-of-the-art results on established benchmarks under constrained budgets, plus cross-domain performance against specialized methods. Paper 1 is novel and timely for AI safety auditing, but is evaluated in a narrower, controlled GSM8K setup on a single model, with more limited immediate applicability and generality evidence.

vs. Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

gemini-3.16/6/2026

Paper 2 presents a self-evolving framework for automated ML algorithm discovery, pushing the boundaries of AI-driven scientific research and demonstrating strong cross-domain generalization. This represents a significant step toward automated science, offering broad applications. Paper 1, while important for refining evaluation methodology, focuses on a narrower vulnerability (post-decision manipulability of LLM judges), making Paper 2's potential impact on the broader field of AI and scientific discovery substantially higher.

vs. Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

gemini-3.16/6/2026

Paper 2 demonstrates higher potential scientific impact due to its focus on automated machine learning algorithm discovery. While Paper 1 offers a highly practical, cost-effective application for educational video generation, Paper 2 tackles a fundamental bottleneck in AI and scientific discovery. By introducing a self-evolving multi-agent framework capable of cross-domain generalization and long-horizon optimization, MLEvolve can accelerate research across numerous scientific fields. Advancing AI's ability to discover and optimize new algorithms provides a compounding technological multiplier that offers significantly broader theoretical and methodological impact than an applied video generation pipeline.

vs. ReTreVal: Reasoning Tree with Validation and Cross-Problem Memory for Large Language Models

claude-opus-4.66/5/2026

MLEvolve addresses a broader and more impactful problem—automated ML algorithm discovery—with a comprehensive multi-agent framework featuring novel components (Progressive MCGS, Retrospective Memory, adaptive coding modes). It demonstrates state-of-the-art results on MLE-Bench and outperforms AlphaEvolve on mathematical optimization, showing strong cross-domain generalization. Paper 2 (ReTreVal) presents valuable contributions to inference-time reasoning with cross-problem memory, but its scope is narrower (math/reasoning benchmarks) and its improvements, while solid, represent incremental gains on established benchmarks rather than enabling a fundamentally new capability like automated algorithm discovery.

vs. Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

gpt-5.26/5/2026

Paper 2 offers a more broadly applicable and conceptually novel contribution: a reconstructed (active) memory access paradigm with an explicit Cue-Tag-Content graph that tightly couples retrieval and reasoning. This targets a central bottleneck for LLM agents across many domains (assistants, tools, robotics, scientific workflows), with demonstrated accuracy and efficiency gains on established long-context benchmarks. Paper 1 is impactful within AutoML/algorithm discovery, but its innovations are more system-integration and benchmark-specific, with narrower cross-field reach and potentially faster obsolescence as agent frameworks evolve.

vs. TAPO: Tool-Aware Policy Optimization via Credit Transfer for Multimodal Search Agents

gpt-5.26/5/2026

Paper 2 is likely higher impact due to clearer methodological novelty and generality: it formalizes a concrete RL failure mode (credit misassignment in tool-augmented agents), quantifies it, and introduces a lightweight, theory-motivated correction (credit transfer via parameter-determinism) that is plug-and-play across multiple RL algorithms and multimodal search benchmarks. This directly targets a timely bottleneck in training tool-using agents and could broadly influence RLHF/tool-use training practices. Paper 1 is strong engineering with solid results, but impact may be narrower and more benchmark/framework-specific.

vs. Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

claude-opus-4.66/5/2026

MLEvolve presents a novel technical framework addressing fundamental limitations in LLM-based ML engineering agents, with state-of-the-art results on established benchmarks and cross-domain generalization. It advances the rapidly growing field of automated ML/scientific discovery with concrete methodological innovations (Progressive MCGS, Retrospective Memory). Paper 2 proposes an educational competency model for AI reasoning skills—valuable but narrower in scope, validated only with simulated learners (not yet humans), and targets a less technically impactful domain. MLEvolve's breadth of applications and rigorous benchmarking suggest substantially higher scientific impact.

vs. Integrating Mechanistic and Data-Driven Models for Neurological Disorders through Differentiable Programming

claude-opus-4.66/5/2026

MLEvolve presents a concrete, implemented system with state-of-the-art empirical results on established benchmarks (MLE-Bench), outperforming notable baselines including AlphaEvolve. It addresses the timely and broadly applicable problem of automated ML algorithm discovery with novel technical contributions (Progressive MCGS, Retrospective Memory). Paper 1 is a perspective/review paper proposing hybrid modeling strategies for neurological disorders without novel experimental validation. While Paper 1 covers an important topic, Paper 2's demonstrated results, open-source code, cross-domain generalization, and positioning in the rapidly growing LLM-agent field give it higher near-term scientific impact.

vs. Learning to replenish: A hybrid deep reinforcement learning for dynamic inventory management in the pharmaceutical supply chains

gpt-5.26/5/2026

Paper 1 has higher likely scientific impact due to greater novelty (self-evolving multi-agent LLM framework with progressive graph-based search and retrospective memory), broader applicability across ML engineering and algorithm discovery, and strong timeliness given rapid growth of LLM-agent research. It reports state-of-the-art results on a community benchmark (MLE-Bench) under constrained budgets and claims cross-domain generalization beyond ML (mathematical optimization), plus open-source release, all of which can accelerate adoption and follow-on work. Paper 2 targets an important application, but the algorithmic contribution appears more incremental within established DRL for inventory management.

vs. Agents on a Tree: Pathwise Coordination for Multi-Objective Molecular Optimization

gpt-5.26/5/2026

Paper 2 likely has higher scientific impact due to broader applicability and timeliness: an LLM-based self-evolving framework for automated ML algorithm discovery can influence many domains beyond a single application area. Its methodological contributions (progressive tree/graph search with cross-branch information flow, retrospective memory, planning–coding decoupling) are general-purpose and directly target current challenges in long-horizon agentic systems. The evaluations report strong results on widely relevant benchmarks (MLE-Bench) and cross-domain tasks, suggesting wider adoption potential. Paper 1 is impactful for molecular design but is narrower in scope.

vs. An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

claude-opus-4.66/5/2026

MLEvolve presents a significantly more novel and impactful contribution: a self-evolving multi-agent framework achieving state-of-the-art on MLE-Bench while outperforming AlphaEvolve on algorithm discovery tasks. It introduces multiple technical innovations (Progressive MCGS, Retrospective Memory, adaptive coding modes) with broad applicability across ML and scientific discovery. Paper 2 applies LLMs to epidemiological simulation in a relatively incremental way, combining existing ideas (LLM-based agents, ABM, census data) without fundamental methodological advances. MLEvolve's cross-domain generalization and strong benchmarks suggest wider and deeper scientific influence.

vs. GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

claude-opus-4.66/5/2026

MLEvolve presents a significantly more novel and impactful contribution: a self-evolving multi-agent framework for automated ML algorithm discovery with several innovative components (Progressive MCGS, Retrospective Memory, adaptive coding modes). It achieves SOTA on MLE-Bench and outperforms AlphaEvolve, demonstrating broad cross-domain generalization. Paper 2 (GuardNet) addresses an important but narrower problem using relatively conventional techniques (BiLSTM ensembles) and achieves results that are acknowledged to be inferior to existing larger models, limiting its impact. MLEvolve's contributions to autonomous scientific discovery have far broader implications.

vs. A Framework for Measuring Appropriate Reliance on Set-Valued AI Advice

gpt-5.26/5/2026

Paper 2 is likely to have higher scientific impact because it introduces a first formal measurement framework for appropriate reliance on set-valued AI advice—an increasingly common uncertainty-aware interface. Its contributions (new dimensions/metrics for both classification and regression in a sequential judge–advisor setting) are broadly applicable across HCI, AI evaluation, decision science, and policy, and can standardize empirical studies. Paper 1 is innovative and practically useful for AutoML/agentic search, but it is more engineering- and benchmark-driven with impact concentrated in ML systems, whereas Paper 2 offers a generalizable conceptual/measurement foundation with wider cross-field uptake.

vs. Zero knowledge verification for frontier AI training is possible

gpt-5.26/5/2026

Paper 2 has higher potential scientific impact because it introduces a novel, timely verification primitive for frontier AI training that could underpin enforceable governance and international agreements—an application with broad cross-field consequences (cryptography, systems, ML, policy). If realized, it changes how training claims are audited and could become infrastructure-level. Paper 1 is a strong engineering contribution to LLM-based AutoML/search with clear applicability, but it is more incremental within a fast-moving area and its impact is likely narrower and less durable than a widely adopted training-verification standard.

vs. Evaluation of LLMs for Mathematical Formalization in Lean

gpt-5.26/5/2026

Paper 2 has higher impact potential due to greater novelty (self-evolving multi-agent framework with progressive graph-based search and retrospective memory), broader real-world applicability (automated ML algorithm discovery, long-horizon engineering), and wider cross-field relevance (AutoML, agent systems, optimization, software engineering). It reports strong empirical results on established benchmarks under tighter budgets and claims cross-domain generalization beyond ML. Paper 1 is valuable but primarily a comparative evaluation/benchmarking study in a narrower domain (Lean formalization) with more incremental methodological contribution.

vs. Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

claude-opus-4.66/5/2026

MLEvolve presents a technically novel framework with strong empirical results on established benchmarks (MLE-Bench), introduces multiple methodological innovations (Progressive MCGS, Retrospective Memory), and demonstrates cross-domain generalization including outperforming AlphaEvolve. Paper 2 introduces a practical enterprise knowledge management framework with a real deployment study, but its contributions are more incremental and applied—focused on developer productivity rather than advancing fundamental capabilities. Paper 1's broader algorithmic contributions, rigorous benchmarking, and potential to advance automated ML discovery give it significantly higher scientific impact potential.

vs. AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

gemini-3.16/5/2026

Paper 2 presents a foundational framework for automated machine learning algorithm discovery using LLMs, offering broad applicability across various scientific and mathematical domains. Its methodological innovations (Progressive MCGS, Retrospective Memory) have the potential to accelerate the pace of ML research itself. In contrast, Paper 1 offers a highly specific, incremental application of existing memory-augmented neural networks to maritime trajectory prediction, limiting its impact primarily to a single niche field.

vs. ToolSelf: Unifying Task Execution and Self-Reconfiguration via Tool-Driven Emergent Adaptation

gemini-3.16/5/2026

Paper 2 proposes a fundamental paradigm shift in LLM agent architecture by unifying execution and adaptation into a single policy's action space. While Paper 1 offers a strong, specialized framework for ML algorithm discovery, Paper 2's 'ToolSelf' addresses a core bottleneck (static configurations) affecting all long-horizon agentic systems. This task-agnostic approach to emergent adaptivity provides greater breadth of impact across diverse domains, advancing the theoretical foundation of autonomous AI agents beyond domain-specific optimization.

vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models

gpt-5.26/5/2026

Paper 2 (MLEvolve) likely has higher impact: it introduces a broadly applicable framework for automated ML algorithm discovery with innovations in search (Progressive MCGS), cross-branch knowledge sharing, and persistent retrospective memory. Its applications span many ML and scientific domains, and it shows strong empirical results on established benchmarks (MLE-Bench) plus cross-domain gains over specialized methods. Paper 1 is rigorous and valuable for autonomous driving safety/controllability, but its impact is more domain-specific and incremental relative to broader AutoML/agentic discovery trends. Paper 2 is also highly timely given rapid growth in LLM agents.