AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models

Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie

May 27, 2026

arXiv:2605.27873v1 PDF

cs.AI(primary)

#313of 2682·Artificial Intelligence

#313 of 2682 · Artificial Intelligence

Tournament Score

1504±43

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor4.5

Novelty5.5

Clarity7.5

Tournament Score

1504±43

10501800

83%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance

Rigor

Novelty

Clarity

Abstract

AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AIBuildAI-2

1. Core Contribution

AIBuildAI-2 introduces a knowledge-enhanced autonomous agent for end-to-end AI model development, built around a hierarchical, continually evolving external knowledge system. The central novelty is the two-level knowledge architecture: ~30 top-level (L1) categories with high-level instructions, and ~1,000 bottom-level (L2) detailed knowledge documents, combined with dynamic context loading that selectively retrieves relevant knowledge based on the agent's current state and task. The knowledge system is initialized from curated web sources and evolves through two mechanisms: distilling the agent's own completed runs into structured takeaways, and incorporating newly published external content. This addresses a genuine limitation of prior agents that rely solely on LLM parametric knowledge, noisy web search, or static document corpora.

2. Methodological Rigor

Strengths in evaluation breadth: The paper evaluates across three distinct settings—MLE-Bench (75 Kaggle-style tasks), a live heart disease prediction competition (4,370 teams), and a blind ADMET drug discovery challenge (103 teams). This diversity provides reasonable coverage of the system's generalization ability.

Weaknesses in experimental rigor:

No ablation studies. The paper provides no controlled experiments isolating the contribution of individual components: the hierarchical knowledge structure, dynamic context loading, the L1 vs. L2 decomposition, knowledge evolution, or the builder agents. Without ablations, it is impossible to determine which design choices actually drive performance. Is it the knowledge system, or simply using Claude Opus 4.7 with a good multi-agent scaffold?

Limited baselines on live competitions. Only AIBuildAI and MLEvolve are compared on the heart disease and ADMET tasks. Other strong MLE-Bench baselines (MARS, Famou-Agent, etc.) are absent from these evaluations.

Single-run evaluation. There is no mention of variance across runs. LLM-based agents are notoriously stochastic; without confidence intervals or multiple trials, the reported improvements could be within noise.

MLE-Bench comparison fairness. Different leaderboard entries may use different backbone LLMs, compute budgets, and numbers of parallel repositories, making direct comparisons imperfect. The paper uses Claude Opus 4.7 (a very recent, powerful model), and it's unclear how much of the improvement stems from the backbone LLM versus the knowledge system.

Knowledge system analysis is absent. There is no quantitative analysis of knowledge retrieval quality, how often L2 documents are accessed, whether retrieved knowledge is actually used in final solutions, or how the knowledge base changes over time.

3. Potential Impact

The paper addresses an important and timely problem: democratizing AI model development for domain scientists. The vision of an agent that accumulates and reuses engineering knowledge is compelling. Practical impacts include:

Accessibility: Natural scientists in biology, chemistry, and medicine could use such a system to build competitive models without deep ML engineering expertise. The ADMET challenge result is a concrete demonstration.

Knowledge accumulation: The idea that solved tasks produce reusable, human-readable knowledge artifacts is valuable both for agent improvement and for human learning.

Benchmark leadership: Ranking first on MLE-Bench and top 6.6% in a live competition demonstrates practical utility.

However, the impact is tempered by several factors: the system requires Claude Opus 4.7 (expensive, proprietary), 24 hours of A100 compute per task, and the knowledge system initialization requires substantial curation effort. Reproducibility is also a concern since the backbone LLM is a closed-source commercial API.

4. Timeliness & Relevance

This work is highly timely. Autonomous AI development agents are a rapidly growing area, with multiple concurrent systems (AIDE, R&D-Agent, MARS, etc.) competing on the same benchmarks. The knowledge augmentation angle is well-motivated: as these agents approach human-level performance, the bottleneck increasingly shifts from reasoning capability to practical engineering know-how, which is exactly what the knowledge system provides. The framing around AI-for-science is also timely given the growing demand for ML in scientific domains.

5. Strengths & Limitations

Key Strengths:

Clear, well-articulated system design with a principled two-level knowledge hierarchy

State-of-the-art results on the primary benchmark (MLE-Bench, 70.7% medal rate)

Evaluation on a live competition provides ecological validity beyond static benchmarks

The ADMET evaluation demonstrates cross-domain generalization to a specialized scientific task

Open-source code and solution artifacts enhance reproducibility

The continual evolution mechanism (from both self-experience and web content) is a conceptually appealing contribution

Notable Limitations:

Absence of ablations is the most critical weakness—without them, the causal contribution of the knowledge system is unestablished

No variance reporting across multiple runs undermines confidence in the results

Backbone LLM confound: Using a cutting-edge proprietary model makes it difficult to attribute improvements to the knowledge system vs. the model

Scalability concerns: The knowledge system's evolution mechanism is described but not quantitatively analyzed—how does performance scale with knowledge base size? Is there degradation from noisy or contradictory entries?

Limited novelty in agent architecture: The hierarchical manager-designer-coder-tuner structure is inherited from AIBuildAI; the primary novelty is the knowledge system, which is essentially a well-organized RAG system with a feedback loop

No failure analysis: The paper does not analyze tasks where AIBuildAI-2 fails or where the knowledge system provides unhelpful guidance

The 38.8% ranking on ADMET is not particularly impressive and is presented with less scrutiny

Overall Assessment

AIBuildAI-2 presents a sensible and well-engineered system that achieves strong benchmark results. The knowledge system concept is sound and addresses a real limitation of existing agents. However, the paper's scientific contribution is undermined by the lack of ablation studies, variance analysis, and detailed analysis of the knowledge system's behavior. The empirical results, while impressive at face value, do not convincingly isolate the contribution of the proposed knowledge system from the powerful backbone LLM. This is fundamentally a strong systems/engineering paper with incomplete scientific validation.

Rating:6/ 10

Significance 6.5Rigor 4.5Novelty 5.5Clarity 7.5

Generated May 28, 2026

Comparison History (24)

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

gemini-3.15/28/2026

Paper 1 presents a highly impactful real-world application with broad cross-disciplinary reach, enabling non-experts in fields like biology and physics to automatically build state-of-the-art AI models. While Paper 2 offers valuable foundational insights into LLM reasoning compression, Paper 1's AIBuildAI-2 agent addresses a critical bottleneck in applied scientific discovery. Its novel evolving knowledge system and verifiable top-tier performance on MLE-Bench demonstrate exceptional potential to accelerate research across diverse domains, giving it a broader overall scientific impact.

vs. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

gemini-3.15/28/2026

Paper 2 directly addresses a critical bottleneck in interdisciplinary research by enabling non-experts to automatically build high-performing AI models. Its evolving knowledge system and state-of-the-art results demonstrate strong practical utility. While Paper 1 provides valuable evaluation insights and a novel benchmark for search agents, Paper 2's potential to accelerate AI application and scientific discovery across diverse domains (biology, physics, chemistry) gives it broader and more immediate real-world scientific impact.

vs. LACUNA: Safe Agents as Recursive Program Holes

claude-opus-4.65/28/2026

LACUNA introduces a fundamentally novel programming model that addresses a core architectural problem in LLM agents—unifying the runtime and model-written code while preserving safety through typed program holes. This has broader impact across the entire agent ecosystem, touching programming languages, safety, and agent design. AIBuildAI-2, while achieving strong benchmark results, is a more incremental contribution (knowledge-enhanced agent for AutoML) building on established retrieval-augmented generation patterns. LACUNA's theoretical framework and safety guarantees have wider applicability and deeper foundational significance.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

claude-opus-4.65/28/2026

AIBuildAI-2 addresses the broader and more impactful problem of automating AI model development with a knowledge-enhanced agent system, demonstrating state-of-the-art results on established benchmarks (MLE-Bench). Its potential to democratize AI for scientific discovery across biology, physics, and chemistry gives it wider cross-disciplinary impact. While Paper 2 presents a novel contribution to knowledge editing with the CODE framework and addresses an important problem (epistemic dissonance), its scope is narrower, focusing specifically on LLM fact updating. Paper 1's practical applications and broader accessibility implications give it higher estimated impact.

vs. ZipRL: Adaptive Multi-Turn Context Compression with Hindsight Response Replay

gemini-3.15/28/2026

AIBuildAI-2 addresses a fundamental bottleneck in AI adoption across all sciences by automating model creation. Its evolving knowledge system and strong empirical performance democratize AI access for non-experts, offering a broader and more transformative real-world impact across diverse scientific fields compared to ZipRL's more specialized, albeit valuable, technical contribution to context compression.

vs. Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

claude-opus-4.65/28/2026

AIBuildAI-2 addresses a broader and more impactful problem—democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent achieves state-of-the-art results on MLE-Bench (70.7% medal rate) and demonstrates real-world competitiveness against human experts. The self-evolving knowledge system is a novel contribution with wide applicability. Paper 2, while technically interesting in revealing refusal signals in intermediate activations and offering efficiency gains for adversarial attacks, addresses a narrower problem in LLM safety/red-teaming with more limited breadth of impact across fields.

vs. Plan Before Search: Search Agents Need Plan

claude-opus-4.65/28/2026

AIBuildAI-2 addresses the broader and more transformative problem of democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent with an evolving knowledge system is highly novel and demonstrates strong empirical results (first on MLE-Bench, top 6.6% against human experts). The potential real-world impact spans multiple scientific fields (biology, physics, chemistry). While Paper 1 makes solid contributions to retrieval-augmented reasoning with its self-bootstrapping paradigm, its scope is narrower (multi-hop QA). Paper 2's breadth of application and potential to accelerate scientific discovery gives it higher estimated impact.

vs. DiagramRAG: A Lightweight Framework to Retrieve Scientific Diagram for Figure Generation

gpt-5.25/28/2026

Paper 1 has higher potential impact due to broader applicability and timeliness: an agentic, knowledge-enhanced system for automatically building AI models can accelerate ML development across many scientific domains. Its hierarchical, continually evolving external knowledge base is a notable innovation over static LLM capability, and the strong competitive results (SOTA on MLE-Bench; top percentile vs human teams) support methodological effectiveness. Paper 2 is valuable and well-engineered but is narrower in scope (diagram completion/generation) and likely impacts a smaller set of workflows compared to automating end-to-end model building.

vs. OpenComputer: Verifiable Software Worlds for Computer-Use Agents

gpt-5.25/28/2026

Paper 2 likely has higher impact: it introduces a verifier-grounded paradigm for computer-use agents with app-specific state verifiers, self-improving verification, task synthesis, and an auditable evaluation harness across 33 apps/1,000 tasks. This addresses a core, timely bottleneck—reliable, reproducible evaluation and supervision for agentic interaction with real software—enabling benchmarking, reward design, and safety auditing across many downstream methods. Paper 1 is strong and application-relevant, but primarily advances an AutoML/agent engineering system whose gains may be more incremental and domain-specific compared with a broadly enabling verification/evaluation infrastructure.

vs. Hierarchical Prompt-Domain Control and Learning for Resource-Constrained Agentic Language Models

gpt-5.25/28/2026

Paper 1 offers a more directly impactful and timely contribution: a knowledge-augmented agent that demonstrably achieves state-of-the-art performance on a widely used benchmark (MLE-Bench) and strong real-world competition results, with an evolving external knowledge system that can generalize across many AI-model-building tasks. Its applications span multiple scientific domains by lowering barriers for non-experts to build models. Paper 2 is methodologically interesting for deployment control of compact agents, but appears narrower in scope and validation, targeting reliability/cost rather than broad automation gains.

vs. Confidence-Orchestrated Self-Evolution against Uncertain LLM Feedback

claude-opus-4.65/28/2026

AIBuildAI-2 addresses a broader and more transformative problem—automating AI model development for non-experts, particularly scientists—with strong empirical validation (top rankings on MLE-Bench and competitive with human experts). Its knowledge-enhanced agent architecture with evolving knowledge systems has wider cross-disciplinary impact potential. While COSE makes a solid contribution to self-evolving LLMs with confidence-based learning, it represents a more incremental improvement within the LLM training paradigm. AIBuildAI-2's practical applications for democratizing AI across scientific domains give it higher potential impact.

vs. GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease

gemini-3.15/28/2026

Paper 2 has significantly broader potential impact across multiple scientific fields by democratizing AI model development for non-experts. While Paper 1 presents a valuable methodological improvement for a specific medical application (IBD detection), Paper 2 addresses a fundamental bottleneck in AI-driven scientific discovery (AutoML via LLM agents) and demonstrates state-of-the-art performance on a major benchmark (MLE-Bench), indicating a much higher ceiling for cross-disciplinary utility and transformative effect.

vs. Cultural Binding Heads in Language Models

claude-opus-4.65/28/2026

AIBuildAI-2 demonstrates substantial practical impact by achieving state-of-the-art results on MLE-Bench (70.7% medal rate) and competitive performance against human experts. It addresses a broadly important problem—democratizing AI model development for non-experts, especially scientists—with a novel knowledge-enhanced agent architecture featuring hierarchical, evolving knowledge systems. Paper 2 provides interesting mechanistic interpretability insights about cultural binding heads, but its scope is narrower (cultural appropriation in LLMs), the improvements are modest (1-3pp), and the practical applications are more limited. Paper 1's broader applicability across scientific domains gives it higher potential impact.

vs. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

gemini-3.15/28/2026

Paper 1 has a significantly broader potential impact across multiple scientific disciplines by democratizing and automating AI model development for researchers without specialized AI expertise. While Paper 2 presents valuable methodological advancements in conversational agent memory, Paper 1's approach directly addresses a critical bottleneck in modern scientific discovery, offering widespread real-world utility and demonstrating impressive empirical results on challenging benchmarks.

vs. Reasoning and Planning with Dynamically Changing Norms

claude-opus-4.65/28/2026

AIBuildAI-2 addresses the high-impact problem of democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent with an evolving knowledge system demonstrates strong empirical results (first on MLE-Bench, top 6.6% against human experts), showing practical utility. The approach has broad cross-disciplinary impact by enabling scientists in biology, physics, and chemistry to build AI models without specialized expertise. Paper 1, while addressing an important theoretical problem of norm-guided planning, has narrower scope and less demonstrated real-world impact, with evaluation limited to a dialogue task.

vs. Gradient Step Plug-and-Play Model for Dental Cone-Beam CT Reconstruction

gpt-5.25/28/2026

Paper 1 has higher potential impact due to broader applicability and timeliness: an evolving, external-knowledge-enhanced agent for automated ML engineering can benefit many domains beyond AI (biology, chemistry, physics) and addresses a major bottleneck in practice. The reported strong benchmark/competition performance suggests practical relevance. Paper 2 is a solid, domain-specific plug-and-play reconstruction method for dental CBCT with evidence largely from synthetic data and qualitative real-image results; its impact is likely narrower (medical imaging) and less broadly transformative despite real-world utility.

vs. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

gemini-3.15/28/2026

Paper 1 presents a highly impactful tool capable of democratizing AI development for natural scientists, directly accelerating cross-disciplinary scientific discovery. Its evolving, hierarchical knowledge system addresses significant bottlenecks in automated model building, offering broader real-world applications and wider impact across multiple fields compared to the LLM-specific algorithmic improvements in Paper 2.

vs. TIGER: Text-Informed Generalized Enzyme-Reaction Retrieval

claude-opus-4.65/28/2026

AIBuildAI-2 addresses the broad challenge of automating AI model development, with demonstrated state-of-the-art results on established benchmarks (MLE-Bench) and real competitions. Its knowledge-enhanced agent framework with self-evolving knowledge systems has wide applicability across all fields using AI, potentially democratizing AI for non-experts. While TIGER makes a solid contribution to enzyme-reaction retrieval in computational biology, its scope is narrower. AIBuildAI-2's breadth of impact across scientific disciplines, practical utility, and novel self-improving knowledge architecture give it higher potential impact.

vs. Proper Scoring Rules for Agentic Uncertainty Quantification

gpt-5.25/28/2026

Paper 2 is likely to have higher scientific impact because it introduces a principled, broadly applicable evaluation framework (strictly proper trajectory-level scoring rules) with formal elicitation guarantees, addressing a foundational measurement problem in agentic uncertainty. Its methodological rigor (theorems, censored-trajectory extension) and clarity about what existing metrics do/do not elicit make it useful across many agent settings and benchmarks, improving comparability and scientific validity of UQ research. Paper 1 is practically impactful, but its contribution is more system/engineering-centric and may date faster as agent-building tools evolve.

vs. Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

gemini-3.15/28/2026

Paper 1 presents a method to automate AI model building, which has profound implications for accelerating scientific discovery across multiple disciplines by lowering the barrier to entry for natural scientists. Its evolving, knowledge-enhanced architecture addresses a major limitation in current autonomous AI agents. While Paper 2 offers a rigorous approach to benchmarking enterprise agents, Paper 1's potential to democratize AI development and its state-of-the-art performance on MLE benchmarks give it a significantly broader and deeper scientific impact.