AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang, Pengtao Xie
Abstract
AI models underpin data-centric applications from image and text processing to scientific discovery in biology, physics, and chemistry. Yet developing them remains heavily manual, requiring practitioners to design architectures, build training pipelines, and iteratively refine solutions, making it challenging for natural scientists without specialized AI engineering expertise to build the high-performing models their research demands. To reduce this burden and broaden access to AI for scientific discovery, agents that automatically build AI models have been proposed. However, the performance of these agents is largely limited by the parametric knowledge of their underlying large language models, which is static, often outdated, and sparse on practical AI model engineering know-how. To address this limitation, we introduce AIBuildAI-2, a knowledge-enhanced agent with an external, evolving knowledge system for automatically building AI models. The knowledge system of AIBuildAI-2 is hierarchical, organizing curated AI development knowledge into high-level knowledge instructions over topical categories and low-level knowledge documents under each category, from which the agent dynamically loads only the context relevant to its current state and the AI task being solved, grounding each design and implementation decision in concrete, externally verifiable expertise. The system is initialized by collecting and cleaning AI-development-related documents from the web and organizing them into the corresponding categories, and continually evolves from the agent's own experience by distilling each completed run on an AI task into structured takeaways that are written back into the knowledge system. AIBuildAI-2 achieves state-of-the-art results, ranking first on MLE-Bench with a 70.7% medal rate and placing in the top 6.6% among 4,370 human-expert teams in a heart disease prediction competition.
AI Impact Assessments
(1 models)Scientific Impact Assessment: AIBuildAI-2
1. Core Contribution
AIBuildAI-2 introduces a knowledge-enhanced autonomous agent for end-to-end AI model development, built around a hierarchical, continually evolving external knowledge system. The central novelty is the two-level knowledge architecture: ~30 top-level (L1) categories with high-level instructions, and ~1,000 bottom-level (L2) detailed knowledge documents, combined with dynamic context loading that selectively retrieves relevant knowledge based on the agent's current state and task. The knowledge system is initialized from curated web sources and evolves through two mechanisms: distilling the agent's own completed runs into structured takeaways, and incorporating newly published external content. This addresses a genuine limitation of prior agents that rely solely on LLM parametric knowledge, noisy web search, or static document corpora.
2. Methodological Rigor
Strengths in evaluation breadth: The paper evaluates across three distinct settings—MLE-Bench (75 Kaggle-style tasks), a live heart disease prediction competition (4,370 teams), and a blind ADMET drug discovery challenge (103 teams). This diversity provides reasonable coverage of the system's generalization ability.
Weaknesses in experimental rigor:
3. Potential Impact
The paper addresses an important and timely problem: democratizing AI model development for domain scientists. The vision of an agent that accumulates and reuses engineering knowledge is compelling. Practical impacts include:
However, the impact is tempered by several factors: the system requires Claude Opus 4.7 (expensive, proprietary), 24 hours of A100 compute per task, and the knowledge system initialization requires substantial curation effort. Reproducibility is also a concern since the backbone LLM is a closed-source commercial API.
4. Timeliness & Relevance
This work is highly timely. Autonomous AI development agents are a rapidly growing area, with multiple concurrent systems (AIDE, R&D-Agent, MARS, etc.) competing on the same benchmarks. The knowledge augmentation angle is well-motivated: as these agents approach human-level performance, the bottleneck increasingly shifts from reasoning capability to practical engineering know-how, which is exactly what the knowledge system provides. The framing around AI-for-science is also timely given the growing demand for ML in scientific domains.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
AIBuildAI-2 presents a sensible and well-engineered system that achieves strong benchmark results. The knowledge system concept is sound and addresses a real limitation of existing agents. However, the paper's scientific contribution is undermined by the lack of ablation studies, variance analysis, and detailed analysis of the knowledge system's behavior. The empirical results, while impressive at face value, do not convincingly isolate the contribution of the proposed knowledge system from the powerful backbone LLM. This is fundamentally a strong systems/engineering paper with incomplete scientific validation.
Generated May 28, 2026
Comparison History (24)
Paper 1 presents a highly impactful real-world application with broad cross-disciplinary reach, enabling non-experts in fields like biology and physics to automatically build state-of-the-art AI models. While Paper 2 offers valuable foundational insights into LLM reasoning compression, Paper 1's AIBuildAI-2 agent addresses a critical bottleneck in applied scientific discovery. Its novel evolving knowledge system and verifiable top-tier performance on MLE-Bench demonstrate exceptional potential to accelerate research across diverse domains, giving it a broader overall scientific impact.
Paper 2 directly addresses a critical bottleneck in interdisciplinary research by enabling non-experts to automatically build high-performing AI models. Its evolving knowledge system and state-of-the-art results demonstrate strong practical utility. While Paper 1 provides valuable evaluation insights and a novel benchmark for search agents, Paper 2's potential to accelerate AI application and scientific discovery across diverse domains (biology, physics, chemistry) gives it broader and more immediate real-world scientific impact.
LACUNA introduces a fundamentally novel programming model that addresses a core architectural problem in LLM agents—unifying the runtime and model-written code while preserving safety through typed program holes. This has broader impact across the entire agent ecosystem, touching programming languages, safety, and agent design. AIBuildAI-2, while achieving strong benchmark results, is a more incremental contribution (knowledge-enhanced agent for AutoML) building on established retrieval-augmented generation patterns. LACUNA's theoretical framework and safety guarantees have wider applicability and deeper foundational significance.
AIBuildAI-2 addresses the broader and more impactful problem of automating AI model development with a knowledge-enhanced agent system, demonstrating state-of-the-art results on established benchmarks (MLE-Bench). Its potential to democratize AI for scientific discovery across biology, physics, and chemistry gives it wider cross-disciplinary impact. While Paper 2 presents a novel contribution to knowledge editing with the CODE framework and addresses an important problem (epistemic dissonance), its scope is narrower, focusing specifically on LLM fact updating. Paper 1's practical applications and broader accessibility implications give it higher estimated impact.
AIBuildAI-2 addresses a fundamental bottleneck in AI adoption across all sciences by automating model creation. Its evolving knowledge system and strong empirical performance democratize AI access for non-experts, offering a broader and more transformative real-world impact across diverse scientific fields compared to ZipRL's more specialized, albeit valuable, technical contribution to context compression.
AIBuildAI-2 addresses a broader and more impactful problem—democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent achieves state-of-the-art results on MLE-Bench (70.7% medal rate) and demonstrates real-world competitiveness against human experts. The self-evolving knowledge system is a novel contribution with wide applicability. Paper 2, while technically interesting in revealing refusal signals in intermediate activations and offering efficiency gains for adversarial attacks, addresses a narrower problem in LLM safety/red-teaming with more limited breadth of impact across fields.
AIBuildAI-2 addresses the broader and more transformative problem of democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent with an evolving knowledge system is highly novel and demonstrates strong empirical results (first on MLE-Bench, top 6.6% against human experts). The potential real-world impact spans multiple scientific fields (biology, physics, chemistry). While Paper 1 makes solid contributions to retrieval-augmented reasoning with its self-bootstrapping paradigm, its scope is narrower (multi-hop QA). Paper 2's breadth of application and potential to accelerate scientific discovery gives it higher estimated impact.
Paper 1 has higher potential impact due to broader applicability and timeliness: an agentic, knowledge-enhanced system for automatically building AI models can accelerate ML development across many scientific domains. Its hierarchical, continually evolving external knowledge base is a notable innovation over static LLM capability, and the strong competitive results (SOTA on MLE-Bench; top percentile vs human teams) support methodological effectiveness. Paper 2 is valuable and well-engineered but is narrower in scope (diagram completion/generation) and likely impacts a smaller set of workflows compared to automating end-to-end model building.
Paper 2 likely has higher impact: it introduces a verifier-grounded paradigm for computer-use agents with app-specific state verifiers, self-improving verification, task synthesis, and an auditable evaluation harness across 33 apps/1,000 tasks. This addresses a core, timely bottleneck—reliable, reproducible evaluation and supervision for agentic interaction with real software—enabling benchmarking, reward design, and safety auditing across many downstream methods. Paper 1 is strong and application-relevant, but primarily advances an AutoML/agent engineering system whose gains may be more incremental and domain-specific compared with a broadly enabling verification/evaluation infrastructure.
Paper 1 offers a more directly impactful and timely contribution: a knowledge-augmented agent that demonstrably achieves state-of-the-art performance on a widely used benchmark (MLE-Bench) and strong real-world competition results, with an evolving external knowledge system that can generalize across many AI-model-building tasks. Its applications span multiple scientific domains by lowering barriers for non-experts to build models. Paper 2 is methodologically interesting for deployment control of compact agents, but appears narrower in scope and validation, targeting reliability/cost rather than broad automation gains.
AIBuildAI-2 addresses a broader and more transformative problem—automating AI model development for non-experts, particularly scientists—with strong empirical validation (top rankings on MLE-Bench and competitive with human experts). Its knowledge-enhanced agent architecture with evolving knowledge systems has wider cross-disciplinary impact potential. While COSE makes a solid contribution to self-evolving LLMs with confidence-based learning, it represents a more incremental improvement within the LLM training paradigm. AIBuildAI-2's practical applications for democratizing AI across scientific domains give it higher potential impact.
Paper 2 has significantly broader potential impact across multiple scientific fields by democratizing AI model development for non-experts. While Paper 1 presents a valuable methodological improvement for a specific medical application (IBD detection), Paper 2 addresses a fundamental bottleneck in AI-driven scientific discovery (AutoML via LLM agents) and demonstrates state-of-the-art performance on a major benchmark (MLE-Bench), indicating a much higher ceiling for cross-disciplinary utility and transformative effect.
AIBuildAI-2 demonstrates substantial practical impact by achieving state-of-the-art results on MLE-Bench (70.7% medal rate) and competitive performance against human experts. It addresses a broadly important problem—democratizing AI model development for non-experts, especially scientists—with a novel knowledge-enhanced agent architecture featuring hierarchical, evolving knowledge systems. Paper 2 provides interesting mechanistic interpretability insights about cultural binding heads, but its scope is narrower (cultural appropriation in LLMs), the improvements are modest (1-3pp), and the practical applications are more limited. Paper 1's broader applicability across scientific domains gives it higher potential impact.
Paper 1 has a significantly broader potential impact across multiple scientific disciplines by democratizing and automating AI model development for researchers without specialized AI expertise. While Paper 2 presents valuable methodological advancements in conversational agent memory, Paper 1's approach directly addresses a critical bottleneck in modern scientific discovery, offering widespread real-world utility and demonstrating impressive empirical results on challenging benchmarks.
AIBuildAI-2 addresses the high-impact problem of democratizing AI model development for non-experts, particularly scientists. Its knowledge-enhanced agent with an evolving knowledge system demonstrates strong empirical results (first on MLE-Bench, top 6.6% against human experts), showing practical utility. The approach has broad cross-disciplinary impact by enabling scientists in biology, physics, and chemistry to build AI models without specialized expertise. Paper 1, while addressing an important theoretical problem of norm-guided planning, has narrower scope and less demonstrated real-world impact, with evaluation limited to a dialogue task.
Paper 1 has higher potential impact due to broader applicability and timeliness: an evolving, external-knowledge-enhanced agent for automated ML engineering can benefit many domains beyond AI (biology, chemistry, physics) and addresses a major bottleneck in practice. The reported strong benchmark/competition performance suggests practical relevance. Paper 2 is a solid, domain-specific plug-and-play reconstruction method for dental CBCT with evidence largely from synthetic data and qualitative real-image results; its impact is likely narrower (medical imaging) and less broadly transformative despite real-world utility.
Paper 1 presents a highly impactful tool capable of democratizing AI development for natural scientists, directly accelerating cross-disciplinary scientific discovery. Its evolving, hierarchical knowledge system addresses significant bottlenecks in automated model building, offering broader real-world applications and wider impact across multiple fields compared to the LLM-specific algorithmic improvements in Paper 2.
AIBuildAI-2 addresses the broad challenge of automating AI model development, with demonstrated state-of-the-art results on established benchmarks (MLE-Bench) and real competitions. Its knowledge-enhanced agent framework with self-evolving knowledge systems has wide applicability across all fields using AI, potentially democratizing AI for non-experts. While TIGER makes a solid contribution to enzyme-reaction retrieval in computational biology, its scope is narrower. AIBuildAI-2's breadth of impact across scientific disciplines, practical utility, and novel self-improving knowledge architecture give it higher potential impact.
Paper 2 is likely to have higher scientific impact because it introduces a principled, broadly applicable evaluation framework (strictly proper trajectory-level scoring rules) with formal elicitation guarantees, addressing a foundational measurement problem in agentic uncertainty. Its methodological rigor (theorems, censored-trajectory extension) and clarity about what existing metrics do/do not elicit make it useful across many agent settings and benchmarks, improving comparability and scientific validity of UQ research. Paper 1 is practically impactful, but its contribution is more system/engineering-centric and may date faster as agent-building tools evolve.
Paper 1 presents a method to automate AI model building, which has profound implications for accelerating scientific discovery across multiple disciplines by lowering the barrier to entry for natural scientists. Its evolving, knowledge-enhanced architecture addresses a major limitation in current autonomous AI agents. While Paper 2 offers a rigorous approach to benchmarking enterprise agents, Paper 1's potential to democratize AI development and its state-of-the-art performance on MLE benchmarks give it a significantly broader and deeper scientific impact.