Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design

Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach

May 15, 2026arXiv:2605.15871v1

cs.AI

#295of 3672·Artificial Intelligence

#295 of 3672 · Artificial Intelligence

Tournament Score

1510±44

10501800

88%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity6

Abstract

Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Agentic Discovery of Neural Architectures (AIRA-Compose and AIRA-Design)

1. Core Contribution

This paper introduces a dual-framework approach for LLM agents to autonomously discover neural architectures, framed as a step toward recursive self-improvement. AIRA-Compose recasts high-level neural architecture search (NAS) as an agentic task where 11 agents arrange predefined computational primitives (Attention, MLP, Mamba) into 16-layer architectures, evaluated at small scale and extrapolated to 350M–3B parameters. AIRA-Design tasks up to 20 agents with writing novel attention mechanisms from scratch (Long Range Arena benchmark) and optimizing training scripts (Autoresearch benchmark).

The key novelty lies not in any single architectural discovery but in the paradigm: demonstrating that LLM agents can serve as effective neural architecture search engines, leveraging their domain knowledge to navigate combinatorial spaces more meaningfully than traditional Bayesian or evolutionary search methods. The paper yields 14 novel architectures (AIRAformers and AIRAhybrids) that outperform Llama 3.2 and Composer-found baselines.

2. Methodological Rigor

Strengths in experimental design:

Extensive scale of experiments: 340 runs at 24-hour budgets, 300 at 60-hour budgets for AIRA-Compose; 1,680 runs for LRA; 100 runs for Autoresearch

Multi-seed evaluation (3 seeds for 1B isotoken, multiple seeds per agent configuration)

Three complementary evaluation protocols: isotoken (fixed token budget), isoFLOP (fixed compute), and downstream task accuracy

Comparison against strong baselines including Llama 3.2, approximated Nemotron-2/H, and Composer-found architectures

Methodological concerns:

The aggregation and extrapolation steps (converting 16-layer proxy architectures to large scale) remain non-agentic and rely on heuristics (k-means clustering, stacking/stretching). This somewhat undermines the "autonomous discovery" narrative.

The Nemotron-2 and Nemotron-H baselines are only approximated (MoEs replaced with MLPs), making direct comparison imperfect.

For the 3-primitive experiments, results are from single seeds at 1B scale, reducing statistical confidence.

The small-scale proxy evaluation doesn't always predict large-scale performance reliably, as acknowledged by the authors.

The isoFLOP scaling analysis extrapolates frontiers from only 3 model sizes, which is limited for robust scaling law estimation.

3. Potential Impact

Direct applications:

The AIRS-Bench task formulation provides a reproducible, extensible framework for evaluating agentic architecture search capabilities

The discovered architectures themselves (particularly AIRAhybrid-D with its 3.8% accuracy gain over Llama 3.2) could inform production model design

The scaling frontier improvements (54-71% faster scaling for AIRAformer-C vs Llama 3.2) are practically significant

Broader influence:

Establishes a concrete methodology for benchmarking recursive self-improvement capabilities

Demonstrates that LLM agents can productively explore the hybrid architecture design space, which is too large for manual exploration

The framework is LLM-agnostic and scaffold-agnostic, enabling future extension

Limitations on impact:

The AIRA-Design results are more modest: agents reached within 2.3-2.6% of human SOTA on LRA, and the authors candidly note that discovered architectures "largely recombine and adapt ideas from prior work" rather than introducing novel theoretical insights

The gap between best-explored and submitted models on LRA reveals agent selection deficiencies

One-shot agents produced zero valid submissions on LRA tasks, indicating that the results depend heavily on the iterative scaffolding

4. Timeliness & Relevance

This work addresses a timely convergence of two trends: (1) the shift toward hybrid post-Transformer architectures, and (2) the emergence of capable agentic systems. The combinatorial explosion of hybrid design spaces (43M possibilities for 3 primitives in 16 layers) makes automated search increasingly necessary. The framing around recursive self-improvement aligns with a central concern in AI safety and capabilities research, though the actual contribution is more modest—agents discover architectures but don't yet improve themselves.

5. Strengths & Limitations

Key Strengths:

Scale and thoroughness: The experimental investment is substantial, with hundreds of GPU-hours and thousands of architecture evaluations

Complementary frameworks: AIRA-Compose (constrained search) and AIRA-Design (open-ended generation) provide different lenses on agent capabilities

Honest assessment: The authors are forthright about limitations, noting that agents excel at "engineering-level synthesis" rather than "genuine scientific innovation"

Practical utility: The discovered architectures genuinely outperform baselines, not just marginally but with meaningful improvements in scaling efficiency

Reproducibility: The AIRS-Bench task formulation and detailed appendices enable replication

Notable Weaknesses:

The RSI framing is aspirational—agents search over a pre-specified space of known primitives, which is sophisticated NAS rather than genuine self-improvement

The contribution is partly incremental over Composer (Acun et al., 2025), replacing Bayesian optimization with LLM-guided search

The paper is extremely long (~55 pages) with much of the substance buried in appendices

Limited analysis of *why* agent-discovered architectures work—what structural insights emerge from the attention-heavy patterns?

The Autoresearch comparison is complicated by hardware differences (H200 vs H100) and different interaction paradigms (full-file regeneration vs. incremental editing)

6. Additional Observations

The paper effectively reveals a capability stratification among LLMs: Opus 4.6 and Gemini 3 Pro consistently outperform other agents, while weaker models (CWM, GPT-4o) struggle significantly, especially on open-ended design tasks. This meta-finding about which LLMs are effective research agents has independent value. The detailed feature analysis in the Autoresearch appendix (Tables 19-21) provides useful insights into which architectural and training modifications matter most under constrained compute budgets.

Rating:6.8/ 10

Significance 7Rigor 6.5Novelty 6.5Clarity 6

Generated May 18, 2026

Comparison History (25)

Wonvs. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

Paper 2 presents a highly ambitious step toward AI recursive self-improvement by using agents to autonomously design neural architectures. Demonstrating that agent-discovered models can outperform strong baselines like Llama 3.2 at significant scales (up to 3B parameters) suggests a paradigm shift in how foundation models are developed. While Paper 1 offers a valuable algorithmic optimization for inference-time parallel reasoning, Paper 2 has much broader implications, greater novelty, and the potential to fundamentally disrupt standard model engineering across the entire AI field.

gemini-3.1-pro-preview·May 28, 2026

Wonvs. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

While Paper 1 offers a strong, theoretically grounded approach for physical scientific discovery, Paper 2 demonstrates a highly impactful step toward recursive self-improvement in AI. By enabling autonomous agents to design foundation models that outperform state-of-the-art baselines like Llama 3.2, Paper 2 accelerates the fundamental engine of AI research. This could yield compounding advancements across all fields reliant on machine learning, granting it a broader and more transformative long-term scientific impact.

gemini-3.1-pro-preview·May 19, 2026

Wonvs. Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

Paper 1 explores the highly impactful frontier of recursive self-improvement, demonstrating that LLM agents can autonomously design novel architectures that outperform strong baselines like Llama 3.2. This agentic discovery paradigm has profound implications for automating AI research and accelerating foundation model development, offering higher potential real-world impact and novelty compared to the specific RL optimization technique proposed in Paper 2.

gemini-3.1-pro-preview·May 19, 2026

Wonvs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Paper 2 has higher potential impact due to greater novelty (agentic, end-to-end architecture and mechanism discovery beyond standard Transformers), stronger real-world implications (new scalable foundation-model families and training improvements), and broader cross-field relevance (NAS, systems, scaling laws, long-context modeling, automated research). If results and comparisons hold, it could directly influence future model design and accelerate progress toward automated AI R&D. Paper 1 is timely and valuable for trustworthy evaluation, but its impact is primarily infrastructural/benchmarking and likely more incremental than a paradigm-shifting method for discovering new model architectures.

gpt-5.2·May 18, 2026

Wonvs. FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Paper 1 presents a highly impactful approach to automated neural architecture search, demonstrating progress toward recursive self-improvement where AI designs better AI. Its discovery of architectures that outperform strong baselines like Llama 3.2 at the 1B scale has broad implications for foundation model development across all domains. In contrast, Paper 2 offers a valuable but narrower contribution regarding agent memory, explicitly noting its evidence is confined to a single specific benchmark (CAGE-2 B-line). Paper 1's generalizability and potential to shift the paradigm of foundation model design give it significantly higher scientific impact.

gemini-3.1-pro-preview·May 18, 2026

Wonvs. RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

Paper 1 likely has higher scientific impact due to greater novelty and broader potential impact: it proposes an agentic, end-to-end framework for discovering new foundation-model architectures and mechanisms, reporting improvements over strong baselines and scaling-frontier gains—results that could influence core model design across NLP and ML systems. Its applications (architecture search, training optimization, long-context mechanisms) are widely relevant and timely. Paper 2 is valuable and practical for EDA/RTL benchmark maintenance and reproducibility, but its impact is narrower (benchmark curation) and more incremental relative to the field’s central algorithmic advances.

gpt-5.2·May 18, 2026

Wonvs. Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Paper 2 has higher potential impact due to greater novelty (agentic, multi-level automated architecture and mechanism design beyond standard Transformers), broader applicability (general-purpose foundation model improvements affecting many domains), and strong real-world relevance (efficiency/scaling frontier gains and competitive downstream performance). Its paradigm could influence ML research methodology itself (automated discovery/recursive improvement). Paper 1 is timely and impressive but is primarily a training/scaling recipe for olympiad reasoning on an existing backbone, with narrower cross-field impact and less foundational methodological shift.

gpt-5.2·May 18, 2026

Wonvs. An Algebraic Exposition of the Theory of Dyadic Morality

Paper 1 demonstrates significant progress toward recursive self-improvement in AI by autonomously discovering neural architectures that outperform state-of-the-art baselines like Llama 3.2. This work has immediate, broad implications for the development of future foundation models and scaling laws, impacting the entire field of AI. Paper 2, while offering an interesting formalization of moral psychology, addresses a more niche intersection of cognitive science and AI alignment, making its overall scientific impact and broad applicability less transformative than Paper 1.

gemini-3.1-pro-preview·May 18, 2026

Wonvs. Position: Artificial Intelligence Needs Meta Intelligence -- the Case for Metacognitive AI

Paper 2 demonstrates concrete, empirical results showing LLM agents can autonomously discover neural architectures that outperform hand-designed baselines like Llama 3.2. It introduces a practical dual-framework (AIRA-Compose and AIRA-Design) with measurable improvements in scaling efficiency and downstream task performance. This addresses the highly timely topic of recursive self-improvement and automated ML, with immediate practical applications in foundation model design. Paper 1, while intellectually interesting as a position paper advocating metacognitive AI, is more conceptual with limited empirical validation beyond a single FL case study.

claude-opus-4-6·May 18, 2026

Wonvs. See Before You Code: Learning Visual Priors for Spatially Aware Educational Animation Generation

Paper 2 has higher potential scientific impact due to broader cross-field implications (automated discovery of foundation-model architectures and training mechanisms), strong real-world applicability (improving model performance and scaling efficiency), and timeliness given rapid progress in agentic AI and next-gen architectures. It reports multi-scale evaluations (to 3B), comparative gains over strong baselines, and introduces a generalizable paradigm (agentic architecture/mechanism design) that could influence ML research, systems, and automated R&D. Paper 1 is novel and rigorous but more domain-specific (educational animation generation).

gpt-5.2·May 18, 2026

#295of 3672·Artificial Intelligence

#295 of 3672 · Artificial Intelligence

Tournament Score

1510±44

10501800

88%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor6.5

Novelty6.5

Clarity6