Alberto Pepe, Chien-Yu Lin, Despoina Magka, Bilge Acun, Yannan Nellie Wu, Anton Protopopov, Carole-Jean Wu, Yoram Bachrach
Toward recursive self-improvement, we investigate LLM agents autonomously designing foundation models beyond standard Transformers. We introduce a dual-framework approach: AIRA-Compose for high-level architecture search, and AIRA-Design for low-level mechanistic implementation. AIRA-Compose uses 11 agents to explore fundamental computational primitives under a 24-hour budget. Agents evaluate million-parameter candidates, extrapolating top designs to 350M, 1B, and 3B scales. This yields 14 architectures across two families: AIRAformers (Transformer-based) and AIRAhybrids (Transformer-Mamba). Pre-trained at 1B scale, these consistently outperform Llama 3.2 and Composer-found baselines. On downstream tasks, AIRAformer-D and AIRAhybrid-D improve accuracy by 2.4% and 3.8% over Llama 3.2. Furthermore, AIRA-Compose finds models with highly efficient scaling frontiers: AIRAformer-C scales 54% and 71% faster than Llama 3.2 and Composer's best Transformer, while AIRAhybrid-C outscales Nemotron-2 by 23% and Composer's best hybrid by 37%. AIRA-Design tasks 20 agents with writing novel attention mechanisms for long-range dependencies and high-performing training scripts. On the Long Range Arena benchmark, agent-designed architectures reach within 2.3% and 2.6% of human state-of-the-art on document matching and text classification. On the Autoresearch benchmark, Greedy Opus 4.5 achieves 0.968 validation bits-per-byte under a fixed time budget, surpassing the published minimum. Together, these frameworks show AI agents can autonomously discover architectures and algorithmic optimizations matching or surpassing hand-designed baselines. This establishes a powerful paradigm for discovering next-generation foundation models, marking a clear step toward recursive self-improvement.
This paper introduces a dual-framework approach for LLM agents to autonomously discover neural architectures, framed as a step toward recursive self-improvement. AIRA-Compose recasts high-level neural architecture search (NAS) as an agentic task where 11 agents arrange predefined computational primitives (Attention, MLP, Mamba) into 16-layer architectures, evaluated at small scale and extrapolated to 350M–3B parameters. AIRA-Design tasks up to 20 agents with writing novel attention mechanisms from scratch (Long Range Arena benchmark) and optimizing training scripts (Autoresearch benchmark).
The key novelty lies not in any single architectural discovery but in the paradigm: demonstrating that LLM agents can serve as effective neural architecture search engines, leveraging their domain knowledge to navigate combinatorial spaces more meaningfully than traditional Bayesian or evolutionary search methods. The paper yields 14 novel architectures (AIRAformers and AIRAhybrids) that outperform Llama 3.2 and Composer-found baselines.
This work addresses a timely convergence of two trends: (1) the shift toward hybrid post-Transformer architectures, and (2) the emergence of capable agentic systems. The combinatorial explosion of hybrid design spaces (43M possibilities for 3 primitives in 16 layers) makes automated search increasingly necessary. The framing around recursive self-improvement aligns with a central concern in AI safety and capabilities research, though the actual contribution is more modest—agents discover architectures but don't yet improve themselves.
The paper effectively reveals a capability stratification among LLMs: Opus 4.6 and Gemini 3 Pro consistently outperform other agents, while weaker models (CWM, GPT-4o) struggle significantly, especially on open-ended design tasks. This meta-finding about which LLMs are effective research agents has independent value. The detailed feature analysis in the Autoresearch appendix (Tables 19-21) provides useful insights into which architectural and training modifications matter most under constrained compute budgets.
Generated May 18, 2026
Paper 2 presents a highly ambitious step toward AI recursive self-improvement by using agents to autonomously design neural architectures. Demonstrating that agent-discovered models can outperform strong baselines like Llama 3.2 at significant scales (up to 3B parameters) suggests a paradigm shift in how foundation models are developed. While Paper 1 offers a valuable algorithmic optimization for inference-time parallel reasoning, Paper 2 has much broader implications, greater novelty, and the potential to fundamentally disrupt standard model engineering across the entire AI field.
While Paper 1 offers a strong, theoretically grounded approach for physical scientific discovery, Paper 2 demonstrates a highly impactful step toward recursive self-improvement in AI. By enabling autonomous agents to design foundation models that outperform state-of-the-art baselines like Llama 3.2, Paper 2 accelerates the fundamental engine of AI research. This could yield compounding advancements across all fields reliant on machine learning, granting it a broader and more transformative long-term scientific impact.
Paper 1 explores the highly impactful frontier of recursive self-improvement, demonstrating that LLM agents can autonomously design novel architectures that outperform strong baselines like Llama 3.2. This agentic discovery paradigm has profound implications for automating AI research and accelerating foundation model development, offering higher potential real-world impact and novelty compared to the specific RL optimization technique proposed in Paper 2.
Paper 2 has higher potential impact due to greater novelty (agentic, end-to-end architecture and mechanism discovery beyond standard Transformers), stronger real-world implications (new scalable foundation-model families and training improvements), and broader cross-field relevance (NAS, systems, scaling laws, long-context modeling, automated research). If results and comparisons hold, it could directly influence future model design and accelerate progress toward automated AI R&D. Paper 1 is timely and valuable for trustworthy evaluation, but its impact is primarily infrastructural/benchmarking and likely more incremental than a paradigm-shifting method for discovering new model architectures.
Paper 1 presents a highly impactful approach to automated neural architecture search, demonstrating progress toward recursive self-improvement where AI designs better AI. Its discovery of architectures that outperform strong baselines like Llama 3.2 at the 1B scale has broad implications for foundation model development across all domains. In contrast, Paper 2 offers a valuable but narrower contribution regarding agent memory, explicitly noting its evidence is confined to a single specific benchmark (CAGE-2 B-line). Paper 1's generalizability and potential to shift the paradigm of foundation model design give it significantly higher scientific impact.
Paper 1 likely has higher scientific impact due to greater novelty and broader potential impact: it proposes an agentic, end-to-end framework for discovering new foundation-model architectures and mechanisms, reporting improvements over strong baselines and scaling-frontier gains—results that could influence core model design across NLP and ML systems. Its applications (architecture search, training optimization, long-context mechanisms) are widely relevant and timely. Paper 2 is valuable and practical for EDA/RTL benchmark maintenance and reproducibility, but its impact is narrower (benchmark curation) and more incremental relative to the field’s central algorithmic advances.
Paper 2 has higher potential impact due to greater novelty (agentic, multi-level automated architecture and mechanism design beyond standard Transformers), broader applicability (general-purpose foundation model improvements affecting many domains), and strong real-world relevance (efficiency/scaling frontier gains and competitive downstream performance). Its paradigm could influence ML research methodology itself (automated discovery/recursive improvement). Paper 1 is timely and impressive but is primarily a training/scaling recipe for olympiad reasoning on an existing backbone, with narrower cross-field impact and less foundational methodological shift.
Paper 1 demonstrates significant progress toward recursive self-improvement in AI by autonomously discovering neural architectures that outperform state-of-the-art baselines like Llama 3.2. This work has immediate, broad implications for the development of future foundation models and scaling laws, impacting the entire field of AI. Paper 2, while offering an interesting formalization of moral psychology, addresses a more niche intersection of cognitive science and AI alignment, making its overall scientific impact and broad applicability less transformative than Paper 1.
Paper 2 demonstrates concrete, empirical results showing LLM agents can autonomously discover neural architectures that outperform hand-designed baselines like Llama 3.2. It introduces a practical dual-framework (AIRA-Compose and AIRA-Design) with measurable improvements in scaling efficiency and downstream task performance. This addresses the highly timely topic of recursive self-improvement and automated ML, with immediate practical applications in foundation model design. Paper 1, while intellectually interesting as a position paper advocating metacognitive AI, is more conceptual with limited empirical validation beyond a single FL case study.
Paper 2 has higher potential scientific impact due to broader cross-field implications (automated discovery of foundation-model architectures and training mechanisms), strong real-world applicability (improving model performance and scaling efficiency), and timeliness given rapid progress in agentic AI and next-gen architectures. It reports multi-scale evaluations (to 3B), comparative gains over strong baselines, and introduces a generalizable paradigm (agentic architecture/mechanism design) that could influence ML research, systems, and automated R&D. Paper 1 is novel and rigorous but more domain-specific (educational animation generation).