AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation
Shanghua Gao, Ada Fang, Marinka Zitnik
Abstract
Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).
AI Impact Assessments
(1 models)Scientific Impact Assessment: AutoScientists
1. Core Contribution
AutoScientists introduces a decentralized multi-agent framework for long-running computational scientific experimentation. The key novelty lies in replacing centralized orchestration (fixed planners, role hierarchies, or consensus-driven convergence) with a self-organizing team structure where agents independently interpret shared experimental state, form and reorganize teams around promising hypotheses, critique proposals before execution, and share both successes and failures. The system addresses a genuine limitation of existing AI-for-science agents: the inability to sustain parallel exploration, adapt to shifting productive directions, and preserve institutional knowledge across extended experimental campaigns.
The architecture has several distinct design elements: (1) dynamic team formation through structured discussion rather than predetermined decomposition; (2) analyst/experiment agent role separation; (3) a shared forum with cross-team visibility including dead-end registries; (4) noise-aware champion promotion gates; and (5) stagnation-triggered reorganization. These collectively enable what the authors frame as "conference-style" knowledge sharing among agents.
2. Methodological Rigor
The evaluation is comprehensive, spanning three distinct domains: BioML-Bench (24 biomedical ML tasks), GPT nanochat training optimization, and ProteinGym protein fitness prediction. The comparison methodology is generally sound, with matched experimental compute budgets being the key controlled variable.
Strengths in evaluation design:
Weaknesses:
3. Potential Impact
Immediate applications: The framework could accelerate computational scientific workflows in drug discovery, protein engineering, and ML architecture search. The BioML-Bench drug discovery improvement (+18.4 leaderboard percentile points over Biomni) and the ProteinGym SOTA improvement (+6.5% Spearman correlation across 217 assays) suggest practical value.
Broader influence: The paper contributes to the growing understanding of how to organize multi-agent LLM systems for complex, long-horizon tasks. The self-organization principle—where agents determine their own coordination structure through discussion rather than following predetermined workflows—represents a meaningful architectural contribution that could transfer beyond scientific experimentation to other domains requiring sustained collaborative search.
Scientific discovery: The ProteinGym result is notable: AutoScientists-Kermut discovers a three-GP ensemble with quantile-warped targets, greedy diversity-based feature selection, and expanded zero-shot features that constitutes a genuine methodological contribution to protein fitness prediction.
4. Timeliness & Relevance
This work arrives at a critical juncture. AI agents for science are proliferating rapidly (the paper cites numerous 2025-2026 works), but most remain single-trajectory or rely on fixed orchestration. The challenge of long-horizon experimental search—where productive directions shift as evidence accumulates—is a genuine bottleneck that becomes more pressing as compute budgets for AI-driven experimentation grow. The paper directly addresses the scaling question: how do we make more agents productively collaborate rather than duplicating effort?
The connection to Karpathy's Autoresearch provides a timely and visible baseline, and the BioML-Bench evaluation situates the work within the emerging standardization of AI-for-science benchmarking.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional observations: The paper is thorough but extremely long (39+ pages with appendices). The algorithmic details in the appendix (Algorithms 1-4, noise-aware gating, analyst proposal protocol) demonstrate engineering depth but also reveal significant complexity that may limit adoption. The distinction between what the LLM decides versus what is hardcoded in the protocol could be clearer—the "self-organization" involves substantial scaffolding.
Generated May 28, 2026
Comparison History (27)
Paper 2 likely has higher scientific impact: it proposes a broadly applicable, decentralized multi-agent framework that demonstrates strong empirical gains across diverse, high-value domains (biomed ML, LLM training optimization, protein fitness) with clear real-world utility for accelerating computational experimentation. Its cross-domain benchmarks and sizable improvements suggest immediate adoption potential and wide spillover across fields. Paper 1 is conceptually novel with a compelling limitation theorem and a targeted workaround, but its demonstrated impact is narrower (causal discovery benchmarks) and may be less broadly deployable than an automated experimentation system.
Paper 2 has higher likely scientific impact because it identifies and formalizes a broadly applicable failure mode (reward bias substitution) in RLHF/preference optimization, provides impossibility-style results showing standard audits can’t distinguish mitigation success from substitution, and offers principled evaluation prescriptions. This targets a timely, safety-critical bottleneck for deployed LLMs and affects many mitigation/benchmarking efforts across alignment, evaluation, and ML theory. Paper 1 is strong empirically and useful, but its agentic system contribution is more incremental and may depend on engineering choices; Paper 2’s conceptual framework is more general and field-shaping.
Paper 2 presents a novel, decentralized AI agent system capable of autonomously conducting scientific research across diverse domains. By accelerating the scientific discovery process itself and demonstrating substantial empirical improvements in areas like drug discovery, biomedical imaging, and protein engineering, its potential breadth of impact and methodological breakthroughs far exceed the framework proposals in Paper 1, which, while valuable for equitable AI deployment, are narrower in scientific scope.
Paper 1 has higher estimated scientific impact due to stronger novelty (decentralized self-organizing scientific agent teams for long-running experimentation), broader cross-domain relevance (biomed ML, LLM training optimization, protein fitness), and substantially more rigorous empirical validation with matched budgets and quantitative gains across large benchmarks (BioML-Bench, ProteinGym) including new SOTA improvements. Its contributions generalize to automated scientific discovery, a timely and high-impact area. Paper 2 is promising and practical for streaming analytics, but appears more systems-integration oriented with less demonstrated methodological/benchmark rigor and narrower scientific reach.
Paper 2 likely has higher impact: it proposes a broadly applicable, decentralized multi-agent framework that demonstrably improves performance across multiple substantive domains (biomedical ML, LM training optimization, and protein fitness prediction), suggesting wide real-world utility and cross-field influence. The reported gains are sizable, evaluated on diverse benchmarks and long-running settings, and the approach is timely given growing interest in autonomous scientific discovery. Paper 1 is novel and important for AI safety evaluation/architecture, but its immediate impact is narrower (diagnostic protocol + validator) and more confined to safety-alignment research.
Paper 1 introduces a broadly applicable, decentralized AI agent framework for automating scientific discovery across diverse, high-impact domains like biomedicine, LLM training, and protein engineering. It achieves state-of-the-art results and tackles a highly timely problem (AI for science). Paper 2 offers specialized algorithmic improvements for a specific chaotic system forecasting benchmark, resulting in a much narrower potential scientific impact.
AutoScientists presents a novel, rigorously evaluated multi-agent framework for autonomous scientific experimentation with strong empirical results across multiple domains (biomedical ML, LLM training, protein fitness prediction). It demonstrates clear quantitative improvements over prior methods and addresses a timely, high-impact problem in AI-driven scientific discovery. Paper 2 is a philosophical/ethical analysis that, while raising valid points about LLM guardrails and 'reality laundering,' lacks empirical methodology, offers primarily conceptual contributions, and has narrower potential for driving follow-on research or real-world applications.
Paper 2 has higher potential scientific impact due to its direct application to accelerating scientific discovery across diverse fields, including biomedicine and machine learning. While Paper 1 presents a rigorous and novel causal framework for agent safety, Paper 2's ability to automate and improve long-running experimental research cycles offers broader, paradigm-shifting implications for how scientific research is conducted.
AutoScientists presents a novel, empirically validated system for autonomous scientific experimentation with demonstrated improvements across multiple domains (biomedical ML, language model optimization, protein fitness prediction). It addresses a timely challenge in AI-driven science with concrete quantitative results showing significant improvements over prior methods. Paper 2 is a survey/framework paper on cyberbullying governance that synthesizes existing work but lacks novel empirical contributions. AutoScientists' broad applicability to accelerating scientific discovery across fields gives it substantially higher potential impact.
Paper 2 has higher likely scientific impact due to a clearly novel, generalizable systems contribution (decentralized self-organizing agent teams) validated with strong quantitative results across multiple benchmarks and domains (biomed ML, LM training optimization, protein fitness). It is methodologically more rigorous (matched budgets, comparative baselines, broad task coverage) and offers immediate real-world applications in automating and accelerating computational science. Paper 1 is intriguing and timely for AI behavior/safety, but relies heavily on auto-ethnography and first-person AI self-report, limiting reproducibility and evidential strength, hence lower expected impact.
AutoScientists presents a more broadly impactful contribution: a general-purpose framework for autonomous scientific experimentation that demonstrates strong results across multiple diverse domains (biomedical ML, LLM training, protein fitness prediction). Its decentralized multi-agent architecture for sustained scientific exploration is highly novel, and the demonstrated improvements over state-of-the-art in protein engineering (+12.5% on ACE2-Spike binding) represent tangible scientific discoveries. Paper 1, while technically solid, addresses a narrower problem (multi-hop QA training strategies) with more incremental contributions to the retrieval-augmented reasoning community.
Paper 2 (AutoScientists) has higher potential impact due to broader cross-domain applicability (general long-running scientific experimentation vs. a specific e-commerce dispute setting), stronger real-world relevance to accelerating research workflows, and demonstrated gains across multiple substantive benchmarks (BioML-Bench, LM training optimization, ProteinGym) including improvements over state-of-the-art models. Its decentralized, self-organizing agent-team paradigm is timely and potentially reusable across many computational sciences. Paper 1 is novel and valuable, but its impact is narrower (platform dispute adjudication) and more application-specific.
While Paper 1 presents an innovative multi-agent architecture for scientific exploration, Paper 2 addresses a fundamental and critical bottleneck in AI-driven research: verifiability and hallucinations. By introducing the Chain-of-Evidence framework and robust audit mechanisms, Paper 2 ensures that autonomous scientific outputs are trustworthy, reproducible, and grounded. This methodological rigor is essential for the widespread adoption and credibility of AI researchers in the real world, giving it a broader and more foundational scientific impact.
Paper 1 introduces a decentralized AI agent framework that directly accelerates multi-disciplinary scientific discovery. Its demonstrated ability to self-organize and significantly improve state-of-the-art results across diverse and highly impactful domains—such as biomedical machine learning, language model optimization, and protein engineering—suggests a broad, transformational impact across STEM fields. While Paper 2 addresses a critical issue in AI safety, Paper 1's potential to automate and enhance the scientific method itself offers a wider scope of real-world applications and cross-domain innovation.
Paper 1 presents a framework that accelerates the scientific method itself. By successfully automating parallel hypothesis generation and experimentation, it demonstrates state-of-the-art improvements across highly impactful and diverse fields like protein engineering and ML optimization. Its potential to act as a force multiplier for broad scientific discovery gives it a higher overall scientific impact compared to the crucial, yet more narrowly focused, AI safety and sociological findings of Paper 2.
AutoScientists introduces a novel decentralized multi-agent framework for autonomous scientific discovery that demonstrates strong empirical results across diverse domains (biomedical ML, language model optimization, protein fitness prediction), including state-of-the-art improvements. It addresses a fundamental challenge in AI-driven science—sustaining parallel exploration and knowledge preservation over long experiments. Paper 1, while valuable for standardizing LLM agent evaluation, is primarily an engineering/benchmarking contribution. Paper 2's potential to accelerate scientific discovery across multiple fields gives it broader and deeper impact.
Paper 1 introduces a fundamentally new foundation model class for tabular data—the most widely used data modality in industry and science—addressing a longstanding gap in the AI stack. Its native understanding of tabular data without preprocessing is a paradigm shift with enormous breadth of application across virtually all data-driven fields. While Paper 2 presents an impressive multi-agent system for automated scientific experimentation with strong results, it represents an incremental advance in AI-for-science agent orchestration. Paper 1's potential to reshape how all tabular AI systems are built gives it broader and deeper long-term impact.
AutoScientists addresses the broadly impactful problem of automating scientific research with a novel decentralized multi-agent framework. It demonstrates strong empirical results across diverse domains (biomedical ML, LLM training, protein fitness prediction), including a state-of-the-art improvement of +12.5% on protein binding prediction. Its breadth of applicability across scientific fields, practical real-world utility in accelerating research, and timeliness given the AI-for-science trend give it higher potential impact than Paper 1, which advances game-theoretic equilibrium computation—an important but narrower contribution.
Paper 1 likely has higher impact: it introduces a novel decentralized, self-organizing multi-agent framework for long-running scientific experimentation, addressing a core limitation of current agentic science systems (single trajectory/central planner). It demonstrates broad, cross-domain applicability with strong empirical gains on large, diverse benchmarks (BioML-Bench, GPT training optimization, ProteinGym), including state-of-the-art improvements that could directly affect biomedical ML, protein engineering, and model training efficiency. This breadth of applications and demonstrated generality suggest wider downstream adoption than Paper 2’s more focused (though valuable) reasoning/RL objective for LLMs.
Paper 1 presents a broad, transformative approach to automating scientific discovery across multiple disciplines like biomedicine and protein engineering. Its potential to accelerate general scientific research gives it a vastly higher cross-disciplinary impact compared to Paper 2, which is narrowly focused on the evaluation of financial trading agents.