AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Shanghua Gao, Ada Fang, Marinka Zitnik

May 27, 2026

arXiv:2605.28655v1 PDF

cs.AI(primary)

#40of 2682·Artificial Intelligence

#40 of 2682 · Artificial Intelligence

Tournament Score

1579±47

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.8

Novelty7

Clarity7

Tournament Score

1579±47

10501800

89%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Scientific research proceeds through iterative cycles of hypothesis generation, experiment design, execution, and revision. AI agents can automate parts of this process, but existing approaches typically follow a single research trajectory or coordinate through a central planner with fixed objectives. As a result, they struggle to sustain parallel exploration, adapt as experimental evidence changes, or preserve knowledge of failed directions over long-running experiments. We introduce AutoScientists, a decentralized team of AI agents for long-running computational scientific experimentation. Agents interpret a shared experimental state, self-organize into teams around promising hypotheses, critique proposals before using experimental compute, and share successes and failures to reduce redundant exploration. Under matched experimental budgets, AutoScientists improves over prior AI agents across biomedical machine learning, language-model training optimization, and protein fitness prediction. On BioML-Bench, spanning biomedical imaging, protein engineering, single-cell omics, and drug discovery, AutoScientists achieves a mean leaderboard percentile of 74.4% across 24 tasks, improving over the strongest AI agent by +8.33%. On GPT training optimization, AutoScientists reaches a target validation bits-per-byte 1.9x faster than Autoresearch and continues discovering improvements from a starting champion where the single-agent approach finds none (7 vs. 0 accepted improvements). On ProteinGym fitness prediction, AutoScientists discovers a method for ACE2-Spike binding that improves over the current state-of-the-art model by +12.5% in Spearman correlation. Applied without modification across all 217 ProteinGym assays, the same method improves over the prior state of the art by +6.5% (Spearman correlation).

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AutoScientists

1. Core Contribution

AutoScientists introduces a decentralized multi-agent framework for long-running computational scientific experimentation. The key novelty lies in replacing centralized orchestration (fixed planners, role hierarchies, or consensus-driven convergence) with a self-organizing team structure where agents independently interpret shared experimental state, form and reorganize teams around promising hypotheses, critique proposals before execution, and share both successes and failures. The system addresses a genuine limitation of existing AI-for-science agents: the inability to sustain parallel exploration, adapt to shifting productive directions, and preserve institutional knowledge across extended experimental campaigns.

The architecture has several distinct design elements: (1) dynamic team formation through structured discussion rather than predetermined decomposition; (2) analyst/experiment agent role separation; (3) a shared forum with cross-team visibility including dead-end registries; (4) noise-aware champion promotion gates; and (5) stagnation-triggered reorganization. These collectively enable what the authors frame as "conference-style" knowledge sharing among agents.

2. Methodological Rigor

The evaluation is comprehensive, spanning three distinct domains: BioML-Bench (24 biomedical ML tasks), GPT nanochat training optimization, and ProteinGym protein fitness prediction. The comparison methodology is generally sound, with matched experimental compute budgets being the key controlled variable.

Strengths in evaluation design:

The "from a champion" GPT experiment (Section 4.3) is particularly compelling: starting from an already-optimized checkpoint where single-agent Autoresearch finds 0 improvements while AutoScientists finds 7 demonstrates genuine long-horizon search capability.

Ablation studies (Table 3) systematically remove individual components and show each addresses different failure modes across different tasks, providing mechanistic understanding.

The ProteinGym transfer experiment (developing on one assay, freezing the recipe, applying to all 217) tests genuine scientific generalization rather than benchmark overfitting.

Three independent cold-start runs on GPT nanochat (Appendix C) show reasonable stability (SD = 0.0010).

Weaknesses:

Most experiments are single-run per condition due to computational cost, limiting statistical confidence. The BioML-Bench results, in particular, lack error bars from repeated runs (only 3 runs on one task).

The system uses substantially more LLM tokens than baselines (Table S8 shows ~4× higher costs), and while the paper frames this as acceptable given matched *experimental* compute, the total cost comparison is less favorable.

The comparison against Autoresearch on BioML-Bench involves adapting Autoresearch to their task interface, introducing potential confounds.

Team size sensitivity (Appendix B.2) shows the optimal configuration is task-dependent, but the default n=9 is not always best, raising questions about robustness of the default configuration.

3. Potential Impact

Immediate applications: The framework could accelerate computational scientific workflows in drug discovery, protein engineering, and ML architecture search. The BioML-Bench drug discovery improvement (+18.4 leaderboard percentile points over Biomni) and the ProteinGym SOTA improvement (+6.5% Spearman correlation across 217 assays) suggest practical value.

Broader influence: The paper contributes to the growing understanding of how to organize multi-agent LLM systems for complex, long-horizon tasks. The self-organization principle—where agents determine their own coordination structure through discussion rather than following predetermined workflows—represents a meaningful architectural contribution that could transfer beyond scientific experimentation to other domains requiring sustained collaborative search.

Scientific discovery: The ProteinGym result is notable: AutoScientists-Kermut discovers a three-GP ensemble with quantile-warped targets, greedy diversity-based feature selection, and expanded zero-shot features that constitutes a genuine methodological contribution to protein fitness prediction.

4. Timeliness & Relevance

This work arrives at a critical juncture. AI agents for science are proliferating rapidly (the paper cites numerous 2025-2026 works), but most remain single-trajectory or rely on fixed orchestration. The challenge of long-horizon experimental search—where productive directions shift as evidence accumulates—is a genuine bottleneck that becomes more pressing as compute budgets for AI-driven experimentation grow. The paper directly addresses the scaling question: how do we make more agents productively collaborate rather than duplicating effort?

The connection to Karpathy's Autoresearch provides a timely and visible baseline, and the BioML-Bench evaluation situates the work within the emerging standardization of AI-for-science benchmarking.

5. Strengths & Limitations

Key Strengths:

The "7 vs. 0 improvements from champion" result is the paper's strongest evidence, demonstrating that the system enables exploration beyond single-agent plateaus.

The dead-end registry and cross-team knowledge sharing address a practical problem (redundant exploration) with measurable impact shown in ablations.

The comprehensive output artifacts (model cards, research insights documents, experiment logs) represent good practice for reproducibility and transparency.

The system is LLM-agnostic by design and evaluated on genuinely diverse scientific domains.

Notable Limitations:

LLM token costs are ~4× higher, making the efficiency gains somewhat ambiguous when considering total cost rather than just experimental compute.

The system's performance with different team sizes is inconsistent (n=2 wins on ProteinGym SPIKE-SARS2 while n=9 wins on TDC-hERG), and no principled method for selecting team size is provided.

Sequential GPU execution on BioML-Bench (due to single-GPU constraint) means the parallelism advantage is not fully tested in the largest benchmark.

The reliance on Claude Sonnet 4.6 as the backbone makes reproducibility contingent on API access and model stability.

The paper does not deeply analyze failure modes—when does self-organization produce worse results than centralized planning?

Additional observations: The paper is thorough but extremely long (39+ pages with appendices). The algorithmic details in the appendix (Algorithms 1-4, noise-aware gating, analyst proposal protocol) demonstrate engineering depth but also reveal significant complexity that may limit adoption. The distinction between what the LLM decides versus what is hardcoded in the protocol could be clearer—the "self-organization" involves substantial scaffolding.

Rating:7.2/ 10

Significance 7.5Rigor 6.8Novelty 7Clarity 7

Generated May 28, 2026

Comparison History (27)

vs. Why LLMs Fail at Causal Discovery and How Interventional Agents Escape

gpt-5.25/28/2026

Paper 2 likely has higher scientific impact: it proposes a broadly applicable, decentralized multi-agent framework that demonstrates strong empirical gains across diverse, high-value domains (biomed ML, LLM training optimization, protein fitness) with clear real-world utility for accelerating computational experimentation. Its cross-domain benchmarks and sizable improvements suggest immediate adoption potential and wide spillover across fields. Paper 1 is conceptually novel with a compelling limitation theorem and a targeted workaround, but its demonstrated impact is narrower (causal discovery benchmarks) and may be less broadly deployable than an automated experimentation system.

vs. Reward Bias Substitution: Single-Axis Bias Mitigations Redirect Optimization Pressure

gpt-5.25/28/2026

Paper 2 has higher likely scientific impact because it identifies and formalizes a broadly applicable failure mode (reward bias substitution) in RLHF/preference optimization, provides impossibility-style results showing standard audits can’t distinguish mitigation success from substitution, and offers principled evaluation prescriptions. This targets a timely, safety-critical bottleneck for deployed LLMs and affects many mitigation/benchmarking efforts across alignment, evaluation, and ML theory. Paper 1 is strong empirically and useful, but its agentic system contribution is more incremental and may depend on engineering choices; Paper 2’s conceptual framework is more general and field-shaping.

vs. Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

gemini-3.15/28/2026

Paper 2 presents a novel, decentralized AI agent system capable of autonomously conducting scientific research across diverse domains. By accelerating the scientific discovery process itself and demonstrating substantial empirical improvements in areas like drug discovery, biomedical imaging, and protein engineering, its potential breadth of impact and methodological breakthroughs far exceed the framework proposals in Paper 1, which, while valuable for equitable AI deployment, are narrower in scientific scope.

vs. Discovery Agents for Real-Time Analytics: Toward Proactive Insight Systems

gpt-5.25/28/2026

Paper 1 has higher estimated scientific impact due to stronger novelty (decentralized self-organizing scientific agent teams for long-running experimentation), broader cross-domain relevance (biomed ML, LLM training optimization, protein fitness), and substantially more rigorous empirical validation with matched budgets and quantitative gains across large benchmarks (BioML-Bench, ProteinGym) including new SOTA improvements. Its contributions generalize to automated scientific discovery, a timely and high-impact area. Paper 2 is promising and practical for streaming analytics, but appears more systems-integration oriented with less demonstrated methodological/benchmark rigor and narrower scientific reach.

vs. When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

gpt-5.25/28/2026

Paper 2 likely has higher impact: it proposes a broadly applicable, decentralized multi-agent framework that demonstrably improves performance across multiple substantive domains (biomedical ML, LM training optimization, and protein fitness prediction), suggesting wide real-world utility and cross-field influence. The reported gains are sizable, evaluated on diverse benchmarks and long-running settings, and the approach is timely given growing interest in autonomous scientific discovery. Paper 1 is novel and important for AI safety evaluation/architecture, but its immediate impact is narrower (diagnostic protocol + validator) and more confined to safety-alignment research.

vs. Adaptive Reservoir Computing for Multi-Scenario Chaotic System Forecasting

gemini-3.15/28/2026

Paper 1 introduces a broadly applicable, decentralized AI agent framework for automating scientific discovery across diverse, high-impact domains like biomedicine, LLM training, and protein engineering. It achieves state-of-the-art results and tackles a highly timely problem (AI for science). Paper 2 offers specialized algorithmic improvements for a specific chaotic system forecasting benchmark, resulting in a much narrower potential scientific impact.

vs. The Ethics of LLM Sandbox and Persona Dynamics

claude-opus-4.65/28/2026

AutoScientists presents a novel, rigorously evaluated multi-agent framework for autonomous scientific experimentation with strong empirical results across multiple domains (biomedical ML, LLM training, protein fitness prediction). It demonstrates clear quantitative improvements over prior methods and addresses a timely, high-impact problem in AI-driven scientific discovery. Paper 2 is a philosophical/ethical analysis that, while raising valid points about LLM guardrails and 'reality laundering,' lacks empirical methodology, offers primarily conceptual contributions, and has narrower potential for driving follow-on research or real-world applications.

vs. CIVeX: Causal Intervention Verification for Language Agents

gemini-3.15/28/2026

Paper 2 has higher potential scientific impact due to its direct application to accelerating scientific discovery across diverse fields, including biomedicine and machine learning. While Paper 1 presents a rigorous and novel causal framework for agent safety, Paper 2's ability to automate and improve long-running experimental research cycles offers broader, paradigm-shifting implications for how scientific research is conducted.

vs. Cyberbullying Governance on Social Media: A Unified Framework from Content Identification to Intervention

claude-opus-4.65/28/2026

AutoScientists presents a novel, empirically validated system for autonomous scientific experimentation with demonstrated improvements across multiple domains (biomedical ML, language model optimization, protein fitness prediction). It addresses a timely challenge in AI-driven science with concrete quantitative results showing significant improvements over prior methods. Paper 2 is a survey/framework paper on cyberbullying governance that synthesizes existing work but lacks novel empirical contributions. AutoScientists' broad applicability to accelerating scientific discovery across fields gives it substantially higher potential impact.

vs. Training Stratigraphy: Persistent Behavioral Artifacts in Large Language Models Observed Through Longitudinal AI-Human Interaction

gpt-5.25/28/2026

Paper 2 has higher likely scientific impact due to a clearly novel, generalizable systems contribution (decentralized self-organizing agent teams) validated with strong quantitative results across multiple benchmarks and domains (biomed ML, LM training optimization, protein fitness). It is methodologically more rigorous (matched budgets, comparative baselines, broad task coverage) and offers immediate real-world applications in automating and accelerating computational science. Paper 1 is intriguing and timely for AI behavior/safety, but relies heavily on auto-ethnography and first-person AI self-report, limiting reproducibility and evidential strength, hence lower expected impact.

vs. Plan Before Search: Search Agents Need Plan

claude-opus-4.65/28/2026

AutoScientists presents a more broadly impactful contribution: a general-purpose framework for autonomous scientific experimentation that demonstrates strong results across multiple diverse domains (biomedical ML, LLM training, protein fitness prediction). Its decentralized multi-agent architecture for sustained scientific exploration is highly novel, and the demonstrated improvements over state-of-the-art in protein engineering (+12.5% on ACE2-Spike binding) represent tangible scientific discoveries. Paper 1, while technically solid, addresses a narrower problem (multi-hop QA training strategies) with more incremental contributions to the retrieval-augmented reasoning community.

vs. CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

gpt-5.25/28/2026

Paper 2 (AutoScientists) has higher potential impact due to broader cross-domain applicability (general long-running scientific experimentation vs. a specific e-commerce dispute setting), stronger real-world relevance to accelerating research workflows, and demonstrated gains across multiple substantive benchmarks (BioML-Bench, LM training optimization, ProteinGym) including improvements over state-of-the-art models. Its decentralized, self-organizing agent-team paradigm is timely and potentially reusable across many computational sciences. Paper 1 is novel and valuable, but its impact is narrower (platform dispute adjudication) and more application-specific.

vs. ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

gemini-3.15/28/2026

While Paper 1 presents an innovative multi-agent architecture for scientific exploration, Paper 2 addresses a fundamental and critical bottleneck in AI-driven research: verifiability and hallucinations. By introducing the Chain-of-Evidence framework and robust audit mechanisms, Paper 2 ensures that autonomous scientific outputs are trustworthy, reproducible, and grounded. This methodological rigor is essential for the widespread adoption and credibility of AI researchers in the real world, giving it a broader and more foundational scientific impact.

vs. Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

gemini-3.15/28/2026

Paper 1 introduces a decentralized AI agent framework that directly accelerates multi-disciplinary scientific discovery. Its demonstrated ability to self-organize and significantly improve state-of-the-art results across diverse and highly impactful domains—such as biomedical machine learning, language model optimization, and protein engineering—suggests a broad, transformational impact across STEM fields. While Paper 2 addresses a critical issue in AI safety, Paper 1's potential to automate and enhance the scientific method itself offers a wider scope of real-world applications and cross-domain innovation.

vs. Human-like in-group bias in instruction-tuned language model agents

gemini-3.15/28/2026

Paper 1 presents a framework that accelerates the scientific method itself. By successfully automating parallel hypothesis generation and experimentation, it demonstrates state-of-the-art improvements across highly impactful and diverse fields like protein engineering and ML optimization. Its potential to act as a force multiplier for broad scientific discovery gives it a higher overall scientific impact compared to the crucial, yet more narrowly focused, AI safety and sociological findings of Paper 2.

vs. A Unified Framework for the Evaluation of LLM Agentic Capabilities

claude-opus-4.65/28/2026

AutoScientists introduces a novel decentralized multi-agent framework for autonomous scientific discovery that demonstrates strong empirical results across diverse domains (biomedical ML, language model optimization, protein fitness prediction), including state-of-the-art improvements. It addresses a fundamental challenge in AI-driven science—sustaining parallel exploration and knowledge preservation over long experiments. Paper 1, while valuable for standardizing LLM agent evaluation, is primarily an engineering/benchmarking contribution. Paper 2's potential to accelerate scientific discovery across multiple fields gives it broader and deeper impact.

vs. Data Language Models: A New Foundation Model Class for Tabular Data

claude-opus-4.65/28/2026

Paper 1 introduces a fundamentally new foundation model class for tabular data—the most widely used data modality in industry and science—addressing a longstanding gap in the AI stack. Its native understanding of tabular data without preprocessing is a paradigm shift with enormous breadth of application across virtually all data-driven fields. While Paper 2 presents an impressive multi-agent system for automated scientific experimentation with strong results, it represents an incremental advance in AI-for-science agent orchestration. Paper 1's potential to reshape how all tabular AI systems are built gives it broader and deeper long-term impact.

vs. Global Policy-Space Response Oracles for Two-Player Zero-Sum Games

claude-opus-4.65/28/2026

AutoScientists addresses the broadly impactful problem of automating scientific research with a novel decentralized multi-agent framework. It demonstrates strong empirical results across diverse domains (biomedical ML, LLM training, protein fitness prediction), including a state-of-the-art improvement of +12.5% on protein binding prediction. Its breadth of applicability across scientific fields, practical real-world utility in accelerating research, and timeliness given the AI-for-science trend give it higher potential impact than Paper 1, which advances game-theoretic equilibrium computation—an important but narrower contribution.

vs. Understanding and Mitigating Premature Confidence for Better LLM Reasoning

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a novel decentralized, self-organizing multi-agent framework for long-running scientific experimentation, addressing a core limitation of current agentic science systems (single trajectory/central planner). It demonstrates broad, cross-domain applicability with strong empirical gains on large, diverse benchmarks (BioML-Bench, GPT training optimization, ProteinGym), including state-of-the-art improvements that could directly affect biomedical ML, protein engineering, and model training efficiency. This breadth of applications and demonstrated generality suggest wider downstream adoption than Paper 2’s more focused (though valuable) reasoning/RL objective for LLMs.

vs. From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

gemini-3.15/28/2026

Paper 1 presents a broad, transformative approach to automating scientific discovery across multiple disciplines like biomedicine and protein engineering. Its potential to accelerate general scientific research gives it a vastly higher cross-disciplinary impact compared to Paper 2, which is narrowly focused on the evaluation of financial trading agents.