Andrew Bo Liu, Samira Nedungadi, Bryce Cai, Alex Kleinman, Harmon Bhasin, Seth Donoughe
Large language models (LLMs) are rapidly acquiring capabilities relevant to biological research, from literature synthesis to interpretation of experimental data. Increasingly, LLM agents can also perform in silico biology tasks that previously required experienced human biologists. These emerging AI capabilities offer new opportunities for scientific discovery and biomedical advances, but they also shift the landscape of biosecurity risks. To address this, we introduce the Agentic Bio-Capabilities Benchmark (ABC-Bench), a suite of tasks to measure agentic biosecurity-relevant capabilities. ABC-Bench evaluates LLM agents on both benign and dual-use biology tasks: writing code to operate liquid handling robots, designing DNA fragments for in vitro assembly, and evading DNA synthesis screening. These tasks require a combination of biology and software expertise. All tested LLM agents outperformed the median expert human baseliner on all three tasks. Agents performed highly on tasks drawing on published knowledge and well-documented protocols, and more weakly on a task requiring novel bioinformatics reasoning. In three wet-lab validation experiments, we found that OpenAI's o4-mini-high produced scripts that, when run on an OpenTrons liquid handling robot, successfully assembled DNA with expected sequences.
ABC-Bench introduces a benchmark suite measuring LLM agents' capabilities on three biosecurity-relevant molecular biology tasks: Fragment Design (designing DNA fragments for Gibson Assembly), Screening Evasion (obfuscating DNA sequences to evade synthesis screening), and Liquid Handling Robot (writing code to operate OpenTrons robots for DNA assembly). The key novelty is the shift from Q&A-style biology benchmarks to *agentic* evaluations where models must use tools (Python, BLAST, simulators, web search) to produce functional artifacts—code and DNA sequences that are algorithmically graded. The benchmark is further validated through wet-lab experiments where LLM-generated scripts successfully assembled DNA on a physical robot.
The paper's most significant finding is that all tested frontier models outperformed the median expert human baseliner on all three tasks, with some achieving perfect scores on Fragment Design and Liquid Handling Robot. This constitutes concrete evidence that AI agents can perform practical molecular biology workflows end-to-end, not just answer knowledge questions.
Strengths in evaluation design: The benchmark uses algorithmic scoring rather than subjective human or model grading, supporting reproducibility. Each task was assessed N=10 times per model, and results include bootstrap confidence intervals. The three wet-lab validation experiments with whole-plasmid sequencing provide ground truth that the in silico evaluations translate to real-world outcomes.
Human baseline concerns: The paper acknowledges this limitation but it significantly weakens the comparative claims. The human baseline design is problematic in several ways: (1) framing tasks as coding problems disadvantages biologists who would normally use dedicated tools like NEBuilder; (2) the $200 compensation per task and "reasonable effort" incentive structure likely under-motivates compared to AI systems that are run to completion; (3) sample sizes are modest (9-13 baseliners per task); (4) baseliners used an older, more hint-heavy prompt version while models used a refined version optimized for model performance. The paper states models received prompts iterated to "best elicit model performance"—this asymmetry in elicitation is a meaningful confound. The claim that "all models outperformed the median expert" should be interpreted cautiously given these design choices.
Refusal handling: Refused samples were excluded from accuracy calculations, meaning models that refused most samples (e.g., Claude Opus 4 on Screening Evasion with 90% refusal) are evaluated on a potentially unrepresentative subset. This "refusal-corrected" approach is reasonable but the paper could have explored whether refused samples systematically differ in difficulty.
Wet-lab validation: Only three experiments were conducted, all with the same model (o4-mini-high), the same kit, and the same assembly. While confirming functional code generation, this is a narrow validation. The human-in-the-loop error correction (feeding compilation errors back to the model) also differs from the autonomous in silico evaluation setup.
This work has substantial policy and governance impact. The benchmark is already being used by major AI companies (Anthropic, OpenAI) for pre-release model evaluation, as evidenced by references in their model cards. This demonstrates real-world adoption even before formal publication.
The biosecurity implications are significant: the Screening Evasion task directly measures a capability relevant to circumventing DNA synthesis screening—a key safeguard against bioweapon development. The finding that open-weights models (Qwen3.5, Kimi K2.5) completed this task without refusals while closed models often refused highlights a genuine governance gap.
The "risk chain" framework—where individual tasks correspond to steps in a pathway to harm—is a valuable conceptual contribution that could guide future benchmark development across biosecurity and other dual-use domains.
For beneficial applications, the demonstration that LLMs can reliably write liquid handling robot code opens doors for democratizing laboratory automation, potentially accelerating research in resource-limited settings.
This paper addresses a critical and timely gap. As LLMs gain tool-use capabilities and biological AI systems proliferate, the field urgently needs benchmarks that go beyond knowledge testing to measure practical execution capabilities. The paper arrives at a moment when policymakers are actively debating AI biosecurity governance, and provides concrete evidence to inform those discussions. The rapid pace of model development (eight frontier models tested, including very recent releases) demonstrates the benchmark's currency.
The finding that models perform poorly on the creative/novel Screening Evasion task relative to well-documented tasks is an important and nuanced insight about current AI capability boundaries. The differential refusal patterns across models also provide valuable data for the AI safety community about the effectiveness (and inconsistency) of current safety training approaches.
Generated Jun 10, 2026
Paper 2 likely has higher scientific impact due to strong novelty and timeliness in AI-biosecurity evaluation, a high-stakes real-world application area, and breadth across ML, bioengineering, and policy. ABC-Bench provides a reusable measurement framework with direct operational relevance (dual-use capability assessment) and includes wet-lab validation, strengthening rigor and credibility. Paper 1 is innovative and practically valuable for LLM systems efficiency, but its impact is more engineering-focused and narrower in cross-domain implications compared to a benchmark shaping biosecurity risk governance and research priorities.
Paper 1 bridges the critical gap between digital AI capabilities and physical biological consequences through real-world wet-lab validation. Its demonstration that AI agents can outperform human experts in dual-use biological tasks addresses urgent biosecurity and AI safety risks, giving it profound implications for policy, safety, and automated scientific discovery that edge out Paper 2's text-synthesis benchmark.
Paper 2 introduces a foundational mathematical impossibility theorem for Eliciting Latent Knowledge (ELK), a core problem in AI safety. While Paper 1 provides a highly relevant and timely empirical benchmark with immediate biosecurity applications, fundamental theoretical results (like impossibility theorems) typically have a deeper, longer-lasting scientific impact across the broader field of AI alignment, shaping future research directions and training paradigms.
While Paper 1 presents strong theoretical advancements in fundamental AI methodology, Paper 2 addresses a highly critical, timely, and real-world problem at the intersection of AI and biosecurity. The introduction of a benchmark for dual-use biological capabilities of LLMs, combined with actual wet-lab validation, has immense implications for scientific policy, safety, and the future of automated biological research, giving it broader societal and interdisciplinary scientific impact.
Paper 1 addresses a highly urgent and timely issue: the biosecurity risks and capabilities of LLM agents in real-world biology tasks. Its cross-disciplinary impact spans AI safety, computational biology, and public policy. The wet-lab validation adds strong methodological rigor. While Paper 2 offers a solid theoretical contribution to reinforcement learning and operations research, Paper 1's real-world implications, novelty in measuring agentic bio-capabilities, and broader societal relevance give it a significantly higher potential for widespread scientific and practical impact.
Paper 1 likely has higher impact due to strong timeliness and high-stakes real-world relevance: it introduces a concrete, agent-focused biosecurity benchmark with dual-use tasks and includes wet-lab validation on real liquid-handling robotics—evidence that capabilities transfer beyond text. This creates an actionable measurement tool for AI governance, biosecurity policy, and lab automation safety across multiple communities. Paper 2 is novel and methodologically interesting for agent learning via online LoRA, but its impact is primarily within ML/agent memory research and lacks the same immediate cross-domain societal and regulatory implications.
Paper 1 likely has higher impact due to stronger novelty and urgency: it benchmarks agentic LLM capabilities directly tied to biosecurity and includes wet-lab validation showing real-world execution (robotic DNA assembly), which raises immediate safety and governance implications. Its applications span AI evaluation, biotechnology automation, and biosecurity policy, giving broad cross-field relevance and timeliness. Paper 2 is methodologically solid and useful for VLM assessment in engineering, but is narrower in societal stakes and lacks comparable real-world validation beyond benchmarking and stage-wise scoring.
Paper 2 (PRISM) likely has higher scientific impact due to its broadly applicable, methodologically novel approach to interpreting and monitoring LLM behavior via activation-based instruction-set retrieval. It addresses a timely, central problem in AI safety/security (prompt injection, hidden objectives) and can generalize across domains wherever agents are deployed. Paper 1 (ABC-Bench) is important and timely for biosecurity, but as a benchmark it is narrower in scope, more domain-specific, and its impact depends on adoption and the evolving LLM/bio tooling landscape. PRISM’s technique could influence multiple subfields (interpretability, agent monitoring, alignment, security).
ABC-Bench introduces a novel, practically grounded benchmark for evaluating biosecurity-relevant AI capabilities with wet-lab validation, addressing a critical and timely policy concern as LLMs become more capable in biology. It bridges AI safety and biosecurity communities with concrete, reproducible evaluations. Paper 2 contributes useful diagnostics for multi-turn reasoning alignment failures, but addresses a narrower methodological concern within AI safety. Paper 1's broader interdisciplinary relevance, direct policy implications, and real-world experimental validation give it higher potential impact.
Paper 2 has higher likely scientific impact: it introduces a concrete, broadly usable benchmark for agentic bio-capabilities with direct biosecurity relevance, includes wet-lab validation, and yields actionable evaluation infrastructure for labs, policymakers, and model developers. Its applications span AI safety, biosecurity, and automation in biotech, making cross-field uptake likely and timely given rapid agent progress. Paper 1 is novel and methodologically interesting for alignment science (early-warning signal for reward hacking), but it appears narrower (coding RL with pytest proxies) and more dependent on interpretability tooling and specific RL setups for immediate external adoption.