ViroBench: Benchmarking Nucleotide Foundation Models on Viral Genomics Tasks

Dongxin Ye, Fang Hu, Han Hu, Shu Hu, Yang Tan, Wanli Ouyang, Stan Z. Li, Jie Cui

#403 of 5371 · cs.LG
Share
Tournament Score
1512±41
10501750
78%
Win Rate
21
Wins
6
Losses
27
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Nucleotide sequences constitute the fundamental genetic basis of biological systems, rendering viral genomic analysis critical for biomedical advancement. Despite progress in biological foundation models, specifically nucleotide foundation models (NFMs), the field lacks a unified standard for viral genomics to facilitate community development and enforce biosecurity constraints. To address this, we introduce ViroBench, the first comprehensive and large-scale benchmark specifically designed for NFMs in viral settings. ViroBench evaluates models across two critical dimensions: biological understanding and latent biosecurity risk, covering 18 diverse scenarios within 4 task types. Extensive evaluation of 66 NFMs across diverse architectures yields three critical conclusions. Firstly, NFMs exhibit a performance degradation in biological understanding under phylogenetic and temporal shifts, indicating weak extrapolation capabilities. Secondly, generation tasks reveal a decoupling between statistical likelihood and biological functional validity, posing latent biosecurity risks. Thirdly, controlled ablation studies reveal that taxonomic diversity in pretraining data outweighs parameter scale. Specifically, a lightweight baseline trained on diverse data achieves a 67.5% performance gain over its original model. Overall, ViroBench provides interpretable, diagnostic evaluations and a reproducible measurement framework for future research on viral nucleotide foundation models. The datasets and code are publicly available at https://github.com/QIANJINYDX/ViroBench.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ViroBench

1. Core Contribution

ViroBench introduces the first large-scale, unified benchmark specifically designed for evaluating nucleotide foundation models (NFMs) on viral genomics tasks. The benchmark spans two complementary evaluation axes—biological understanding (classification) and latent biosecurity risk (generation)—across 18 scenarios within 4 task types. The benchmark is built on 58,314 curated viral sequences and evaluates 66 NFMs plus 4 conventional baselines. Three key findings emerge: (1) NFMs degrade under phylogenetic and temporal distribution shifts, (2) generative models exhibit a decoupling between statistical likelihood and biological functional validity, and (3) taxonomic diversity in pretraining data matters more than parameter scale, demonstrated by a lightweight model achieving 67.5% improvement over its larger counterpart.

2. Methodological Rigor

Strengths in design: The benchmark's data curation pipeline is thorough—starting from 273,974 TaxIDs, applying multi-stage filtering, and arriving at 58,314 high-quality samples. The use of Qwen3-235B for host label standardization is validated against manual annotations across three LLMs (96.25% accuracy), lending credibility to the label quality. The splitting strategies are biologically motivated: genus-disjoint splits enforce phylogenetic extrapolation, and temporal splits simulate real-world surveillance scenarios.

Evaluation protocol: The frozen-backbone approach with standardized classification heads is appropriate for fair comparison across heterogeneous architectures. The window-based evaluation with multiple configurations (512/1024/2048) and learning rate sweeps, reporting means and standard deviations, adds robustness. The CDS generation evaluation is particularly well-designed, with a tripartite assessment separating surface-level fidelity (edit distance, exact match) from biological validity (CDS success rate) and distributional properties (K-mer JSD/KS).

Concerns: The reliance on NCBI deposit dates rather than true emergence dates for temporal splits is acknowledged but introduces confounding between sequencing effort and evolutionary novelty. The single-label host classification is a simplification that may penalize biologically correct multi-host predictions. The ViroBland pretraining dataset is relatively small (216 MB), and the 67.5% improvement claim, while striking, should be contextualized—the baseline HyenaDNA-Large-1M performs poorly on viral tasks to begin with (mean F1 of 23.48), so large relative gains from a low baseline are expected.

3. Potential Impact

Immediate utility: ViroBench fills a genuine gap—there was no standardized benchmark for NFMs in viral genomics. The benchmark enables systematic comparison across model families and can accelerate development of virus-specific models. The public release of datasets and code enhances reproducibility.

Biosecurity dimension: The explicit inclusion of biosecurity-related generation evaluation is timely and important. The finding that models can achieve low perplexity while producing biologically invalid sequences (the likelihood-validity decoupling) has direct implications for dual-use risk assessment frameworks.

Broader influence: The finding that taxonomic diversity outweighs parameter scale could reshape pretraining strategies beyond virology. The benchmark design—combining phylogenetic-aware splits with temporal drift evaluation—provides a template for other biological sequence benchmarks.

Applied virology: The Nipah virus case study demonstrates practical applicability for viral surveillance, showing that NFMs can provide taxonomy-aware classification and interpretable perplexity landscapes for unseen pathogens.

4. Timeliness & Relevance

The paper addresses a current bottleneck at the intersection of two rapidly advancing fields: genomic foundation models and computational virology. Post-COVID, there is heightened awareness of the need for robust viral genomics tools and biosecurity frameworks. The proliferation of NFMs (66 evaluated here) without standardized viral evaluation creates an urgent need for benchmarking. The biosecurity axis is particularly timely given ongoing policy discussions about dual-use AI in biology.

5. Strengths & Limitations

Key Strengths:

  • Scale and comprehensiveness: 66 NFMs, 18 scenarios, 4 task types, 10 metrics—this is among the most comprehensive NFM evaluations published.
  • Biologically grounded evaluation: The genus-disjoint and temporal splits are more realistic than random splits commonly used in prior work.
  • Actionable insights: The data composition finding (diversity > scale) is immediately actionable for practitioners.
  • AlphaFold3 structural verification: Using AF3 to validate generated CDS at the protein structure level adds a meaningful biological validation layer beyond sequence metrics.
  • Thorough ablation studies: Architecture, tokenization, window configuration, prefix length, and segmentation strategy ablations strengthen confidence in the main conclusions.
  • Notable Limitations:

  • The lightweight ViroHyena models are trained on only 216 MB of data with max context of 8K tokens—limiting conclusions about what diverse pretraining can achieve at larger scales.
  • The CDS success rate across all models is extremely low (<1.5%), raising questions about whether this metric is informative for model differentiation or simply reflects a universally hard task.
  • The structural verification shows only 22/1143 pairs achieving TM-like ≥0.50, making it difficult to draw strong comparative conclusions across models.
  • The biosecurity framing, while important, remains somewhat abstract—the paper identifies risks but doesn't propose concrete mitigation strategies or red-line criteria.
  • Host label validation accuracy of 96.25% means ~2,200 samples may be mislabeled, potentially affecting host prediction results.
  • The benchmark is static; viral genomics is inherently dynamic, and the benchmark will need regular updates to remain relevant.
  • Overall Assessment: ViroBench represents a substantial and well-executed contribution to genomic benchmarking. Its primary value lies in establishing standardized evaluation infrastructure for an underserved domain, revealing systematic failure modes of existing NFMs, and providing actionable insights about pretraining data composition. While individual components (curation, evaluation protocols, model training) are not individually revolutionary, their integration into a coherent, reproducible framework with extensive empirical validation makes this a valuable community resource.

    Rating:7.2/ 10
    Significance 7.5Rigor 7.5Novelty 6.5Clarity 7.5

    Generated May 26, 2026

    Comparison History (27)

    vs. MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
    claude-opus-4.65/27/2026

    MONA addresses a fundamental challenge in large language model training—optimizer design—with both theoretical convergence guarantees and empirical validation at scale (up to 68B parameters, 1T tokens). Its direct applicability to the rapidly growing LLM training ecosystem gives it broad, immediate impact across NLP, AI systems, and downstream applications. While ViroBench is a valuable benchmarking contribution for viral genomics NFMs, it is more niche in scope. MONA's combination of theoretical rigor, practical scalability, and SOTA results on widely-used benchmarks positions it for higher citation impact and adoption.

    vs. Innovation: An Almost Characterization of Hallucination
    gpt-5.25/27/2026

    Paper 2 likely has higher scientific impact due to its immediate real-world applicability and breadth: it establishes a large-scale, reproducible benchmark for viral genomics NFMs, evaluates 66 models, introduces biosecurity-risk evaluation, and provides public datasets/code—facilitating community standardization and accelerating progress across ML, genomics, and biosecurity. Paper 1 is novel and theoretically valuable for understanding LLM hallucination, but its impact may be narrower and more indirect (primarily ML theory) compared to an actionable benchmark that can steer model development and safety practices now.

    vs. Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning
    claude-opus-4.65/27/2026

    ViroBench addresses a critical gap by providing the first comprehensive benchmark for nucleotide foundation models in viral genomics, spanning 18 scenarios and 66 models. Its findings on biosecurity risks, extrapolation failures, and data diversity vs. scale have broad implications for genomics, AI safety, and public health. The benchmark infrastructure enables reproducible community-wide progress. While Paper 2 presents a solid RL contribution (GraphGPO), it is more incremental within the crowded LLM-RL optimization space. ViroBench's interdisciplinary relevance, timeliness given pandemic preparedness concerns, and potential to reshape NFM development give it higher impact.

    vs. Interdomain Attention: Beyond Token-Level Key-Value Memory
    gpt-5.25/26/2026

    Paper 2 likely has higher scientific impact: it proposes a new, general sequence-modeling mechanism that bridges attention and SSMs, offering fixed-state, length-flat scaling with strong empirical results up to 1.3B parameters. This can influence model architecture across NLP and other long-context domains and is timely given efficiency/long-context constraints. Paper 1 is valuable and rigorous for viral NFM evaluation and biosecurity-aware benchmarking, but its impact is more domain-specific (viral genomics + benchmarking) and less likely to reshape core ML methodology broadly.

    vs. The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth
    gpt-5.25/26/2026

    Paper 1 likely has higher scientific impact: it introduces the first large-scale, task-diverse benchmark tailored to viral nucleotide foundation models, including explicit biosecurity-risk evaluation—an urgent, societally relevant need. Benchmarks often become community standards and accelerate progress across many downstream biomedical and public-health applications. It also reports broad evaluation (66 models), shift-robustness findings, and actionable guidance (data diversity > scale), with open code/data. Paper 2 is novel for mechanistic interpretability, but its impact is more specialized and less directly tied to immediate real-world deployment.

    vs. SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors
    claude-opus-4.65/26/2026

    ViroBench addresses a significant gap in evaluating nucleotide foundation models for viral genomics, combining biological understanding with biosecurity assessment across 66 models and 18 scenarios. Its scale, rigor, and timeliness (given ongoing pandemic preparedness concerns) give it broad impact across computational biology, genomics, and biosecurity. Paper 2 presents an interesting but early-stage pilot framework for lossy text compression with LLMs, evaluated on only five author-constructed cases, with explicitly limited claims. ViroBench's community-wide benchmarking utility and real-world relevance substantially exceed SemanticZip's niche contribution.

    vs. Small Models, Strong Priors: Architectural Inductive Bias for Parameter-Efficient Neural PDE Solvers
    gpt-5.25/26/2026

    Paper 2 is likely to have higher scientific impact due to a more novel methodological contribution (a new PDE-solver architecture with explicit inductive biases) and broad applicability across scientific computing domains (CFD, acoustics, wave physics, engineering). Its strong efficiency claim (1–10M params, single-GPU training) improves real-world deployability and timeliness given concerns about scaling costs. It also provides interpretable success/failure structure across multiple benchmarks, suggesting generalizable insight. Paper 1 is valuable infrastructure for viral NFM benchmarking and biosecurity analysis, but is primarily evaluative rather than a new modeling paradigm.

    vs. Personalized Federated Learning by Energy-Efficient UAV Communications
    claude-opus-4.65/26/2026

    ViroBench addresses a significant gap by providing the first comprehensive benchmark for nucleotide foundation models in viral genomics, spanning 66 models across 18 scenarios. Its contributions—identifying extrapolation weaknesses, biosecurity risks in generation tasks, and the importance of taxonomic diversity over parameter scale—have broad implications for genomics, AI safety, and public health. Paper 2 presents an incremental combination of personalized FL with UAV energy optimization, which is a narrower contribution in a well-explored space with limited cross-disciplinary impact.

    vs. Hermite-NGP: Gradient-Augmented Hash Encoding for Learning PDEs
    gpt-5.25/26/2026

    Paper 2 (ViroBench) is likely to have higher scientific impact due to broader cross-field relevance and real-world applicability: it introduces a large-scale, community-facing benchmark for viral nucleotide foundation models, spanning multiple task types and explicitly addressing biosecurity risk—highly timely given rapid growth in genomics AI. Benchmarks often become standard infrastructure that shapes future research and evaluation practices. Paper 1 is technically innovative and may strongly impact neural PDE solving, but its scope is narrower and more specialized, potentially limiting breadth of adoption compared to a widely usable benchmarking suite.

    vs. Mapping the Schedule x Bit-Width Boundary in Sub-100M Quantisation-Aware Training
    claude-opus-4.65/26/2026

    ViroBench addresses a significant gap in evaluating nucleotide foundation models for viral genomics, combining biological understanding with biosecurity assessment across 66 models and 18 scenarios. It establishes a comprehensive benchmark for a rapidly growing field with direct biomedical and public health relevance. Paper 2, while methodologically thorough, addresses a narrow technical question about learning-rate schedules in sub-100M quantization-aware training, with limited breadth of impact. ViroBench's broader applicability across genomics, biosecurity, and AI evaluation gives it substantially higher potential scientific impact.

    vs. Representation-Guided Discrete Molecular Graph Retrosynthesis
    gpt-5.25/26/2026

    Paper 2 is likely to have higher impact because it introduces a first-of-its-kind, large-scale benchmark for nucleotide foundation models in viral genomics, addressing an urgent gap in standardized evaluation and explicitly incorporating biosecurity risk—highly timely and broadly relevant. Benchmarking infrastructure and reproducible datasets/code typically catalyze community-wide progress across many downstream tasks and model families. Paper 1 advances retrosynthesis modeling with solid empirical gains, but its scope is narrower (single-step retrosynthesis) and more incremental relative to existing diffusion/guidance paradigms.

    vs. LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
    claude-opus-4.65/26/2026

    Paper 1 demonstrates a novel metacognitive harness framework that achieves state-of-the-art results across multiple major benchmarks (HLE, LiveCodeBench, R-Bench-V) without fine-tuning, suggesting broad applicability to LLM inference. Its grounding in cognitive psychology theory (Nelson-Narens) and practical test-time scaling improvements make it highly impactful for the massive LLM community. Paper 2 introduces a valuable but more niche benchmark for viral genomics NFMs. While important for biosecurity and virology, its audience and application scope are narrower compared to the broadly transformative potential of improved LLM reasoning control.

    vs. Private Adaptive Covariance Estimation via Gaussian Graphical Models
    claude-opus-4.65/26/2026

    ViroBench addresses a critical gap by providing the first comprehensive benchmark for nucleotide foundation models in viral genomics, evaluating 66 models across 18 scenarios. Its breadth of impact spans computational biology, genomics, biosecurity, and AI/ML benchmarking. The biosecurity dimension is particularly timely and relevant. The actionable finding that taxonomic diversity outweighs parameter scale has broad implications for foundation model training. While PACE-GGM makes a solid technical contribution to private covariance estimation, ViroBench's community resource potential, public availability, and cross-disciplinary relevance give it higher estimated impact.

    vs. A Context Augmented Multi-Play Multi-Armed Bandit Algorithm for Fast Channel Allocation in Opportunistic Spectrum Access
    gemini-3.15/26/2026

    Paper 1 offers higher potential scientific impact due to its timeliness, scale, and cross-disciplinary relevance. It introduces the first comprehensive benchmark for Nucleotide Foundation Models in viral genomics, directly addressing critical global challenges in biomedical advancement and AI biosecurity. By evaluating 66 models, it provides foundational insights that will shape future research in AI-driven biology. In contrast, while Paper 2 presents a solid methodological improvement for spectrum access in telecommunications using multi-armed bandits, its scope is much narrower and represents an incremental advance in a highly specialized subfield.

    vs. TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale
    gpt-5.25/26/2026

    Paper 2 likely has higher impact: it proposes a generally applicable architectural solution to continual pre-training at LLM scale without replay/task IDs, addressing a core bottleneck (catastrophic forgetting) with broad relevance across NLP, continual learning, and scalable training. It reports results across multiple domains and model sizes up to ~9B, plus extensions (meta-control, planning) that could influence future LLM training pipelines. Paper 1 is valuable and timely for viral NFMs and biosecurity, but its impact is narrower (benchmark/dataset) and more domain-specific, whereas Paper 2’s method could affect many models and applications.

    vs. Invariant-Based Weight Sharing for Message Passing
    gpt-5.25/26/2026

    Paper 2 likely has higher scientific impact due to its broadly applicable methodological innovation: a new, invariant-indexed weight sharing principle for MPNNs with theoretical expressivity guarantees and demonstrated gains beyond 1-WL. This can influence graph learning across many fields (chemistry, social networks, physics, program analysis) and offers a general design knob (choice of invariants) for controlling complexity. Paper 1 is timely and valuable for viral genomics evaluation and biosecurity-aware benchmarking, but its impact is more domain-specific and centered on benchmarking rather than introducing a new general learning paradigm.

    vs. Relative Repairability: A Calibration-Based Diagnostic for High-Sparsity Post-Pruning Allocation
    claude-opus-4.65/26/2026

    ViroBench addresses a significant gap by providing the first comprehensive benchmark for nucleotide foundation models in viral genomics, spanning 66 models across 18 scenarios. Its broad applicability to biosecurity, public health, and genomic AI, combined with actionable findings (e.g., data diversity outweighing model scale, biosecurity risks from generation models), gives it wide cross-disciplinary impact. Paper 1, while technically sound, addresses a narrow niche in pruning allocation diagnostics with incremental improvements on specific architectures/datasets, limiting its broader scientific influence.

    vs. SAE-FD: Sparse Autoencoder Feature Distillation for Continual Learning of Large Language Models
    gpt-5.25/26/2026

    Paper 1 is likely to have higher scientific impact because it introduces a large-scale, first-of-its-kind benchmark for nucleotide foundation models in viral genomics, including both capability and biosecurity-risk evaluation—an urgent and broadly relevant area. Its findings (shift robustness issues, likelihood–functionality decoupling, and data-diversity importance) can influence model development, dataset curation, and policy across bioinformatics, ML, and biosecurity. Methodologically, evaluating 66 models across 18 scenarios with ablations plus released code/data strengthens reproducibility and community adoption. Paper 2 is valuable but more incremental within continual learning.

    vs. You Only Need Minimal RLVR Training: Extrapolating LLMs via Rank-1 Trajectories
    gpt-5.25/26/2026

    Paper 2 (ViroBench) is likely higher impact because it delivers a field-defining, reusable benchmark plus safety evaluation for viral genomics, enabling standardized comparison across many nucleotide foundation models and directly informing biosecurity-relevant deployment. Its applications span virology, bioinformatics, ML evaluation, and AI safety, and its findings (shift sensitivity, likelihood–function decoupling, data diversity over scale) are broadly actionable. Paper 1 is innovative and compute-reducing for RLVR in LLMs, but may generalize less beyond specific RLVR setups and has narrower cross-domain reach.

    vs. LAPLEX: The FFT of Learnable Laplace Kernels
    claude-opus-4.65/26/2026

    ViroBench addresses a critical gap in evaluating nucleotide foundation models for viral genomics, combining biological understanding with biosecurity assessment across 66 models and 18 scenarios. Its findings—that data diversity outweighs scale, and that generation models pose biosecurity risks—have immediate implications for both AI and public health. The benchmark's comprehensive nature, public availability, and timeliness (given ongoing pandemic preparedness needs) give it broader real-world impact. While LAPLEX is technically innovative in enabling efficient dense operations, its impact is more narrowly focused on computational efficiency in deep learning architectures.