scTranslation: A Comprehensive Benchmark for Single-Cell Multi-Omics Modality Translation

Jiabei Cheng, Jingbo Zhou, Jun Xia, Changkai Li, Zhen Lei, Chang Yu, Stan Z. Li

Jun 2, 2026

arXiv:2606.03906v1 PDF

cs.AI(primary)

#1324of 3355·Artificial Intelligence

#1324 of 3355 · Artificial Intelligence

Tournament Score

1427±46

10501800

41%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance5.5

Rigor5

Novelty3.5

Clarity5.5

Tournament Score

1427±46

10501800

41%

Win Rate

Wins

Losses

Matches

Rating

5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Simultaneous measurement of multiple omics modalities in single cells enables researchers to gain a more comprehensive understanding of cellular states and regulatory mechanisms. However, due to high experimental costs, significant noise, and incomplete modality coverage, a variety of computational methods for modality translation have emerged in recent years. Despite the development of translation models, there is still a lack of systematic benchmark evaluation in terms of datasets, evaluation metrics, and influencing factors. To address this, we present scTranslation, a comprehensive benchmark for single-cell multi-omics modality translation tasks. It includes diverse translation datasets, integrates state-of-the-art models, and provides a comprehensive evaluation metrics. In addition, we assess model performance under different scenarios, such as feature selection, feature quality, and few-shot settings. These factors significantly affect model performance but have rarely been systematically studied before. Leveraging this benchmark, we conduct a large-scale study of current methods, report many insightful findings that open up new possibilities for future development. The benchmark is open-sourced to facilitate future research. The code is anonymously released at https://github.com/Bunnybeibei/scTranslation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: scTranslation

1. Core Contribution

scTranslation proposes a comprehensive benchmarking framework for single-cell multi-omics modality translation — the computational task of predicting one molecular modality (e.g., ATAC-seq, protein) from another (e.g., RNA-seq) at single-cell resolution. The benchmark integrates eight diverse datasets spanning multiple techniques (SNARE-seq, sci-CAR, 10x Multiome, CITE-seq, scCAT-seq), species (human, mouse), organs, scales, and developmental stages (the "6M" criteria). It evaluates six representative models from three architectural families (AE-based, VAE-based, distribution-based) across eight metrics organized into clustering, regression, and distribution categories. Additionally, it systematically examines three practical influencing factors — feature selection, feature quality (sparsity/dropout), and few-shot learning — that have been underexplored in prior work.

The main novelty is not in any individual component but in the systematic assembly of a standardized evaluation ecosystem. The paper fills a genuine gap: prior methods were typically evaluated on idiosyncratic datasets with inconsistent metrics, making cross-method comparison unreliable.

2. Methodological Rigor

Strengths:

The use of stratified 5-fold cross-validation with reported mean ± standard deviation across all experiments is appropriate and supports reproducibility.

The metric suite is well-motivated and multi-dimensional, covering biological structure preservation (NMI, ARI, AMI, HOM), quantitative accuracy (PCC, MSE), and distributional alignment (MMD, LISI). This avoids the trap of single-metric optimization.

All model hyperparameters are kept as originally published, ensuring fair comparison.

Weaknesses:

The paper does not introduce any new method or propose improvements to existing models. While benchmarks are valuable, the analytical depth beyond reporting numbers is somewhat limited. The "insightful findings" are relatively expected (e.g., more HVGs help then hurt; more dropout degrades performance; diffusion models benefit from implicit augmentation in few-shot settings).

Some experimental design choices are underspecified. For instance, the few-shot setting (train on one fold, test on four) is a fixed 20/80 split rather than a systematically varied sample-size ablation, which limits the depth of few-shot analysis.

The feature quality experiments use random masking to simulate dropout, which may not faithfully capture the structured missingness patterns in real single-cell data (e.g., gene-length bias, expression-level-dependent dropout).

Several models show zero LISI scores across many datasets, suggesting either implementation issues, incompatible output formats, or a metric calculation problem. This is not adequately discussed.

The paper lacks statistical significance testing between methods, making it difficult to determine whether observed differences are meaningful given the variance.

3. Potential Impact

Benchmarking papers in computational biology can have outsized practical impact by standardizing evaluation and accelerating method development. scTranslation could serve this role for the multi-omics translation community, analogous to how benchmarks like the NeurIPS 2021 multimodal single-cell competition catalyzed progress. The open-source code and curated datasets lower the barrier for new method development.

However, the impact may be limited by:

The relatively narrow scope of six models (several from 2024, but missing important methods like MultiVI, totalVI, Cobolt, and others cited but not benchmarked).

The lack of downstream biological validation — the paper evaluates translation quality in statistical terms but does not assess whether translated data improve actual biological tasks (e.g., gene regulatory network inference, trajectory analysis, differential expression).

No computational cost analysis is provided, which is important for practical adoption.

4. Timeliness & Relevance

The paper addresses a timely need. Multi-omics single-cell technologies are proliferating rapidly, and the gap between experimental capacity and computational translation methods is widening. Standardized benchmarks are urgently needed. The 2024-2025 period has seen an explosion of translation methods, making systematic comparison increasingly important.

That said, the benchmark landscape is not entirely empty: Xiao et al. (2024, cited as [44]) published a related benchmark for RNA-ATAC integration, and the NeurIPS 2021 competition [31] established some standards. The paper could have done a better job differentiating itself from these prior efforts beyond the inclusion of influencing factors.

5. Strengths & Limitations

Key Strengths:

Comprehensive dataset curation following principled "6M" criteria

Multi-faceted evaluation metrics covering three complementary aspects

Systematic investigation of practical influencing factors (feature selection, quality, few-shot) that are genuinely underexplored

Bidirectional evaluation (both directions of translation) revealing asymmetric model behavior

Open-source release facilitating reproducibility

Notable Limitations:

No novel methodology or theoretical contribution; purely empirical

Analysis remains largely descriptive rather than explanatory — the paper documents what happens but rarely explains why or proposes solutions

Missing important baselines (MultiVI, totalVI, Cobolt, MOFA+) despite citing them

No downstream task evaluation to validate biological utility of translations

Inconsistent model support (BABEL cannot handle protein input; multiDGD/scDiffusion-X required modifications for RNA-to-protein) introduces confounds

The writing could be tighter; the paper is table-heavy with limited synthesis of findings into actionable guidelines

Some results are difficult to interpret (e.g., LISI = 0.00 for many model-dataset combinations)

The claim of "many insightful findings" in the abstract is somewhat oversold relative to the novelty of the observations

6. Additional Observations

The paper would benefit significantly from: (1) a ranking or recommendation table summarizing which method to use under which conditions, (2) computational efficiency comparisons, (3) analysis of failure modes and potential causes, and (4) concrete guidelines for practitioners. The current format is more of a data dump than an analytical resource. The venue (KDD) is reasonable but perhaps not the primary audience for single-cell biology methods — venues like Genome Biology, Nature Methods, or Bioinformatics might yield greater domain impact.

Rating:5/ 10

Significance 5.5Rigor 5Novelty 3.5Clarity 5.5

Generated Jun 3, 2026

Comparison History (17)

vs. MulFeRL: Enhancing Reinforcement Learning with Verbal Feedback in a Multi-turn Loop

gemini-3.16/5/2026

Paper 1 introduces a novel reinforcement learning framework that leverages verbal feedback to improve LLM reasoning, directly addressing the critical limitation of sparse scalar rewards. Given the explosive growth and broad applicability of LLMs, this algorithmic advancement has high potential for widespread, cross-disciplinary impact. Paper 2 presents a valuable but narrower benchmark for single-cell multi-omics translation. While useful for bioinformatics, benchmarks typically organize existing knowledge rather than introducing paradigm-shifting methodologies. Thus, Paper 1's fundamental innovation in a highly active and universally relevant field yields higher potential scientific impact.

vs. The Meta-Agent Challenge: Are Current Agents Capable of Autonomous Agent Development?

gemini-3.16/5/2026

Paper 2 explores a frontier topic with profound implications for AI: recursive self-improvement and autonomous agent development. While Paper 1 provides a valuable benchmark for single-cell biology, Paper 2 addresses a fundamental challenge in AI capabilities and safety. The framework's ability to expose emergent adversarial behaviors during agent development highlights critical alignment issues. Given the rapid advancement of LLM-based systems, Paper 2's focus on meta-agents offers broader, more transformative potential for the AI research community.

vs. Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

gpt-5.26/5/2026

Paper 1 introduces a novel, broadly applicable verification concept (constraint injection) that addresses a fundamental failure mode in LLM-generated optimization models, and integrates it into both data synthesis filtering and RL training, demonstrating large performance gains on a hard, constraint-dense domain with a new expert-verified benchmark. Its impact can extend beyond VRPs to many formal-specification/code-generation settings (OR, planning, program synthesis, safety). Paper 2 is valuable infrastructure (a comprehensive benchmark) with clear utility in single-cell multi-omics, but is primarily evaluative rather than methodologically transformative.

vs. Diagnosing Knowledge Gaps in LLM Tool Use: An Agentic Benchmark for Novel API Acquisition

gpt-5.26/3/2026

Paper 1 likely has higher impact due to greater timeliness and cross-field breadth: robust LLM tool/API use is central to current AI deployment, software engineering, agents, and safety/reliability. NovelAPIBench’s automated dynamic construction (discovering novel APIs, generating executable tasks, and fine-grained diagnostic categorization) is methodologically innovative and broadly reusable across libraries and models, enabling standardized evaluation and driving practical improvements in retrieval + tuning strategies. Paper 2 is valuable and rigorous for single-cell multi-omics, but its domain scope is narrower and benchmarking modality translation, while important, is less universally transformative than LLM tool-use generalization.

vs. EvoBrain: Continual Learning of EEG Foundation Models Across Heterogeneous BCI Tasks

gemini-3.16/3/2026

Paper 1 pioneers cross-task continual learning for EEG foundation models, addressing critical bottlenecks in BCI scalability. While Paper 2 provides a highly valuable bioinformatics benchmark, Paper 1 introduces a novel methodological framework (NSN and RAD) that drives theoretical and practical innovation toward unified brain decoding, offering broader transformative potential in neuroscience and human-computer interaction.

vs. AHD Agent: Agentic Reinforcement Learning for Automatic Heuristic Design

gemini-3.16/3/2026

Paper 1 introduces a novel Agentic RL framework to solve NP-hard combinatorial optimization problems, offering profound real-world applications across logistics, scheduling, and AI. While Paper 2 provides a valuable computational biology benchmark, Paper 1 represents a methodological breakthrough that enables compact LLMs to outperform larger models in complex reasoning and heuristic design, granting it wider cross-disciplinary impact and higher scientific significance.

vs. SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

claude-opus-4.66/3/2026

scTranslation addresses a critical need in the rapidly growing single-cell multi-omics field by providing a systematic benchmark with standardized datasets, metrics, and evaluation scenarios. This infrastructure contribution will directly accelerate method development in computational biology, serving a large and active research community. While SAGE (Paper 1) introduces an interesting evaluation framework for social agent evolution with novel findings, its scope is narrower—focused on a niche aspect of LLM agent self-improvement. Paper 2's practical utility as an open-source benchmark, combined with the broader biomedical impact of single-cell genomics, gives it higher potential impact.

vs. Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

gpt-5.26/3/2026

Paper 1 proposes a novel, general-purpose “planet-scale” runtime/protocol (EACN) for dynamically connecting heterogeneous scientific capabilities (simulations, wet labs, proof engines, AI agents) with emergent coordination, demonstrated across two very different domains. If validated and adopted, it could reshape how AI-driven discovery is orchestrated across fields, with broad, timely impact. Paper 2 is methodologically rigorous and highly useful, but as a domain-specific benchmark it is incremental and likely narrower in cross-field transformative reach.

vs. Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

gemini-3.16/3/2026

Paper 1 introduces a novel theoretical framework (Partial Information Decomposition) to solve a critical open problem in AI: understanding and interpreting modality interactions in Multimodal Large Language Models. By identifying synergy bottlenecks and offering actionable improvements, it provides deep foundational insights applicable across the rapidly expanding AI field. While Paper 2 offers a valuable benchmark for single-cell genomics, it is primarily an evaluation of existing methods. Paper 1's methodological innovation and broad relevance to the design of next-generation multimodal AI systems give it a higher potential for widespread scientific impact.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

gpt-5.26/3/2026

Paper 1 likely has higher scientific impact due to strong timeliness and broad utility: an open, comprehensive benchmark can standardize evaluation and accelerate progress across many single-cell multi-omics translation methods and downstream biological applications. Benchmarks often become field infrastructure, enabling reproducibility and fair comparison, with clear real-world relevance to costly/noisy multi-omics experiments. Paper 2 is novel and relevant for LLM-based social simulation, but its impact may be narrower and more sensitive to validity concerns about LLM agent modeling and evaluation in simulated settings.

vs. Mitigating Hallucinations in Large Language Models Via Decoder Layer Skipping

gpt-5.26/3/2026

Paper 2 likely has higher impact: a comprehensive, open benchmark for single-cell multi-omics translation can standardize evaluation, accelerate method development, and be broadly adopted across computational biology and genomics. Its real-world relevance is strong given rapid growth of single-cell multi-omics, high experimental costs, and the need for reliable modality imputation/translation. While Paper 1 is novel and timely for LLM reliability, decoder-layer skipping is a narrower algorithmic contribution with uncertain long-term adoption versus a community benchmark that can shape an entire subfield’s methodology and comparability.

vs. Code-on-Graph: Iterative Programmatic Reasoning via Large Language Models on Knowledge Graphs

claude-opus-4.66/3/2026

Paper 2 (Code-on-Graph) addresses a fundamental challenge in LLM-KG integration with a novel programmatic reasoning framework that demonstrates significant performance improvements (up to 10.5%) on established benchmarks. It tackles the broadly impactful problem of LLM hallucination and knowledge limitations, which is highly timely given the widespread adoption of LLMs. Paper 1 (scTranslation) provides a valuable benchmark for single-cell multi-omics translation but is more incremental as a benchmarking study rather than introducing a fundamentally new method. Paper 2's broader applicability across AI/NLP gives it higher potential impact.

vs. Forget Attention: Importance-Aware Attention Is All You Need

claude-opus-4.66/3/2026

Paper 2 introduces a novel architectural concept (score-level fusion of SSMs and attention) that defines a new design axis for hybrid language models, addressing a fundamental challenge in the dominant field of language modeling. Its innovation is more foundational and broadly applicable across NLP/AI. Paper 1, while valuable as a benchmark for single-cell multi-omics translation, is more incremental—systematizing existing methods rather than proposing a new paradigm. The breadth of impact for advances in language model architecture far exceeds that of a domain-specific benchmark study.

vs. BehaviorBench: Modeling Real-World User Decisions from Behavioral Traces

claude-opus-4.66/3/2026

scTranslation addresses a critical need in single-cell genomics—a rapidly growing field with broad biomedical impact. It benchmarks computational methods for multi-omics modality translation, which has direct applications in understanding cellular regulation and disease mechanisms. The systematic evaluation of factors like feature selection and few-shot settings provides actionable insights for method developers. BehaviorBench, while novel in using real-world prediction market data for personalized decision modeling, targets a narrower domain (crypto/prediction markets) with less immediate scientific breadth. scTranslation's open-source framework and relevance to biology give it broader and more lasting impact.

vs. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact because it delivers an open, comprehensive benchmark for single-cell multi-omics modality translation—a rapidly growing, high-stakes area with direct biomedical applications. Benchmarks often become community infrastructure, shaping evaluation standards, enabling fair comparison, and accelerating method development across labs. It also studies underexplored but practically critical factors (feature quality/selection, few-shot), increasing usefulness and rigor. Paper 1 is novel and timely for multimodal AI, but its impact may be narrower (specific VLM spatial reasoning setting and datasets) and more dependent on adoption within a fast-moving model landscape.

vs. Proof-Refactor: Refactoring Generated Formal Proofs into Modular Artifacts

claude-opus-4.66/3/2026

scTranslation addresses a critical need in the rapidly growing single-cell multi-omics field by providing a comprehensive benchmark with diverse datasets, standardized metrics, and systematic evaluation of factors like feature selection and few-shot settings. This infrastructure paper will likely be widely adopted by the computational biology community, influencing method development and evaluation practices. While Proof-Refactor is a solid contribution to formal verification, its impact is narrower—targeting LLM-generated proof quality in a smaller research community. The biological benchmark has broader real-world applications in understanding disease mechanisms and cellular regulation.

vs. Unveiling the Structure of Do-Calculus Reasoning via Derivation Graphs

gpt-5.26/3/2026

Paper 2 likely has higher impact due to strong timeliness and broad real-world applicability in single-cell multi-omics, a rapidly growing field with many labs needing standardized evaluation. A comprehensive, open-source benchmark with datasets, metrics, and scenario analyses can become community infrastructure, shaping method development across genomics, ML, and bioinformatics. Paper 1 offers novel theoretical structure for do-calculus reasoning and could influence causal inference, but its audience is narrower and downstream adoption may be slower compared to a widely usable benchmark resource.