Xiang Mei, Jordi Del Castillo, Pulkit Singh Singaria, Haoran Xi, Abdelouahab Benchikh, Tiffany Bao, Ruoyu Wang, Yan Shoshitaishvili
Achieving reproducibility, quantity, and diversity in vulnerability datasets has long been viewed as an inherent three-way trade-off, where improving one dimension often comes at the cost of the others. In practice, reproducibility has been the dimension most often neglected. This has limited what can be automatically extracted from historical bug datasets, and has reduced their utility for downstream security research. In this work, we propose a method to produce a new security dataset which ensures reproducibility for diverse vulnerabilities at scale by identifying the key obstacles to large-scale bug reproduction and addressing them with general solutions. Using this method, we introduce full reproducibility to the largest open source software vulnerability dataset (OSS-Fuzz) and construct the ARVO dataset (an Atlas of Reproducible Vulnerabilities in Open-source software). ARVO is a large-scale dataset consisting of over 6,100 real-world vulnerabilities across 311 projects. Focusing on reproducibility, ARVO differs from existing datasets by providing each vulnerability in a form that can be consistently rebuilt, triggered, and analyzed across versions. Reproducibility also enables automatic identification of the corresponding patch for each vulnerability and supports direct interaction with vulnerabilities after code changes, capabilities that existing large-scale datasets do not provide. In our evaluation, ARVO successfully reproduces 81% of vulnerabilities and achieves 89.4% accuracy on the located patches. We also discuss ARVO's influence on both upstream practices and downstream security research.
ARVO addresses a fundamental gap in security research infrastructure: the lack of reproducible, large-scale vulnerability datasets. The paper argues that reproducibility, quantity, and diversity have traditionally been a three-way trade-off, with reproducibility consistently sacrificed. ARVO introduces reproducibility into OSS-Fuzz, the largest open-source vulnerability dataset, by solving three technical challenges: incompatible dependencies, missing resources, and fragile build processes.
The core technical contributions are: (1) a minimally intrusive build instrumentation approach that preserves original build flows while enabling revision control; (2) timestamp-based dependency version selection for commit bisection; and (3) a detection-fixing loop for broken resources. The resulting dataset contains 6,138 reproducible vulnerabilities across 311 projects, each packaged as Docker images with both vulnerable and fixed versions.
The evaluation is reasonably thorough. The reproducibility comparison uses a controlled random sample of 100 vulnerabilities, showing 81% success versus OSS-Fuzz's 37%—a substantial improvement. The ablation study (Table 3) systematically disables features, demonstrating their complementary contributions. The patch accuracy evaluation (89.4%) uses a stratified sampling approach across agree/disagree/partially-agree groups, with manual verification of 100 samples per group.
However, several methodological concerns merit attention. The 100-sample evaluation, while controlled, is modest relative to the full dataset. The patch accuracy metric relies on weighted averages across groups with different sample rates, and the disagree group—where accuracy matters most—has only 63% correctness. The paper acknowledges but doesn't fully resolve that PoC-based reproduction doesn't guarantee the reproduced crash matches the original vulnerability. The syzbot generalization experiment (78/100 success) is helpful but preliminary.
One notable weakness is the conflation of the framework's capabilities with the dataset's quality. The 6,138 completed reproductions represent 69% of the filtered upstream (8,921 issues), not the 81% headline figure, which comes from the 100-sample experiment. The gap is attributed to time constraints rather than methodological limitations, which is plausible but unverified.
ARVO's impact potential is substantial and already partially realized:
Immediate practical impact: ARVO's reproducer has been merged into OSS-Fuzz, directly improving Google's infrastructure. It served as a benchmark for multiple teams in DARPA's AI Cyber Challenge (AIxCC), and CyberGym uses it as a data source. These are concrete, verified adoption signals that few academic papers achieve.
Downstream research enablement: Reproducible vulnerability datasets are critical for evaluating automated program repair (e.g., PATCHAGENT), fuzzer benchmarking, and binary analysis. By providing recompilable environments, ARVO enables instrumentation changes, sanitizer swapping, and other modifications that static datasets cannot support.
Upstream quality improvement: The discovery of 1,519 false positives and 300+ unfixed vulnerabilities in OSS-Fuzz demonstrates a valuable feedback loop. The finding that 14.5% of OSS-Fuzz records contain errors is significant for anyone using this data.
Vulnerability backporting: The automated approach to creating fuzzing benchmarks (Section 6.2) scales Magma's manual approach, though with acknowledged limitations in patch applicability and triggerability.
This work is highly timely. The explosion of LLM-based vulnerability repair systems (PATCHAGENT, etc.) has created urgent demand for large-scale, reproducible evaluation benchmarks. The community has recognized dataset quality as a bottleneck—prior work like Mu et al. (2018) documented the reproduction problem but didn't solve it at scale. The integration into DARPA's AIxCC program underscores the practical demand.
The focus on C/C++ memory safety vulnerabilities remains relevant despite the push toward memory-safe languages, as legacy C/C++ codebases will persist for decades. The methodology's demonstrated portability to kernel bugs (syzbot) suggests broader applicability.
The paper's framing of reproducibility as a "missing dimension" is compelling and well-supported by the comparative analysis in Table 1. The concrete example of issue #42486945 (a vulnerability mislabeled as fixed for two years) effectively illustrates the real-world consequences of non-reproducible datasets. The ethical handling of discovered unfixed vulnerabilities is appropriate.
The work is more infrastructure/systems contribution than algorithmic innovation, but its practical impact—already demonstrated through adoption—may exceed that of many technically deeper but less practically useful papers.
Generated Jun 17, 2026
Paper 1 provides a foundational, large-scale, reproducible vulnerability dataset. Overcoming the trade-off between reproducibility, scale, and diversity addresses a critical bottleneck in software security. Foundational datasets historically catalyze massive downstream advancements across multiple subfields (e.g., automated program repair, fuzzing, ML-based vulnerability detection), yielding broader and longer-lasting scientific impact than the specific, albeit highly innovative, federated learning privacy attack presented in Paper 2.
Paper 1 identifies a novel and practically significant security vulnerability in widely-used HNSW vector databases, directly relevant to the rapidly growing RAG/LLM ecosystem. It demonstrates concrete privacy risks with regulatory implications (GDPR, HIPAA), provides empirical evidence across multiple data modalities, and proposes a practical mitigation (Epoch Key Rotation) with cryptographic guarantees. Its timeliness—intersecting AI infrastructure security, data privacy regulation, and the LLM boom—gives it broader interdisciplinary impact. Paper 2 makes a solid engineering contribution to vulnerability datasets but addresses a more incremental, niche problem within the security research community.
While Paper 1 offers a highly innovative LLM-based detection methodology, Paper 2 (ARVO) provides a foundational dataset of over 6,100 reproducible vulnerabilities. High-quality, scalable benchmark datasets typically yield broader and longer-lasting scientific impact by enabling widespread downstream research, standardizing evaluations, and serving as the essential testing ground for future tools (including agents like Code-Augur). Solving the reproducibility-scale trade-off addresses a critical, long-standing bottleneck in cybersecurity research, giving ARVO the edge in overarching scientific utility.
Paper 2 addresses a fundamental bottleneck in security research by providing a large-scale, highly reproducible dataset of software vulnerabilities. Foundational resources and datasets like ARVO typically yield broad, long-lasting impact as they enable extensive downstream research in vulnerability detection, fuzzing, and automated patching. In contrast, while Paper 1 presents an interesting application of GNNs for cyber-physical systems, its emulation-based case study and highly specific context limit its generalizability and foundational scientific contribution compared to Paper 2.
ARVO addresses a fundamental, long-standing challenge in security research—reproducibility of vulnerability datasets—at unprecedented scale (6,100+ vulnerabilities across 311 projects). It enables multiple downstream research directions (patch identification, vulnerability analysis, automated repair) and serves as critical infrastructure for the entire software security community. OTRO solves an important but narrower problem (side-channel leakage in LLM tokenizers within TEEs), with impact limited to a specific deployment scenario. ARVO's breadth of impact, community utility as a reusable dataset, and influence on both upstream and downstream practices give it greater scientific impact potential.
ARVO addresses a fundamental infrastructure problem in security research—reproducibility of vulnerability datasets—at massive scale (6,100+ vulnerabilities, 311 projects). It provides a reusable resource that enables numerous downstream research directions (patch analysis, vulnerability detection, program repair, fuzzing benchmarks). Its breadth of impact across the security research community is substantial, as reproducible datasets are foundational. GRIEF is innovative and timely for LLM serving security, but targets a narrower domain. ARVO's dataset utility and methodological contribution to the reproducibility crisis give it broader and more lasting scientific impact.
Paper 2 (ARVO) offers higher scientific impact because reproducible, large-scale vulnerability datasets serve as foundational infrastructure for broader software security research. While Paper 1 presents an impressive production-tested industry architecture, its scientific reach is narrower. ARVO solves a persistent trade-off in empirical security, enabling countless downstream applications in automated vulnerability detection, fuzzing, and machine learning for code repair, which will likely generate significantly more citations and drive future academic work.
Paper 2 bridges a significant gap between theoretical cryptographic backdoor results and practical modern neural network architectures, demonstrating that undetectable backdoors are inherent to learned representations rather than requiring exotic constructions. This has profound implications for AI safety, trustworthy ML deployment, and security policy. While Paper 1 (ARVO) makes a valuable engineering contribution by improving vulnerability dataset reproducibility, Paper 2 introduces a fundamentally new conceptual framework connecting cryptographic undetectability to the geometry of latent spaces, with broader cross-disciplinary impact spanning cryptography, ML security, and AI governance.
ARVO addresses a fundamental infrastructure gap in security research by providing a large-scale, reproducible vulnerability dataset (6,100+ vulnerabilities across 311 projects). It has broader impact potential: it serves as a reusable resource for multiple downstream research areas (fuzzing, program repair, vulnerability detection), directly improves upstream practices (OSS-Fuzz), and solves a well-recognized three-way trade-off problem. Paper 1, while technically sophisticated in analyzing safety geometry of LLMs, is narrower in scope—focused on diagnostic methodology for a specific alignment setting—and its findings are more incremental and model-specific.
Paper 2 introduces a foundational, large-scale reproducible vulnerability dataset that addresses a critical bottleneck in security research. High-quality datasets typically drive broad downstream research in program analysis, software engineering, and machine learning, leading to a much wider impact and higher citation potential compared to the domain-specific algorithmic improvements for network intrusion detection presented in Paper 1.