RiskNet: A large-scale dataset of AI risk incidents from news with alignment and multi-dimensional annotations

Leihan Zhang, Wecheng Ye, Xianlong Ma, Haochuan Liu, Yang Li, Qianyu Zhang, Jinliang Chen, Qiang Yan

Jun 7, 2026arXiv:2606.08376v1

cs.LGcs.AI

#2946of 5669·cs.LG

#2946 of 5669 · cs.LG

Tournament Score

1397±43

10501750

50%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor5.5

Novelty5.5

Clarity7

Abstract

As artificial intelligence (AI) systems are increasingly deployed across socially consequential domains, reports of AI-related harms and failures have grown in frequency and diversity. Although existing governance frameworks articulate high-level principles for responsible AI, large-scale empirical resources for tracking and analyzing real-world AI risk incidents remain limited. Existing incident collections are often manually curated, relatively small in scale, and insufficient for continuous, data-driven monitoring and downstream computational analysis. To address this need, we present RiskNet, a large-scale dataset of AI risk incidents constructed from large-scale multilingual news sources. RiskNet applies a structured pipeline for AI risk news identification, event-level report screening, incident alignment, and multi-dimensional incident classification. The resulting resource organizes dispersed news reports into incident-centered records and provides benchmark datasets for event classification, incident alignment, and incident-level risk labeling. In its current release, RiskNet covers hundreds of millions of source records and yields a large-scale collection of AI risk-related reports, including aligned incident clusters and annotated benchmark subsets. The dataset is also accessible through an online platform for browsing and exploration. We describe the data sources, processing workflow, taxonomy design, and technical validation of the resource. RiskNet is intended to support downstream research on AI safety, governance, risk analysis, and benchmarking, as well as longitudinal and cross-source analyses of AI-related harms. By providing a structured and reusable empirical resource, RiskNet helps bridge the gap between high-level governance principles and the documented realities of AI risk incidents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: RiskNet

1. Core Contribution

RiskNet presents a large-scale, multilingual dataset of AI risk incidents constructed from news sources, accompanied by a structured pipeline for identification, alignment, and multi-dimensional classification of incidents. The core novelty lies in three areas: (1) the scale — processing hundreds of millions of source records to yield ~777K AI risk-related reports and ~265K event-level reports organized into ~54K incident clusters; (2) the incident alignment methodology that aggregates multiple news reports about the same real-world event into unified incident records using a dual-view retrieval and DeepWide pairwise classification approach; and (3) a multi-dimensional classification framework combining EU AI Act risk levels with MIT-derived domain taxonomies and causal tags. The paper addresses a genuine gap: existing AI incident repositories (AIID, AIAAIC) are manually curated, relatively small (~5K reports each), and lack automated cross-document incident linking.

2. Methodological Rigor

The pipeline is well-structured and technically reasonable, though several concerns arise:

Strengths in methodology:

The three-stage pipeline (identification → alignment → classification) is logically sound and clearly formalized with mathematical notation.

The dual-view candidate recall (full-text + event-element embeddings) is a sensible design choice that balances semantic and structural matching signals.

The conservative complete-link clustering constraint is well-motivated, with the authors documenting how transitive closure created a giant cluster of 260K+ reports in preliminary experiments.

Inter-annotator agreement (Cohen's κ = 0.74) for classification labels is reasonable.

Weaknesses:

Heavy reliance on LLMs throughout the pipeline (risk identification, structured extraction, pre-annotation) introduces compounding errors that are only partially validated. The precision/recall figures for individual stages don't account for error propagation across the full pipeline.

The pairwise alignment model achieves a positive precision of only 0.506 (Table 8), meaning roughly half of predicted same-incident edges are incorrect. While the complete-link clustering mitigates this somewhat (Hungarian Macro-F1 of 0.895), this low precision is concerning for downstream analyses.

The classification baselines (Tables 10-11) show modest performance, particularly for risk level (best accuracy ~54%) and subdomain classification (Macro-F1 ~0.37). The authors chose qwen3-14B-sft to classify all incidents despite it not being the best-performing model on all metrics, without strong justification.

The validation of AI relevance filtering uses AIID/AIAAIC as a reference set, which conflates recall of known incidents with actual population-level recall. True false negative rates on the broader news corpus remain unknown.

The event classification benchmark (Table 7) shows recall of only 0.76 for event-level AI risk reports, meaning approximately 24% of genuine event-level incidents may be missed.

3. Potential Impact

AI Safety and Governance: RiskNet could serve as a valuable empirical complement to high-level governance frameworks. The ability to systematically track incident trends, identify underrepresented risk domains, and conduct cross-lingual comparisons addresses real needs in the AI governance community.

NLP and Information Extraction: The benchmark subsets for event classification, incident alignment, and multi-label classification provide useful evaluation resources, though the benchmark sizes are relatively modest (2,000 for event classification, ~1,752 reference incidents for alignment, 2,285 for classification).

Practical limitations on impact: The full dataset is not publicly released due to licensing constraints — only benchmarks, code, and sample data are available, with full access requiring application through an online platform. This significantly limits reproducibility and community adoption. The paper's claim of being an "open dataset" is partially undermined by this access model.

4. Timeliness & Relevance

The paper is highly timely. AI incident reporting has become a policy priority (OECD frameworks, EU AI Act reporting requirements), and the gap between governance principles and empirical tracking is widely recognized. The inclusion of multilingual sources, particularly Chinese-language news, addresses an important gap given the language bias documented in existing AI safety resources. The 2025-era coverage including deepfake fraud and agentic system incidents captures emerging risk categories.

5. Strengths & Limitations

Key Strengths:

Unprecedented scale: ~54K aligned incident clusters from ~265K event-level reports, far exceeding existing repositories.

Multilingual coverage bridging English and Chinese sources, addressing documented language gaps in AI safety.

The incident alignment framework is a genuine methodological contribution — moving beyond simple deduplication to cross-source event linking.

Thoughtful usage notes addressing temporal bias, source composition effects, and appropriate granularity for analysis.

Power-law analysis of cluster sizes (α ≈ 2.53) provides useful characterization of media attention distribution.

Notable Limitations:

Restricted data access undermines openness claims and limits community adoption.

Source bias: The dataset is heavily dominated by two sources — CommonCrawl News (674M raw) and China News commercial dataset (2.98M AI-related) — creating potential geographic and topical biases that are acknowledged but not deeply analyzed.

Classification quality: The risk level classifier achieving only ~54% accuracy raises questions about the reliability of dataset-wide labels applied using this model. The gap between human annotation quality (κ = 0.74) and model performance is substantial.

Evaluation limitations: The incident alignment benchmark relies on AIID/AIAAIC as ground truth, but these repositories themselves have known quality issues (as the paper acknowledges). There is no independent human evaluation of a random sample of aligned incident clusters from the full RiskNet dataset.

Lack of comparative analysis: The paper does not systematically compare the coverage, quality, or utility of RiskNet against existing repositories beyond scale metrics.

No downstream task demonstration: The paper presents the dataset but does not demonstrate its utility through any substantive downstream analysis (e.g., trend analysis, risk prediction, governance insights).

Overall Assessment

RiskNet addresses a real and growing need for large-scale, structured AI incident data. The scale and multilingual coverage are impressive, and the incident alignment framework is a meaningful technical contribution. However, the restricted data availability, modest classification accuracy for dataset-wide labeling, and absence of demonstrated downstream utility temper the potential impact. The work is more of an infrastructure contribution than a scientific breakthrough, and its ultimate value will depend on community adoption and the quality of research it enables.

Rating:6/ 10

Significance 6.5Rigor 5.5Novelty 5.5Clarity 7

Generated Jun 9, 2026

Comparison History (20)

Wonvs. BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

Paper 1 offers a foundational, large-scale dataset addressing the critical and timely issue of AI safety and governance. While Paper 2 provides a highly valuable technical optimization for LLM inference, datasets like RiskNet typically generate broader, long-lasting cross-disciplinary impact by establishing new benchmarks and enabling extensive downstream research across AI development, policy-making, and societal studies.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

RiskNet addresses a critical infrastructure gap in AI governance by providing a large-scale, structured empirical dataset for tracking real-world AI risk incidents. Its breadth of impact spans AI safety, governance, policy, and multiple research communities. The resource nature of the contribution means it enables numerous downstream studies. Paper 2, while technically sound with a novel framework for federated graph learning with missing modalities, addresses a narrower technical problem. RiskNet's timeliness amid growing AI deployment concerns and its potential to inform policy decisions give it broader and higher estimated impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Paper 1 exposes a critical gap between theoretical differential privacy guarantees and empirical vulnerabilities in LLM adaptation, specifically regarding distribution shifts between pretraining and fine-tuning data. This provides highly actionable, technical insights for deploying LLMs in sensitive domains. While Paper 2 offers a valuable dataset for AI governance, Paper 1 directly impacts core machine learning methodologies and security practices, likely driving immediate changes in how practitioners and researchers approach privacy-preserving LLM fine-tuning.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. Graph Mamba Operator: A Latent Simulator for Interacting Particle Systems

Paper 1 introduces a novel architectural contribution (GraMO) that combines state-space models with graph-based learning in a principled way, addressing fundamental challenges in modeling interacting dynamical systems. It demonstrates strong empirical results across multiple benchmarks. While Paper 2 (RiskNet) provides a valuable dataset for AI governance research, dataset papers typically have narrower methodological impact. Paper 1's technical innovation in coupling spatial-temporal dynamics within a single recurrence has broader applicability across physics simulation, robotics, and scientific computing, and advances the methodological frontier more significantly.

claude-opus-4-6·Jun 9, 2026

Lostvs. Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

Paper 1 introduces a novel, broadly applicable training framework that addresses a key limitation of RL with verifiable rewards (zero group-level advantage) via trace tournaments and efficient Bradley–Terry ranking, yielding demonstrated gains in reasoning benchmarks and compute savings—likely to influence future LLM training methods across domains. Paper 2 provides a valuable dataset for AI governance and risk analysis, but its scientific impact hinges more on adoption/maintenance and may be narrower methodologically. Overall, Paper 1 is more technically innovative, timely for LLM training, and likely to propagate across multiple research areas.

gpt-5.2·Jun 9, 2026

Wonvs. Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

Paper 1 likely has higher scientific impact due to broader cross-field relevance (AI safety, governance, policy, NLP/IR, incident analysis), strong timeliness, and high leverage as a large-scale, reusable dataset/platform that can enable many downstream studies and benchmarks. Its real-world applicability is immediate for monitoring and evaluating AI harms. Paper 2 is innovative and rigorous for aerial manipulation and sim-to-real meta-RL, but its impact is narrower (robotics/UAVs) and depends more on adoption and reproducibility in specific hardware settings.

gpt-5.2·Jun 9, 2026

Lostvs. Temporal Preference Concepts and their Functions in a Large Language Model

Paper 2 offers higher scientific impact due to its novel contribution to mechanistic interpretability of LLMs, specifically causally localizing temporal preference representations—a previously unexplored area. It combines multiple rigorous methods (gradient attribution, activation patching, steering vectors) to provide actionable insights for AI alignment and control. The finding that LLMs discount the future differently than humans has broad implications for AI safety and deployment in decision-making. Paper 1, while useful as a dataset resource, is primarily an infrastructure contribution with incremental novelty over existing incident databases, and its impact depends on downstream adoption.

claude-opus-4-6·Jun 9, 2026

Lostvs. Unifying Model-Free Efficiency and Model-Based Representations via Latent Dynamics

Paper 1 introduces a fundamental algorithmic breakthrough in reinforcement learning, unifying model-free and model-based methods. Its rigorous theoretical proofs, error bounds, and extensive empirical validation across 80 diverse environments demonstrate exceptional methodological rigor. While Paper 2 provides a timely and valuable dataset for AI governance, Paper 1's potential to advance core AI capabilities and inspire follow-up algorithmic research gives it a higher estimated scientific impact.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. $α$-PFN: Fast Entropy Search via In-Context Learning

Paper 2 has higher potential scientific impact due to a novel methodological contribution (amortizing entropy-search acquisition via PFNs/in-context learning) that directly improves a widely used core tool (Bayesian optimization) with large practical gains (50x speedups) and open-source code. Its approach is broadly applicable across ML, AutoML, experimental design, robotics, and hyperparameter tuning, making cross-field uptake likely. Paper 1 provides a valuable dataset/platform for AI risk analysis, but its impact depends on sustained curation, coverage, and adoption, and is more domain-specific and sensitive to data/annotation biases.

gpt-5.2·Jun 9, 2026

Lostvs. End-to-End Subgraph Detection with GraphDETR

GraphDETR introduces a novel deep learning framework that reformulates a fundamental NP-complete graph theory problem (subgraph isomorphism) as a set prediction task, drawing an innovative analogy to object detection (DETR). It offers both exact and approximate matching capabilities with strong empirical results on molecular functional group detection. This methodological innovation has broad applicability across scientific domains (chemistry, biology, network analysis). While RiskNet is a valuable dataset contribution for AI governance, it is primarily a curation/annotation effort with narrower scope; GraphDETR's algorithmic novelty and cross-domain applicability suggest higher long-term scientific impact.

claude-opus-4-6·Jun 9, 2026

#2946of 5669·cs.LG

#2946 of 5669 · cs.LG

Tournament Score

1397±43

10501750

50%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor5.5

Novelty5.5

Clarity7