AOP-Wiki EMOD 3.0: Data Model Expansions and Content Evaluation Framework for Using Agentic AI to Improve Integration between AOPs and New Approach Methodologies (NAMs)

Virginia K. Hench, J. Harry Caufield, Sierra A. T. Moxon, Jason M. O'Brien, Stephen W. Edwards

May 20, 2026

arXiv:2605.21645v1 PDF

cs.AI(primary)cs.DB

#1453of 2292·Artificial Intelligence

#1453 of 2292 · Artificial Intelligence

Tournament Score

1378±46

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty4.5

Clarity5

Tournament Score

1378±46

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Adverse Outcome Pathways (AOP) are logic models that causally link biological mechanisms that can be measured in a lab to adverse outcomes, relevant to chemical regulatory endpoints. AOPs contextualize new approach methodologies (NAMs), in vitro and in silico methods used as alternatives to animal testing and the sequential events in an AOP serve as multi-scale models spanning biological scales. The AOP-Wiki serves as the global repository for AOPs. While the AOP-Wiki has played a central role in AOP expansion over the past decade, constraints within the current data model and application infrastructure limit the AOP-Wiki from supporting continued AOP growth and evolution. Yet, the transformative power of agentic AI has re-invigorated AOP-Wiki data modernization efforts at a time when core AOP principles can be harnessed to inform use of AI for aggregating and structuring AOP-relevant information. Seizing upon this momentum, we present AOP-Wiki EMOD 3.0, the third in a series of evidence model prototypes, which concretely demonstrates data model expansions and our vision for how the AOP-Wiki might be transformed to better serve regulatory science and emergent use of AOPs in biomedical and One Health contexts. We aim to lay a foundation to support computationally-generated AOPs and quantitative AOPs (qAOPs) by focussing on solutions for AOP-Wiki internal quality improvement, evidence structuring to enhance AOP FAIRness and AI-readiness, and improved integration between the AOP framework and NAMs to better serve next generation risk assessment.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AOP-Wiki EMOD 3.0

Core Contribution

This paper presents AOP-Wiki EMOD 3.0, the third iteration of an evidence model prototype that proposes data model expansions and a new web application for the AOP-Wiki — the globally recognized repository for Adverse Outcome Pathways (AOPs). The core contributions are: (1) new data classes (Observation, Assay, Evidence, Citation, Experiment Type, Biological Target Family) that move the AOP-Wiki from predominantly free-text entries toward structured, ontology-annotated content; (2) content quality assessment tools including document completion scores and a novel Event Integration Score (EIS); (3) LLM-based approaches for identifying redundant Key Events (KEs); and (4) a CLI application for processing AOP-Wiki XML data. The work is fundamentally an infrastructure/data modeling contribution aimed at making the AOP-Wiki more FAIR-compliant, AI-ready, and better integrated with New Approach Methodologies (NAMs).

Methodological Rigor

The paper is primarily a systems design and prototype description rather than a hypothesis-driven study, which limits the extent to which traditional methodological rigor criteria apply. Several observations:

Strengths in approach: The iterative development through EMOD 1.0→2.0→3.0 shows responsiveness to community feedback (e.g., Methods2AOP stress testing revealing that Study Design entries were too detailed). The use of multiple domain-specific use cases (depression/neural networks, seizures, lung fibrosis) to validate data model design is sound practice. The KER evidence analysis using the CLI app provides concrete metrics (183/2336 KERs with tabulated entries, 52 harmonizable) that ground the discussion in reality.

Weaknesses: The LLM-based KE grouping approach lacks formal evaluation — there are no precision/recall metrics, no systematic validation against expert judgments, and no description of which LLM was used or how prompts were constructed. The Event Integration Score, while conceptually useful, appears somewhat ad hoc; the weighting scheme is not formally justified or validated. The paper acknowledges that the Evidence class has not undergone stress testing beyond a small group, and the roll-up principle remains unimplemented and undemonstrated. EMOD 2.0 was "publicly deployed for approximately two years but was never tested by users," which raises concerns about community adoption.

Potential Impact

The potential impact operates at several levels:

For the AOP community: If adopted, EMOD 3.0 could substantially improve AOP-Wiki usability by addressing long-standing pain points — KE redundancy, poor provenance tracking, lack of structured evidence, and barriers to content entry. The structured Observation and Assay classes could meaningfully improve integration between AOPs and NAMs, which is crucial for the ongoing transition away from animal testing in regulatory toxicology.

For regulatory science: Better structured AOPs could accelerate development of Integrated Approaches to Testing and Assessment (IATA) and support next-generation risk assessment. The connection to OECD-endorsed frameworks gives this work policy relevance.

For AI/computational approaches: Making AOP-Wiki content more structured and ontology-annotated is a necessary precondition for computational AOP development and agentic AI approaches. However, the paper describes laying groundwork rather than delivering functional AI-driven AOP generation.

Breadth of impact: The work is narrowly scoped to the AOP community and regulatory toxicology. While it references One Health and biomedical contexts, the actual demonstrated applications remain within traditional AOP use cases.

Timeliness & Relevance

The paper is highly timely on multiple fronts: (1) regulatory pressure to reduce animal testing is intensifying globally; (2) the AOP framework is increasingly used outside its original toxicology context; (3) AI/LLM capabilities create new opportunities for knowledge base management; and (4) the AOP-Wiki's aging infrastructure has been a recognized bottleneck. The reference to agentic AI in the title and abstract is somewhat aspirational — the paper demonstrates LLM-based KE clustering but doesn't implement true agentic AI workflows. A concurrent publication (Song et al., 2026) appears to address that more directly.

Strengths

1. Addresses a real infrastructure gap: The AOP-Wiki's limitations are well-documented and this work provides concrete, implementable solutions rather than abstract recommendations.

2. Community-informed design: Years of engagement with SAAOP, OECD, Methods2AOP, and EHLC workshops ground the design decisions in actual user needs.

3. Practical deliverables: A deployed web application (emod.aopwiki.org) and open-source CLI tool provide tangible outputs beyond the paper itself.

4. FAIR alignment: The systematic approach to increasing ontological annotation and structured content directly serves data interoperability goals.

5. Multiple use cases: The seizure, lung fibrosis, and depression use cases demonstrate breadth of applicability.

Limitations

1. No formal evaluation of proposed metrics: The EIS and completion scores lack validation against expert assessments of AOP quality or utility.

2. Adoption uncertainty: The history of EMOD 2.0 going untested for two years signals significant adoption risk. The paper doesn't address strategies for community migration.

3. Incomplete implementation: Key features (roll-up principle, full Evidence workflow, webform-based content entry) remain aspirational, making it difficult to evaluate their feasibility.

4. Limited AI demonstration: Despite prominent billing of "Agentic AI" in the title, the AI component is limited to LLM-based KE clustering without rigorous evaluation.

5. Heavy on description, light on evaluation: The paper reads more as a technical report/design document than a research contribution with testable claims.

6. Scalability questions: How the system will handle the full AOP-Wiki migration and ongoing community contributions is not addressed in detail.

Overall Assessment

This is a valuable infrastructure contribution to the AOP and regulatory toxicology communities, addressing genuine and well-documented needs. However, as a scientific publication, it is primarily descriptive — presenting a prototype and vision rather than validated results. The impact will ultimately depend on community adoption and whether the EMOD 3.0 data model becomes the foundation for the next AOP-Wiki, which remains uncertain. The paper would benefit substantially from formal evaluation of its proposed quality metrics and AI-based approaches, and from a clearer migration roadmap.

Rating:4.5/ 10

Significance 5.5Rigor 3.5Novelty 4.5Clarity 5

Generated May 22, 2026

Comparison History (15)

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to broader cross-domain relevance (regulatory toxicology, biomedical science, One Health, data standards/FAIR, AI-ready knowledge infrastructure) and strong real-world applicability in next-generation risk assessment and reducing animal testing via NAM integration. Its data model expansions and content evaluation framework can enable downstream computational/qAOP development and scalable community adoption. Paper 1 is timely and useful but is narrower (finance spreadsheets/LLM evaluation) and primarily benchmarking-focused, with impact concentrated in applied AI tooling rather than foundational infrastructure affecting multiple scientific and regulatory fields.

vs. A Subjective Logic-based method for runtime confidence updates in safety arguments

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to its broad applicability and timeliness: modernizing the AOP-Wiki data model and governance to support FAIR, AI-ready evidence structuring could influence regulatory toxicology, risk assessment, NAM adoption, and One Health/biomedical domains. Its platform/infrastructure nature enables downstream reuse, standardization, and integration across many stakeholders and methods (including agentic AI and qAOPs). Paper 2 is methodologically focused and valuable for safety assurance, but its impact is narrower (safety cases/runtime assurance) and demonstrated on a single simulated use case.

vs. Investigating Concept Alignment Using Implausible Category Members

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to strong real-world and regulatory applications: modernizing AOP-Wiki infrastructure can directly affect chemical risk assessment, NAM adoption, and reduction of animal testing, with broad relevance across toxicology, regulatory science, biomedicine, and One Health. Its emphasis on FAIR/AI-ready evidence structuring and support for computational/qAOPs suggests durable, community-scale infrastructure impact. Paper 2 is novel and timely for AI safety evaluation, but appears more narrowly methodological and may yield incremental diagnostic insights rather than a widely deployable platform or standards change.

vs. AtelierEval: Agentic Evaluation of Humans & LLMs as Text-to-Image Prompters

gemini-3.15/22/2026

Paper 1 addresses a critical gap in the rapidly expanding field of Generative AI by evaluating upstream prompting proficiency, an issue relevant to both human-computer interaction and multimodal LLM research. Its introduction of a unified benchmark and an agentic evaluator has broad applicability across AI disciplines. While Paper 2 offers valuable real-world regulatory impact for toxicology and alternatives to animal testing, Paper 1's generalizable methodology and relevance to a larger, highly active scientific community give it a higher potential for widespread scientific impact and citation.

vs. COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space

gpt-5.25/22/2026

Paper 1 targets a widely used, community-facing regulatory science infrastructure (AOP-Wiki) and proposes data model expansions plus an evaluation framework to enable AI-ready evidence structuring, qAOPs, and tighter integration with NAMs—directly supporting policy-relevant risk assessment and animal-testing replacement. Its potential real-world and cross-field impact spans toxicology, biomedicine, One Health, data stewardship/FAIR standards, and AI-assisted knowledge modeling, making it timely and broadly influential. Paper 2 is methodologically strong and impactful within VRP optimization, but its breadth and downstream societal/regulatory leverage are narrower.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental architectural challenge for safe LLM agent deployment with broad applicability across all AI agent systems. Its formal probabilistic framework for compositional safety guarantees is highly novel and timely given rapid LLM agent proliferation. It identifies concrete open problems that could shape an entire research agenda. Paper 2, while valuable for the AOP/toxicology community, addresses a more domain-specific data infrastructure problem with narrower impact scope. Paper 1's breadth of impact across AI safety, formal methods, and deployment practices gives it substantially higher potential scientific impact.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

gpt-5.25/22/2026

Paper 2 likely has higher scientific impact due to its broad relevance to core ML questions (generalization vs. memorization), a clear, testable evaluation framework (brittleness testing), and a practical, general approach (verifier-in-the-loop) applicable beyond chess to other constrained domains. It offers strong methodological rigor with reproducible open-source artifacts and quantitative comparisons, and it is timely amid scrutiny of LLM capabilities. Paper 1 is valuable for regulatory toxicology infrastructure, but its impact is more domain-specific and depends on downstream adoption of the proposed data model modernization.

vs. Learning to Solve Compositional Geometry Routing Problems

claude-opus-4.65/22/2026

Paper 2 addresses a critical real-world problem at the intersection of regulatory science, toxicology, and AI, with direct implications for replacing animal testing and improving chemical safety assessment. Its impact spans multiple fields (regulatory science, biomedicine, One Health, AI) and addresses timely needs for data modernization and AI-readiness in a globally used repository. While Paper 1 makes solid methodological contributions to routing problems with a novel framework, its impact is more narrowly focused on combinatorial optimization. Paper 2's broader societal relevance and cross-disciplinary reach give it higher potential impact.

vs. Learning to Solve Compositional Geometry Routing Problems

gpt-5.25/22/2026

Paper 1 targets a high-stakes, timely bottleneck in regulatory toxicology: modernizing the AOP-Wiki data/evidence model to enable FAIR, AI-ready integration of AOPs with NAMs, potentially accelerating animal-free risk assessment and broader One Health/biomedical applications. Its impact could be amplified via community adoption as shared infrastructure and standards. Paper 2 proposes a novel ML solver framework for a generalized routing abstraction; while methodologically strong and broadly relevant within optimization/ML, it is one of many algorithmic advances and its real-world uptake is less assured than a widely used regulatory-science platform upgrade.

vs. LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

gemini-3.15/22/2026

Paper 2 addresses a highly timely and critical issue—security and privacy in multi-agent LLM systems—by proposing a novel defense against sensitive information leakage in latent KV cache communication. Its methodological rigor, featuring adversarial training, and its broad applicability across the rapidly expanding field of generative AI give it a wider potential impact than Paper 1, which is highly specialized in regulatory toxicology and biomedical modeling.

vs. LACO: Adaptive Latent Communication for Collaborative Driving

claude-opus-4.65/22/2026

LACO presents a novel, concrete technical framework addressing fundamental challenges in collaborative autonomous driving—a rapidly growing field with massive real-world applications. It introduces three specific technical innovations (ILD, CHSA, SSKD) validated through closed-loop experiments, offering immediate practical impact. Paper 1 presents a vision/prototype for AOP-Wiki data modernization that, while valuable for toxicology, is more incremental (third iteration), narrower in audience, and relies on future implementation rather than demonstrated results. Paper 2's contributions to multi-agent AI communication have broader cross-disciplinary relevance.

vs. Personality Engineering with AI Agents: A New Methodology for Negotiation Research

gemini-3.15/22/2026

Paper 2 addresses a critical real-world problem (alternatives to animal testing and chemical risk assessment) by modernizing a global repository for AI integration. Its impact spans biomedicine, toxicology, and regulatory science, offering a foundational framework for future AI-driven research. Paper 1 is innovative in social sciences but has a narrower scope compared to the broad, systemic, and regulatory implications of Paper 2.

vs. Personality Engineering with AI Agents: A New Methodology for Negotiation Research

gemini-3.15/22/2026

While Paper 1 offers a novel methodological use of AI in behavioral science, Paper 2 demonstrates significantly higher potential for broad real-world impact. By modernizing the AOP-Wiki to support agentic AI, it directly accelerates the development of alternatives to animal testing (NAMs), advances next-generation risk assessment, and informs regulatory science and toxicology. Its contribution to biological modeling and One Health contexts gives it a broader and more critical scientific footprint compared to negotiation research.

vs. KAPPS: A knowledge-based CPPS Architecture for the Circular Factory

gemini-3.15/22/2026

Paper 2 demonstrates higher potential scientific impact due to its broad cross-disciplinary applicability spanning toxicology, biomedicine, regulatory science, and AI. By leveraging agentic AI to upgrade Adverse Outcome Pathways, it directly supports the critical transition away from animal testing toward New Approach Methodologies (NAMs). This addresses a high-stakes global challenge in chemical risk assessment and One Health. While Paper 1 offers a valuable architecture for circular manufacturing, Paper 2's integration of AI-ready FAIR data principles to revolutionize regulatory biology suggests a wider and more transformative societal and scientific reach.

vs. Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to stronger methodological innovation (RL fine-tuning in a realistic Excel environment), concrete artifacts (new benchmark dataset, gym environment, scalable data collection pipeline), and clear, quantifiable performance gains on standardized tasks. Its applications (spreadsheet automation) are broad across industries and align with timely interest in LLM agents and tool use, increasing cross-field adoption. Paper 2 is important for regulatory science infrastructure and FAIR data modernization, but appears more domain-specific and systems/standards-focused with less demonstrated empirical advancement, potentially limiting broader, faster uptake.