KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

Ziwei Li, Liujun Zhu, Yuchen Liu, Yichen Zhao, Birk Li, Ruiqi Wu, Junliang Jin, Jianyun Zhang

May 18, 2026

arXiv:2605.17856v1 PDF

cs.AI(primary)

#32of 2292·Artificial Intelligence

#32 of 2292 · Artificial Intelligence

Tournament Score

1582±45

10501800

91%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8.5

Rigor7.5

Novelty8

Clarity8

Tournament Score

1582±45

10501800

91%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: KISS – Knowledge Infrastructure for Scientific Simulation

1. Core Contribution

The paper introduces Knowledge Infrastructure (KI), a structured scaffold that externalizes tacit operational expertise required to run process-based simulation models into three machine-readable layers: validated modelling operators (procedural knowledge), staged domain protocols (evaluative knowledge), and diagnostic recovery mechanisms (diagnostic knowledge). The key insight is that the barrier to using scientific models is not informational but operational — knowing which sequence of decisions, checks, and debugging steps turns a model binary into a valid simulation.

The accompanying Knowledge Dissection Toolkit (KDT) automates the construction of KI packages from model source code, documentation, and examples. The paper demonstrates this at two scales: a deep 3,000-trial benchmark on coupled VIC-Lohmann hydrology, and a breadth test across 119 process-based models spanning 14 Earth-science domains. The central empirical finding — that operational expertise converges structurally across models with different underlying physics (unit-conversion and I/O format errors account for 55% of failures universally; three decision categories appear in every domain) — is arguably the paper's most important scientific contribution.

2. Methodological Rigor

The experimental design is commendably thorough for an AI-for-science paper:

Scale of evaluation: 3,000 independent trials across 10 agents from 5 platforms, 3 basins, with 100 trials per agent-basin combination. The Clopper-Pearson confidence intervals are explicitly reported.

Ablation design: The clean-room ablation (KI artifacts removed) directly links the three knowledge layers to specific failure modes (fabrication ↔ missing operators, physical blindness ↔ missing protocols, error looping ↔ missing diagnostics). This is well-constructed causal evidence.

Multi-level validation: The 119-model cohort uses three validation tiers (hand-built, expert-supervised, fully autonomous), with observation-based validation where possible and runnability verification otherwise. The distinction is transparently reported.

Inter-rater reliability: Cohen's κ = 0.81–0.82 (agent vs. auditors) and Fleiss' κ = 0.86 for decision-point classification on n=100 stratified sample.

Weaknesses in rigor: The success criterion (NSE ≥ 0.2) is relatively lenient — this threshold indicates the model explains only 20% of discharge variance. The paper acknowledges runs are uncalibrated, but this makes it harder to assess whether agents are producing *scientifically useful* rather than merely *physically plausible* outputs. The 33 models validated only for "runnability" (not against observations) represent a weaker evidence tier that could inflate the breadth claims. The demonstrations in Fig. 6 are explicitly proof-of-concept and lack systematic evaluation — the Vietnamese farmer and MRV officer scenarios are compelling narratively but unvalidated scientifically.

3. Potential Impact

Immediate impact: If KI packages become widely adopted, this could substantially democratize access to process-based models. The paper correctly identifies that communities most vulnerable to climate risk are least able to run these models. The practical demonstrations (carbon credit verification, multi-model ensembles assembled in single sessions vs. years of coordination) point toward genuine workflow acceleration.

Broader scientific impact: The finding that operational expertise is structurally convergent across 14 domains is a meta-scientific insight. It suggests that the reproducibility crisis in computational science may be addressable through systematic knowledge externalization rather than ad hoc documentation efforts. This has implications well beyond Earth science.

Infrastructure impact: HydroCraft (the execution platform) and KDT together constitute a significant community resource. The 119-model library with standardized interfaces could lower barriers to multi-model intercomparison, potentially accelerating projects like AgMIP and ISIMIP by orders of magnitude in setup time.

Limitations on impact: The paper relies entirely on commercial, closed-source LLM agents. The top performers (Claude Sonnet 4.5, Claude Opus 4.5) are from a single provider. This creates fragility — API changes, pricing shifts, or deprecation could undermine deployed workflows. The paper doesn't address this vendor-lock concern.

4. Timeliness & Relevance

This paper is exceptionally well-timed. The convergence of capable coding agents (2025-2026 vintage), growing climate urgency, and the persistent reproducibility crisis in computational geoscience creates a clear window for this contribution. The paper explicitly benchmarks against 2026-era agents (GPT-5.x Codex, Gemini 3, Claude Opus 4.5), placing it at the frontier. The reference to coding agents achieving only 54% on adjacent reproduction tasks [17] positions KI as addressing a documented capability gap.

5. Strengths & Limitations

Key strengths:

The three-layer knowledge decomposition (procedural/evaluative/diagnostic) is elegant and empirically validated through ablation.

The convergence analysis across 119 models provides genuine insight into the structure of scientific operational knowledge — this is publishable independently of the agent framework.

The scale of the benchmark (3,000 trials, 10 agents, 5 platforms) is unprecedented for agentic science evaluation.

The paper carefully distinguishes what is demonstrated (agent-operable execution) from what is not claimed (universal scientific certification).

Key limitations:

The NSE ≥ 0.2 success threshold is generous; higher bars would likely show steeper performance drops.

Expert-supervised vs. autonomous KDT is not clearly ablated — how much does expert supervision actually matter for scientific validity?

The demonstrations (Fig. 6) mix inspiring visions with unvalidated claims — the MRV scenario implies policy-relevant accuracy without formal uncertainty quantification.

No comparison with alternative scaffolding approaches (e.g., RAG over documentation, fine-tuned domain models, traditional workflow managers like Pegasus/Snakemake).

Long-term maintenance of 119+ KI packages is handwaved as "community task" without addressing governance, versioning, or quality control mechanisms.

Overall Assessment

This is a high-impact systems paper that makes a compelling case for knowledge externalization as the missing layer between AI agents and scientific simulation. The empirical evidence is extensive and the conceptual framework is clean. The convergence finding across 119 models is the paper's deepest contribution. However, the work would benefit from higher success thresholds, systematic comparison with simpler baselines, and more honest engagement with the sustainability challenges of maintaining a living knowledge commons.

Rating:7.8/ 10

Significance 8.5Rigor 7.5Novelty 8Clarity 8

Generated May 19, 2026

Comparison History (22)

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gpt-5.25/22/2026

Paper 1 likely has higher scientific impact due to a broadly applicable, engineered knowledge infrastructure that measurably improves agentic use of process-based simulators, with strong real-world relevance to climate risk and resource management. Its demonstrated generalization across 119 models and 14 Earth-science domains suggests wide cross-disciplinary utility and infrastructure-like adoption potential. Paper 2 is highly novel and timely for AI+math, but its direct applicability is narrower (specialist formalization workflows) and impact may be constrained by tooling and domain formalization overhead, despite impressive open-problem results.

vs. Towards a General Intelligence and Interface for Wearable Health Data

claude-opus-4.65/22/2026

Paper 1 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), demonstrating systematic scaling laws, few-shot learning across 35 health tasks, and clinical validation. Its breadth of impact spans healthcare, AI, and consumer wellness with immediate real-world applications. While Paper 2 introduces an innovative knowledge infrastructure for scientific simulation with strong results, Paper 1's massive scale, methodological rigor (clinician evaluation), integration of LLM agents, and potential to transform personalized healthcare give it broader and more transformative impact across multiple fields.

vs. Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

claude-opus-4.65/19/2026

Paper 1 demonstrates broader scientific impact by creating a generalizable infrastructure (KI) applicable across 14 Earth-science domains and 119 process-based models, fundamentally democratizing access to scientific simulation. Its scaffolding approach is domain-agnostic and addresses systemic barriers for climate-vulnerable communities. Paper 2, while impressive in its real-time disease forecasting results matching CDC ensembles, addresses a narrower application domain. Paper 1's breadth of impact across fields, its novel conceptual framework (knowledge infrastructure as a scientific commons), and its potential to transform how entire scientific communities interact with simulation models gives it higher long-term impact potential.

vs. Fusion-fission forecasts when AI will shift to undesirable behavior

claude-opus-4.65/19/2026

Paper 1 introduces a broadly applicable infrastructure (KI) that democratizes access to process-based simulation models across 14 Earth-science domains with rigorous validation (3,000 trials, 119 KIs). Its practical impact on climate adaptation, resource management, and scientific accessibility for underserved communities is substantial. Paper 2 addresses an important AI safety problem with an interesting physics-inspired framework, but its fusion-fission analogy, while creative, faces questions about mechanistic validity in transformer architectures. Paper 1's demonstrated scalability across 117+ models and its potential to transform how Earth science is practiced gives it broader and more immediate scientific impact.

vs. Self-supervised Hierarchical Visual Reasoning with World Model

gemini-3.15/19/2026

Paper 1 offers a broader and more highly translational scientific impact by bridging AI agents with 14 different Earth-science domains. By democratizing access to complex, process-based climate and hydrological simulations, it directly addresses critical real-world challenges like climate risk and resource scarcity. While Paper 2 presents a strong methodological advancement in reinforcement learning and world models, Paper 1's massive interdisciplinary scope, extensive benchmarking across 119 Knowledge Infrastructures, and direct societal relevance give it a significantly higher potential for widespread scientific and real-world impact.

vs. Responsible Agentic AI Requires Explicit Provenance

claude-opus-4.65/19/2026

Paper 1 demonstrates higher scientific impact through its concrete, large-scale empirical validation (3,000 trials, 119 knowledge infrastructures across 14 Earth-science domains) and addresses a critical practical barrier—democratizing access to process-based simulation models. It delivers a generalizable toolkit (KDT) with measurable performance gains (84% vs 40% success). Paper 2, while addressing an important problem (provenance for responsible AI), is primarily a position/framework paper with only preliminary experiments. Paper 1's combination of methodological rigor, immediate practical utility, and breadth across Earth sciences gives it stronger near-term and long-term impact potential.

vs. Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

claude-opus-4.65/19/2026

Paper 2 demonstrates higher scientific impact through its concrete, empirically validated system (3,000 trials, 119 knowledge infrastructures across 14 Earth-science domains) with clear real-world applications in democratizing climate and Earth science modeling for underserved communities. It combines methodological rigor with broad interdisciplinary impact and immediate practical utility. Paper 1, while intellectually rigorous in proposing a safety architecture for LLM agents, is a position paper that sketches theoretical frameworks without empirical validation, and its impact is more narrowly focused on the AI safety community.

vs. Distribution-Aware Algorithm Design with LLM Agents

claude-opus-4.65/19/2026

Paper 2 addresses a broader societal challenge—democratizing access to Earth science simulation models for climate-vulnerable communities—with demonstrated generalization across 119 models and 14 domains. Its knowledge infrastructure framework is domain-general and tackles both accessibility and integration barriers. Paper 1 is technically strong with rigorous theory and impressive empirical results on combinatorial optimization, but its impact is more specialized. Paper 2's breadth across Earth sciences, equity implications, and potential to create a 'living scientific commons' gives it wider cross-disciplinary impact and real-world relevance.

vs. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

claude-opus-4.65/19/2026

Paper 2 (KISS) has higher potential scientific impact due to its broader real-world applicability and interdisciplinary reach. It addresses a critical democratization challenge—making complex Earth science simulation models accessible to communities most affected by climate risk. The empirical validation across 119 knowledge infrastructures spanning 14 Earth-science domains demonstrates remarkable generalizability. While Paper 1 offers elegant theoretical contributions (rate-distortion framework for agent memory), its impact is more narrowly scoped to the AI/agent memory community. Paper 2's potential to transform how scientific simulation knowledge is shared and operationalized across diverse communities gives it substantially wider societal and scientific impact.

vs. NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

claude-opus-4.65/19/2026

Paper 1 demonstrates broader scientific impact through its validated framework (KISS/KI) enabling democratized access to 119 process-based simulation models across 14 Earth-science domains, with rigorous benchmarking (3,000 trials). It addresses a critical real-world need—bridging climate risk knowledge gaps for underserved communities—while showing generalizable principles about extractable operational expertise. Paper 2 offers an interesting neural-network-inspired MAS framework, but its impact is more narrowly scoped to LLM multi-agent system design. Paper 1's cross-disciplinary breadth, practical toolkit (KDT), and direct societal relevance give it higher potential impact.

vs. Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

gpt-5.25/19/2026

Paper 2 has higher potential impact due to broader cross-field applicability (agentic scaffolding for 14 Earth-science domains and 117 models), strong real-world relevance (lowering barriers to climate-risk and resource-scarcity modeling), and timeliness (agentic AI + scientific simulation infrastructure). It proposes a reusable knowledge infrastructure and an automated toolkit, suggesting scalable adoption beyond a single task. Paper 1 is a solid, novel medical-imaging method with clear utility, but its impact is narrower (brain tumor segmentation) and primarily incremental within a mature subfield.

vs. The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

gpt-5.25/19/2026

Paper 1 is more novel and timely: it identifies a new multi-agent LLM vulnerability (semantic hijacking) and a counterintuitive “capability paradox,” backed by very large-scale experiments and mediation analysis, and proposes a concrete, generalizable defense (heterogeneous ensemble verification) with dramatic ASR reduction. Its implications extend across AI safety, security, and deployment of agentic systems in many domains. Paper 2 is impactful for Earth-science accessibility, but appears more domain-scoped and infrastructure-heavy, with less clear methodological detail on validation beyond benchmarks.

vs. Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

claude-opus-4.65/19/2026

Paper 1 addresses a fundamental challenge in making process-based Earth science simulations accessible, demonstrating impact across 119 models in 14 domains with rigorous 3,000-trial benchmarks. Its Knowledge Infrastructure concept is highly novel—bridging AI agents with decades of scientific modeling expertise—with significant real-world applications for climate-vulnerable communities. Paper 2, while interesting, presents an incremental improvement to knowledge graph enrichment using existing tools (GPT-4o, Neo4j, Tavily) with relatively modest evaluation scope. Paper 1's breadth of impact, methodological rigor, and societal relevance substantially exceed Paper 2's contributions.

vs. GIM: Evaluating models via tasks that integrate multiple cognitive domains

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact: it proposes a reusable “knowledge infrastructure” that measurably improves agentic operation of process-based Earth system simulators, with strong real-world relevance (climate, water, hazards) and clear application pathways. The scale of evaluation (3,000 trials) plus broad generalization claims (119 models across 14 domains) suggests wide cross-disciplinary utility and timeliness for agentic scientific computing. Paper 1 is novel and methodologically rigorous for LLM evaluation, but its impact is more confined to benchmarking/AI assessment rather than enabling new scientific workflows.

vs. Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

gpt-5.25/19/2026

Paper 2 has higher potential scientific impact due to its broader cross-domain scope (119 process-based models across 14 Earth-science domains) and strong timeliness: enabling agentic use of complex climate/hydrology simulations is highly relevant. The proposed “knowledge infrastructure” and toolkit suggest a reusable methodological contribution that could generalize to other scientific modeling fields, lowering access and integration barriers. While Paper 1 is rigorous and valuable for industrial-scale retrieval, its novelty and impact are more domain-specific (music search) and primarily engineering-focused rather than broadly enabling across scientific disciplines.

vs. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

gpt-5.25/19/2026

Paper 1 offers a broadly applicable, infrastructure-level contribution: a validated scaffold (KI) plus an automated toolkit (KDT) that generalizes to 117 models across 14 Earth-science domains, directly lowering barriers to using decades of process-based simulation knowledge. Its real-world relevance to climate/risk decision support is immediate and timely, and the large-scale cross-domain evaluation suggests strong methodological rigor and potential to reshape how simulation expertise is shared and maintained. Paper 2 is a solid ML training innovation with likely impact within search-augmented RL, but its applications are narrower and incremental relative to Paper 1’s cross-disciplinary, societally critical scope.

vs. How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

claude-opus-4.65/19/2026

Paper 1 demonstrates higher potential scientific impact due to its broad applicability across 14 Earth-science domains with 119 knowledge infrastructures, its novel framework for democratizing complex process-based simulation models, and its rigorous 3,000-trial benchmark. It addresses a critical real-world need (climate risk accessibility) with a generalizable, scalable solution. Paper 2, while innovative in applying neuroimaging to AI hallucination detection, has a smaller sample size (27 participants), narrower scope, and more incremental contribution to understanding human-AI interaction without offering a transformative solution.

vs. Revealing Interpretable Failure Modes of VLMs

gpt-5.25/19/2026

Paper 1 has higher potential impact due to its novelty in formalizing “knowledge infrastructure” that makes process-based Earth-system simulators agent-actionable, plus strong evidence of scalability (3,000-trial benchmark and 117 models across 14 domains). Its real-world relevance to climate-risk decision support and lowering barriers for non-experts is immediate, and the approach could generalize to other scientific simulation communities beyond Earth science. Paper 2 is timely and valuable for VLM safety, but its impact is more incremental within ML evaluation/safety and narrower in cross-disciplinary scientific infrastructure implications.

vs. GraphMind: From Operational Traces to Self-Evolving Workflow Automation

gemini-3.15/19/2026

Paper 2 addresses a critical global challenge (climate risk and resource scarcity) by democratizing complex Earth science simulations. Its Knowledge Infrastructure spans 14 scientific domains, offering broader cross-disciplinary impact and significant societal benefits compared to Paper 1's focus on enterprise IT workflow automation.

vs. Marrying Generative Model of Healthcare Events with Digital Twin of Social Determinants of Health for Disease Reasoning

claude-opus-4.65/19/2026

Paper 2 (KISS) has broader scientific impact potential. It addresses a fundamental infrastructure problem across 14 Earth-science domains with 119 knowledge infrastructures, democratizing access to process-based simulation models. The framework is generalizable, practically validated across 3,000 trials, and directly addresses climate equity by enabling non-specialist communities to use complex models. Paper 1, while technically sophisticated in combining generative models with SDoH for disease reasoning, is more narrowly focused on a specific biomedical application using UK Biobank data. Paper 2's cross-disciplinary scaffolding approach and its potential as a 'living scientific commons' suggest wider transformative impact.