Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou

May 28, 2026

arXiv:2605.29966v1 PDF

cs.AI(primary)

#535of 2821·Artificial Intelligence

#535 of 2821 · Artificial Intelligence

Tournament Score

1478±50

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6

Novelty5.5

Clarity7.5

Tournament Score

1478±50

10501800

79%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Compass

1. Core Contribution

Compass addresses a genuine and well-defined problem: the fragmentation of marine lead (Pb) measurement data across decades of academic literature, creating "data silos" that impede global-scale oceanographic analysis. The paper's core novelty lies in its expert-guided adaptation paradigm—a Knowledge Tree co-designed with marine scientists that encodes domain knowledge (background knowledge, logical constraints, operational guidelines, and validation criteria) into a structured hierarchy that guides LLM reasoning without requiring fine-tuning. This is operationalized through a three-phase pipeline (collection, extraction, aggregation) with five specialized components.

The concrete output is the largest integrated marine Pb database to date: 35,563 records including 3,751 newly extracted from over 230,000 open-access papers, representing an 86% increase over the GEOTRACES baseline. This is a tangible scientific artifact with immediate utility.

2. Methodological Rigor

The methodology is generally sound but has notable caveats:

Strengths in evaluation design: The authors construct a benchmark with 10 paper categories and 337 tables (63 containing target data, 1,397 data points), enabling quantitative comparison. They compare against both general-purpose LLMs (GPT-4o, Gemini-2.5-pro, open-source models) and domain-specific fine-tuned models (K2, OceanGPT). The ablation study demonstrates that both tree-structured logic and knowledge nodes contribute meaningfully.

Validation is multi-layered but limited in scale: Expert manual validation covers 22 of 110 identified papers (20%), yielding 92% accuracy (±1.7% CI). While this is reasonable for a deployment study, the validation was conducted by only one marine scientist with a second providing consistency checks—a thin validation layer for a claim about scientific rigor.

Metric interpretation requires care: The end-to-end F1 of 0.465 (single-pass) and 0.619 (with rollback) are modest in absolute terms, though they exceed all baselines. The gap between the 0.619 F1 on benchmarks and 92% deployment accuracy is explained but somewhat unsatisfyingly—deployment conditions appear easier than the benchmark, which may indicate the benchmark is not fully representative.

Reproducibility: Code and prompts are publicly available, which is commendable. However, the Knowledge Tree construction process, while documented, involves subjective expert judgment that may be difficult to replicate exactly.

3. Potential Impact

Domain-specific impact: The marine Pb database itself is a meaningful contribution to marine geochemistry. Expanded coverage in the East China Sea, Southern Ocean, and Arabian Sea addresses real data gaps. The interactive visualization platform (1,590+ visits) suggests community uptake.

Methodological generalizability: The expert-guided adaptation paradigm—encoding domain logic as a Knowledge Tree rather than fine-tuning—is potentially transferable to other scientific domains with similar data integration challenges (e.g., other trace elements, paleoclimate proxies, biogeochemical datasets). However, the paper provides no empirical evidence of such transfer, and the Knowledge Tree construction remains a manual process requiring domain expertise.

Broader AI for Science implications: The paper contributes to the growing body of work on making LLMs reliable for scientific applications. The finding that domain-specific fine-tuned models (K2, OceanGPT) actually perform worse than guided general-purpose models is an important practical insight, suggesting that preserving instruction-following capabilities may matter more than domain vocabulary for structured extraction tasks.

4. Timeliness & Relevance

The paper is well-timed, sitting at the intersection of two active trends: (1) the explosion of LLM agent frameworks, and (2) the growing need for automated scientific data integration. The specific application to GEOTRACES-adjacent data is timely given ongoing international efforts to understand ocean trace element cycling. The choice of KDD '26 as venue is appropriate given the data mining and integration focus.

5. Strengths & Limitations

Key Strengths:

Produces a concrete, usable scientific dataset rather than merely demonstrating a method

The Knowledge Tree design with four knowledge dimensions (BK, LC, OG, VC) is well-structured and interpretable

Comparison against both general-purpose and domain-specific LLMs is thorough

Open-source code, data platform, and complete paper list enhance reproducibility

Practical efficiency: ~52 GPU hours vs. manual curation represents orders-of-magnitude improvement

The one-time Knowledge Tree construction cost (6-7 hours) is remarkably low

Notable Limitations:

Figure extraction excluded: The paper acknowledges but does not address data in figures, which may contain significant additional records

PDF parsing dependency: Upstream parsing errors (MinerU) propagate through the pipeline, and 10% of errors in the manual validation stem from this

Semantic confusion errors (56% of errors): Misclassification of sample types (rainwater vs. seawater, sediment vs. dissolved) represents a fundamental limitation of the current approach—these are precisely the kinds of errors domain experts would not make

Limited generalization evidence: Despite claims of transferability, no experiments demonstrate the framework on other trace elements or scientific domains

Benchmark scale: 337 tables with 63 positives and 1,397 data points is relatively small for a benchmark

Single-domain expert validation: Relying on essentially one expert reviewer introduces potential bias

Additional Observations:

The paper's framing of expert-guided adaptation as an alternative to fine-tuning and RAG is somewhat overstated—the Knowledge Tree is essentially sophisticated prompt engineering with structured task decomposition. While effective, calling it a new "paradigm" may overclaim. The approach also doesn't address how the Knowledge Tree would need updating as domain conventions evolve.

The 0.06% rollback rate across 230,000 papers suggests either remarkably clean extraction or potentially insufficient validation sensitivity—this deserves more investigation.

Summary

Compass makes a solid applied contribution at the intersection of LLM agents and scientific data integration, producing a valuable domain-specific database. The methodology is practical and well-evaluated within its scope, though the generalizability claims exceed the evidence presented. The work is most impactful as a demonstration that structured expert guidance can make general-purpose LLMs reliable enough for scientific data extraction at scale.

Rating:6.5/ 10

Significance 6.5Rigor 6Novelty 5.5Clarity 7.5

Generated May 29, 2026

Comparison History (14)

vs. Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

claude-opus-4.65/29/2026

Paper 1 demonstrates higher scientific impact through several factors: (1) it produces a tangible, large-scale scientific resource—the largest integrated marine Pb database—with immediate utility for oceanography and environmental science; (2) it introduces a generalizable expert-guided LLM framework applicable across geosciences; (3) it bridges AI and domain science with rigorous validation (92% expert-verified accuracy); (4) it has broad interdisciplinary impact spanning NLP, marine science, and environmental monitoring. Paper 2, while technically sound in improving LLM safety robustness via zeroth-order optimization, addresses a narrower problem within AI safety with less cross-disciplinary reach.

vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

gemini-3.15/29/2026

Paper 2 offers high potential impact by directly enabling new discoveries in Earth sciences. By bridging AI and marine science to create the largest integrated marine lead database, it resolves a critical data scarcity issue. Its expert-guided LLM framework demonstrates a scalable, highly accurate methodology for scientific data extraction that can be replicated across other high-stakes domains, yielding immediate and tangible real-world scientific benefits compared to the domain-specific theoretical AI insights of Paper 1.

vs. Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact because it delivers an immediate, field-defining real-world artifact: the largest integrated marine Pb database plus a public visualization platform, enabling new oceanographic and pollution studies. Its expert-guided, verifiable extraction workflow is methodologically grounded with large-scale deployment (230k papers) and manual validation (92% accuracy), and the approach is transferable to other high-stakes scientific domains. Paper 1 is novel for LLM-agent memory optimization, but its impact is more methodological within AI and depends on broader adoption.

vs. Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

gemini-3.15/29/2026

Paper 2 demonstrates a profound interdisciplinary impact by bridging AI and geoscience to solve a massive data-silo problem. By producing the largest integrated marine lead database, it delivers an immediate, tangible resource for oceanographic and climate research. This concrete scientific artifact and the scalable 'AI for Science' methodology offer broader real-world applications and scientific value than the incremental LLM fine-tuning optimization proposed in Paper 1.

vs. ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

claude-opus-4.65/29/2026

Paper 2 (Compass) has higher potential scientific impact because it addresses a concrete, high-value scientific problem—creating the largest integrated marine Pb database by extracting data from 230,000+ papers—with immediate real-world applications in oceanography and environmental science. It produces a tangible, reusable scientific resource (database + visualization platform) and demonstrates a generalizable expert-guided LLM framework applicable across geosciences. Paper 1 (ConMoE) offers incremental improvements in MoE compression, a narrower ML engineering problem with less broad scientific impact and limited novelty beyond existing pruning/merging approaches.

vs. LaneRoPE: Positional Encoding for Collaborative Parallel Reasoning and Generation

claude-opus-4.65/29/2026

LaneRoPE introduces a fundamentally novel architectural modification (inter-sequence attention and RoPE extension) that enables collaborative parallel reasoning in LLMs, addressing a core limitation of test-time scaling methods. Its broad applicability across LLM inference pipelines, minimal overhead, and potential to improve reasoning across many domains gives it wider impact. Paper 2, while valuable for marine geoscience, is a domain-specific application of existing LLM agent techniques to a niche data extraction problem with more limited generalizability beyond its specific scientific domain.

vs. Geometry of Human Perceptual Domains Emerges Transiently in LLM Representations

gemini-3.15/29/2026

Paper 1 delivers immediate, tangible scientific value by creating the largest marine lead database to date, directly addressing a critical bottleneck in geosciences. Its expert-guided LLM framework offers a scalable, rigorously validated methodology applicable across various scientific domains. While Paper 2 provides valuable theoretical insights into LLM interpretability, Paper 1 demonstrates broader interdisciplinary impact, practical innovation, and significant real-world applicability in environmental tracking and global oceanography.

vs. When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

gemini-3.15/29/2026

Paper 2 addresses a fundamental challenge in AI—belief management and reasoning over long contexts—which has broad implications for the reliability and capability of LLMs across virtually all domains. While Paper 1 presents a highly valuable real-world application for geosciences, Paper 2's foundational contribution to model reasoning, supported by rigorous benchmarking and representation-level interventions, offers a significantly wider breadth of potential scientific impact.

vs. When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

gpt-5.25/29/2026

Paper 2 has higher potential impact due to its immediate real-world deliverable (largest integrated marine Pb database from 230k papers), strong methodological validation (multi-layer checks, 92% expert-verified accuracy), and broad downstream utility for oceanography, pollution studies, and climate-related circulation research. It is timely (LLM agents for scientific extraction) and provides open infrastructure (visualization platform) that can be reused across domains. Paper 1 is novel/theoretically important for multi-model self-consuming dynamics and alignment, but likely more specialized and with less direct near-term empirical or cross-disciplinary uptake.

vs. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

gemini-3.15/29/2026

Paper 1 exposes a fundamental flaw in evaluating LLM search agents—showing they rely on intrinsic memory rather than genuine retrieval—and introduces a dynamic benchmark to solve this. This foundational methodological contribution impacts the entire rapidly growing field of AI agent development. While Paper 2 offers a valuable domain-specific application of LLMs to marine geoscience, Paper 1's insights into AI behavior and benchmarking will likely have a broader, more pervasive impact across all disciplines building or evaluating AI search tools.

vs. Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics

claude-opus-4.65/29/2026

Paper 1 demonstrates higher scientific impact through several factors: (1) it produces a tangible, largest-ever integrated marine Pb database from 230,000+ papers with 3,751 new records; (2) its expert-guided LLM framework is broadly applicable to other scientific data extraction domains beyond geosciences; (3) it addresses a real, widely-recognized problem (data silos in scientific literature) with validated results (92% expert-verified accuracy); (4) it releases an open visualization platform enabling future discoveries. Paper 2 addresses a narrower industrial scheduling problem with evaluation limited to simulation, offering less breadth of impact and fewer immediate real-world applications.

vs. PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

claude-opus-4.65/29/2026

Paper 1 (Compass) demonstrates higher scientific impact through its novel expert-guided LLM framework for scientific data extraction, producing a tangible real-world contribution (the largest marine Pb database) with cross-disciplinary relevance spanning AI, oceanography, and environmental science. Its methodology for bridging LLMs and domain expertise is broadly generalizable to many scientific fields facing similar data silo problems. Paper 2 (PassNet) makes a solid contribution to compiler optimization but addresses a narrower, more incremental problem in systems/ML infrastructure with less breadth of impact.

vs. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

claude-opus-4.65/29/2026

Paper 2 demonstrates higher scientific impact through several factors: (1) it produces a tangible, lasting scientific resource—the largest integrated marine Pb database—that directly enables future research in oceanography and environmental science; (2) it addresses a broadly applicable problem (extracting structured data from unstructured scientific literature) relevant across many scientific domains; (3) the 92% expert-verified accuracy on 230,000+ papers demonstrates real-world scalability; (4) it has immediate practical applications for ocean circulation and pollution studies. Paper 1, while technically innovative in GPU kernel optimization, targets a narrower computational engineering audience with less cross-disciplinary impact.

vs. mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

claude-opus-4.65/29/2026

Paper 1 (Compass) demonstrates significantly higher scientific impact through multiple dimensions: it introduces a novel expert-guided LLM agent framework validated on a real large-scale scientific problem, produces a tangible and substantial scientific output (the largest integrated marine Pb database with 3,751 new records), undergoes rigorous multi-layered validation achieving 92% expert-verified accuracy, and addresses a fundamental challenge in geosciences data integration. Paper 2 describes a useful but relatively incremental software tool (an MCP server wrapper for knowledge graph querying) with a brief abstract suggesting limited methodological depth and no demonstrated scientific results or validation.