AI Scientists Are Only as Good as Their Evidence: A Stratified Ablation of Proprietary Data and Reasoning Skills in Drug-Asset Valuation

Yinan Wang

Jun 8, 2026arXiv:2606.09556v1

cs.AI

#3017of 3489·Artificial Intelligence

#3017 of 3489 · Artificial Intelligence

Tournament Score

1284±43

10501800

35%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor4.5

Novelty4

Clarity7

Abstract

AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper presents a controlled three-arm ablation study examining whether the primary bottleneck for AI scientist agents in drug-asset valuation is the reasoning scaffold (prompts, playbooks, verification layers) or the evidence substrate (proprietary curated databases). The central claim is that proprietary data sets an upper bound on decision quality that no amount of reasoning improvement can overcome. The authors introduce a "completeness-aware decision utility" metric (informed-DQ = decision quality × gold coverage) and demonstrate that even a hypothetical perfect report from the non-proprietary stack would score below the observed proprietary-data-enabled system.

The core insight—that knowledge-intensive AI agents are fundamentally evidence-limited rather than reasoning-limited—is intuitive but has not been rigorously demonstrated in this domain through controlled ablation. The paper operationalizes this clearly: Arm A (web-only LLM), Arm B (+public tools, playbooks, red-team), Arm C (+proprietary Noah AI corpus), all sharing the same backbone model (Claude Opus 4.8).

Methodological Rigor

Strengths of the experimental design:

The three-arm structure with strict environment control (empty data directory for B, isolated run directories) is clean and interpretable.

The "capability-superset accounting" is a thoughtful methodological choice—crediting higher arms with lower arms' discoveries makes the comparison deliberately generous to the non-proprietary stack, strengthening the conclusion.

The stratified benchmark with five decision archetypes (crowded me-too, white-space opportunity, false-positive trap, deal-sensitive, biological void) is well-motivated and reveals stratum-specific failure modes.

Multiple robustness views for decision utility (product, ceiling, min, geometric) prevent cherry-picking.

Significant weaknesses:

Sample size: 13 assets with 28-30 scored cells per condition is quite small. Some strata have as few as 1-3 cells with scored outputs (S4: A/B/C = 3/1/2), making stratum-level conclusions fragile.

Gold-set circularity: The gold answer keys are derived from the same proprietary data family available to Arm C. The authors acknowledge this but understate its implications—C is essentially being tested on its ability to retrieve from itself. This is appropriate for the narrow claim (can the agent access the curated record?) but limits generalizability.

LLM-as-judge limitations: Despite hard-blinding between B and C, Arm A produces structurally different outputs (free-form vs. scorecard), making it partially identifiable. The inter-judge std of ~0.5 on a 0-10 scale seems reasonable but isn't formally analyzed for statistical significance.

Single model backbone: All experiments use Claude Opus 4.8, limiting generalizability to other models that might have different web retrieval capabilities.

No cost analysis: The paper acknowledges incomplete cost telemetry, but for a production system comparison, cost-effectiveness is crucial.

Potential Impact

Within pharmaceutical AI: The paper provides useful practical guidance—"skills on top of the right data"—that could influence how pharma companies invest in AI agent infrastructure. The finding that reasoning scaffolds fix calibration errors (the S2 white-space finding is compelling) while data access fixes coverage is actionable.

For AI agent design broadly: The evidence-substrate bottleneck thesis generalizes beyond pharma to any knowledge-intensive domain with significant private/curated information (legal, financial, intelligence analysis). This challenges the current focus on reasoning scaffolds (chain-of-thought, tool use, etc.) as the primary lever for improving AI agents.

Commercial implications: This is fundamentally a paper that argues for the value of proprietary data products (specifically Noah AI's product). While the scientific question is legitimate, the commercial interest is transparent and should be weighed.

Timeliness & Relevance

The paper is timely. The AI agent community is rapidly building increasingly sophisticated reasoning scaffolds, and there is relatively little systematic work examining whether the evidence base rather than the reasoning procedure is the binding constraint. As pharma companies deploy AI for portfolio decisions, understanding where to invest (better prompts vs. better data) has immediate practical value. The paper also connects to broader debates about RAG systems and knowledge-intensive tasks.

Strengths & Limitations

Key strengths:

1. Clean experimental design with a clear hypothesis and falsifiable predictions per stratum

2. The completeness-aware decision utility metric is a genuinely useful conceptual contribution—it highlights that blind judges systematically overrate reports with incomplete evidence

3. The S2 calibration finding (plain LLM exhibits indiscriminate pessimism) is a specific, interesting behavioral insight

4. Honest reporting of threats to validity, including the gold-set circularity issue

Key limitations:

1. Conflict of interest: The paper is authored by a Noah AI researcher evaluating Noah AI's proprietary data product. While the experimental design is reasonable, the conclusion directly supports the commercial proposition of the author's employer.

2. Narrow benchmark: 13 assets, single-target only, single therapeutic modality focus. Generalizability is uncertain.

3. No reproducibility for the key claim: The C arm and gold keys depend on proprietary data that cannot be shared, making the central result non-reproducible.

4. The informed-DQ metric is constructed to show the result: Multiplying DQ by coverage mechanically amplifies coverage differences. While the authors provide alternative formulations, all share the property that low coverage dramatically penalizes A/B.

5. Statistical analysis is minimal: No confidence intervals, significance tests, or formal statistical modeling despite small samples and acknowledged seed-to-seed noise of ~±0.3.

6. Missing baselines: No comparison with other commercial intelligence sources (Citeline directly, Evaluate Pharma, etc.) to test whether the finding is specific to Noah AI or generalizes to any curated database.

Overall Assessment

The paper asks an important question and provides a reasonably well-designed experiment to answer it, arriving at an unsurprising but usefully quantified conclusion: proprietary curated data dramatically improves factual coverage in knowledge-intensive AI tasks. The methodological contribution (completeness-aware decision utility, stratified ablation design) has moderate value. However, the small scale, gold-set circularity, commercial conflict, and non-reproducibility of the key result significantly limit the scientific impact. The paper reads more as a rigorous product evaluation than a generalizable scientific finding.

Rating:4.5/ 10

Significance 5.5Rigor 4.5Novelty 4Clarity 7

Generated Jun 9, 2026

Comparison History (20)

Wonvs. A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Paper 2 addresses a more fundamental and broadly applicable question about AI systems: whether capability improvements come primarily from model quality/reasoning or from the evidence substrate. This insight—that proprietary data sets the upper bound on AI scientist performance—has implications across all knowledge-intensive AI applications, not just drug valuation. Paper 1, while practically useful, addresses a narrow engineering automation problem with a relatively incremental contribution (multi-agent framework for barrier design). Paper 2's rigorous ablation methodology and its challenge to prevailing assumptions about AI capability scaling make it more likely to influence future research directions across multiple fields.

claude-opus-4-6·Jun 11, 2026

Wonvs. Instruction Finetuning DeepSeek-R1-8B Model Using LoRA and NEFTune

Paper 2 addresses a fundamental question about AI agents by demonstrating that data access, rather than reasoning scaffolds, limits performance in knowledge-intensive tasks. It introduces novel metrics and its insights have broad implications for AI evaluation across scientific domains. In contrast, Paper 1 is an incremental study applying established fine-tuning techniques (LoRA, NEFTune) to a specific, narrow application (Financial NER).

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Paper 1 introduces a novel, reproducible framework (MCPS) combining Monte Carlo search with trajectory generation for football analytics, releasing code and checkpoints publicly. It adapts methods across domains (autonomous driving to sports) and enables counterfactual reasoning with distribution-aware attribution. Paper 2, while raising a valid point about proprietary data importance for AI agents, is more of an empirical ablation study on a narrow commercial application (drug-asset valuation) with results that are somewhat intuitive (better data yields better decisions). Paper 1 has broader methodological novelty, cross-domain applicability, and open-science contributions.

claude-opus-4-6·Jun 10, 2026

Lostvs. Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

Paper 2 likely has higher scientific impact due to broader, cross-domain relevance: it reframes RAG reliability as an architectural/ontological mismatch and proposes a general diagnostic taxonomy (structural, temporal, causal) with design commitments that could influence legal informatics, IR, knowledge representation, and AI governance. Its contributions are more field-shaping and timely given ongoing public/legal failures of LLM systems. Paper 1 is methodologically concrete and application-relevant, but its impact is narrower (drug-asset valuation) and depends heavily on proprietary data, limiting reproducibility and generalization.

gpt-5.2·Jun 9, 2026

Lostvs. SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Paper 1 introduces a comprehensive, unified benchmark for multimodal agents, addressing a critical bottleneck in AI research (interactive spatial reasoning). Its broad applicability across vision, NLP, and robotics, combined with open-source potential, promises significant methodological impact. In contrast, Paper 2 focuses on a niche domain (drug-asset valuation) and emphasizes the role of proprietary data, which inherently limits its reproducibility and broader scientific adoption.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. Agent Economics: An Entropy-Controlled Pluralistic Alignment Framework for Preventing Artificial Hivemind in Autonomous Agents

Paper 2 presents empirical results from a controlled ablation study on a real production system, providing concrete quantitative evidence for an important finding: that proprietary data access, not reasoning scaffolds, is the binding constraint for AI scientist agents in knowledge-intensive tasks. This has immediate practical implications for AI-driven drug discovery and broader AI agent design. Paper 1 proposes a theoretical framework (BPF) for agent economies but presents no actual results—only anticipated outcomes from planned experiments—making it a position/proposal paper with unvalidated claims and significantly lower demonstrated impact.

claude-opus-4-6·Jun 9, 2026

Lostvs. Framing, Judging, Steering: An Assessable Competency Model for Teach-ing Students to Reason With Generative AI

Paper 2 has higher likely scientific impact due to broader applicability, openness, and field-wide relevance. Its CoRe-3 competency model and accompanying assessment platform target a timely, scalable need in education and human-AI interaction, with testable propositions and initial validity evidence across model backends—supporting methodological rigor and replicability via released instruments/data. Paper 1 is a valuable, quantitative ablation highlighting the importance of proprietary evidence in drug-asset valuation, but its impact is narrower (domain- and dataset-proprietary), limiting reproducibility and cross-field uptake despite clear real-world relevance.

gpt-5.2·Jun 9, 2026

Wonvs. RunAgent SuperBrowser: A Theory of Autonomous Web Navigation Grounded in Human Browsing Behaviour

Paper 2 has higher potential impact because it isolates a broadly relevant bottleneck for “AI scientist” performance—evidence access—via a controlled, stratified ablation with clear, decision-centric metrics (including a novel completeness-aware utility). Its conclusions generalize across high-stakes scientific/industrial domains (biomed, finance, policy) and directly inform evaluation methodology and system design. Paper 1 is a strong engineering contribution with impressive benchmark results, but its advances are more domain-specific (web navigation) and may generalize less broadly than Paper 2’s framing about data substrates limiting scientific reasoning agents.

gpt-5.2·Jun 9, 2026

Lostvs. LLM-Orchestrated Conformance Checking in Stroke Care Without Computer-Interpretable Guidelines

Paper 2 introduces a novel, generalizable framework for healthcare conformance checking that eliminates the need for Computer-Interpretable Guidelines—a significant practical barrier. It has broader applicability across medical domains, stronger methodological contribution (modular LLM orchestration architecture), and addresses a well-recognized gap in healthcare process mining. Paper 1, while rigorous in its ablation study, is narrower in scope (drug-asset valuation), heavily tied to a proprietary commercial product (Noah AI), and its core finding—that proprietary data matters—is somewhat intuitive. Paper 2's potential to impact clinical quality assessment across many healthcare settings gives it greater breadth of impact.

claude-opus-4-6·Jun 9, 2026

Wonvs. A Scoping Review of the Ethical Perspectives on Anthropomorphising Large Language Model-Based Conversational Agents

Paper 2 presents a novel empirical finding with broad implications: that proprietary data access, not model capability or reasoning scaffolds, is the binding constraint for AI scientist agents in knowledge-intensive domains. This challenges prevailing assumptions in the AI agent community and provides actionable, quantitative evidence (controlled ablation study with specific metrics). Its impact extends across AI agent design, drug discovery, and scientific automation. Paper 1, while timely and thorough as a scoping review, synthesizes existing fragmented literature on AI anthropomorphism ethics without generating new empirical findings, limiting its transformative potential.

claude-opus-4-6·Jun 9, 2026

#3017of 3489·Artificial Intelligence

#3017 of 3489 · Artificial Intelligence

Tournament Score

1284±43

10501800

35%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor4.5

Novelty4

Clarity7