Yinan Wang
AI Scientist agents are often evaluated as if capability were mainly a function of model quality, prompting, or reasoning scaffolds. We test a different hypothesis in drug-asset valuation: for knowledge-intensive scientific decisions, the limiting factor is often the evidence substrate the agent can access. We run a controlled three-arm ablation on a production valuation agent: A is a plain web-only LLM analyst, B adds public structured tools plus a 14-dimension valuation playbook, verifier, objectivity policy and red-team, and C adds the proprietary Noah AI corpus of curated pipeline, trial and deal intelligence. Across a 13-asset stratified benchmark, B improves calibration and audit discipline: tier-in-range accuracy rises from 0.80 to 0.89 and objectivity from 3.16 to 3.30. But B does not remove the factual ceiling. Under capability-superset accounting, A and B recover only 0.25 and 0.38 of the curated gold competitive record, while C recovers 0.96; on the curated long-tail subset, C reaches 0.93 vs. 0.26/0.30. Raw blind-panel decision quality is similar for A and B (7.01 vs. 6.96), so we introduce completeness-aware decision utility: informed decision-quality = decision-quality x gold-coverage. On this metric, C reaches 7.43 vs. 1.76/2.57 for A/B. Even a perfect non-proprietary-data report would be capped at 3.83 by B's coverage. The result is not that reasoning scaffolds are unimportant; they improve calibration and discipline. Rather, proprietary evidence sets the upper bound of what the AI Scientist can know and therefore decide.
This paper presents a controlled three-arm ablation study examining whether the primary bottleneck for AI scientist agents in drug-asset valuation is the reasoning scaffold (prompts, playbooks, verification layers) or the evidence substrate (proprietary curated databases). The central claim is that proprietary data sets an upper bound on decision quality that no amount of reasoning improvement can overcome. The authors introduce a "completeness-aware decision utility" metric (informed-DQ = decision quality × gold coverage) and demonstrate that even a hypothetical perfect report from the non-proprietary stack would score below the observed proprietary-data-enabled system.
The core insight—that knowledge-intensive AI agents are fundamentally evidence-limited rather than reasoning-limited—is intuitive but has not been rigorously demonstrated in this domain through controlled ablation. The paper operationalizes this clearly: Arm A (web-only LLM), Arm B (+public tools, playbooks, red-team), Arm C (+proprietary Noah AI corpus), all sharing the same backbone model (Claude Opus 4.8).
Within pharmaceutical AI: The paper provides useful practical guidance—"skills on top of the right data"—that could influence how pharma companies invest in AI agent infrastructure. The finding that reasoning scaffolds fix calibration errors (the S2 white-space finding is compelling) while data access fixes coverage is actionable.
For AI agent design broadly: The evidence-substrate bottleneck thesis generalizes beyond pharma to any knowledge-intensive domain with significant private/curated information (legal, financial, intelligence analysis). This challenges the current focus on reasoning scaffolds (chain-of-thought, tool use, etc.) as the primary lever for improving AI agents.
Commercial implications: This is fundamentally a paper that argues for the value of proprietary data products (specifically Noah AI's product). While the scientific question is legitimate, the commercial interest is transparent and should be weighed.
The paper is timely. The AI agent community is rapidly building increasingly sophisticated reasoning scaffolds, and there is relatively little systematic work examining whether the evidence base rather than the reasoning procedure is the binding constraint. As pharma companies deploy AI for portfolio decisions, understanding where to invest (better prompts vs. better data) has immediate practical value. The paper also connects to broader debates about RAG systems and knowledge-intensive tasks.
1. Clean experimental design with a clear hypothesis and falsifiable predictions per stratum
2. The completeness-aware decision utility metric is a genuinely useful conceptual contribution—it highlights that blind judges systematically overrate reports with incomplete evidence
3. The S2 calibration finding (plain LLM exhibits indiscriminate pessimism) is a specific, interesting behavioral insight
4. Honest reporting of threats to validity, including the gold-set circularity issue
1. Conflict of interest: The paper is authored by a Noah AI researcher evaluating Noah AI's proprietary data product. While the experimental design is reasonable, the conclusion directly supports the commercial proposition of the author's employer.
2. Narrow benchmark: 13 assets, single-target only, single therapeutic modality focus. Generalizability is uncertain.
3. No reproducibility for the key claim: The C arm and gold keys depend on proprietary data that cannot be shared, making the central result non-reproducible.
4. The informed-DQ metric is constructed to show the result: Multiplying DQ by coverage mechanically amplifies coverage differences. While the authors provide alternative formulations, all share the property that low coverage dramatically penalizes A/B.
5. Statistical analysis is minimal: No confidence intervals, significance tests, or formal statistical modeling despite small samples and acknowledged seed-to-seed noise of ~±0.3.
6. Missing baselines: No comparison with other commercial intelligence sources (Citeline directly, Evaluate Pharma, etc.) to test whether the finding is specific to Noah AI or generalizes to any curated database.
The paper asks an important question and provides a reasonably well-designed experiment to answer it, arriving at an unsurprising but usefully quantified conclusion: proprietary curated data dramatically improves factual coverage in knowledge-intensive AI tasks. The methodological contribution (completeness-aware decision utility, stratified ablation design) has moderate value. However, the small scale, gold-set circularity, commercial conflict, and non-reproducibility of the key result significantly limit the scientific impact. The paper reads more as a rigorous product evaluation than a generalizable scientific finding.
Generated Jun 9, 2026
Paper 2 addresses a more fundamental and broadly applicable question about AI systems: whether capability improvements come primarily from model quality/reasoning or from the evidence substrate. This insight—that proprietary data sets the upper bound on AI scientist performance—has implications across all knowledge-intensive AI applications, not just drug valuation. Paper 1, while practically useful, addresses a narrow engineering automation problem with a relatively incremental contribution (multi-agent framework for barrier design). Paper 2's rigorous ablation methodology and its challenge to prevailing assumptions about AI capability scaling make it more likely to influence future research directions across multiple fields.
Paper 2 addresses a fundamental question about AI agents by demonstrating that data access, rather than reasoning scaffolds, limits performance in knowledge-intensive tasks. It introduces novel metrics and its insights have broad implications for AI evaluation across scientific domains. In contrast, Paper 1 is an incremental study applying established fine-tuning techniques (LoRA, NEFTune) to a specific, narrow application (Financial NER).
Paper 1 introduces a novel, reproducible framework (MCPS) combining Monte Carlo search with trajectory generation for football analytics, releasing code and checkpoints publicly. It adapts methods across domains (autonomous driving to sports) and enables counterfactual reasoning with distribution-aware attribution. Paper 2, while raising a valid point about proprietary data importance for AI agents, is more of an empirical ablation study on a narrow commercial application (drug-asset valuation) with results that are somewhat intuitive (better data yields better decisions). Paper 1 has broader methodological novelty, cross-domain applicability, and open-science contributions.
Paper 2 likely has higher scientific impact due to broader, cross-domain relevance: it reframes RAG reliability as an architectural/ontological mismatch and proposes a general diagnostic taxonomy (structural, temporal, causal) with design commitments that could influence legal informatics, IR, knowledge representation, and AI governance. Its contributions are more field-shaping and timely given ongoing public/legal failures of LLM systems. Paper 1 is methodologically concrete and application-relevant, but its impact is narrower (drug-asset valuation) and depends heavily on proprietary data, limiting reproducibility and generalization.
Paper 1 introduces a comprehensive, unified benchmark for multimodal agents, addressing a critical bottleneck in AI research (interactive spatial reasoning). Its broad applicability across vision, NLP, and robotics, combined with open-source potential, promises significant methodological impact. In contrast, Paper 2 focuses on a niche domain (drug-asset valuation) and emphasizes the role of proprietary data, which inherently limits its reproducibility and broader scientific adoption.
Paper 2 presents empirical results from a controlled ablation study on a real production system, providing concrete quantitative evidence for an important finding: that proprietary data access, not reasoning scaffolds, is the binding constraint for AI scientist agents in knowledge-intensive tasks. This has immediate practical implications for AI-driven drug discovery and broader AI agent design. Paper 1 proposes a theoretical framework (BPF) for agent economies but presents no actual results—only anticipated outcomes from planned experiments—making it a position/proposal paper with unvalidated claims and significantly lower demonstrated impact.
Paper 2 has higher likely scientific impact due to broader applicability, openness, and field-wide relevance. Its CoRe-3 competency model and accompanying assessment platform target a timely, scalable need in education and human-AI interaction, with testable propositions and initial validity evidence across model backends—supporting methodological rigor and replicability via released instruments/data. Paper 1 is a valuable, quantitative ablation highlighting the importance of proprietary evidence in drug-asset valuation, but its impact is narrower (domain- and dataset-proprietary), limiting reproducibility and cross-field uptake despite clear real-world relevance.
Paper 2 has higher potential impact because it isolates a broadly relevant bottleneck for “AI scientist” performance—evidence access—via a controlled, stratified ablation with clear, decision-centric metrics (including a novel completeness-aware utility). Its conclusions generalize across high-stakes scientific/industrial domains (biomed, finance, policy) and directly inform evaluation methodology and system design. Paper 1 is a strong engineering contribution with impressive benchmark results, but its advances are more domain-specific (web navigation) and may generalize less broadly than Paper 2’s framing about data substrates limiting scientific reasoning agents.
Paper 2 introduces a novel, generalizable framework for healthcare conformance checking that eliminates the need for Computer-Interpretable Guidelines—a significant practical barrier. It has broader applicability across medical domains, stronger methodological contribution (modular LLM orchestration architecture), and addresses a well-recognized gap in healthcare process mining. Paper 1, while rigorous in its ablation study, is narrower in scope (drug-asset valuation), heavily tied to a proprietary commercial product (Noah AI), and its core finding—that proprietary data matters—is somewhat intuitive. Paper 2's potential to impact clinical quality assessment across many healthcare settings gives it greater breadth of impact.
Paper 2 presents a novel empirical finding with broad implications: that proprietary data access, not model capability or reasoning scaffolds, is the binding constraint for AI scientist agents in knowledge-intensive domains. This challenges prevailing assumptions in the AI agent community and provides actionable, quantitative evidence (controlled ablation study with specific metrics). Its impact extends across AI agent design, drug discovery, and scientific automation. Paper 1, while timely and thorough as a scoping review, synthesizes existing fragmented literature on AI anthropomorphism ethics without generating new empirical findings, limiting its transformative potential.