Can AI Agents Synthesize Scientific Conclusions?

Hayoung Jung, Pedro Viana Diniz, José Reinaldo Corrêa Roveda, Abner Fernandes da Silva, Haeun Jung, Enoch Tsai, Aleksandra Korolova, Manoel Horta Ribeiro

Jun 9, 2026arXiv:2606.11337v1

cs.AIcs.CLcs.CY

#176of 3489·Artificial Intelligence

#176 of 3489 · Artificial Intelligence

Tournament Score

1527±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

8/ 10

Significance8.5

Rigor8.5

Novelty7.5

Clarity8

Abstract

Scientific AI agents increasingly retrieve evidence, reason across sources, and synthesize conclusions used in consequential decisions. Yet, their ability to do so in high-stakes domains such as health remains unclear. We introduce SciConBench, a large-scale live benchmark of 9.11K questions and expert-written conclusions from systematic reviews to evaluate open-domain scientific conclusion synthesis. The benchmark draws on an expert-validated automated evaluation pipeline that decomposes conclusions into atomic facts and measures correctness and comprehensiveness via factual precision and recall. To mitigate data leakage, we further introduce SciConHarness, a clean-room evaluation harness that equips agents with controlled web interaction to ensure valid measurement. Evaluating 8 frontier models and deep research agents, we find that factual quality remains low: under clean-room settings, the best agent achieves only a factual F1 of 0.337. Our clean-room setting consistently reduces performance relative to unconstrained evaluation, suggesting that leakage inflates estimates of models' true synthesis capabilities. Finally, we audit consumer-facing agents (e.g., Google AI Overview, OpenEvidence) and find they frequently generate incomplete and sometimes contradictory conclusions, even when the ground-truth answer is available. Overall, our results show that reliable synthesis of scientific conclusions remains an open challenge, and that clean-room evaluation is essential for assessing open-domain AI agents.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Can AI Agents Synthesize Scientific Conclusions?"

1. Core Contribution

This paper introduces SciConBench, a large-scale live benchmark of 9,107 questions derived from the Cochrane Database of Systematic Reviews (CDSR), designed to evaluate AI agents' ability to synthesize scientific conclusions from open-web evidence. The paper makes four interconnected contributions: (1) the benchmark dataset itself, (2) SciConHarness, a clean-room evaluation harness that prevents models from retrieving ground-truth artifacts via controlled web interaction, (3) an expert-validated factual evaluation pipeline based on atomic fact decomposition measuring precision, recall, and F1, and (4) empirical evaluation of 8 frontier models/agents plus audits of consumer-facing systems (Google AI Overview, OpenEvidence).

The core problem addressed is the gap between evaluating intermediate capabilities (retrieval, citation, summarization) and the end-to-end task of scientific conclusion synthesis—a long-horizon task requiring evidence retrieval, filtering, quality assessment, and integration. The clean-room innovation is particularly important: it reveals that apparent model performance is substantially inflated by ground-truth leakage, with F1 dropping 0.02–0.172 across all systems under controlled conditions.

2. Methodological Rigor

The methodology is exceptionally thorough across multiple dimensions:

Benchmark construction: Questions are generated from CDSR Objectives using GPT and validated by medical students across faithfulness (92%), PICO completeness (92%), and clarity (96%), with high inter-annotator agreement (AC1: 0.756–1.00).

Atomic fact decomposition: A six-step modular pipeline (decomposition, decontextualization, rewriting, relevance/redundancy filtering) is validated by two medical doctors showing 96.4% faithfulness, 96.0% completeness, and 98.0% comprehensiveness.

LLM judge validation: The paper constructs a gold-standard dataset (N=129 for precision, N=119 for recall) annotated by medical doctors with a third adjudicating disagreements. The chosen judge (gpt-5.4-mini) achieves macro F1 of 0.837/0.868 for precision/recall and passes the Alternative Annotator Test. Agreement between the LLM judge and individual experts matches or exceeds inter-expert agreement—a strong validation.

Clean-room validation: Manual annotation of 150 tool outputs confirms high filtering precision (0.933) and recall (0.972), with 100% of ground-truth CDSR articles successfully removed.

Statistical rigor: Power analysis confirms the sample size (N=268) can detect F1 differences of Δ≈0.037 at α=0.05 with power 0.8.

One notable weakness is that the source text for factual precision evaluation uses only abstracts and plain-language summaries (not full review text) due to copyright constraints, potentially missing nuances in full reviews.

3. Potential Impact

Immediate impact: The benchmark fills a clear gap for evaluating agentic AI systems on realistic, long-horizon scientific synthesis tasks. The finding that the best agent achieves only F1=0.337 under clean-room conditions provides a sobering calibration for the field.

Methodological impact: The clean-room evaluation paradigm is generalizable beyond health/science to any domain where web-enabled agents might retrieve rather than synthesize answers. The consistent 0.02–0.172 F1 reduction under clean-room conditions is a crucial finding that challenges performance claims of deep research agents more broadly.

Policy and safety impact: The audit of consumer-facing agents is directly relevant to public health. Finding that 50.8–59% of conclusions from Google AI Overview/Mode and OpenEvidence contain at least one contradictory fact—despite having access to ground-truth—raises serious concerns for the hundreds of millions of health consultations these systems handle.

Practical deployment: The benchmark's live, continuously-updated design and open-source harness lower barriers for ongoing evaluation as new models and agents emerge.

4. Timeliness & Relevance

This work addresses a critical and timely need. OpenAI reports billions of weekly healthcare messages on ChatGPT; OpenEvidence reports over 200 million AI-powered health consultations. Deep research agents from OpenAI, Anthropic, Google, and Perplexity are rapidly proliferating. Yet rigorous evaluation of their synthesis capabilities—as opposed to retrieval or QA—has been lacking. The paper arrives at precisely the moment when the gap between deployment scale and evaluation rigor is most dangerous.

The benchmark leakage problem is also increasingly recognized but poorly addressed in existing benchmarks (Table 3 comparison shows no prior benchmark offers both live updates and clean-room evaluation).

5. Strengths & Limitations

Key strengths:

Scale (9.1K samples) far exceeds comparable agentic benchmarks (typically 65–200 samples)

Multi-layered validation at every pipeline stage with domain experts

The clean-room innovation provides a transferable methodological contribution

Failure mode analysis reveals clinically meaningful error categories (direction-of-effect inversions, evidence quality mischaracterization)

Comprehensive cost analysis ($3,336 total) and reproducibility details

The consumer-facing agent audit bridges benchmark evaluation to real-world deployment concerns

Notable limitations:

Domain specificity to health/clinical evidence (acknowledged by authors)

Residual indirect leakage through derivative content remains possible

Perplexity agents use "best-effort" clean-room via provider-side filters rather than direct SciConHarness integration, limiting comparability

The evaluation pipeline itself relies on LLMs at multiple stages, creating potential error cascades

The benchmark treats CDSR conclusions as gold standard, but these can become outdated (though French et al. suggest most remain stable)

No evaluation of intermediate reasoning quality or evidence selection

Missing analysis: The paper does not examine whether errors cluster around particular medical domains, question types, or evidence quality levels, which could help target improvements.

Overall Assessment

This is a high-quality, comprehensive benchmark paper that addresses a genuinely important and timely problem. Its primary scientific contribution—demonstrating that clean-room evaluation is essential and that current agents are far from reliable for scientific synthesis—is well-supported by rigorous methodology. The scale, validation depth, and practical relevance distinguish it from the numerous smaller, static benchmarks in the comparison table. The work should have substantial influence on how the community evaluates agentic AI systems, particularly in high-stakes domains.

Rating:8/ 10

Significance 8.5Rigor 8.5Novelty 7.5Clarity 8

Generated Jun 11, 2026

Comparison History (22)

Wonvs. Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

Paper 1 addresses a critical and timely problem—evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. SciConBench introduces a large-scale benchmark with a novel clean-room evaluation methodology that reveals significant limitations of frontier models and consumer-facing AI products. Its findings about data leakage inflating performance and the unreliability of deployed systems (Google AI Overview, OpenEvidence) have broad implications for AI safety, policy, and scientific practice. Paper 2 presents a solid engineering contribution for agent memory management, but its impact is more incremental and narrower in scope compared to Paper 1's foundational evaluation framework for scientific AI reliability.

claude-opus-4-6·Jun 11, 2026

Wonvs. Cross-Modal Knowledge Distillation without Paired Data: Theoretical Foundation and Algorithm

Paper 2 addresses a timely and broadly impactful question about AI agents' ability to synthesize scientific conclusions, introducing a large-scale benchmark (SciConBench) with a novel clean-room evaluation methodology. Its findings that frontier models achieve only 0.337 F1 and that data leakage inflates performance estimates have immediate implications for AI safety, healthcare, and policy. The audit of consumer-facing tools adds real-world relevance. While Paper 1 makes solid methodological contributions to cross-modal knowledge distillation, Paper 2's broader societal implications, timeliness given rapid AI agent deployment, and cross-disciplinary relevance give it higher potential impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

Paper 2 addresses a critical and timely issue: the reliability of AI agents in synthesizing high-stakes scientific information. By introducing a robust benchmark and clean-room evaluation methodology that exposes significant flaws in frontier models, it is likely to drive substantial future research in AI safety, reasoning, and scientific discovery across multiple disciplines. While Paper 1 offers a strong technical systems-ML contribution, Paper 2 has broader societal and cross-disciplinary impact.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

Paper 1 demonstrates higher potential impact due to its critical focus on the reliability of AI agents in high-stakes scientific and healthcare domains. Its large-scale benchmark and novel clean-room evaluation harness address a major flaw in current LLM evaluation: data leakage. The findings audit widely used consumer-facing agents, revealing severe factual shortcomings. This provides urgent, timely, and broad implications across AI safety, medical informatics, and public health. In contrast, Paper 2, while methodologically sound, targets a narrower application in educational creativity assessment with a smaller scale of validation.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Paper 2 addresses a more fundamental and broadly impactful question—whether AI agents can reliably synthesize scientific conclusions—with implications across health, policy, and all evidence-based domains. Its introduction of a large-scale benchmark (SciConBench) with clean-room evaluation methodology tackles the critical issue of data leakage in LLM evaluation, which has wide relevance. The finding that even frontier models achieve only 0.337 F1 and that consumer-facing tools produce incomplete/contradictory conclusions has immediate real-world safety implications. Paper 1, while technically solid in compressing memory tokens for resource-constrained QA, addresses a narrower efficiency optimization problem with less transformative potential.

claude-opus-4-6·Jun 11, 2026

Wonvs. Mind the Perspective: Let's Reason Recursively for Theory of Mind

SciConBench addresses a critical problem—evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. It introduces a large-scale benchmark (9.11K questions), a clean-room evaluation harness to prevent data leakage, and audits consumer-facing systems, revealing significant shortcomings. This has broad impact across AI safety, scientific integrity, and public health policy. Paper 2, while technically sound with its recursive ToM framework, addresses a narrower problem in LLM reasoning with incremental improvements on existing benchmarks. Paper 1's methodological contributions and real-world implications give it greater potential impact.

claude-opus-4-6·Jun 11, 2026

Wonvs. A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Paper 1 likely has higher scientific impact due to broader relevance and novelty: it introduces a large-scale, live benchmark (SciConBench) plus a clean-room evaluation harness to address leakage—an important methodological contribution for assessing agentic scientific reasoning across domains. Its findings expose a general reliability gap in frontier and consumer agents, affecting health and other high-stakes uses, and are timely for AI evaluation and policy. Paper 2 has strong applied value for a specific engineering task, but its scope and cross-field influence are narrower and methodology may be more domain-bounded.

gpt-5.2·Jun 11, 2026

Wonvs. Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline

Paper 1 is likely higher impact due to its broad, timely contribution: a large-scale benchmark (9.11K) and a clean-room harness addressing a critical, cross-domain evaluation problem (data leakage and factuality in open-domain scientific synthesis), with direct implications for high-stakes health use. The methodology is rigorous and reusable across models/agents, and its findings can reshape how the community evaluates and deploys research agents. Paper 2 is applied and promising, but is narrower in scope (negotiation pre-mediation) and more sensitive to scenario/prompt specifics, limiting breadth and generalizability.

gpt-5.2·Jun 11, 2026

Wonvs. MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

SciConBench addresses a critical gap in evaluating AI agents' ability to synthesize scientific conclusions in high-stakes domains like health. Its large-scale benchmark (9.11K questions), clean-room evaluation methodology to counter data leakage, and audit of consumer-facing tools (Google AI Overview, OpenEvidence) have broad implications for AI safety, scientific integrity, and policy. The finding that even the best agents achieve only 0.337 F1 highlights a fundamental limitation with wide-reaching consequences. Paper 2 presents an incremental multi-agent framework for social intelligence reasoning with narrower scope and less transformative potential.

claude-opus-4-6·Jun 11, 2026

Wonvs. When Do Data-Driven Systems Exhibit the Capability to Infer?

Paper 1 addresses a critical and highly relevant challenge across all sciences: the reliability of AI agents in synthesizing scientific literature. By introducing a rigorous benchmark and clean-room evaluation to prevent data leakage, it exposes significant flaws in frontier models. Its impact spans AI development, healthcare, and general scientific methodology. Paper 2, while important for policy, is narrowly focused on regulatory compliance and interpreting the EU AI Act, limiting its broader scientific innovation compared to Paper 1.

gemini-3.1-pro-preview·Jun 11, 2026

#176of 3489·Artificial Intelligence

#176 of 3489 · Artificial Intelligence

Tournament Score

1527±47

10501800

86%

Win Rate

Wins

Losses

Matches

Rating

8/ 10

Significance8.5

Rigor8.5

Novelty7.5

Clarity8