AI scientists produce results without reasoning scientifically

Martiño Ríos-García, Nawaf Alampara, Chandan Gupta, Indrajeet Mandal, Sajid Mannan, Ali Asghar Aghajani, N. M. Anoop Krishnan, Kevin Maik Jablonka

Apr 20, 2026

arXiv:2604.18805v1 PDF

cs.AI(primary)cond-mat.mtrl-sci cs.LG

#6of 2292·Artificial Intelligence

Gold · Week 17, 2026 Share

Tournament Score

1653±21

10501800

90%

Win Rate

205

Wins

Losses

227

Matches

Rating

8.5/ 10

Significance9

Rigor8.5

Novelty8.5

Clarity8.5

Tournament Score

1653±21

10501800

90%

Win Rate

205

Wins

Losses

227

Matches

Rating

8.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language model (LLM)-based systems are increasingly deployed to conduct scientific research autonomously, yet whether their reasoning adheres to the epistemic norms that make scientific inquiry self-correcting is poorly understood. Here, we evaluate LLM-based scientific agents across eight domains, spanning workflow execution to hypothesis-driven inquiry, through more than 25,000 agent runs and two complementary lenses: (i) a systematic performance analysis that decomposes the contributions of the base model and the agent scaffold, and (ii) a behavioral analysis of the epistemological structure of agent reasoning. We observe that the base model is the primary determinant of both performance and behavior, accounting for 41.4% of explained variance versus 1.5% for the scaffold. Across all configurations, evidence is ignored in 68% of traces, refutation-driven belief revision occurs in 26%, and convergent multi-test evidence is rare. The same reasoning pattern appears whether the agent executes a computational workflow or conducts hypothesis-driven inquiry. They persist even when agents receive near-complete successful reasoning trajectories as context, and the resulting unreliability compounds across repeated trials in epistemically demanding domains. Thus, current LLM-based agents execute scientific workflows but do not exhibit the epistemic patterns that characterize scientific reasoning. Outcome-based evaluation cannot detect these failures, and scaffold engineering alone cannot repair them. Until reasoning itself becomes a training target, the scientific knowledge produced by such agents cannot be justified by the process that generated it.

AI Impact Assessments

(3 models)

Scientific Impact Assessment: "AI scientists produce results without reasoning scientifically"

1. Core Contribution

This paper introduces a dual-lens evaluation framework for LLM-based scientific agents that goes beyond outcome-based metrics to assess the *epistemic quality* of reasoning processes. The central novelty is the combination of: (a) a systematic performance decomposition separating base-model vs. scaffold contributions, and (b) an epistemological behavioral analysis that annotates reasoning traces as directed graphs of epistemic operations (hypothesis, test, evidence, judgment, update, commitment). Through 25,000+ agent runs across eight scientific domains, three frontier models, and two scaffold architectures, the authors demonstrate that agents routinely violate fundamental norms of scientific reasoning—ignoring evidence (68% of traces), failing to revise beliefs under contradiction (only 26% show refutation-driven revision), and rarely assembling convergent multi-test evidence (7%).

The key finding that the base model accounts for 41.4% of explained variance versus 1.5% for the scaffold is consequential: it redirects attention from scaffold engineering to base-model training as the lever for improvement.

2. Methodological Rigor

The methodology is exceptionally thorough and multi-layered:

Performance analysis uses Item Response Theory (IRT) to separate knowledge from reasoning ability, followed by Bayesian latent factor models (eight candidate specifications compared via PSIS-LOO cross-validation). The variance decomposition is well-calibrated (posterior predictive checks show R² > 0.95 at task level).

Epistemological analysis employs a two-stage LLM annotation pipeline validated against human expert annotations. Inter-annotator agreement is strong (overall 92.6% human-human PABAK, 95.7% human-LLM), lending credibility to the automated pipeline. The pattern taxonomy—distinguishing productive motifs (Popperian falsification, convergent evidence) from anti-patterns (untested claims, evidence non-uptake)—is well-grounded in philosophy of science.

Trace interventions provide a clever experimental control: injecting partial successful/failed trajectories into conversation history to test whether accumulated context (rather than intrinsic reasoning capability) drives performance. The finding that near-complete trajectories are needed for hypothesis-driven tasks, while early steps suffice for workflow tasks, is methodologically illuminating.

Limitations acknowledged: Only three models tested, two simple scaffolds (ReAct and tool-calling), no multi-agent or advanced planning architectures, fixed compute budgets. The restriction to episodic tasks (no cross-task learning) limits ecological validity relative to real scientific practice.

3. Potential Impact

Immediate impact on AI-for-science: This paper fundamentally challenges the growing deployment of LLM agents as "AI scientists." By demonstrating that outcome-based evaluation masks systematic reasoning failures, it argues compellingly that current benchmarks are insufficient. The framework (Corral) is open-source with standardized environments, tools, and scoring functions—positioned as community infrastructure.

Training signal implications: The insight that reasoning itself must become a training target, rather than just task completion, could redirect base-model development. Each environment provides reproducible tasks with scoring over trajectories, sufficient to define process-based training signals.

Philosophy of science intersection: The paper bridges computer science, philosophy of science, and metascience in a way that is rare and productive. The connection between justified true belief and AI-generated knowledge is not merely philosophical decoration—it has practical implications for trust, reproducibility, and deployment decisions.

Broader AI safety/alignment: The finding that agents apply a "single reasoning mode across the full epistemic range" regardless of task demands resonates with alignment concerns about brittle generalization.

4. Timeliness & Relevance

This paper arrives at a critical moment. The authors document (Figure A.1) that AI-scientist publications are growing from ~1% to ~5% of AI-for-chemistry literature. High-profile systems (Sakana's AI Scientist, Coscientist, AlphaEvolve) are being deployed with increasing autonomy. The paper directly addresses the gap between capability claims and epistemic reliability, providing the evaluation machinery the field currently lacks.

The concern that AI tools narrow scientific hypotheses (Evans et al., 2026) is complemented here with evidence about *how* this narrowing occurs at the reasoning level—agents don't adapt their epistemic strategies to problem demands.

5. Strengths & Limitations

Key strengths:

Scale and systematicity: 25,000+ runs across 8 domains with controlled experimental design

The epistemological graph framework is genuinely novel—no prior work annotates agent reasoning at this granularity with validated taxonomies

Multiple converging analyses (raw performance, IRT, manual annotation, latent factor modeling, trace interventions) all point to the same conclusion

Open infrastructure (Corral framework, datasets, interactive trace browser) enables reproduction and extension

The variance decomposition cleanly attributes performance to base model vs. scaffold, resolving a question that pervades the agent-building community

Notable limitations:

Only two simple scaffolds tested; advanced architectures (tree-of-thought, multi-agent debate, plan-and-execute) might yield different epistemological profiles

Three models is a narrow sample; reasoning-focused models (e.g., o1-style) are absent

The epistemological annotation relies on Claude Sonnet 4.5, introducing potential systematic bias (though validated against human judgment)

The claim that "reasoning itself must become a training target" is prescriptive but not experimentally demonstrated

Eight domains, while diverse, may not capture all forms of scientific reasoning (e.g., theory construction, analogy-based discovery)

Additional Observations

The paper's framing—connecting DENDRAL's transparent rule-based reasoning to modern LLMs' opaque statistical inference—is historically apt and rhetorically effective. The distinction between "workflow execution" and "hypothesis-driven inquiry" as an axis of epistemic demand is a useful conceptual contribution that could organize future benchmark design.

The Pass∧k analysis (probability all k trials succeed) is particularly sobering: in hypothesis-driven domains, reliability drops below 5% by k=4-6, meaning agents cannot be trusted for consistent performance even on problems they sometimes solve.

Rating:8.5/ 10

Significance 9Rigor 8.5Novelty 8.5Clarity 8.5

Generated Apr 22, 2026

Comparison History (227)

vs. Towards a General Intelligence and Interface for Wearable Health Data

claude-opus-4.65/22/2026

Paper 2 addresses a fundamental and timely question about whether AI agents can truly reason scientifically, finding critical epistemic failures across 25,000+ runs. This has broader impact because it challenges the rapidly growing field of autonomous AI-driven science, providing evidence that current LLM agents lack self-correcting reasoning. Its findings affect every domain deploying AI scientists and will likely influence AI training paradigms, evaluation standards, and policy. Paper 1, while impressive in scale and clinical utility, is more narrowly focused on wearable health and represents incremental (though significant) progress in foundation models for a specific domain.

vs. Towards a General Intelligence and Interface for Wearable Health Data

gemini-3.15/22/2026

Paper 1 introduces a foundation model pretrained on an unprecedented scale of wearable data (5 million participants, 1 trillion minutes), demonstrating high impact through immediate real-world applications in predictive healthcare across 35 tasks. While Paper 2 provides a valuable critical evaluation of AI reasoning, Paper 1 represents a massive technological leap with direct, measurable benefits to public health, personalized medicine, and the broader application of AI in clinical settings.

vs. Forecasting Scientific Progress with Artificial Intelligence

claude-opus-4.65/22/2026

Paper 1 addresses a more fundamental and urgent question about AI-driven science: whether LLM agents actually reason scientifically. With 25,000+ agent runs across 8 domains, it provides rigorous evidence that current AI agents ignore evidence 68% of the time and lack epistemic self-correction. This has immediate implications for the rapidly growing field of autonomous AI research agents, challenging a core assumption underlying billions in investment. Its finding that scaffold engineering cannot fix reasoning deficits and that reasoning must become a training target provides actionable direction. Paper 2 offers a valuable benchmark but addresses a narrower, less immediately consequential question about forecasting scientific progress.

vs. Forecasting Scientific Progress with Artificial Intelligence

gemini-3.15/22/2026

Paper 2 addresses a more fundamental and widely applicable issue: the epistemic validity of autonomous 'AI Scientists.' By demonstrating that these agents fail to engage in actual scientific reasoning (e.g., ignoring evidence, lacking belief revision), it challenges the foundational reliability of a highly hyped and rapidly growing field. Paper 1's focus on forecasting scientific progress, while novel, represents a narrower use case compared to the broader implications of AI systems generating unjustified scientific knowledge.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

gpt-5.25/21/2026

Paper 2 is likely higher impact: it delivers a clear theoretical contribution (conditional—not universal—equivalence of DPO and RLHF), identifies concrete failure modes with formal characterization, and proposes a new method (CPO) with provable alignment and benchmarked SOTA results—high methodological rigor and immediate applicability to widely used LLM alignment pipelines. Paper 1 is timely and broadly relevant as an empirical/behavioral critique of “AI scientist” agents, but it is primarily diagnostic and may be less directly actionable than a new alignment objective with proofs and code.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

gpt-5.25/21/2026

Paper 2 likely has higher scientific impact: it tackles a timely, cross-cutting concern—whether autonomous LLM “scientists” meet epistemic norms—using large-scale empirical evaluation (25k+ runs) across eight domains and introduces behavioral/epistemic diagnostics that could reshape how agentic systems are evaluated and trained. Its conclusions affect AI safety, ML, scientific automation, and research policy. Paper 1 is rigorous and valuable for alignment theory/practice (DPO vs RLHF, CPO), but its scope is narrower (preference optimization methods) and primarily impacts a subarea of alignment/finetuning rather than broad scientific practice and evaluation.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gpt-5.25/21/2026

Paper 2 has higher potential impact: it addresses a timely, high-stakes question (validity of autonomous AI science) with broad relevance across AI, metascience, and research policy. Its large-scale empirical design (25,000+ runs, eight domains) and decomposition of model vs scaffold effects provide actionable conclusions (outcome metrics miss epistemic failures; scaffold tweaks insufficient; reasoning must be trained). Paper 1 is valuable as a harder deep-research benchmark, but its impact is more scoped to evaluation of web-based LLM research agents rather than the foundational epistemic reliability of AI-generated science.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gpt-5.25/21/2026

Paper 1 has higher potential impact: it tackles a foundational question about whether LLM scientific agents satisfy epistemic norms, using large-scale evaluation (25k+ runs) across eight domains and introducing process-level behavioral diagnostics that expose failure modes invisible to outcome-only metrics. Its conclusions directly affect how autonomous science agents should be trained, evaluated, and trusted, with broad implications for AI, philosophy of science, and scientific practice. Paper 2 is valuable and timely as an auditing-friendly benchmark, but its impact is more incremental and primarily methodological within evaluation.

vs. How Far Are We From True Auto-Research?

claude-opus-4.65/20/2026

Paper 1 offers deeper scientific insight by identifying fundamental epistemological failures in LLM-based scientific agents through 25,000+ runs across 8 domains, demonstrating that base models—not scaffolds—determine reasoning quality, and that evidence is ignored in 68% of traces. This mechanistic understanding of *why* AI scientists fail (lack of epistemic reasoning patterns) has broader implications across all scientific domains and provides actionable direction (reasoning as a training target). Paper 2, while valuable in benchmarking auto-research quality, is more descriptive and narrower in scope (CS papers only, 117 papers, 3 agents). Paper 1's findings are more fundamental and generalizable.

vs. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

gpt-5.25/20/2026

Paper 2 has higher potential impact due to a clearer technical innovation (evidence-carrying authorization via typed certificates + deterministic gating), strong real-world applicability to safety-critical multimodal agent deployments, and rigorous evaluation under an explicit adversarial threat model with red-teaming and quantified risk reduction. Its approach is broadly relevant across agent security, HCI, and trustworthy AI, and is timely as multimodal agents increasingly perform privileged actions. Paper 1 is important diagnostically but offers fewer actionable remedies and may have narrower immediate practical uptake.

vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

claude-opus-4.65/19/2026

Paper 1 addresses a fundamental epistemological question about AI-driven science with broad implications across all fields using LLM agents for research. Its finding that LLM agents fail to reason scientifically despite producing correct outputs challenges the growing trend of autonomous AI research and has deep implications for AI safety, scientific integrity, and policy. The scale (25,000+ runs, 8 domains) and the dual analytical framework are rigorous. Paper 2, while technically strong and practically useful, addresses a narrower technical problem (hallucination reduction) with an inference-time correction method. Paper 1's impact spans scientific methodology, AI governance, and epistemology, giving it broader and more transformative potential.

vs. SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

gpt-5.25/18/2026

Paper 2 has higher likely impact because it addresses a timely, broadly relevant question—whether AI “scientists” follow epistemic norms—using large-scale, cross-domain evaluation (25k+ runs) and yields actionable conclusions (outcome metrics miss failures; scaffolds contribute little; reasoning must be a training target). This can influence AI evaluation standards, agent design, and scientific reliability practices across fields. Paper 1 is methodologically strong and novel (SMC framing, finite-sample complexity) with practical gains, but its impact is more specialized to LLM-driven program search frameworks.

vs. Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

claude-opus-4.65/16/2026

Paper 2 addresses a fundamental question about the epistemic reliability of LLM-based scientific agents—a topic of immense and growing importance as AI is increasingly deployed in research. Its findings (evidence ignored in 68% of traces, scaffold contributing only 1.5% of variance) provide concrete, actionable insights with broad implications across all fields using AI for science. Paper 1 is a valuable benchmark contribution, but Paper 2's critical evaluation of AI reasoning limitations has broader cross-disciplinary impact and is more timely given the rush to deploy autonomous AI scientists.

vs. Process Matters more than Output for Distinguishing Humans from Machines

gpt-5.25/16/2026

Paper 1 has higher likely scientific impact due to broader relevance and timeliness: it directly interrogates the reliability of autonomous “AI scientist” systems across eight scientific domains with large-scale empirical evaluation (25,000+ runs) and identifies a fundamental limitation—lack of epistemically normative reasoning—that affects trustworthiness of AI-generated science. Its claims (scaffolds contribute little; outcome-only evaluation fails; reasoning must be a training target) can reshape evaluation practices and model-training priorities across AI-for-science, agent design, and research governance. Paper 2 is valuable but narrower (human–machine discrimination via cognitive tasks) and more application-specific.

vs. Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces

gemini-3.15/15/2026

Paper 2 addresses a critical, timely issue with broad implications: the epistemic reliability of autonomous AI scientists. By demonstrating that LLMs fail to exhibit true scientific reasoning despite producing results, it challenges current hype and highlights a fundamental limitation in AI-driven science. This will likely spark significant debate and influence future training paradigms across multiple disciplines, giving it a broader and more profound impact than Paper 1's technical analysis of reasoning trace compression.

vs. Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection

gpt-5.25/15/2026

Paper 2 likely has higher scientific impact: it provides a large-scale, cross-domain empirical evaluation (25k+ runs, eight domains) diagnosing a foundational limitation of LLM-based scientific agents—systematic failures of epistemic/scientific reasoning—and argues that outcome-based metrics and scaffold tweaks are insufficient. This is timely and broadly relevant across AI, ML evaluation, agent design, and scientific automation, with direct implications for deployment and research priorities (e.g., training reasoning as a target). Paper 1 is useful and practical but more incremental and narrower to synthetic data efficiency.

vs. Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

gpt-5.25/11/2026

Paper 2 likely has higher impact due to broader scope and timeliness: it evaluates autonomous “AI scientist” agents across eight domains with >25,000 runs and identifies systematic epistemic failures (evidence neglect, weak refutation, rare convergent testing) that directly affect reliability of machine-generated science. Its conclusions inform evaluation standards, training objectives, and deployment policy across AI, scientific automation, and metascience. Paper 1 is novel and methodologically interesting but is narrower (a single game domain) and mainly advances interpretability of LLM planning rather than addressing cross-domain scientific validity.

vs. AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

claude-opus-4.65/7/2026

Paper 2 addresses a fundamental question about whether AI agents can truly reason scientifically, with broad implications across all scientific domains using LLMs. Its large-scale empirical study (25,000+ runs, 8 domains) reveals that LLM-based agents lack epistemic reasoning patterns critical for science, and that scaffold engineering cannot fix this—only training changes can. This finding challenges the growing trend of autonomous AI science and has profound implications for AI development, scientific methodology, and trust in AI-generated knowledge. Paper 1, while practically useful, is a narrower engineering contribution focused on runtime safety for tool-calling agents.

vs. Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective

claude-opus-4.65/6/2026

Paper 2 addresses the timely and critical question of whether LLM-based AI agents can truly perform scientific reasoning, finding fundamental epistemic failures across 25,000+ runs. Given the rapid deployment of AI in science, this work has enormous breadth of impact—affecting AI safety, scientific methodology, and policy. Its findings that scaffold engineering cannot fix reasoning deficits and that outcome-based evaluation misses failures are actionable and consequential. Paper 1, while theoretically interesting in applying evolutionary game theory to shortcut learning, addresses a narrower technical question with less immediate broad impact.

vs. ScrapMem: A Bio-inspired Framework for On-device Personalized Agent Memory via Optical Forgetting

gemini-35/6/2026

Paper 2 addresses a fundamental and highly timely issue regarding the capabilities and limitations of autonomous AI scientists. By exposing deep epistemological flaws in how LLMs conduct research, it challenges current paradigms and sets a clear direction for future AI training. Its broad implications across all fields attempting to use AI for scientific discovery give it significantly higher scientific impact compared to Paper 1, which offers a valuable but more narrowly focused engineering solution for on-device memory optimization.