Read the Paper, Write the Code: Agentic Reproduction of Social-Science Results

Benjamin Kohler, David Zollikofer, Johanna Einsiedler, Alexander Hoyle, Elliott Ash

Apr 23, 2026

arXiv:2604.21965v1 PDF

cs.AI(primary)

#111of 2292·Artificial Intelligence

#111 of 2292 · Artificial Intelligence

Tournament Score

1539±32

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor7.5

Novelty7

Clarity8

Tournament Score

1539±32

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Recent work has used LLM agents to reproduce empirical social science results with access to both the data and code. We broaden this scope by asking: Can they reproduce results given only a paper's methods description and original data? We develop an agentic reproduction system that extracts structured methods descriptions from papers, runs reimplementations under strict information isolation -- agents never see the original code, results, or paper -- and enables deterministic, cell-level comparison of reproduced outputs to the original results. An error attribution step traces discrepancies through the system chain to identify root causes. Evaluating four agent scaffolds and four LLMs on 48 papers with human-verified reproducibility, we find that agents can largely recover published results, but performance varies substantially between models, scaffolds, and papers. Root cause analysis reveals that failures stem both from agent errors and from underspecification in the papers themselves.

AI Impact Assessments

(3 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a meaningful gap in the automated reproducibility literature: whether LLM agents can reproduce empirical social science results from a paper's methods description and data alone, without access to the original analysis code. Prior work (Hu et al. 2025; Shah et al. 2026; Xu & Yang 2026) provided agents with both code and data, essentially testing code execution and debugging capability. By removing code access, this work tests something fundamentally different — whether scientific papers communicate methods with sufficient precision to enable faithful reimplementation.

The pipeline has four well-defined stages: (1) structured extraction of methods from papers with result blinding, (2) autonomous reimplementation by sandboxed agents, (3) deterministic cell-level evaluation, and (4) root-cause error attribution. The information isolation design — where agents never see original code, results, or the paper itself — is a critical methodological choice that prevents trivial shortcuts like hardcoding or iterating until matching known outputs.

Methodological Rigor

The experimental design is careful and well-controlled. Several aspects stand out:

Information isolation: The two-stage audit pipeline (regex scan + LLM review for guardrail violations, plus a separate hardcoding audit) provides credible assurance that agents are genuinely reimplementing rather than retrieving memorized outputs. The finding that hardcoded runs actually perform *worse* than clean runs strengthens this claim.

Deterministic evaluation: Unlike much prior work relying on LLM-as-judge, this paper uses deterministic cell-level comparison with a well-defined grading rubric (A-F based on percentage deviation). This is more transparent and reproducible than subjective evaluation, though the grading thresholds are somewhat arbitrary.

Leakage analysis: The pre/post knowledge-cutoff comparison using Economic Journal papers is a thoughtful robustness check, though with only 10 papers (5 pre, 5 post) the statistical power is limited. The authors appropriately acknowledge this caveat.

Stability analysis: Re-running 20 papers three times reveals that while table-level grades are reasonably stable (80%+ within one grade step), coefficient-level estimates show considerable stochasticity — about half of coefficients differ significantly from themselves across runs. This is an important finding that the authors commendably report rather than suppress.

Sample selection: The reliance on I4Replication-verified papers ensures ground truth, but introduces selection bias — these are papers with functional, verified reproduction packages, likely representing better-than-average documentation quality. The 48-paper sample, while not small, limits statistical power for subgroup analyses.

Key Findings and Their Significance

The headline results are striking: the best agent (OpenCode GPT-5.4) recovers correct coefficient signs 91% of the time and places 80%+ of reproduced coefficients within the 95% confidence interval. This demonstrates substantial capability but also reveals clear limitations — roughly 1 in 10 coefficients has the wrong sign even with the best system.

The most impactful finding is arguably the error attribution analysis. The dominant source of failures is not agent errors but paper underspecification — methods descriptions in published papers are frequently insufficient for faithful reimplementation. This finding has direct implications for scientific publishing norms and editorial standards. The illustrative examples in Table 3 (party coding ambiguity, F-statistic type omission) are concrete and compelling.

The scaffold-model interaction effects are noteworthy: GPT-5.4 on OpenCode dramatically outperforms GPT-5.4 on mini-SWE-Agent, demonstrating that agentic infrastructure matters as much as model capability. The effort-performance tradeoff (OpenCode GPT-5.4 consuming 2.4x more tokens than the next system) suggests current performance is partly a function of computational budget rather than pure capability.

Potential Impact

For reproducibility science: This work provides empirical evidence that methods sections are systematically underspecified — a long-suspected but rarely quantified problem. The finding could influence editorial policies and reporting standards.

For automated science: The pipeline represents a step toward automated scientific auditing at scale. If refined, such systems could become standard tools for journals, funders, or meta-research organizations.

For AI evaluation: The benchmark of 48 papers with deterministic evaluation contributes a reusable testbed, though the proprietary nature of some components (specific model versions, scaffold configurations) may limit exact reproducibility of the benchmark itself.

Limitations on impact: The system only handles tables (not figures) for quantitative evaluation. It requires data availability (6 papers dropped for missing data). The 48-paper sample from economics and political science limits generalizability to other social sciences or natural sciences.

Timeliness & Relevance

This work arrives at a moment when both automated reproducibility and AI-assisted research are receiving intense attention. The concurrent Nature publications on mass reproducibility (Brodeur et al. 2026; Miske et al. 2026) create natural demand for scalable alternatives to labor-intensive human reproduction efforts. The paper positions itself well within this ecosystem.

Strengths & Limitations

Strengths: Rigorous information isolation; deterministic evaluation avoiding LLM-as-judge issues; comprehensive error attribution; honest reporting of limitations including run instability; practical comparison across 7 model-scaffold combinations.

Limitations: Non-random sample of well-documented papers likely overestimates real-world performance; limited generalizability beyond economics/political science; the extraction step itself uses LLMs (GPT-5-mini) whose errors propagate; no human baseline for the reimplementation task; the error attribution pipeline itself relies on LLMs and is not independently validated; some model versions (GPT-5.4, Claude Opus 4.6) are very recent and may not be available for replication.

Overall Assessment

This is a well-executed study addressing a timely and important question. Its primary contribution is conceptual — shifting the automated reproducibility paradigm from "run existing code" to "reimplement from methods descriptions" — with the secondary finding about paper underspecification being potentially more impactful for scientific practice than the AI capability demonstration itself. The work is thorough but would benefit from a human baseline comparison and larger sample size.

Rating:7.2/ 10

Significance 7.5Rigor 7.5Novelty 7Clarity 8

Generated Apr 27, 2026

Comparison History (53)

vs. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

gemini-3.15/16/2026

Paper 1 addresses a fundamental and broad limitation in current AI models—causal reasoning—by introducing a highly novel, scalable framework for generating causal data via structural causal models. This methodological innovation has significant implications for advancing AI capabilities across numerous fields. In contrast, Paper 2, while valuable for addressing the reproducibility crisis in social sciences, is primarily an evaluative application of existing LLM agents to a narrower domain, resulting in a lower ceiling for widespread scientific impact.

vs. Evaluating Strategic Reasoning in Forecasting Agents

gpt-5.24/30/2026

Paper 1 is more novel and broadly impactful: it operationalizes end-to-end, information-isolated reproduction of published social-science results from methods text plus data, adds deterministic cell-level comparisons, and performs root-cause attribution that can surface paper underspecification—directly advancing scientific reliability. Its applications span computational social science, meta-science, reproducibility tooling, and agent evaluation, with clear real-world value for journals, reviewers, and researchers. Paper 2 is timely and useful for forecasting-agent diagnostics, but is narrower in domain and primarily benchmark/analysis oriented rather than altering core scientific practice across fields.

vs. MarketBench: Evaluating AI Agents as Market Participants

gpt-5.24/28/2026

Paper 2 is likely higher impact due to stronger novelty and broader cross-field relevance: it operationalizes “methods-only” computational reproducibility with strict information isolation, deterministic cell-level output comparison, and systematic error attribution. This directly addresses a central, timely problem in science (reproducibility) across many domains beyond social science, with clear real-world applications for journals, reviewers, and research labs. Methodologically, the evaluation on 48 papers with human-verified reproducibility and multiple models/scaffolds is rigorous and generalizable. Paper 1 is valuable but more niche (market-based agent coordination benchmarks).

vs. Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

claude-opus-4.64/28/2026

Paper 1 addresses a fundamental question about AI's ability to reproduce scientific research from natural language descriptions alone, which has broad implications across all empirical sciences for reproducibility, verification, and scientific workflow automation. Its systematic evaluation framework, error attribution methodology, and findings about underspecification in published methods are highly relevant to the reproducibility crisis. Paper 2, while technically solid, addresses a narrower NLU retrieval optimization problem with incremental improvements. Paper 1's cross-disciplinary relevance and timeliness regarding LLM agents in science give it substantially higher impact potential.

vs. Certified geometric robustness -- Super-DeepG

gemini-34/28/2026

Paper 1 addresses the critical and widespread reproducibility crisis in science by automating the replication of studies using LLM agents. This has profound implications for peer review, methodological transparency, and the integrity of empirical research across multiple disciplines. Paper 2, while offering valuable technical improvements in neural network verification, addresses a narrower subfield. Paper 1's innovative approach to evaluating scientific literature itself offers a broader, more transformative potential impact on how research is validated.

vs. Adaptive ToR: Complexity-Aware Tree-Based Retrieval for Pareto-Optimal Multi-Intent NLU

gemini-34/28/2026

Paper 1 addresses the critical and widespread issue of scientific reproducibility by introducing a novel LLM-based agentic system. Its potential to automate the validation of research findings offers profound, cross-disciplinary impact on meta-science and the peer-review process. While Paper 2 provides a strong, Pareto-optimal technical advancement in natural language understanding, its impact is primarily confined to NLP and retrieval optimization, making Paper 1's broader implications for the integrity of the scientific method more significant.

vs. PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

claude-opus-4.64/28/2026

Paper 2 addresses the fundamental and widely relevant problem of scientific reproducibility using LLM agents, with broader cross-disciplinary impact. It introduces a rigorous evaluation framework (48 papers, multiple models/scaffolds, error attribution) that can generalize across social sciences and beyond. The methodology of reproducing results from methods descriptions alone (without code) is highly novel and timely, with direct implications for meta-science, peer review, and research integrity. Paper 1, while solid, offers incremental improvements on a specific benchmark in a narrower domain.

vs. PhysNote: Self-Knowledge Notes for Evolvable Physical Reasoning in Vision-Language Model

gpt-5.24/28/2026

Paper 2 likely has higher scientific impact: it targets a broad, timely problem—computational reproducibility—across many social-science domains, with clear real-world applications for journals, reviewers, and researchers. Its methodology emphasizes strict information isolation, deterministic comparisons, and error attribution on a sizable benchmark (48 papers), suggesting strong rigor and utility. Paper 1 is innovative for VLM physical reasoning and may matter for embodied/robotics settings, but its impact is narrower (specific to spatio-temporal physics reasoning in VLMs) and the reported gain is incremental over a baseline.

vs. Certified geometric robustness -- Super-DeepG

gpt-5.24/28/2026

Paper 1 likely has higher impact due to stronger novelty and broader cross-field relevance: it introduces an end-to-end, information-isolated agentic pipeline to reproduce social-science results from methods text alone, plus deterministic cell-level comparison and error attribution that can directly inform better scientific reporting standards. Its applications span computational social science, reproducibility/meta-science, ML agents, and research infrastructure, aligning with a timely community priority. Paper 2 is valuable and rigorous for safety/verification, but robustness certification is a more specialized subfield with incremental improvements over existing relaxation/Lipschitz methods.

vs. Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

gemini-34/27/2026

Paper 2 offers a fundamental breakthrough in mechanistic interpretability for Mixture-of-Experts (MoE) architectures, shifting the paradigm from analyzing individual experts to understanding expert trajectories. Given the dominance of MoEs in state-of-the-art AI, this insight has profound implications for AI safety, alignment, and model design. While Paper 1 provides a valuable tool for addressing the replication crisis, Paper 2's theoretical and architectural insights have a deeper, more immediate structural impact on the rapidly advancing field of AI development.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-34/27/2026

Paper 1 addresses a fundamental and universal challenge in AI agent reliability: uncertainty detection and help-seeking. By introducing a novel metric (Ask-F1) and demonstrating that this judgment is trainable via RL across domains, it offers a foundational contribution to AI alignment and agent design. Paper 2 is highly valuable for metascience and reproducibility, but its scope is narrower and represents an applied use-case of existing agent capabilities rather than addressing a core architectural bottleneck of autonomous systems.

vs. Auditable Agents

claude-opus-4.64/27/2026

Paper 1 presents a novel empirical system for automated reproduction of social science results using LLM agents with strict information isolation, addressing the critical reproducibility crisis with concrete benchmarks across 48 papers. It offers immediate practical applications, rigorous methodology with deterministic comparison and error attribution, and generates actionable insights about both LLM capabilities and paper underspecification. Paper 2 provides a valuable conceptual framework for agent auditability but is more of a position/framework paper with limited empirical validation. Paper 1's concrete empirical contributions and direct relevance to reproducibility give it broader near-term impact.

vs. Hodoscope: Unsupervised Monitoring for AI Misbehaviors

gpt-5.24/27/2026

Paper 2 has higher likely impact: it introduces a broadly applicable, novel paradigm (unsupervised monitoring for agent misbehavior) with immediate relevance to AI safety and evaluation, and demonstrates real-world utility by discovering a new benchmark vulnerability plus recovering known exploits. Its group-wise distributional approach can generalize across domains/benchmarks and can feed into improved supervised monitors, amplifying downstream impact. Paper 1 is methodologically strong and valuable for computational social science reproducibility, but its application scope is narrower and depends heavily on availability/quality of data and method descriptions.

vs. Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

gemini-34/27/2026

Paper 2 offers a fundamental breakthrough in AI interpretability by redefining how Mixture-of-Experts (MoEs) process information, shifting the focus from individual experts to monosemantic trajectories. This theoretical advancement is likely to broadly influence future MoE architecture design and safety research. Paper 1, while highly valuable for addressing the reproducibility crisis in social sciences, represents an applied use of LLMs rather than a foundational algorithmic or theoretical advancement.

vs. Auditable Agents

claude-opus-4.64/27/2026

Paper 1 addresses a fundamental and increasingly critical problem—auditability and accountability of LLM agent systems—that will become more important as agents are deployed at scale. It provides a novel conceptual framework (five dimensions, three mechanism classes), empirical evidence across multiple layers, and practical contributions (Auditability Card, open research agenda). Its breadth of impact spans AI safety, policy, governance, and software engineering. Paper 2, while valuable for meta-science and reproducibility, addresses a narrower application domain. Paper 1's timeliness and relevance to the rapidly growing agent ecosystem gives it higher potential impact.

vs. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

gemini-34/27/2026

Paper 1 addresses a fundamental bottleneck in autonomous AI systems—knowing when to ask for help—which spans across all agentic applications. By introducing a novel metric (Ask-F1) and demonstrating that this judgment is trainable via RL, it provides a foundational contribution to AI safety and reliability. Paper 2 offers a valuable application for scientific reproducibility, but its impact is more narrowly focused on computational social science and meta-science, making Paper 1's broader methodological advancements more impactful.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

claude-opus-4.64/27/2026

The AAAI-26 AI Review Pilot represents the first large-scale real-world deployment of AI-assisted peer review across all 22,977 submissions at a major conference, demonstrating immediate practical impact on a critical scientific infrastructure problem. Its scale, real-world validation through author/PC surveys, and potential to reshape how scientific evaluation works give it broader cross-disciplinary impact. While Paper 1 makes a meaningful contribution to automated reproducibility, Paper 2 addresses a more pressing systemic challenge affecting all of science and provides actionable evidence for institutional adoption.

vs. IatroBench: Pre-Registered Evidence of Iatrogenic Harm from AI Safety Measures

claude-opus-4.64/27/2026

IatroBench addresses a critically important and timely problem—AI safety measures causing iatrogenic harm—with a rigorous pre-registered methodology. It reveals a striking, policy-relevant finding (identity-contingent withholding) with clear real-world consequences for healthcare. The paper challenges fundamental assumptions in AI safety design, has broad implications across AI policy, medical AI, and ethics, and its findings about evaluation blind spots (LLM judges missing omission harms) have methodological significance for the entire field. Paper 1 is solid engineering work but is more incremental, extending prior LLM-based reproduction systems.

vs. AI scientists produce results without reasoning scientifically

gemini-34/27/2026

Paper 2 provides a broader, cross-domain evaluation of AI scientists, uncovering fundamental epistemological flaws in LLM reasoning. Its call to shift training targets away from outcome-based evaluation to reasoning itself has profound implications for the future development of AI in science, giving it higher potential impact than Paper 1's narrower focus on reproducing social science results.

vs. Epistemic Blinding: An Inference-Time Protocol for Auditing Prior Contamination in LLM-Assisted Analysis

gpt-5.24/27/2026

Paper 2 introduces a simple, general inference-time protocol (epistemic blinding) to audit and quantify prior/memorization contamination in LLM-assisted analyses—an increasingly urgent, cross-domain reliability problem. It demonstrates effects in high-stakes biology and finance, provides an actionable metric (blinded vs unblinded divergence), and ships tooling for adoption, boosting real-world impact and timeliness. Paper 1 is valuable for computational social science reproducibility, but its impact is narrower and more field-specific; it is more an evaluation/systematization of agentic reproduction than a broadly applicable auditing primitive.