Rollout Cards: A Reproducibility Standard for Agent Research

Charlie Masters, Ziyuan Liu, Stefano V. Albrecht

May 12, 2026

arXiv:2605.12131v1 PDF

cs.AI(primary)

#106of 2292·Artificial Intelligence

#106 of 2292 · Artificial Intelligence

Tournament Score

1541±47

10501800

92%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor8

Novelty6.5

Clarity7.5

Tournament Score

1541±47

10501800

92%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reproducibility problems that have long affected machine learning and reinforcement learning are now surfacing in agent research: papers compare systems by reported scores while leaving the rollout records behind those scores difficult to inspect. For agentic tasks, this matters because the same behaviour can receive different reported scores when evaluations select different parts of a rollout or apply different reporting rules. In a structured audit of 50 popular training and evaluation repositories, we find that none report how many runs failed, errored, or were skipped alongside headline scores. We also document 37 cases where reporting rules can change task-success rates, cost/token accounting, or timing measurements for fixed evidence, sometimes dramatically. We treat rollout records, not reported scores, as the unit of reproducibility for agent research. We introduce rollout cards: publication bundles that preserve the rollout record and declare the views, reporting rules, and drops manifests behind reported scores. We validate rollout cards in two settings. First, four partial public releases in tool safety, multi-agent systems, theorem proving, and search let us compute analyses their original reports did not include. Second, re-grading preserved benchmark outputs across short-answer, code-generation, and tool-use tasks shows that changing only the reporting rule can change reported scores by 20.9 absolute percentage points and, in some cases, invert rankings of frontier models. We release a reference implementation integrated into Ergon, an open-source reinforcement learning gym, and publicly publish Ergon-produced rollout-card exports for benchmarks spanning tool use, software engineering, web interaction, multi-agent coordination, safety, and search to support future research.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Rollout Cards: A Reproducibility Standard for Agent Research

Core Contribution

This paper identifies and formalizes two specific failure modes in agent research reproducibility: the recording problem (rollout evidence is discarded after scoring) and the reporting problem (reporting rules vary silently across implementations). The core proposal is rollout cards — publication bundles that preserve the full rollout record alongside declared views, reporting rules, and "drops manifests" that document what information each reported score uses or omits. This follows the lineage of Datasheets for Datasets and Model Cards, but targets a distinct artifact: completed agent-environment interaction episodes.

The key intellectual move is reframing the unit of reproducibility from the reported score to the rollout record. This is a simple but consequential shift: it enables post-hoc re-analysis under different reporting rules and makes convention-driven score differences inspectable.

Methodological Rigor

The paper's empirical foundation rests on three pillars:

1. The 50-repository audit (§2, Appendix B) is impressively thorough. Each repository is pinned to a specific commit SHA, audited against a seven-pattern severity rubric (P1–P7), and classified with code-level evidence. The finding that 0/50 repositories report failure/error/skip counts alongside headline scores is striking. The methodology is transparent — inclusion criteria, scope decisions, and exclusions (e.g., OpenHands) are documented. The severity rubric is well-constructed, ranging from catastrophic silent absorption (score 3) to logged-but-not-surfaced (score 1). However, the sampling is purposive rather than random, limiting generalizability claims.

2. The 37-pair variance catalogue (§2.2, Appendix C) documents concrete reporting-rule discrepancies with a four-tier evidence hierarchy. The flagship examples are compelling: LLaMA-65B scoring 63.7 vs 48.8 on MMLU under different evaluation code, and τ-bench grader choice changing scores by 16.9pp while inverting model rankings. The catalogue spans task success (22 pairs), cost/tokens (9 pairs), and latency/timing (6 pairs), with careful de-duplication.

3. The experimental validation (§4) addresses two research questions effectively. RQ1 demonstrates that preserved rollout records from four public releases (GAP, MAESTRO, COPRA miniF2F, Tree-of-Thought) support analyses their original papers didn't report — including the notable finding that 20.6% of GAP text-safe responses made forbidden tool calls. RQ2 shows reporting rules can change scores by up to 20.9pp (MLE-Bench medal definitions) and invert frontier model rankings (τ-bench). The experimental design is sound: everything is retrospective, deterministic, and operates on fixed public artifacts.

Potential Impact

Immediate field impact: The paper addresses a genuine and growing problem. As agent benchmarks proliferate and evaluation costs rise (the paper cites 3–18× annual cost increases), the inability to re-analyze expensive rollouts becomes increasingly wasteful. If adopted, rollout cards could significantly reduce duplicated evaluation effort.

Infrastructure contribution: The release of 21 card exports across diverse domains (tool use, software engineering, web interaction, multi-agent coordination, safety, search) and the ERGON reference implementation provide concrete adoption infrastructure. The format specification (Appendix E) is carefully designed with portability invariants and extensibility mechanisms.

Cross-community synthesis: Perhaps the most underappreciated impact is enabling cross-community analysis. The paper demonstrates this concretely: safety researchers can examine tool-call behavior in benchmarks designed for capability measurement; process-level reasoning researchers can study proof-search dynamics from theorem-proving logs. This could accelerate knowledge transfer across agent research subcommunities.

Standards adoption: The paper positions itself in the Datasheets/Model Cards lineage, which has achieved significant community adoption. The analogous need for agent evaluation documentation is well-motivated.

Timeliness & Relevance

This paper is exceptionally well-timed. Agent evaluation is experiencing rapid fragmentation as capabilities advance, and the community lacks shared infrastructure for preserving and comparing evaluation evidence. The doubling of task horizons every seven months means that today's expensive rollouts will be irreplaceable tomorrow. The paper's observation that important evaluation questions often emerge after the rollouts that could answer them have been discarded is particularly acute.

Strengths

Extraordinary thoroughness: The 50-repo audit with pinned SHAs, the 37-pair catalogue with tiered evidence, and the four smoking-gun benchmarks (Appendix C.5) where 3-4 implementations of the same benchmark produce different numbers are compelling.

Concrete, actionable proposal: The format specification is detailed enough to implement, yet framework-agnostic.

Strong demonstration of value: The RQ1 findings (especially GAP's 20.6% tool-call safety gap and MAESTRO's coordination-overhead-predicts-failure finding) show real scientific value recovered from existing records.

Calibration cases: Including near-null results (HumanEval 0.6pp, GPQA 1.0pp) alongside dramatic ones shows intellectual honesty and helps readers understand when reporting rules matter most.

Limitations & Weaknesses

Adoption barrier: The paper acknowledges but underemphasizes the social/incentive challenge. Standards succeed through community adoption, and the paper provides limited evidence of willingness to adopt. No benchmark maintainers or evaluation framework authors were consulted.

Scale questions: The storage profile (Table 13) shows MAESTRO at 4.25 GB for ~1,000 runs. For frontier evaluations with millions of tokens per rollout, storage and hosting costs could be substantial.

No causal evidence for reproducibility improvement: The paper demonstrates that rollout cards *enable* re-analysis and make reporting-rule effects *visible*, but doesn't demonstrate that adoption actually improves reproducibility outcomes in practice.

RQ1 case selection: While the inclusion criteria are documented, four cases is a small demonstration set, and the paper acknowledges most public releases don't preserve enough for reanalysis — somewhat undermining the practical impact.

The drops manifest relies on author honesty: Semantic loss declarations are "inspectable declarations" rather than verifiable properties, leaving room for incomplete disclosure.

Overall Assessment

This is a well-executed infrastructure/standards paper that identifies a real and growing problem, provides thorough evidence of its prevalence, and proposes a concrete solution with reference implementation. The audit work alone represents a significant empirical contribution. The paper's main limitation is uncertainty about adoption, but the technical foundation is sound and the need is genuine.

Rating:7.2/ 10

Significance 7.5Rigor 8Novelty 6.5Clarity 7.5

Generated May 13, 2026

Comparison History (26)

vs. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

gemini-3.15/19/2026

Paper 1 addresses a systemic methodological crisis in AI agent research by proposing 'rollout cards,' a standardization akin to Model Cards. By exposing how hidden reporting rules can invert model rankings and providing an open-source framework, it has the potential to fundamentally change how agent research is evaluated and published. While Paper 2 offers deep technical insights into multimodal safety, Paper 1's structural impact on scientific rigor and reproducibility gives it broader, longer-lasting significance across the AI community.

vs. To Whom Do Language Models Align? Measuring Principal Hierarchies Under High-Stakes Competing Demands

gpt-5.25/16/2026

Paper 1 introduces a broadly applicable reproducibility standard (rollout cards) for agent research, backed by a large repository audit, demonstrated score sensitivity (including rank inversions), and an open-source reference implementation with published exports—positioning it to change community practice and evaluation norms across many agentic benchmarks. Its methodological contribution is timely as agent research scales and depends on complex rollouts. Paper 2 provides important measurements in medical/legal alignment, but its impact is more domain-scoped and primarily diagnostic rather than a field-wide infrastructure shift.

vs. Structured Role-Aware Policy Optimization for Multimodal Reasoning

gemini-3.15/16/2026

Paper 1 proposes a foundational evaluation standard for the rapidly growing field of AI agent research, addressing a critical reproducibility crisis. While Paper 2 offers a strong algorithmic improvement for multimodal RL, Paper 1's 'Rollout Cards' have the potential to shift community-wide practices, similar to 'Model Cards.' By demonstrating how arbitrary reporting rules can drastically alter benchmark scores and invert frontier model rankings, Paper 1 highlights a systemic flaw and provides an actionable, open-source solution. This methodological contribution offers a broader, longer-lasting impact across the entire AI and machine learning ecosystem.

vs. Log analysis is necessary for credible evaluation of AI agents

claude-opus-4.65/16/2026

Both papers address the same core problem—agent evaluation credibility requires examining rollout/log data beyond final scores. Paper 1 has higher impact because it goes beyond argumentation to deliver a concrete reproducibility standard (rollout cards) with a reference implementation, conducts a systematic audit of 50 repositories, provides quantitative evidence (20.9pp score changes), and releases tooling integrated into an open-source framework. Paper 2 provides valuable taxonomy and principles but is more of a position/guidelines paper. Paper 1's actionable artifacts and empirical validation make it more likely to drive adoption and change research practices.

vs. Identifying Culprits Through Deep Deterministic Policy Gradient Deep Learning Investigation

gemini-3.15/16/2026

Paper 1 addresses a critical, timely issue in AI (reproducibility in agent research) with high methodological rigor, extensive audits, and concrete open-source solutions. It has broad implications for AI evaluation. Paper 2 suffers from poor grammar and applies DDPG (an RL algorithm for continuous control) to a criminal classification task, indicating severe methodological flaws and lack of scientific rigor.

vs. TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization

claude-opus-4.65/16/2026

Rollout Cards addresses a fundamental infrastructure problem in the rapidly growing field of agent research—reproducibility—backed by a systematic audit of 50 repositories and empirical evidence that reporting rules alone can shift scores by 20+ percentage points and invert model rankings. This has broad, field-shaping impact: it proposes a new reproducibility standard with open-source tooling applicable across all agent benchmarks. TUR-DPO, while a solid methodological contribution to DPO, is more incremental—one of many DPO variants—and its impact is narrower, likely to be superseded as alignment methods evolve.

vs. Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

gemini-3.15/16/2026

Paper 1 tackles a fundamental methodological crisis in AI agent research. By exposing severe flaws in current evaluation practices and providing a concrete, validated standard for reproducibility (Rollout Cards), it has the potential to fundamentally change how agent research is conducted, reported, and compared across the discipline. While Paper 2 offers a valuable framework for AI safety, Paper 1's foundational impact on rigorous scientific practice and broad applicability gives it a higher potential for widespread scientific impact.

vs. Large Language Models Exhibit Normative Conformity

gemini-3.15/16/2026

Paper 2 addresses a critical and systemic issue in the rapidly growing field of agent research: reproducibility. By proposing a concrete standard ('rollout cards'), auditing existing repositories, and demonstrating significant ranking variations due to reporting rules, it provides foundational infrastructure that could be widely adopted. Methodological and standardization papers often have massive cross-domain impact and high citation counts. While Paper 1 offers intriguing insights into LLM behavior, Paper 2 provides actionable, field-wide methodological improvements that will affect how all future agent research is evaluated.

vs. Interval POMDP Shielding for Imperfect-Perception Agents

claude-opus-4.65/13/2026

Paper 1 addresses a fundamental and timely problem in the rapidly growing field of AI agent research—reproducibility—which affects the entire community. It introduces a concrete, practical standard (rollout cards) backed by a structured audit of 50 repositories, demonstrates that reporting rule changes can shift scores by >20 percentage points and invert model rankings, and provides an open-source implementation. Its breadth of impact across all agent benchmarking makes it highly influential. Paper 2 makes a solid but more incremental contribution to a narrower subfield (safe POMDPs with perception uncertainty), with impact limited to safety-critical autonomous systems.

vs. Hierarchical LLM-Driven Control for HAPS-Assisted UAV Networks: Joint Optimization of Flight and Connectivity

claude-opus-4.65/13/2026

Paper 2 addresses a fundamental, cross-cutting problem in AI research: reproducibility in agent evaluations. By documenting that reporting rules alone can shift scores by 20.9 percentage points and even invert model rankings, it exposes a systemic issue affecting the entire field. The proposed rollout cards standard, backed by empirical audits and open-source tooling, has broad applicability across all agent research. Paper 1, while technically interesting, is a relatively incremental application of LLMs to UAV network optimization in a narrow domain. Paper 2's potential to reshape evaluation practices gives it wider and more lasting impact.

vs. Emergent social transmission of model-based representations without inference

claude-opus-4.65/13/2026

Paper 1 addresses a critical infrastructure problem in the rapidly growing field of AI agent research—reproducibility—with concrete empirical evidence (audit of 50 repositories, 37 documented cases of score variability up to 20.9 pp). It introduces a practical standard (rollout cards) with open-source tooling, potentially reshaping how agent benchmarks are reported across the field. Its timeliness is exceptional given the explosion of agent/LLM research. Paper 2 offers an interesting theoretical contribution connecting cultural evolution with reinforcement learning, but its scope and practical impact are narrower, primarily advancing understanding within cognitive science without immediate broad methodological consequences.

vs. SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

claude-opus-4.65/13/2026

Rollout Cards addresses a fundamental infrastructure problem in the rapidly growing field of agent research—reproducibility—with a concrete, validated standard backed by systematic audits of 50 repositories, quantified evidence that reporting rules can shift scores by >20 percentage points, and open-source tooling. This has broad, lasting impact across the entire agent research community by establishing reproducibility norms. SPARD, while solid, offers an incremental training improvement (curriculum learning for RLHF) in a crowded space of LLM alignment methods. Rollout Cards' breadth of impact and foundational nature give it higher potential.

vs. GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization

gemini-3.15/13/2026

Paper 1 addresses a foundational crisis in the rapidly growing field of agent research: reproducibility and evaluation standardization. By exposing significant flaws in current reporting practices and proposing a validated, open-source standard, it has the potential to fundamentally shift how agent research is conducted, evaluated, and trusted. Paper 2 presents a valuable operational optimization for sustainable LLM inference, but its scientific impact is narrower compared to the broad, paradigm-shifting implications of establishing robust reproducibility standards for an entire subfield.

vs. U-CECE: A Universal Multi-Resolution Framework for Conceptual Counterfactual Explanations

claude-opus-4.65/13/2026

Rollout Cards addresses a fundamental reproducibility crisis in the rapidly growing field of AI agent research, proposing a new standard with broad community-level implications. The paper's audit of 50 repositories revealing universal reporting gaps, demonstration that reporting rules can shift scores by 20.9 percentage points and invert model rankings, and release of a reference implementation make it highly actionable. Its impact spans all agentic AI research, a field of enormous current interest. Paper 2, while technically solid, addresses a narrower XAI subproblem with more incremental contributions to concept-based counterfactual explanations.

vs. Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

gemini-3.15/13/2026

Paper 1 addresses a critical, systemic issue—reproducibility and evaluation standards—in the rapidly growing field of agent research. By proposing 'rollout cards' and auditing existing repositories, it offers a foundational methodological improvement that could shape how future research is reported across the entire discipline. Paper 2 introduces an interesting but more narrowly focused benchmark for VLMs in a specific game context, limiting its potential breadth of impact compared to the broad, field-wide standardization proposed in Paper 1.

vs. Dual-Temporal LSTM with Hybrid Attention for Airline Passenger Load Factor Forecasting: Integrating Intra-Flight and Inter-Flight Booking Dynamics

gemini-3.15/13/2026

Paper 1 addresses a critical reproducibility crisis in the rapidly growing field of AI agent research. By proposing 'Rollout Cards' as a new foundational standard, its impact spans across the entire machine learning community, influencing how models are evaluated globally. In contrast, Paper 2 presents a solid but highly specific application of LSTMs to airline passenger forecasting, which has a much narrower scientific scope and industry-specific impact.

vs. Template-as-Ontology: Configurable Synthetic Data Infrastructure for Cross-Domain Manufacturing AI Validation

claude-opus-4.65/13/2026

Paper 1 addresses a fundamental, field-wide reproducibility problem in the rapidly growing area of AI agent research. Its findings—that reporting rules alone can shift scores by 20+ percentage points and invert model rankings—have immediate, broad implications for how the entire community evaluates and compares agents. The rollout cards standard, if adopted, would affect benchmarking across all agentic AI subfields. Paper 2 solves a real but narrower problem (synthetic data for manufacturing AI validation) with a domain-specific solution. Paper 1's breadth of impact, timeliness given the explosion of agent research, and its potential to become a community standard give it substantially higher scientific impact.

vs. Constraint-Data-Value-Maximization: Utilizing Data Attribution for Effective Data Pruning in Low-Data Environments

gemini-3.15/13/2026

Paper 1 addresses a critical reproducibility crisis in the rapidly growing field of AI agent research. By proposing a new reporting standard (Rollout Cards) and providing a comprehensive audit demonstrating the flaws in current evaluation practices, it has the potential to fundamentally shift methodological standards across the entire AI community. Paper 2 offers a valuable but much narrower technical contribution to data pruning, limiting its breadth of impact compared to a field-wide reproducibility standard.

vs. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

claude-opus-4.65/13/2026

Paper 2 addresses a fundamental infrastructure problem in the rapidly growing field of AI agent research—reproducibility. By documenting systematic issues (37 cases where reporting rules change results, 20.9pp score swings from re-grading), proposing a concrete standard (rollout cards), and releasing open-source tooling, it has broad cross-field impact. Paper 1 offers a solid but incremental contribution to LLM alignment with moderate improvements. Paper 2's potential to reshape evaluation practices across agent research, similar to how model cards transformed ML reporting, gives it higher long-term scientific impact.

vs. Why Users Go There: World Knowledge-Augmented Generative Next POI Recommendation

claude-opus-4.65/13/2026

Rollout Cards addresses a fundamental reproducibility crisis in the rapidly growing field of AI agent research, proposing a community-wide standard rather than an incremental improvement to a specific task. The finding that reporting rules alone can shift scores by 20.9 percentage points and invert model rankings is striking and broadly impactful. It affects how all agent benchmarks are conducted and reported, spanning tool use, software engineering, safety, and more. Paper 1, while solid, offers an incremental improvement (12.4%) to POI recommendation—a narrower application domain with less field-wide influence.