Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat

May 28, 2026

arXiv:2605.29786v1 PDF

cs.AI(primary)

#742of 2821·Artificial Intelligence

#742 of 2821 · Artificial Intelligence

Tournament Score

1458±49

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance6.5

Rigor5

Novelty5.5

Clarity7

Tournament Score

1458±49

10501800

57%

Win Rate

Wins

Losses

Matches

Rating

6/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Croissant Tasks

1. Core Contribution

Croissant Tasks proposes a declarative, JSON-LD-based metadata format for representing ML evaluation benchmarks as structured, machine-actionable specifications. The key intellectual contribution is the formalization of conceptual reproducibility: rather than replicating brittle source code in identical environments, the format enables independent agents (LLM-based or otherwise) to generate fresh implementations from high-level task descriptions. The paper makes three concrete contributions: (1) the vocabulary/schema itself, extending schema.org and the existing Croissant Datasets standard; (2) an LLM-based pipeline (`pdf2ct`) that extracts Croissant Task descriptions from research papers; and (3) empirical validation showing agents can implement benchmarks from these specifications alone.

The problem-solution decoupling (TaskProblem vs. TaskSolution) is a clean design choice that mirrors real scientific practice — problem definitions are stable while solutions evolve — and enables structured comparisons across methods.

2. Methodological Rigor

The empirical validation is preliminary but demonstrates feasibility. Five NeurIPS 2025 D&B track papers were used for evaluation, covering diverse modalities (vision, language, code reasoning, medical imaging, safety). The two-stage evaluation — (1) can LLMs extract Croissant Tasks from papers? (97.4% field coverage) and (2) can agents implement benchmarks from those files? (97.1% metric implementation accuracy) — is well-structured.

However, several methodological concerns limit confidence:

Small sample size: Five benchmarks is insufficient to claim broad expressivity. The authors acknowledge this explicitly, but conclusions about format adequacy rest on thin evidence.

Human guidance: Up to 3 human prompts were allowed during code generation, which somewhat undermines the "autonomous" framing. The NOVA benchmark required 2 guidance prompts and still only achieved 85.7% correctness from CT files.

Comparison fairness: The PDF-only baseline and CT-only condition are not perfectly controlled — the PDF contains far more information (context, rationale, implementation hints), yet CT-only sometimes outperforms it. This likely reflects context window limitations rather than information sufficiency, complicating interpretation.

Metric verification: "Correct implementation" was determined by expert human review, introducing subjectivity. The detailed appendices reveal substantial numerical discrepancies (e.g., NOVA Binary F1: reported 5.3% vs. obtained 89.11%), some attributed to model version differences, which makes it hard to assess true reproduction fidelity.

No formal expressivity analysis: There is no theoretical argument about what classes of evaluations the format can or cannot represent, only empirical coverage on five examples.

3. Potential Impact

The paper addresses a genuine pain point. ML reproducibility is widely acknowledged as problematic, and existing solutions (checklists, model cards, evaluation harnesses) each address only part of the problem. Croissant Tasks' positioning as a "glue layer" between high-level claims and execution infrastructure is architecturally sound.

High-impact scenarios include:

Standardized benchmark ingestion across platforms (Hugging Face, Codabench, Kaggle)

Automated retroactive formalization of the benchmark literature

Enabling cross-platform benchmark portability

Supporting the emerging ecosystem of autonomous coding agents

The practical impact depends critically on adoption, which the authors candidly discuss. The Croissant Datasets format has achieved meaningful adoption (Hugging Face, Kaggle), and this extension leverages that ecosystem. The MLCommons affiliation adds institutional credibility. However, the format must compete with the inertia of existing evaluation harnesses (lm-eval-harness, HELM) that are deeply embedded in current workflows.

The conceptual shift from technical replication to conceptual reproducibility is intellectually interesting but potentially controversial — it implicitly accepts that exact numerical reproduction is often unattainable, which some researchers may view as lowering the reproducibility bar rather than raising it.

4. Timeliness & Relevance

The timing is excellent. Three converging trends make this work relevant: (1) the proliferation of LLM benchmarks creating an acute need for standardization; (2) the maturation of autonomous coding agents (as demonstrated by tools like OpenHands, Devin, etc.) that can consume structured specifications; and (3) growing institutional attention to reproducibility (NeurIPS checklists, IEEE badges). The paper explicitly positions itself at the intersection of these trends.

The reliance on LLM agents for both extraction and implementation is both a strength (demonstrating practical feasibility) and a risk (tying the approach to rapidly evolving capabilities that may shift unpredictably).

5. Strengths & Limitations

Key Strengths:

Clean separation of concerns: The TaskProblem/TaskSolution duality is well-motivated and technically elegant.

Builds on existing infrastructure: Extending Croissant Datasets and schema.org rather than starting from scratch increases adoption likelihood.

Detailed skill files and prompts: The appendices (B, C) provide unusually thorough documentation of the agentic pipelines, aiding reproducibility of the paper itself.

Honest reporting: The detailed appendix reveals failures and discrepancies transparently (e.g., NOVA's 50% PDF-only success, github_prs domain failures in AbsenceBench).

Practical tooling: SHACL validation, Python validator, and GitHub repository lower the barrier to engagement.

Notable Limitations:

Evaluation scope: Five benchmarks with one baseline each is insufficient for strong claims. Complex benchmarks (NOVA) already show degraded performance.

Semantic gap: The format captures *what* to evaluate but often lacks sufficient detail about *how* (e.g., exact prompt templates, coordinate conventions), leading agents to make incorrect implementation choices.

Circular dependency on LLMs: Using LLMs to extract task descriptions and implement solutions introduces the very reproducibility concerns the format aims to address — different LLM versions may extract or implement differently.

No formal validation framework: The paper lacks automated verification that a TaskSolution actually satisfies a TaskProblem's constraints, which would be the strongest demonstration of the format's utility.

Limited novelty in the vocabulary itself: The schema is relatively straightforward; the novelty lies more in the *application* (agent-driven conceptual reproducibility) than in the technical format design.

Overall Assessment

Croissant Tasks presents a sensible, well-timed infrastructure contribution that addresses a real problem. The conceptual framework (declarative specifications enabling agent-driven conceptual reproducibility) is compelling and forward-looking. However, the empirical validation remains preliminary, and the practical impact hinges entirely on community adoption — a social rather than technical challenge. The paper is strongest as a position/systems contribution and weakest as an empirical study.

Rating:6/ 10

Significance 6.5Rigor 5Novelty 5.5Clarity 7

Generated May 29, 2026

Comparison History (14)

vs. Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

gemini-3.15/29/2026

While Paper 1 offers rigorous theoretical contributions to multi-agent LLM coherence, Paper 2 tackles the foundational and widespread issue of reproducibility in machine learning. By proposing a scalable, machine-actionable metadata format that enables automated reproduction, Paper 2 has the potential to fundamentally improve scientific methodology and evaluation standards across the entire ML community.

vs. PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

gemini-3.15/29/2026

While Paper 1 addresses the critical issue of ML reproducibility, its impact relies heavily on widespread community adoption of a new metadata standard. Paper 2 presents a direct, highly performant solution to a significant bottleneck in materials science. By vastly reducing computational overhead and achieving a 4x reduction in error for targeted material discovery, PRISMat has immediate, measurable potential to accelerate the real-world discovery of novel materials (e.g., catalysts, batteries), leading to broader tangible scientific and technological advancements.

vs. Formalizing Mathematics at Scale

gpt-5.25/29/2026

Paper 1 likely has higher impact due to strong novelty (scaling autoformalization via multi-agent LLMs plus formal verification), a large, concrete artifact (45k declarations/500k LOC across 26 textbooks), and broad cross-field implications for mathematics, theorem proving, and reliable AI-generated research. It is timely given rapid advances in LLM-assisted proof and could reshape how mathematical knowledge is validated and reused. Paper 2 is important and practical for ML reproducibility, but metadata standards tend to see slower, adoption-dependent impact and may be narrower in scope than large-scale formalized mathematics infrastructure.

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

gemini-3.15/29/2026

Paper 2 addresses a highly critical and timely bottleneck in LLM development: the efficiency and token cost of Chain-of-Thought reasoning. By systematically analyzing how compressed reasoning data affects supervised fine-tuning and reinforcement learning, it provides actionable empirical insights that can immediately influence the post-training pipelines of frontier models. While Paper 1 tackles an important foundational issue (reproducibility), Paper 2's findings have broader, immediate applications for scaling LLM reasoning capabilities in a rapidly advancing field.

vs. MemCog: From Memory-as-Tool to Memory-as-Cognition in Conversational Agents

gemini-3.15/29/2026

Paper 2 addresses a fundamental scientific challenge across all of machine learning—reproducibility. By introducing a machine-actionable metadata format that enables conceptual reproducibility via autonomous agents, it has the potential to systematically improve how ML research is validated. While Paper 1 presents a strong, innovative approach to conversational agent memory, Paper 2's scope and potential to transform evaluation standards across multiple ML subfields grant it a higher breadth of impact.

vs. Xetrieval: Mechanistically Explaining Dense Retrieval

gpt-5.25/29/2026

Paper 2 has higher likely impact: it proposes a general, extensible metadata standard plus an automated conversion pipeline and agent-based validation, directly addressing ML reproducibility—an urgent, cross-cutting problem affecting many subfields and benchmarks. If adopted, Croissant Tasks could become infrastructure used broadly across academia and industry, enabling scalable, tool-driven “conceptual reproducibility.” Paper 1 is novel and methodologically interesting for interpretability of dense retrieval, with concrete applications in IR, but its scope is narrower and adoption depends on specific retriever setups rather than ecosystem-wide standardization.

vs. A Unified Framework for the Evaluation of LLM Agentic Capabilities

gemini-3.15/29/2026

Paper 2 addresses a critical bottleneck in the rapidly growing field of LLM agents by providing a unified, large-scale evaluation framework. Its massive empirical validation (400K rollouts, 15 models) and standardization of volatile environments offer immediate, widespread utility for researchers benchmarking new models. While Paper 1's focus on reproducibility is important, Paper 2's highly timely contribution and comprehensive methodology are more likely to drive broad adoption and immediate scientific impact.

vs. PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft

gemini-3.15/29/2026

Paper 1 addresses a foundational challenge across the entire machine learning field—reproducibility—by introducing a declarative metadata format and automated verification pipeline. If widely adopted, it could fundamentally standardize how ML evaluations are published and verified. While Paper 2 offers innovative techniques for embodied AI memory and continual learning, its immediate applications and impact are largely confined to the specific subfield of embodied agents in virtual environments, making Paper 1 much broader in scope and potential scientific impact.

vs. Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

claude-opus-4.65/29/2026

Battery-Sim-Agent introduces a novel paradigm of using LLM agents as reasoning-based optimizers for scientific inverse problems, with concrete real-world applications in battery technology—a critical area for energy transition. It demonstrates tangible performance gains over established baselines (Bayesian optimization) on practical tasks including real-world datasets. Paper 1 addresses important reproducibility infrastructure but is more incremental (extending existing Croissant format) and serves as tooling rather than opening a new methodological direction. Paper 2's approach of LLM-driven scientific reasoning has broader transferability to other inverse problems across science and engineering.

vs. Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

gpt-5.25/29/2026

Paper 2 has higher estimated impact: it proposes a general, machine-actionable metadata standard for ML evaluation reproducibility, a cross-cutting bottleneck affecting many subfields. The approach is novel (conceptual reproducibility via declarative task specs + agent-generated reimplementations), broadly applicable (benchmarks, leaderboards, papers, audits), timely amid LLM-driven automation, and includes an explicit specification plus empirical validation. Paper 1 addresses an important education problem, but is more domain-specific and appears primarily conceptual/architectural without clear empirical rigor or standardization leverage comparable to a reproducibility format.

vs. CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

gpt-5.25/29/2026

Paper 2 (CORE) likely has higher scientific impact: it proposes a novel, efficient non-parametric self-improvement algorithm for LLM reasoning that reduces rollout/sample requirements and shows competitive gains across multiple tasks, making it timely and broadly applicable to many agentic and reasoning settings. Its methodological contribution (contrastive reflection into compact, interpretable insights) could influence both training-free adaptation and interpretability. Paper 1 addresses an important reproducibility problem with a metadata standard and LLM retrofitting pipeline, but adoption/friction and standards competition may limit near-term impact relative to a performance-enabling algorithm.

vs. PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

gpt-5.25/29/2026

Paper 1 targets a core, cross-cutting bottleneck in ML science—reproducible evaluation—via a declarative, machine-actionable metadata standard plus automated conversion and agent-based validation. If adopted, it can scale conceptual reproducibility across benchmarks, models, and institutions, influencing research practice, tooling, and governance beyond any single domain. Paper 2 is novel and timely for LLM-based game play, but its impact is narrower (poker-centric), depends on proprietary model performance, and relies heavily on expert-crafted skill libraries, limiting generality. Overall, Paper 1 has broader, longer-term scientific and infrastructural impact potential.

vs. Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

gemini-3.15/29/2026

Paper 1 addresses the foundational and systemic crisis of reproducibility in machine learning. By proposing a scalable, declarative metadata format that enables agent-driven conceptual reproducibility, it has the potential to become a universal standard for ML benchmarking. While Paper 2 offers a highly timely and rigorous algorithmic improvement for multimodal model alignment, Paper 1's contribution transcends specific model architectures. It promises broader, longer-lasting impact across the entire ML community by fundamentally improving how scientific claims are verified and evaluated.

vs. BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

gpt-5.25/29/2026

Paper 2 likely has higher scientific impact because it proposes a broadly applicable infrastructure contribution (a declarative metadata standard plus an automated conversion and validation pipeline) that can affect many ML subfields by improving reproducibility at scale. Its real-world adoption potential is high (benchmarks, leaderboards, audits, tooling) and it addresses a timely, widely recognized bottleneck. Paper 1 is novel and important for AI biosecurity auditing, but the evidence is preliminary (small prompt set, limited SAE coverage, within-sample calibration) and its impact is narrower and more dependent on specialized interpretability tooling.