CT Open: An Open-Access, Uncontaminated, Live Platform for the Open Challenge of Clinical Trial Outcome Prediction

Jianyou Wang, Youze Zheng, Longtian Bao, Hanyuan Zhang, Qirui Zheng, Yuhan Chen, Yang Zhang, Matthew Feng

Apr 17, 2026

arXiv:2604.16742v1 PDF

cs.AI(primary)cs.CL

#144of 2292·Artificial Intelligence

#144 of 2292 · Artificial Intelligence

Tournament Score

1533±40

10501800

77%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1533±40

10501800

77%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Scientists have long sought to accurately predict outcomes of real-world events before they happen. Can AI systems do so more reliably? We study this question through clinical trial outcome prediction, a high-stakes open challenge even for domain experts. We introduce CT Open, an open-access, live platform that will run four challenge every year. Anyone can submit predictions for each challenge. CT Open evaluates those submissions on trials whose outcomes were not yet public at the time of submission but were made public afterwards. Determining if a trial's outcome is public on the internet before a certain date is surprisingly difficult. Outcomes posted on official registries may lag behind by years, while the first mention may appear in obscure articles. To address this, we propose a novel, fully automated decontamination pipeline that uses iterative LLM-powered web search to identify the earliest mention of trial outcomes. We validate the pipeline's quality and accuracy by human expert's annotations. Since CT Open's pipeline ensures that every evaluated trial had no publicly reported outcome when the prediction was made, it allows participants to use any methodology and any data source. In this paper, we release a training set and two time-stamped test benchmarks, Winter 2025 and Summer 2025. We believe CT Open can serve as a central hub for advancing AI research on forecasting real-world outcomes before they occur, while also informing biomedical research and improving clinical trial design. CT Open Platform is hosted at $\href{https://ct-open.net/}{https://ct-open.net/}$

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CT Open

1. Core Contribution

CT Open introduces a live, quarterly-updated benchmarking platform for clinical trial outcome prediction — a genuinely difficult forecasting task where even domain experts struggle. The paper makes three intertwined contributions:

A dynamic benchmark infrastructure: Rather than a one-time dataset release, CT Open commits to quarterly challenges (Winter/Spring/Summer/Fall) with new time-stamped benchmarks, creating a persistent evaluation framework.

A novel decontamination pipeline: This is arguably the paper's most technically interesting contribution. The pipeline combines LLM-powered web search (GPT-5.2, Gemini 3), Brave Search API with custom scraping, and database queries (PubMed, PMC, BioRxiv, MedRxiv) to determine whether any trial results were publicly available before a cutoff date. This is a non-trivial problem — results can leak through conference posters, LinkedIn posts, financial disclosures, patient blogs, and obscure press releases long before formal registry updates.

Baseline evaluations: The paper benchmarks prompt-based LLMs, RAG-augmented LLMs, agentic LLMs, and traditional ML/NN methods, revealing that simple baselines remain competitive with frontier LLMs on this prediction task.

2. Methodological Rigor

The decontamination pipeline is well-designed with multiple redundant search strategies and a two-round GPT-5 verification process (trial matching, then results verification). The validation is reasonable: 50 manually reviewed trials (2-4 hours each) yielded 49/50 confirmed clean, plus external validation against Elsevier's proprietary Embase database found no missed contamination. The one ambiguous case (a blurry LinkedIn photo of a conference poster) was conservatively excluded.

The answer generation pipeline achieves 99% accuracy on 100 expert-annotated samples, validated by a physician-scientist.

However, there are methodological concerns:

Sample sizes for validation are modest: 50 trials for decontamination and 100 questions for answer generation. Given that the benchmarks contain 605 and 857 questions respectively, these sample sizes provide reasonable but not comprehensive coverage.

The benchmark sizes themselves are small: Winter 2025 has 314 trials (605 questions) and Summer 2025 has 240 trials (857 questions). Some question-class subsets (e.g., Comparative Effect in Winter 2025 with only 9% of questions) are quite small, making statistical comparisons fragile.

Statistical testing is limited: Most performance differences are described as "nominal" increases/decreases, with only a few achieving statistical significance via hierarchical bootstrap. This is honest reporting but undermines strong claims about method comparisons.

The agent evaluation was only run with o3-mini due to cost ($1000+), limiting generalizability of findings about agentic approaches.

3. Potential Impact

For AI/ML research: CT Open fills a genuine gap. Most LLM benchmarks test knowledge retrieval or reasoning on problems with known answers. CT Open tests *prospective prediction* — whether AI can forecast outcomes that haven't happened yet. This is fundamentally different and more aligned with real-world scientific utility. The open-method design also enables evaluation of agentic systems with unrestricted web access, which most benchmarks cannot support.

For biomedical research: If prediction accuracy improves meaningfully, this could inform clinical trial design, resource allocation, and drug development strategy. The pharmaceutical industry's ~90% failure rate in clinical development makes even marginal improvements in trial outcome forecasting valuable.

For benchmarking methodology: The decontamination pipeline addresses a real and growing problem. As LLMs are trained on increasingly comprehensive web corpora, any static benchmark with web-accessible answers becomes progressively compromised. The pipeline's approach — iterative multi-source search with LLM verification — could be adapted to other domains requiring temporal decontamination.

4. Timeliness & Relevance

This paper is highly timely. Benchmark contamination is one of the most pressing problems in LLM evaluation, and the Opus 4.5 analysis in Section 6 directly demonstrates why: a model whose knowledge cutoff postdates the benchmark timestamp shows suspiciously strong performance that drops on the uncontaminated benchmark. The quarterly refresh cycle is well-suited to the rapid pace of model releases.

The choice of clinical trial prediction is also apt — it's a domain where AI could provide substantial societal value, the ground truth is objectively determined, and the temporal structure naturally supports contamination-resistant evaluation.

5. Strengths & Limitations

Key Strengths:

The decontamination pipeline is a genuine technical contribution with broad applicability

The open-method design is philosophically sound — it doesn't artificially constrain what participants can do

The finding that traditional ML baselines compete with or outperform LLMs is valuable and humbling

The platform design (quarterly cadence, automated pipelines) supports long-term sustainability

Thorough documentation of prompts, features, and implementation details supports reproducibility

Key Limitations:

Scale: The benchmarks are relatively small (hundreds of questions), which limits statistical power for comparing methods, especially within question subtypes

Question design: Questions are formulated around statistical significance at the 95% level, which reduces clinical trial outcomes to binary/ternary classifications. This loses nuance about effect sizes, clinical significance, and dose-response relationships

Training set mismatch: The training set has very different temporal distribution (91.2% pre-2020) compared to test sets (88-89% post-2020), which could systematically disadvantage ML methods that rely on training data

Sustainability concerns: The platform depends on continued funding for LLM API calls, web scraping infrastructure, and expert validation. The paper doesn't address long-term sustainability

Limited baseline diversity: No ensemble or hybrid methods combining ML features with LLM reasoning were tested, despite this being an obvious direction

The "live" aspect is aspirational: At submission time, the platform hasn't yet run a live challenge; Winter 2025 and Summer 2025 are retrospectively constructed

Additional Observations

The paper demonstrates a thoughtful understanding of what makes benchmarks fail over time and proposes practical solutions. The insight that determining "when something first became public on the internet" is itself a hard AI problem is interesting and could spawn its own research direction. The dataset contribution (training set of 7,292 trials with 15,444 questions) has standalone value even apart from the platform.

Rating:6.8/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

Generated May 5, 2026

Comparison History (30)

vs. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

gemini-35/5/2026

Paper 2 addresses a highly impactful, real-world problem (clinical trial outcome prediction) with profound implications for medicine and economics. It also solves a critical methodological issue in AI evaluation—data contamination—by creating a live, prospectively evaluated platform. While Paper 1 introduces a rigorous benchmark for LLM agents, Paper 2's cross-disciplinary application, live forecasting design, and potential to directly influence life-saving biomedical research give it a significantly higher potential for broad scientific and real-world impact.

vs. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact due to its live, uncontaminated evaluation platform for forecasting real-world clinical trial outcomes—a high-stakes, broadly relevant application spanning AI, biomedicine, and decision science. The automated decontamination pipeline addressing outcome leakage is a notable methodological contribution and enables sustained community benchmarking (recurring challenges, time-stamped tests), which can catalyze field-wide progress. Paper 1 is a rigorous and useful benchmark for LLM agent commitment integrity, but its impact is more specialized to LLM evaluation and less directly tied to consequential real-world outcomes.

vs. BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

gemini-35/5/2026

BioMiner demonstrates immediate, measurable impact on drug discovery by solving a critical bottleneck in multi-modal bioactivity extraction. It introduces a novel framework and a massive benchmark (BioVista), proving its real-world utility by directly improving downstream QSAR models and identifying novel hit candidates. While CT Open offers a valuable benchmarking platform for future clinical forecasting challenges, BioMiner provides a completed, rigorously validated system with immediate, transformative applications in cheminformatics and pharmaceutical research.

vs. BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

gpt-5.25/5/2026

Paper 2 has higher potential impact due to its broad, cross-domain relevance: a live, prospective evaluation platform for forecasting clinical trial outcomes with strict decontamination tackles a central reproducibility/contamination problem in modern ML. It enables ongoing community benchmarking, drives methodological innovation, and can influence both AI evaluation practices and biomedical decision-making. While Paper 1 is novel and useful for drug discovery data curation, its current extraction performance (F1=0.32) suggests maturity limitations and a narrower scope. CT Open’s timeliness and infrastructure-like role likely yield wider, longer-lasting impact.

vs. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact due to strong real-world relevance and breadth: it establishes a live, open, uncontaminated evaluation platform for prospective prediction in high-stakes biomedicine, with a novel automated decontamination pipeline and time-stamped benchmarks. This infrastructure can become a community standard, enabling reproducible longitudinal comparisons and influencing both ML forecasting research and clinical trial planning. Paper 1 is innovative for agent training via self-evolving environments, but impact may be narrower (agent benchmarks) and more sensitive to implementation details and fast-moving competitive baselines.

vs. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

gpt-5.25/5/2026

Paper 2 has higher likely impact: it creates a live, open, uncontaminated evaluation platform in a high-stakes biomedical domain, with a novel automated decontamination pipeline and time-stamped benchmarks. This directly addresses a major reproducibility/contamination barrier in forecasting research and can become a community standard, enabling broad participation and longitudinal progress. Its real-world applications (trial design, R&D prioritization) are substantial and timely. Paper 1 is innovative for agent training, but ecosystem/tool changes and dependence on benchmark design may limit durability versus a sustained public challenge platform.

vs. A phenotype-driven and evidence-governed framework for knowledge graph enrichment and hypotheses discovery in population data

gpt-5.25/5/2026

Paper 1 has higher potential impact due to its novel, live, decontaminated evaluation platform for prospective clinical trial outcome prediction—an unusually rigorous setup that can become a community standard benchmark. Its automated LLM-powered decontamination pipeline addresses a key methodological barrier (information leakage) and enables broad participation with real-world, time-stamped tests. The platform nature supports sustained, cross-field relevance (forecasting, ML evaluation, biomedicine) and strong real-world applications in trial design and decision-making. Paper 2 is methodologically rich but closer to incremental integration of existing components.

vs. SynHAT: A Two-stage Coarse-to-Fine Diffusion Framework for Synthesizing Human Activity Traces

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a live, open-access, time-stamped evaluation platform for clinical trial outcome prediction with an automated decontamination pipeline, addressing a core reproducibility/contamination problem in forecasting benchmarks. This enables broad, ongoing community participation, standardized comparisons, and real-world deployment relevance in a high-stakes biomedical domain, with potential cross-domain influence on evaluation of predictive AI. Paper 1 is technically novel and useful for privacy-preserving mobility data, but its impact is more domain-specific and incremental relative to existing diffusion-based synthesis work.

vs. Reinforcement Learning Trained Observer Control for Bearings-Only Tracking

claude-opus-4.65/5/2026

CT Open introduces a novel open-access benchmarking platform for clinical trial outcome prediction with a creative automated decontamination pipeline, addressing a high-stakes real-world problem. Its broader impact spans AI forecasting, biomedicine, and clinical trial design, and it provides a reusable community resource (live challenges, benchmarks). While Paper 1 makes a solid technical contribution to bearings-only tracking using DRL, it represents an incremental advance in a narrower domain. Paper 2's infrastructure contribution, interdisciplinary relevance, and potential to become a standard benchmark give it higher estimated scientific impact.

vs. Runtime Evaluation of Procedural Content Generation in an Endless Runner Game Using Autonomous Agents

gpt-5.25/5/2026

Paper 1 has higher likely scientific impact due to stronger novelty (a live, uncontaminated forecasting benchmark with an automated decontamination pipeline), high real-world stakes and applicability (clinical trial outcome prediction affects biomedical R&D and policy), broader cross-field relevance (ML evaluation, information retrieval, LLM-assisted web mining, forecasting, and biomedicine), and timeliness (urgent interest in reliable AI forecasting and contamination-resistant benchmarks). Paper 2 is solid and rigorous within game AI/PCG, but its applications and cross-disciplinary reach are narrower.

vs. Using large language models for embodied planning introduces systematic safety risks

gpt-5.25/5/2026

Paper 2 likely has higher scientific impact due to its timely focus on LLM-driven robotics safety, a high-stakes deployment area with broad relevance to AI, robotics, and safety policy. It contributes a large, deterministic benchmark (DESPITE), evaluates many models, and reports a clear, generalizable empirical finding (planning ability can saturate while safety remains flat), plus an interpretable relationship between capabilities. This combination of rigorous measurement, actionable insights, and cross-field implications is likely to drive follow-on work more than Paper 1’s valuable but narrower platform/benchmark contribution in clinical trial forecasting.

vs. AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories

gpt-5.25/5/2026

Paper 2 likely has higher impact: it introduces a live, open-access, time-stamped evaluation platform with automated decontamination to prevent label leakage—addressing a core bottleneck in forecasting research. This infrastructure can become a community benchmark with recurring challenges, broad applicability across AI forecasting, biomedicine, and research methodology, and strong real-world relevance (clinical trials). Paper 1 is novel and rigorous for repository-grounded reproducibility/ablation in virtual cell codebases, but its scope is narrower and more tool-like within a subfield.

vs. Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

claude-opus-4.65/5/2026

CT Open introduces a novel, continuously-running benchmark platform for clinical trial outcome prediction with a sophisticated decontamination pipeline, addressing a high-stakes real-world problem. Its impact spans AI forecasting methodology, biomedical research, and clinical trial design, with broad community engagement potential. Paper 2 presents an incremental improvement (COSPLAY) for LLM game-playing agents using skill banks—a more narrow contribution to an already crowded space of LLM agent frameworks, with limited real-world applicability beyond game environments.

vs. Context Over Content: Exposing Evaluation Faking in Automated Judges

claude-opus-4.65/5/2026

Paper 1 exposes a fundamental and previously unmeasured vulnerability in the widely-adopted LLM-as-a-judge paradigm, showing that contextual framing systematically corrupts evaluations while remaining invisible to chain-of-thought inspection. This has immediate, broad implications for AI safety evaluation pipelines, alignment research, and the trustworthiness of automated benchmarking—areas of intense current focus. Paper 2 introduces a valuable benchmark platform for clinical trial prediction, but its impact is more niche and incremental. Paper 1's finding that evaluation faking is implicit and undetectable by standard methods challenges core assumptions across the entire LLM evaluation ecosystem.

vs. On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length

gemini-35/5/2026

Paper 2 addresses a high-stakes, real-world problem (clinical trial prediction) with a novel, live evaluation platform that solves the critical issue of data contamination. Its interdisciplinary reach across AI and biomedicine, combined with an ongoing open challenge format, promises broader and more tangible real-world impacts than Paper 1's empirical study on agent training dynamics.

vs. MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

gemini-35/5/2026

Paper 2 introduces a foundation model for Multi-Agent Reinforcement Learning, addressing a core AI challenge by unifying diverse environments under a single architecture. This methodological breakthrough has broad implications for autonomous systems and general AI. While Paper 1 offers a highly valuable real-world benchmark for clinical trials, Paper 2's foundational approach is likely to drive wider adoption, extensive follow-up research, and broader theoretical impact across the AI community.

vs. First-Order Efficiency for Probabilistic Value Estimation via A Statistical Viewpoint

gpt-5.25/5/2026

Paper 1 likely has higher impact: it introduces a live, open-access benchmark/platform with a novel automated decontamination pipeline addressing a major confound (information leakage) in clinical trial outcome prediction. This enables rigorous, prospective evaluation with broad real-world relevance (biomedicine, forecasting, ML benchmarking) and can become shared infrastructure adopted by many researchers over time. Paper 2 is methodologically rigorous and valuable for efficient Shapley/semivalue estimation, but is a narrower technical contribution with less immediate cross-domain societal application than a vetted, live clinical prediction platform.

vs. The Compliance Trap: How Structural Constraints Degrade Frontier AI Metacognition Under Adversarial Pressure

gemini-35/5/2026

Paper 2 offers profound insights into frontier AI safety, a critical and rapidly evolving field. By rigorously evaluating 11 models across 67k records, it uncovers a fundamental failure mode—metacognitive collapse driven by compliance-forcing instructions rather than adversarial threats. This actionable finding directly impacts AI alignment methodologies and safe deployment in high-stakes environments. While Paper 1 presents a valuable biomedical forecasting platform, Paper 2's theoretical contribution to understanding and mitigating AI cognitive degradation promises broader, more immediate impact across the entire AI research, safety, and policy landscape.

vs. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

gemini-35/5/2026

Paper 2 demonstrates higher potential scientific impact due to its unprecedented scale and breadth of application. While Paper 1 introduces a valuable platform for clinical trial prediction, Paper 2 tackles a systemic crisis in all of science: the breaking point of peer review. By successfully deploying AI reviews across nearly 23,000 AAAI submissions and proving author preference over human reviews for technical accuracy, it establishes a paradigm shift for the entire scientific publishing ecosystem. This promises immediate, cross-disciplinary impact on how research is evaluated globally.

vs. Counterfactual Reasoning in Automated Planning

gpt-5.25/5/2026

Paper 2 has higher potential impact: it introduces a live, open-access platform with time-stamped, uncontaminated evaluation—addressing a major methodological gap in forecasting and clinical trial outcome prediction. The decontamination pipeline (LLM-powered web search + validation) is a concrete innovation enabling rigorous, ongoing benchmarking. Applications are high-stakes and immediate (drug development, trial design, biomedical decision-making), and the platform can influence multiple fields (ML evaluation, forecasting, biomedical informatics, evidence synthesis). Paper 1 is a useful survey and agenda-setter but is less novel and typically yields narrower direct impact than an enabling infrastructure benchmark.