OncoTraj: a public benchmark for longitudinal resistance prediction in EGFR-mutant non-small-cell lung cancer on osimertinib

Abhijoy Sarkar, Aarchi Singh Thakur

Jun 9, 2026arXiv:2606.11144v1

cs.LGq-bio.GNq-bio.QMstat.AP

#2842of 5669·cs.LG

#2842 of 5669 · cs.LG

Tournament Score

1401±41

10501750

59%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance4.5

Rigor7

Novelty3.5

Clarity7.5

Abstract

Resistance to first-line osimertinib in EGFR-mutant non-small-cell lung cancer (NSCLC) is the canonical example of predictable clonal evolution under therapeutic pressure, yet no public benchmark exists for training or evaluating computational models on the corresponding longitudinal patient trajectories. We introduce OncoTraj, a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three real-world clinical-genomic sources: MSK-CHORD (672 patients), AACR Project GENIE BPC NSCLC (34 patients), and the FLAURA molecular-resistance supplement (107 patients). OncoTraj defines three locked tasks: (A) binary classification of progression by a fixed 12-month landmark, (B) regression of time-to-first-progression in days, and (C) six-class classification of the dominant resistance mechanism. We release the harmonized dataset, patient-level train/validation/test splits with an audited no-leakage guarantee, an open-source evaluation harness, and six reference baselines spanning a majority-class predictor, logistic regression, random forest, XGBoost, an LSTM, and a multi-task transformer. With v1's single-timepoint snapshot features, no task clears chance on clean within-source evaluation: the uniformity of this ceiling across every model class localizes the limit to the input modality (single-snapshot tissue NGS rather than serial ctDNA), not the algorithm. The benchmark does recover a reproducible literature-consistent association: TP53 co-mutation raises the 12-month progression rate from 29% to 59% cohort-wide. OncoTraj establishes a reproducible, leakage-audited baseline and converts the modality limit into concrete design requirements for a serial-ctDNA-enriched v2.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: OncoTraj

1. Core Contribution

OncoTraj introduces a public benchmark of 813 EGFR-mutant NSCLC patients receiving first-line osimertinib, harmonized from three clinical-genomic sources (MSK-CHORD, GENIE BPC, FLAURA). It defines three locked prediction tasks—12-month landmark progression (binary), time-to-progression (regression), and resistance mechanism classification (6-class)—alongside patient-level splits, an evaluation harness, and six reference baselines. The paper's central thesis is not that these tasks are solved, but rather that they are *unsolvable* with the current input modality (single-snapshot tissue NGS), thereby converting a negative result into a concrete specification for future data collection (serial ctDNA).

This is a benchmark-infrastructure paper, not a methods paper. Its contribution lies in standardizing the problem formulation, providing leakage-audited splits, and documenting exactly where and why current data fails—rather than in algorithmic novelty.

2. Methodological Rigor

The paper demonstrates unusual honesty and rigor in several respects:

Strengths in methodology:

The three-stage leakage/confound audit trail (§6.1) is exemplary. The authors document how an initial AUC of 0.92 collapsed to chance after fixing temporal leakage, how a degenerate "ever-progresses" label was reformulated to a balanced 12-month landmark, and how source-flag confounds inflated the headline number from 0.680 to 0.716. This transparency is rare and valuable.

The within-source vs. mixed-source reporting discipline is commendable. The authors consistently present the conservative MSK-CHORD-only estimate (AUC 0.596, CI includes 0.50) alongside the mixed-source figure, preventing readers from over-interpreting cross-source structure as genuine signal.

The distribution-shift analysis (v1.1) with Cox PH baseline adds robustness evidence.

Bootstrap confidence intervals on all key metrics with explicit test-set sizes.

Weaknesses in methodology:

The cohort is dominated by MSK-CHORD (672/813 = 83%), making "three-source" diversity somewhat nominal. GENIE BPC contributes only 34 patients.

The FLAURA subset lacks individual progression dates, requiring a constant pseudo-target that creates the very confound the authors then spend considerable effort dissecting. Including FLAURA in Tasks A and B is questionable—the authors themselves recommend dropping it in v2.

The "813 patients" headline is misleading for practical modeling: effective test sizes are 110 (Task A), 85 (Tasks B, C), and within-source MSK-CHORD test is only 91 patients. These are small for reliable benchmark evaluation.

Feature engineering is rudimentary (9 binary co-mutation flags, VAF statistics, clinical covariates). While this is partly the point—demonstrating the modality ceiling—it also means the benchmark doesn't test whether richer feature engineering from the same data could help.

3. Potential Impact

Positive potential:

The benchmark fills a genuine gap: there is no standardized public evaluation framework for computational resistance prediction in EGFR-mutant NSCLC. Even as a "floor," this enables reproducible comparison of future methods.

The conversion of a negative result into design specifications for v2 (≥3 ctDNA timepoints per patient, serial sampling) is a useful contribution to the field's data-collection priorities.

The TP53 co-mutation association (29% vs. 59% 12-month progression rate) is a reproducible, literature-consistent finding that validates the benchmark captures real biology.

The open-source evaluation harness and leakage audit infrastructure are reusable.

Limiting factors:

The practical utility for methods development is currently near-zero: no task is solvable above chance on clean within-source evaluation, meaning researchers cannot meaningfully iterate on algorithms using v1. The benchmark is essentially a placeholder awaiting v2's serial ctDNA data.

The clinical oncology community may view this as premature: the "benchmark" demonstrates primarily that publicly available data is insufficient, which oncologists already know from the ctDNA literature.

The 813-patient size, while reasonable for a clinical cohort, is small for ML benchmarking. The effective sizes per task and per source are even smaller.

4. Timeliness & Relevance

The paper addresses a real need: resistance prediction on osimertinib is clinically important, and the lack of standardized evaluation frameworks hampers computational oncology research. The timing is appropriate—serial ctDNA monitoring is becoming routine, and computational methods are being developed that will need benchmarks. However, v1 arrives too early to be useful for actual method development, making it more of a position statement than a functional benchmark.

5. Strengths & Limitations

Key strengths:

Exceptional transparency about limitations, confounds, and negative results

Rigorous leakage audit with automated tests and full audit trail

Open-source code, data, and evaluation infrastructure

Clinically grounded task definitions with careful labeling guidelines

The TP53 finding validates biological signal in the data

Key limitations:

No task achieves above-chance performance within-source, making this a benchmark where nothing can currently be benchmarked

Heavy reliance on a single source (MSK-CHORD)

The "longitudinal" framing is aspirational—inputs are mostly single-timepoint snapshots

The FLAURA subset introduces more confusion than value

Small effective sample sizes undermine statistical power

v2 is contingent on data partnerships that may or may not materialize

The paper is extremely long for what amounts to a dataset description paper with null results

Authors have a potential conflict of interest as co-founders of a precision oncology company, though they disclose this

Summary

OncoTraj is a well-intentioned and transparently documented benchmark that currently serves more as a detailed negative result and specification document than as a functional evaluation platform. Its primary value is in formalizing the problem, documenting exactly why single-snapshot tissue NGS is insufficient for resistance prediction, and establishing infrastructure for a future v2 with serial ctDNA. The exceptional honesty about limitations is laudable but also reveals that the benchmark is premature as a practical tool. The impact will depend entirely on whether v2 materializes with adequate serial molecular data.

Rating:4/ 10

Significance 4.5Rigor 7Novelty 3.5Clarity 7.5

Generated Jun 10, 2026

Comparison History (17)

Lostvs. Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

Paper 1 provides foundational theoretical guarantees for distributed machine learning, addressing a critical bottleneck (stragglers in ASGD) with broad applicability across large-scale AI. Its high-probability convergence proofs under heavy-tailed noise represent a significant methodological advance. While Paper 2 offers a valuable clinical benchmark, its narrow focus on a specific cancer subtype and its baseline negative results limit its cross-disciplinary reach compared to the universal utility of robust distributed optimization algorithms.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Towards More General Control of Diffusion Models Using Jeffrey Guidance

Paper 1 introduces a broadly applicable, principled mathematical framework for controlling diffusion models, a highly active and widely impactful area of AI research. Its ability to improve sample quality and enforce fairness constraints gives it immense cross-disciplinary potential. While Paper 2 provides a highly valuable medical benchmark, its impact is constrained to a specific subfield of oncology, whereas Paper 1's methodological innovation will likely influence a wider array of domains and generate broader scientific interest.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

OncoTraj addresses a critical unmet need in precision oncology by creating the first public benchmark for longitudinal resistance prediction in EGFR-mutant NSCLC. It provides a harmonized dataset, standardized tasks, evaluation infrastructure, and baselines that can catalyze an entire research community. Its identification of specific data modality limitations (single-snapshot vs. serial ctDNA) provides actionable design requirements. Paper 1 presents an interesting adversarial robustness finding, but RL-based gradient disruption is incremental within an already crowded adversarial ML field, and the practical utility remains uncertain given potential adaptive attacks. OncoTraj's benchmark infrastructure has broader, more durable impact across ML and oncology.

claude-opus-4-6·Jun 11, 2026

Wonvs. MODIP: Efficient Model-Based Optimization for Diffusion Policies

Paper 2 likely has higher scientific impact due to releasing a large, harmonized, leakage-audited public clinical-genomic benchmark with locked tasks and an evaluation harness. Such resources can catalyze broad, long-term work across ML, oncology, bioinformatics, and regulatory/clinical translation, and its explicit identification of a modality ceiling informs future data collection (serial ctDNA) and study design. Paper 1 is a solid, timely algorithmic contribution for diffusion-policy fine-tuning in robotics, but its impact is narrower and more incremental relative to fast-moving RL/model-based control literature compared with a new public benchmark in precision oncology.

gpt-5.2·Jun 10, 2026

Wonvs. EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

OncoTraj addresses a critical unmet need in precision oncology by providing the first public benchmark for longitudinal resistance prediction in EGFR-mutant NSCLC. It creates reusable infrastructure (harmonized dataset, evaluation harness, leakage-audited splits) that can catalyze an entire research community around a clinically important problem. Its honest reporting of negative results (no model beats chance with current features) provides actionable insight directing future data collection (serial ctDNA). While Paper 1 offers incremental improvements in prompt learning for LLM agents, Paper 2 has broader cross-disciplinary impact spanning oncology, genomics, and ML, with direct translational potential.

claude-opus-4-6·Jun 10, 2026

Lostvs. OPRD: On-Policy Representation Distillation

Paper 1 addresses a major bottleneck in LLM distillation with a novel hidden-state approach, offering immediate, broad impact through significant gains in reasoning performance, training speed, and memory efficiency. Paper 2 introduces a valuable oncology benchmark, but its immediate impact is constrained by the negative results of its initial single-snapshot data modality.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Geometrically Averaged Hard Target Updates for Linear Q-Learning

OncoTraj addresses a critical gap in computational oncology by providing the first public benchmark for longitudinal resistance prediction in EGFR-mutant NSCLC, enabling reproducible model development for a clinically important problem. It has strong real-world medical applications, promotes open science with released data/code, and identifies concrete modality limitations guiding future work. Paper 2 makes a narrower theoretical contribution analyzing a target update variant in linear Q-learning—a well-studied area with incremental novelty. OncoTraj's potential to catalyze precision oncology research gives it broader and more timely impact across clinical and ML communities.

claude-opus-4-6·Jun 10, 2026

Wonvs. Unifying Local Communications and Local Updates for LLM Pretraining

Paper 2 likely has higher scientific impact: it creates a large, public, leakage-audited clinical-genomic benchmark with locked tasks, splits, and an evaluation harness—an enabling resource that can standardize and accelerate work across oncology, ML, and biomarker development. It also clearly identifies a modality ceiling and sets concrete requirements for improved data collection (serial ctDNA), which can steer the field. Paper 1 is technically novel and timely for distributed LLM training, but impacts may be narrower to systems/ML training infrastructure and may compete with fast-moving proprietary implementations.

gpt-5.2·Jun 10, 2026

Lostvs. Harpoon: Generalised Manifold Guidance for Conditional Tabular Diffusion

Paper 2 presents a general methodological advancement in machine learning (conditional tabular diffusion) with broad applicability across numerous domains relying on tabular data. In contrast, Paper 1 introduces a highly specialized medical benchmark where current data modalities fail to exceed chance performance. The fundamental algorithmic innovation in Paper 2 offers significantly broader impact and immediate practical utility across diverse fields compared to the domain-specific stepping-stone dataset in Paper 1.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Encoding the Euler Characteristic Transform

While Paper 2 offers a valuable methodological advancement in topological data analysis, Paper 1 is likely to have higher scientific impact due to its profound clinical relevance. By introducing a harmonized, leakage-audited public benchmark for longitudinal cancer resistance prediction, Paper 1 addresses a critical bottleneck in computational oncology. Although it highlights the limitations of current single-snapshot data, establishing this standardized evaluation framework will catalyze future algorithmic development and guide next-generation data collection efforts, directly paving the way for predictive models that could improve patient outcomes in cancer treatment.

gemini-3.1-pro-preview·Jun 10, 2026

#2842of 5669·cs.LG

#2842 of 5669 · cs.LG

Tournament Score

1401±41

10501750

59%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance4.5

Rigor7

Novelty3.5

Clarity7.5