Jannik Lübberstedt, Krischan Braitsch, Jacqueline Lammert, Christof Winter, Florian Gabriel, Tristan Lemke, Christopher Zirn, Markus Graf
Routine laboratory panels drawn during cancer treatment constitute longitudinal physiological recordings of organ function, yet their temporal structure is discarded by single-timepoint prognostic tools. A transformer trained on 2,777,595 laboratory measurements from 3,905 patients with multiple myeloma or ovarian cancer predicted the two-year onset of 162 treatment-associated complications, including therapy-related myelodysplastic syndromes, spanning eight clinical categories, achieving 1.5- to 6.1-fold enrichment above prevalence at the group level. It matched or outperformed non-sequential baselines across grouped endpoints (AUROC gains up to +0.11), demonstrating that longitudinal laboratory trajectories capture evolving complication-specific physiology inaccessible from isolated measurements. Predictions generalised across both cancers, divergence concentrating in disease-specific complications, and biomarker masking recovered signatures consistent with established pathophysiology. External validation on MIMIC-IV and MMRF CoMMpass confirmed transferability across independent healthcare systems (AUROC up to 0.85). Routine oncological laboratory data encode organ deterioration weeks to months before clinical onset, enabling complication-specific surveillance without additional testing infrastructure.
This paper presents a transformer-based model trained on longitudinal routine laboratory trajectories (2.78M measurements from 3,905 patients with multiple myeloma or ovarian cancer) to predict the two-year onset of 162 treatment-associated complications across eight clinical categories. The central thesis is that temporal patterns in routine lab panels—already collected in standard oncological care—encode organ-level deterioration before clinical manifestation, and that this temporal structure is lost by existing single-timepoint prognostic tools (e.g., CARG, CRASH scores). The key novelty lies in the combination of (1) complication-resolved multi-label prediction (162 endpoints vs. aggregate toxicity scores), (2) exploitation of temporal trajectories rather than cross-sectional snapshots, and (3) the claim that no new testing infrastructure is required.
Strengths in design: The study employs several rigorous design choices: patient-level splits preventing leakage, 5-fold cross-validation with logit ensembling, non-overlapping temporal windows (10–30 days), per-diagnosis exclusion of patients with pre-existing conditions, and bootstrap confidence intervals (1,000 resamples). The comparison against non-sequential baselines (XGBoost and logistic regression following the CoMET framework) appropriately isolates the temporal modeling contribution. The biomarker masking analysis—both single-feature and pairwise—provides interpretability without requiring attention-based explanations.
Concerns: Several methodological limitations temper enthusiasm. First, the model is trained and internally tested at a single German tertiary center, introducing significant selection bias. Second, the 25-year observation window (2000–2026) spans substantial changes in treatment standards, lab assays, and coding practices—potential temporal confounders the authors acknowledge but do not address analytically. Third, the model receives no treatment information, making it impossible to disentangle disease progression from treatment toxicity—a fundamental limitation for a tool purportedly targeting "treatment-associated" complications. Fourth, some key results rest on very small case counts (e.g., 21 MDS cases, yielding the striking but unstable OC AUROC of 0.93 vs. MM AUROC of 0.50). Fifth, the imputation of 56.9% missing data via a transformer-based model is substantial, and while ablation shows minimal internal impact, this could mask important biases.
The external validation on MIMIC-IV and MMRF CoMMpass is valuable but reveals notable degradation for several endpoints (bacterial infections dropping to 0.56 on MIMIC-IV, fungal infections to 0.56), and the authors correctly identify that ICD coding differences between GM and CM systems confound interpretation of cross-system performance.
Clinical utility: The promise of complication-specific surveillance from already-collected data is compelling. The 1.5- to 6.1-fold enrichment above prevalence for grouped endpoints suggests potential for risk stratification, though the moderate AUROCs (0.65–0.75) are insufficient for individual-level clinical decision-making without prospective calibration. The connection to emerging prevention strategies (e.g., CDK4/6 inhibition for TP53-mutant clonal hematopoiesis expansion) is forward-looking but speculative at this stage.
Methodological influence: The paper could influence how longitudinal laboratory data are modeled in oncology more broadly. The demonstration that temporal trajectories outperform cross-sectional summaries for specific endpoints (MDS: +0.11, fungal infections: +0.09) provides evidence for investing in sequence-aware architectures. The biomarker masking framework offers a practical interpretability tool applicable beyond this specific application.
Limitations on impact: The absence of prospective validation is a significant barrier to clinical translation. The paper does not demonstrate actionability—no decision thresholds are proposed, no clinical workflow integration is described, and no cost-effectiveness analysis is provided. The code and weights are promised but not yet available.
The work addresses a genuine gap: existing oncology risk tools are cross-sectional, aggregate, and poorly validated externally. The transformer architecture choice is timely, and the focus on repurposing existing data infrastructure ("cheapest dense longitudinal monitoring channel") is practically appealing in an era of cost-conscious healthcare. The connection to digital twin concepts positions the work within an active research trajectory.
The LLM-based clinical data extraction pipeline (Llama 3.3-70B) for myeloma cohort characterization is a noteworthy secondary contribution, though its separation from the prediction model input is important to maintain. The paper is well-written and transparent about limitations. The supplementary materials are comprehensive, with full per-endpoint results enabling independent assessment.
The claim that "routine laboratory data encode organ deterioration weeks to months before clinical onset" is supported for some endpoints but overstated for others where the temporal advantage is negligible. The work would benefit from clearer delineation of where temporal modeling adds value versus where static features suffice.
Generated Jun 9, 2026
Paper 1 has higher likely scientific impact due to clear clinical relevance and near-term real-world deployment potential using routinely collected labs, plus strong methodological rigor (large longitudinal dataset, multiple endpoints, cross-cancer generalization, mechanistic probing via masking, and external validation across independent health systems). Its breadth spans oncology, clinical informatics, and organ-specific physiology, and it addresses a timely need for early complication surveillance. Paper 2 is novel and useful within LLM training, but its impact is more concentrated in ML engineering and depends on broader adoption of OPD pipelines.
Paper 1 offers profound real-world clinical impact by utilizing universally available routine laboratory data to predict severe cancer complications. Its robust methodology, large-scale validation across independent healthcare systems, and potential to improve patient outcomes without requiring new testing infrastructure give it exceptional translational value, surpassing the algorithmic efficiency gains presented in Paper 2.
Paper 2 has higher likely scientific impact due to its large-scale, clinically grounded dataset, strong methodological validation (cross-cancer generalization plus external validation on independent health systems), and direct real-world applicability for early complication surveillance using existing lab infrastructure. Its approach is timely and broadly relevant across oncology, clinical informatics, and healthcare operations, with potential to change monitoring practices. Paper 1 is novel for geometry-generalizing GNN surrogates in FEA, but the scope (2D plates, limited geometries/loads) and narrower immediate translational pathway suggest comparatively smaller near-term impact.
Paper 2 likely has higher impact: it applies a timely, rigorous transformer approach to large-scale longitudinal clinical data, predicts many clinically relevant complications, and includes cross-cancer generalization plus external validation on independent datasets—key for real-world adoption. Its findings can influence oncology care, surveillance, and ML-for-health broadly. Paper 1 targets an important area (quantum error correction) and reports strong training improvements, but impact may be narrower and harder to translate without clear benchmarking against standard QEC codes/hardware constraints and broader validation.
Paper 2 presents a highly impactful real-world medical application, leveraging existing routine laboratory data to predict severe cancer treatment complications. Its external validation on independent datasets demonstrates strong methodological rigor and immediate potential to improve patient outcomes without additional infrastructure. While Paper 1 provides valuable insights into LLM behavior, Paper 2's direct life-saving implications and broad applicability in clinical oncology give it a higher potential for tangible scientific and societal impact.
Paper 1 demonstrates significantly higher scientific impact potential. It addresses a critical clinical problem—early prediction of treatment complications in cancer—using a large-scale dataset of nearly 3 million measurements, validates across multiple independent datasets (MIMIC-IV, MMRF CoMMpass), and shows practical utility without requiring new infrastructure. The breadth of 162 complications across 8 clinical categories, combined with interpretable biomarker masking analysis, offers both methodological rigor and direct clinical applicability. Paper 2, while technically interesting, addresses a narrower optimization problem (tensor compression for LLMs) with incremental improvements over baselines and limited demonstrated real-world impact.
Paper 2 has higher potential scientific impact due to its direct clinical applicability—predicting organ-level complications weeks to months before onset using existing routine lab data, requiring no additional infrastructure. It demonstrates broad generalizability across cancers and healthcare systems (external validation on MIMIC-IV and CoMMpass), addresses 162 complications across eight clinical categories, and could transform oncological surveillance. Paper 1 makes an important methodological contribution to interpretability of deep learning on physiological signals, but its impact is narrower, primarily serving as an auditing tool for the ML/neuroscience community rather than enabling new clinical capabilities.
Paper 1 likely has higher scientific impact due to greater cross-domain novelty and breadth: it leverages massive longitudinal routine lab data with a transformer to predict a wide range of organ-level complications, includes mechanistic interpretability (masking), and shows external validation across independent health systems—key for clinical translation. The potential real-world benefit is substantial (earlier detection/surveillance without new infrastructure) and timely for ML in healthcare. Paper 2 is practically valuable for recommender-system RL robustness with production A/B wins, but its methodological contribution is narrower and less broadly generalizable beyond industrial recommendation settings.
Paper 1 demonstrates higher scientific impact potential due to its direct clinical applicability—using routine laboratory data already collected during cancer treatment to predict 162 complications weeks to months before onset, without requiring additional infrastructure. The large-scale validation (3,905 patients, 2.7M measurements) across multiple cancers and external datasets (MIMIC-IV, MMRF CoMMpass) shows robustness and generalizability. This addresses a pressing real-world clinical need in oncology. Paper 2, while theoretically rigorous in advancing multi-objective bandit theory with proactive queries, addresses a narrower algorithmic problem with less immediate broad-field impact.
Paper 2 likely has higher scientific impact due to strong real-world applicability (early detection of diverse treatment complications using routine labs), large-scale longitudinal dataset, external validation across independent systems (MIMIC-IV, CoMMpass), and breadth across clinical categories and cancers. The transformer-based trajectory modeling is timely and directly translatable to clinical surveillance without new infrastructure, increasing adoption potential. Paper 1 is novel methodologically, but impact may be narrower and dependent on broader acceptance/validation of fractal metrics for generalization beyond the evaluated benchmarks.