Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

Jun 10, 2026arXiv:2606.11794v1

cs.LGcs.AI

#4682of 5669·cs.LG

#4682 of 5669 · cs.LG

Tournament Score

1308±44

10501750

30%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor5

Novelty3.5

Clarity6.5

Abstract

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper presents a multimodal deep learning framework that combines T1-weighted structural MRI with demographic/genetic tabular data for automated staging of Alzheimer's disease severity according to the Clinical Dementia Rating (CDR) scale. The key methodological contributions are: (1) attention-based fusion of imaging and tabular modalities, (2) ordinal regression heads (CORAL/CORN) that respect the ordered nature of CDR stages, and (3) explainability analyses via Grad-CAM++ and SHAP. The framework is evaluated across three cohorts (ADNI, AIBL, NIFD) with a strictly held-out test set.

Methodological Rigor

Strengths in experimental design: The paper demonstrates reasonable methodological care in several areas. Subject-level splitting to prevent data leakage, cohort-stratified train/validation splits, and a strictly held-out test set are all important design choices. Imputation was restricted to training data, and bootstrap confidence intervals (N=1000) were computed for key metrics.

Concerns about rigor:

1. Performance levels are modest. The best QWK of 0.549 represents only moderate agreement with clinical staging. The overall accuracy of 0.667 for a 3-class problem (CDR 0, 0.5, 1) is not substantially above what simple baselines might achieve, particularly given the class distribution (50.2% CDR 0). The F1-score for CDR 1 (0.477) and CDR 0.5 (0.557) suggest the model struggles precisely where clinical utility matters most.

2. Limited severity range. Excluding CDR 2 and 3 reduces the problem to distinguishing cognitively normal, very mild, and mild dementia—the most clinically challenging distinctions but also limiting the claim of "severity staging." The practical utility of the ordinal formulation is somewhat diminished when only three ordered categories remain.

3. Ordinal vs. non-ordinal comparison is ambiguous. The ordinal model achieves higher QWK (0.549 vs. 0.477) but worse MAE (0.363 vs. 0.340). The authors argue QWK better captures ordinal structure, which is reasonable, but the improvements are modest and the confidence intervals overlap substantially (QWK ordinal: 0.493–0.604 vs. non-ordinal: 0.413–0.528).

4. No comparison to established baselines. The paper lacks comparison with existing published methods for CDR prediction, volumetric/FreeSurfer-based approaches, or simpler ML baselines (e.g., random forest on extracted brain volumes). Without these, it is difficult to contextualize performance.

5. TABPFN integration is awkward. The authors acknowledge that TABPFN does not natively support ordinal regression, with ordinal constraints applied only at the fusion stage. This means the tabular branch is not truly ordinal, potentially undermining the theoretical motivation.

6. Class imbalance handling. CDR 1 comprises only 11.7% of data. While weighted sampling was used during training, the severe drop in CDR 1 recall on the test set (0.50 for ordinal, 0.67 for non-ordinal) suggests this remains inadequately addressed. Notably, the non-ordinal model actually achieves better CDR 1 recall.

Potential Impact

The clinical motivation—automating CDR staging to reduce clinician burden and inter-rater variability—is sound and practically relevant. However, the current performance levels (67% accuracy, QWK 0.55) would be insufficient for clinical deployment. The framework could serve as a research tool or screening aid, but the paper does not discuss operational thresholds for clinical acceptability.

The use of routinely acquired T1w MRI and basic demographics is a practical strength, as it avoids dependence on PET or CSF biomarkers. The multi-cohort evaluation adds some credibility regarding generalizability, though all three cohorts are research datasets with relatively similar populations.

The interpretability analyses (Grad-CAM++, SHAP) are presented at a surface level—single example visualizations without systematic validation. Showing that the model highlights the hippocampus is expected and does not constitute rigorous interpretability validation.

Timeliness & Relevance

The topic is timely given the approval of disease-modifying therapies (lecanemab, donanemab) that require early and accurate staging. Ordinal regression for clinical scales is a relevant methodological direction that deserves more attention in the neuroimaging community. However, the specific combination of components (3D ResNet + attention fusion + ordinal heads) is relatively incremental rather than representing a paradigm shift.

Strengths

Well-motivated clinical problem with clear practical relevance

Multi-cohort evaluation with proper data leakage prevention

Systematic comparison of unimodal vs. multimodal and ordinal vs. non-ordinal approaches

Bootstrap confidence intervals for robustness assessment

Use of routinely available clinical data

Limitations

Modest absolute performance that limits clinical translatability

No comparison with published state-of-the-art methods or simpler baselines

Shallow interpretability analysis without quantitative validation

Only three CDR categories, limiting ordinal modeling benefits

Overlapping confidence intervals between ordinal and non-ordinal approaches weaken the central claim

No external validation on truly independent cohorts (test set drawn from same source cohorts)

Missing important methodological details: which fusion strategy was ultimately selected, specific hyperparameter values chosen, computational requirements

The attention mechanism's contribution is not ablated independently from the ordinal head

Overall Assessment

This paper addresses a clinically relevant problem with a reasonable methodological framework, but the contributions are primarily integrative rather than novel. The performance gains from ordinal modeling, while directionally correct, are modest and not clearly statistically significant given overlapping confidence intervals. The lack of comparison to established baselines, limited interpretability validation, and moderate absolute performance weaken the impact. The paper reads as a competent application study rather than a methodological advance, and would benefit from stronger baselines, ablation studies, and more rigorous interpretability evaluation.

Rating:4.5/ 10

Significance 4.5Rigor 5Novelty 3.5Clarity 6.5

Generated Jun 11, 2026

Comparison History (23)

Lostvs. Different Layers, Different Manifolds: Module-Wise Weight-Space Geometry in Transformer Optimization

Paper 1 addresses a fundamental question in deep learning optimization—whether different transformer modules benefit from different manifold geometries—providing novel insights into weight-space geometry that could influence how large language models are trained. This has broad implications across all transformer-based architectures. Paper 2 presents a solid but incremental contribution combining known techniques (ordinal regression, multimodal fusion, attention mechanisms) for AD staging, with moderate performance improvements. Paper 1's novelty in module-specific geometric optimization and its potential to reshape training practices for widely-used architectures gives it higher impact potential.

claude-opus-4-6·Jun 12, 2026

Wonvs. MP3: Multi-Period Pattern Pre-training forSpatio-Temporal Forecasting

Paper 2 targets a critical, high-impact clinical challenge (Alzheimer's staging) with immediate real-world relevance. It demonstrates high methodological rigor through multi-cohort validation, strict held-out testing, and the integration of explainability tools crucial for medical AI. While Paper 1 offers a strong, broadly applicable algorithmic plugin, Paper 2's direct potential to improve clinical decision support and patient outcomes in neurodegenerative diseases gives it a higher potential for significant scientific and societal impact.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Emotional regulation improves deep learning-based image classification

Paper 2 has higher likely scientific impact due to clear clinical relevance (Alzheimer’s severity staging), strong methodological rigor (multi-cohort data, subject-level splits, strictly held-out test set, leakage controls), and actionable real-world deployment potential with interpretability (Grad-CAM++/SHAP). Its ordinal modeling aligns with clinically ordered outcomes and demonstrates measurable gains in agreement with staging, supporting adoption. Paper 1 is more speculative: “artificial subjective experience” is conceptually novel but less clearly grounded, with impact currently limited to CIFAR-scale image classification and weaker immediate applicability beyond ML benchmarks.

gpt-5.2·Jun 12, 2026

Wonvs. Detecting Explanatory Insufficiency in Learned Representations: A Framework for Representational Vigilance

Paper 2 presents a concrete, empirically validated multimodal framework for Alzheimer's disease severity staging with clear real-world clinical applications, rigorous methodology (multiple datasets, held-out test sets, explainability analyses), and quantitative results. Paper 1 introduces a purely conceptual/theoretical diagnostic framework (VER) without empirical validation, making it speculative. While Paper 1 addresses an interesting gap in representation evaluation, its lack of experimental evidence significantly limits its near-term scientific impact compared to Paper 2's immediately applicable contributions to clinical AI and neurodegenerative disease assessment.

claude-opus-4-6·Jun 12, 2026

Wonvs. Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

Paper 1 addresses a critical healthcare challenge (Alzheimer's staging) using a robust, interpretable multimodal framework. Advancing automated clinical decision support for neurodegenerative diseases offers profound, long-term societal and medical benefits. Conversely, Paper 2 presents valuable engineering optimizations for deploying a specific generative AI model on consumer hardware. While highly practical, its scientific impact is narrower and more transient compared to fundamental improvements in medical diagnostics.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. TaskFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

Paper 1 addresses a novel and underexplored problem—continual anomaly detection across heterogeneous tabular datasets with varying feature schemas—introducing multiple innovative components (AGF model, TaskFusion augmentation, dataset distillation for replay). This tackles a fundamental challenge in continual learning with broader applicability across domains. Paper 2, while methodologically sound, applies relatively established techniques (attention mechanisms, ordinal regression, multimodal fusion) to AD staging, representing an incremental advance in a well-studied area. Paper 1's novelty, breadth of evaluation (21 datasets), and generalizability across domains give it higher potential impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. Redesign Mixture-of-Experts Routers with Manifold Power Iteration

Paper 2 likely has higher impact: it proposes a novel, general router-design principle for Mixture-of-Experts with a theoretically motivated algorithm (Manifold Power Iteration) and validation across large-scale (1B–11B) pretraining, aligning with timely, high-interest foundation-model scaling. Its contributions could broadly affect NLP and systems/optimization communities and improve widely used MoE architectures. Paper 1 is rigorous and clinically relevant, but is more incremental within established multimodal AD staging and may have narrower cross-field influence and higher translational barriers.

gpt-5.2·Jun 11, 2026

Lostvs. MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

Paper 2 identifies a fundamental pathology in state-of-the-art transformer models for proteomics and proposes a highly novel, training-free solution with massive performance improvements (up to 39.1%). Its foundational contribution to computational mass spectrometry offers broader applicability and higher methodological innovation compared to Paper 1, which, while rigorous and clinically relevant, represents a more standard application of existing multimodal machine learning techniques.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Pretraining Recurrent Networks without Recurrence

Paper 2 proposes a fundamentally novel training paradigm (SMT) that addresses core limitations of RNN training—sequential computation and vanishing gradients—by decoupling memory learning from recurrent credit assignment. This has broad implications across all sequence modeling domains and could unlock scalable RNN pretraining, a long-standing challenge. Paper 1, while methodologically sound, represents an incremental application of existing techniques (attention, ordinal regression, multimodal fusion) to AD staging. Paper 2's potential to reshape how recurrent models are trained gives it significantly broader and deeper scientific impact.

claude-opus-4-6·Jun 11, 2026

Lostvs. Implicit Neural Representations of Individual Behavior

Paper 2 introduces a highly novel methodological advancement by adapting Implicit Neural Representations (INRs) from vision to behavioral modeling. This creates a versatile framework for policy representation learning applicable across diverse domains like robotics, autonomous racing, and gaming. While Paper 1 offers a rigorous and clinically valuable application of multimodal learning for Alzheimer's staging, Paper 2 provides a foundational AI innovation with broader cross-disciplinary impact, addressing complex challenges like varying episode lengths and novel out-of-distribution shifts. Consequently, Paper 2 has a higher potential for widespread theoretical and practical adoption across the broader AI community.

gemini-3.1-pro-preview·Jun 11, 2026

#4682of 5669·cs.LG

#4682 of 5669 · cs.LG

Tournament Score

1308±44

10501750

30%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor5

Novelty3.5

Clarity6.5