Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

Jun 10, 2026arXiv:2606.12252v1

cs.LGcs.AI

#4682of 5669·cs.LG

#4682 of 5669 · cs.LG

Tournament Score

1308±44

10501750

30%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor4

Novelty5

Clarity6

Abstract

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ERTS — Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

1. Core Contribution

ERTS proposes augmenting Progressive Data Dropout (PDD) with a Grad-CAM-based "focus score" that serves as a training-time reliability signal. The core idea is that among uncertain samples (those not yet confidently classified), some are uncertain because they contain meaningful but under-learned patterns, while others are uncertain due to noise, label ambiguity, or artifacts. ERTS distinguishes between these two cases by computing Grad-CAM attention maps and measuring their spatial concentration (via the mean intensity of the top-10% salient regions). Samples with diffuse attention are filtered out; those with focused attention are retained for gradient updates.

The conceptual contribution—using explanation quality as a training-time data selection criterion rather than purely as a post-hoc interpretability tool—is genuinely interesting and represents a meaningful shift in how explainability methods can be employed. However, the technical novelty is relatively modest: the method is a two-stage filter combining an existing PDD framework with a simple Grad-CAM concentration metric.

2. Methodological Rigor

Experimental breadth is a strength. The paper evaluates across three ECG datasets (PTB-XL, CPSC 2018, Georgia 2020), three backbone architectures (EfficientNetV2-S, ResNet-18, MobileNetV2), and multiple PDD variants (DBPD, SMRD, SRD) with various threshold settings. This combinatorial evaluation (9 dataset-backbone pairs, dozens of configurations) lends credibility to the generalization claims.

However, several concerns arise:

Marginal improvements. The macro-F1 gains are very small—typically 0.002–0.01 in absolute terms. For instance, on CPSC 2018 with EfficientNetV2-S, the improvement is 0.7166→0.7188 (+0.0022). Without confidence intervals, statistical significance tests, or multiple-run variance reporting, it is impossible to determine whether these differences are meaningful or within noise. This is a critical omission for a paper claiming "consistent improvements."

No statistical testing. None of the results include standard deviations, confidence intervals, or significance tests. Given the small effect sizes, this is a major weakness.

Focus score simplicity. The focus score (Equation 3-4) is simply the mean of the top-10% Grad-CAM activations. No justification is provided for why 90th percentile is the right choice, nor is there ablation over this percentile threshold. The measure conflates several properties—a high score could reflect a single spike or broadly elevated attention.

Threshold sensitivity. The paper acknowledges that φ is dataset- and model-dependent, effectively adding another hyperparameter that must be tuned. The paper shows φ=0.5 is too aggressive and φ=0.9 works best for some settings while φ=0.7 for others, but provides no principled method for selection.

Computational overhead not quantified. The paper claims Grad-CAM overhead is "small" but never reports wall-clock times or GPU-hour comparisons. Effective Epochs is a proxy metric that ignores the cost of computing Grad-CAM maps, which require backward passes through the network for each candidate sample.

3. Potential Impact

The paper addresses a real need: training efficiency in clinical ML settings where computational resources are constrained. The idea of using explanation quality during training (rather than only post-hoc) has broader applicability beyond ECG classification—potentially extending to medical imaging, EHR analysis, and other clinical time-series domains.

However, the practical impact may be limited by:

The improvements are marginal and may not justify the added complexity of computing Grad-CAM during training.

The method is tightly coupled to Grad-CAM, which works well for convolutional architectures but is less naturally applicable to transformers or other modern architectures increasingly used in ECG analysis.

No comparison is made against other data selection methods (e.g., influence functions, forgetting events, dataset cartography) that could serve as baselines.

4. Timeliness & Relevance

The paper is timely in two respects: (1) growing interest in green/efficient AI and (2) increasing emphasis on trustworthy clinical AI. The intersection of explainability and training efficiency is underexplored, making the conceptual framing relevant. ECG classification is a well-motivated application domain given the scale of cardiac data and the resource constraints in many healthcare settings.

However, the paper does not engage with recent curriculum learning advances, data influence methods, or active learning literature that also addresses the question of which samples to prioritize during training.

5. Strengths & Limitations

Strengths:

Novel conceptual framing: using explainability as a training-time signal rather than post-hoc tool

Comprehensive experimental grid across datasets, architectures, and PDD variants

Useful qualitative analysis (Figures 5-8) showing the filtering behavior and class-level effects

The observation that NORM samples are preferentially filtered (producing diffuse attention) is clinically sensible

The method is architecturally non-invasive—no model modifications required

Limitations:

No error bars or statistical significance testing despite very small effect sizes

No wall-clock time or actual computational cost measurements

Missing comparison to alternative data selection baselines (influence functions, dataset cartography, etc.)

The focus score is simplistic and may not capture distributed but clinically meaningful attention patterns

φ requires tuning per dataset/model, undermining the "practical signal" narrative

The paper is quite verbose relative to its technical contribution—the method section is short while results tables are extensive

No ablation on the percentile used in the focus score

Code is promised but not yet available

6. Additional Observations

The paper reads more as an empirical study than a methodological contribution. The extensive tables (taking ~8 pages) report many configurations but the core insight could be conveyed more concisely. The writing is clear but repetitive. The related work adequately contextualizes the contribution but undersells the connection to active learning and data valuation literatures.

The class-level analysis (Section 4.8) is the most compelling part of the paper, showing that ERTS preferentially removes NORM samples with diffuse attention while preserving pathological classes—this provides mechanistic insight into why the method works.

Rating:4.5/ 10

Significance 4.5Rigor 4Novelty 5Clarity 6

Generated Jun 11, 2026

Comparison History (23)

Wonvs. Distributional Loss for Robust Classification

Paper 1 has higher potential impact because it introduces a training-time mechanism that leverages explainability (Grad-CAM focus) as a reliability/efficiency signal, addressing a concrete bottleneck in clinical ECG deployment (compute constraints, noisy/ambiguous samples). It is timely (trustworthy/efficient medical AI), has clear real-world applicability, and could generalize to other clinical time-series tasks. Paper 2’s distributional loss is broadly applicable, but the abstract is less specific about rigor/validation and similar “soft target/label smoothing/ambiguity-aware” losses exist, reducing perceived novelty.

gpt-5.2·Jun 12, 2026

Wonvs. How Much Memory Do We Need? Adaptive Memory Gate for Neural Operators

Paper 2 demonstrates higher potential scientific impact due to its broad applicability and conceptual novelty. While Paper 1 introduces a valuable, domain-specific architectural improvement for neural operators solving PDEs, Paper 2 innovatively bridges explainable AI (XAI) and efficient training by using explanation quality as an active data-pruning signal. This approach addresses critical real-world challenges in clinical machine learning—computational constraints and noisy datasets—offering immediate translational value for healthcare applications while contributing a novel methodology to the broader fields of trustworthy and efficient deep learning.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Paper 2 presents a highly innovative approach by utilizing explainability (XAI) not just for post-hoc analysis, but as an active training signal to improve data efficiency and model reliability. Its application to ECG classification addresses a critical real-world bottleneck in healthcare AI—computational constraints and noisy clinical data. While Paper 1 offers solid theoretical improvements to multimodal VAEs, Paper 2 demonstrates broader translational impact, directly benefiting clinical ML deployment and advancing the integration of explainability into the model optimization process.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. Simplex-Constrained Sparse Bagging: Transitioning from Uniform Priors to Sparse Posteriors in Ensemble Learning

Paper 2 addresses a fundamental and broadly applicable problem in ensemble learning with a mathematically rigorous solution. The identification and resolution of the 'L1-simplex paradox' is a genuine theoretical contribution. SCSB is model-agnostic, applicable across Random Forests, Bagged SVMs, and Neural Networks, giving it broad impact across machine learning. The 96% compression with maintained accuracy has significant practical implications. Paper 1, while useful, is more incremental—combining existing techniques (Grad-CAM, progressive data dropout) in a specific domain (ECG classification) with relatively narrow applicability.

claude-opus-4-6·Jun 12, 2026

Lostvs. TaskFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

Paper 2 addresses a fundamental and broadly applicable challenge—continual anomaly detection across heterogeneous tabular data with varying schemas. Its comprehensive approach, combining alignment, augmentation, and distillation, evaluated across 21 diverse datasets, offers significantly broader cross-disciplinary impact than Paper 1, which focuses on a domain-specific efficiency improvement for ECG classification.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. The Confidence Trap: Calibration Attacks for Graph Neural Networks

Paper 2 addresses a novel and underexplored security vulnerability in GNN calibration, combining adversarial robustness with calibration—two critical topics in trustworthy AI. It provides theoretical insights linking generalization and calibration vulnerability, introduces a comprehensive framework (UGCA) with multiple technical innovations, and has broader implications for safety-critical AI deployment. Paper 1, while practically useful for ECG efficiency, is more incremental—combining existing techniques (Grad-CAM, progressive data dropout) in a narrower clinical domain. Paper 2's findings about fundamental model vulnerabilities have wider cross-domain relevance.

claude-opus-4-6·Jun 11, 2026

Lostvs. ATLAS: Active Theory Learning for Automated Science

Paper 1 proposes a highly innovative framework for automated science that automates hypothesis generation and experimental design. Its potential to accelerate mechanistic modeling across various scientific domains gives it a much larger breadth of impact and transformative potential compared to Paper 2, which focuses on a narrower methodological improvement for training efficiency in ECG classification.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. A Riemannian Approach to Low-Rank Optimal Transport

Paper 1 likely has higher scientific impact due to greater methodological novelty (Riemannian manifold formulation with Fisher–Rao metric, projectors/retractions/HVPs, rank-sufficiency certificate) and broader applicability across OT variants (linear OT, GW, fused GW, balanced/unbalanced). Its contributions are foundational and can influence multiple fields (optimization, geometry, ML, graphics, computational biology). Paper 2 is timely and practically relevant for clinical ECG efficiency, but its core idea (using Grad-CAM-based filtering as a training signal) is more incremental and domain-specific, with narrower cross-field impact.

gpt-5.2·Jun 11, 2026

Wonvs. What Uncertainties Do We Need for Dynamical Systems?

Paper 2 likely has higher scientific impact due to a concrete, novel training-time mechanism (explainability-derived reliability signal) with demonstrated empirical gains across multiple ECG datasets/architectures and clear real-world relevance in resource-constrained clinical ML. It combines efficiency, reliability, and interpretability—timely needs in healthcare AI—and offers deployable methodology plus code release. Paper 1 is valuable as a conceptual framework for uncertainty in dynamical systems, but appears more survey/position-oriented with less immediate methodological or application impact unless it introduces new formalism or validated tools.

gpt-5.2·Jun 11, 2026

Lostvs. On Subquadratic Architectures: From Applications to Principles

Paper 1 addresses a fundamental bottleneck in modern AI—the quadratic scaling of Transformers—by evaluating and theoretically unifying subquadratic alternatives like xLSTM and Mamba-2. Its findings on state tracking and memory dynamics have broad implications across multiple domains, including NLP, code generation, and time-series analysis. Paper 2 offers a valuable but more niche application of explainability for efficient ECG training. Because Paper 1 tackles a core architectural challenge with field-wide relevance and high timeliness, it possesses significantly higher potential scientific impact.

gemini-3.1-pro-preview·Jun 11, 2026

#4682of 5669·cs.LG

#4682 of 5669 · cs.LG

Tournament Score

1308±44

10501750

30%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance4.5

Rigor4

Novelty5

Clarity6