Boosting ECG Classification Performance by Pre-training with Synthesized Data

Naoki Nonaka, Jun Seita

Jun 9, 2026arXiv:2606.10802v1

cs.LGcs.AI

#4565of 5669·cs.LG

#4565 of 5669 · cs.LG

Tournament Score

1315±44

10501750

32%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance4

Rigor4.5

Novelty3.5

Clarity6.5

Abstract

Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data is often challenging due to privacy concerns and the rarity of certain diseases. To address this data scarcity, we investigate the efficacy of training DNN models using synthetic data, generated based on domain-specific medical knowledge. Specifically, we develop a knowledge-driven Gaussian-composition synthesis algorithm for single-lead II ECGs, in which each heartbeat is represented by Gaussian-shaped P, Q, R, S, and T wave components. Using this simulator, we generate synthetic data for four abnormal electrocardiogram (ECG) classes: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). We evaluate the utility of this synthetic data by conducting abnormal ECG classification using ten different DNN architectures. Our results demonstrate that synthetic-to-real training improves classification performance for three of the four target abnormalities, with the largest architecture-averaged gain of $33.2\%$ observed for AFLT. Further analysis reveals that the performance enhancement from synthetic data is more pronounced with smaller real-world datasets. These findings suggest that domain-knowledge-based synthetic ECGs can serve as a useful pre-training resource, particularly in scenarios where real-world data are limited or difficult to obtain.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper proposes a knowledge-driven Gaussian-composition simulator for generating synthetic single-lead II ECG signals to pre-train deep neural networks for abnormal ECG classification. The simulator represents each heartbeat as a sum of five Gaussian-shaped components (P, Q, R, S, T waves) and introduces class-specific modifications for four abnormalities: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). The key finding is that a "Syn→Real" transfer learning approach—pre-training on synthetic data then fine-tuning on real data—improves classification for three of four abnormality classes, with the largest gains observed when real-world data is most scarce.

The core idea of using simulator-based synthetic data for pre-training is not new (domain randomization, FractalDB, and prior ECG simulators like McSharry et al. 2003 all exist), but the specific application to class-specific abnormal ECG synthesis with systematic evaluation across ten architectures provides a useful empirical contribution.

Methodological Rigor

The experimental design has several commendable elements: evaluation across ten diverse DNN architectures (CNNs, RNNs, Transformers, and variants), five independent training runs per configuration, and systematic downsizing experiments to characterize the relationship between data scarcity and synthetic data benefit. The use of PTB-XL as the real-world benchmark is appropriate given its public availability and established use.

However, there are notable methodological concerns:

1. Single dataset evaluation: All experiments use only PTB-XL. Generalizability to other ECG datasets, recording devices, or patient populations is unknown.

2. Binary classification only: Each experiment is a binary normal-vs-abnormal task rather than a multi-class setting, limiting practical applicability.

3. Fixed threshold of 0.5 for F1: This is suboptimal and may not reflect true model capability; threshold-agnostic metrics like AUPRC (provided in the appendix) would be more informative as the primary metric.

4. No statistical significance testing: Despite reporting standard deviations, no formal statistical tests are conducted to determine whether improvements are significant, which is problematic given the high variance in some results (e.g., ResNet34 AFLT "Real" has SD of 0.3334).

5. Limited baseline comparisons: The paper does not compare against other data augmentation or synthesis approaches (e.g., GAN-based synthesis, time-series augmentation methods, or the McSharry ODE-based simulator they cite).

6. AF results are negative: The Syn→Real approach actually degrades performance for AF (−2.02% on average), the most common abnormality with the most training data. This is discussed but the mechanism is not well understood.

Potential Impact

The practical impact is moderate. The approach is most useful in rare disease classification scenarios where fewer than ~100 positive samples are available, which is a genuine clinical need. The method's simplicity—requiring no learned generative model and no access to patient data for synthesis—is advantageous from a privacy perspective.

However, the impact is limited by several factors:

The simulator requires manual specification of class-specific rules for each abnormality, which the authors acknowledge does not scale well.

Only four abnormalities are demonstrated, and the method's applicability to the remaining ~67 ECG statement types in PTB-XL is unclear.

The simulator operates on single-lead II ECG only, whereas clinical ECG typically involves 12 leads.

The WPW results, while showing relative improvement, remain at very low absolute F1 scores (0.15→0.19 on average), suggesting the approach is insufficient on its own for practical deployment.

Timeliness & Relevance

The paper addresses a legitimate and timely concern: data scarcity in medical AI, particularly for rare conditions. The intersection of domain-knowledge-based synthesis and deep learning pre-training is relevant given increasing interest in foundation models and self-supervised learning for medical signals. However, the approach feels somewhat dated given recent advances in generative modeling (diffusion models, large-scale pre-trained models) that might achieve similar or better results with less manual engineering. The paper does not position itself relative to modern self-supervised pre-training approaches for ECG (e.g., contrastive learning on unlabeled ECG data), which represent a competing paradigm for addressing data scarcity.

Strengths

1. Comprehensive architecture evaluation: Testing across 10 architectures spanning CNNs, RNNs, and Transformers provides robust evidence that findings are not architecture-specific.

2. Practical privacy advantage: Synthetic data generation requires no patient data, sidestepping privacy concerns entirely.

3. Clear trend with data scarcity: The downsizing experiments convincingly demonstrate that synthetic pre-training benefits increase as real data decreases.

4. Reproducibility: The method is simple and fully described with explicit parameters in the appendix.

5. Honest reporting: The paper transparently reports negative results for AF classification.

Limitations

1. Simplistic synthesis: Five-Gaussian composition produces waveforms that are recognizably artificial (visible in Figures 1-4), and the gap between synthetic and real data is large (Syn-only performance is very poor).

2. Limited scalability: Manual rule design per abnormality class is labor-intensive and requires expert knowledge.

3. No ablation studies: The relative contributions of different synthesis components (noise addition, parameter perturbation, class-specific rules) are not disentangled.

4. Narrow evaluation scope: Single dataset, single lead, binary classification, four conditions.

5. Missing comparisons: No comparison with GAN-based ECG synthesis, standard augmentation techniques, or self-supervised pre-training methods.

Overall Assessment

This paper makes an incremental contribution to the ECG classification literature by systematically demonstrating that even simple, knowledge-driven synthetic ECG data can improve DNN pre-training under data scarcity. The finding that benefits scale inversely with real-world data availability is intuitive but well-documented here. However, the simplicity of the synthesis approach, limited evaluation scope, lack of comparison with competing methods, and the negative result for AF (the most common condition) constrain the paper's impact. The work is best characterized as a useful proof-of-concept that domain-knowledge-based synthetic ECGs have value for pre-training, with significant room for improvement in both the synthesis methodology and experimental validation.

Rating:4/ 10

Significance 4Rigor 4.5Novelty 3.5Clarity 6.5

Generated Jun 10, 2026

Comparison History (22)

Wonvs. Finding Multiple Interpretations in Datasets

Paper 2 addresses a critical bottleneck in medical AI (data scarcity) with a highly practical, domain-knowledge-driven approach. Its rigorous methodology, testing across ten DNN architectures, and clear quantifiable results demonstrate immediate and significant real-world applicability. While Paper 1 offers valuable methodological insights for model interpretability, Paper 2 provides a direct, highly timely solution to a pervasive problem in healthcare machine learning, indicating a higher potential for immediate scientific and clinical impact.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Population-Aware Physics-Informed Neural Particle Flow for Bayesian Update

Paper 2 introduces a more novel methodological contribution by combining population-aware Deep Sets representations with physics-informed neural particle flows for Bayesian inference—a broadly applicable framework across many scientific domains. Paper 1, while practical, applies a relatively straightforward synthetic data pre-training strategy to ECG classification with Gaussian-composition models, which is incremental over existing work. Paper 2's approach has broader potential impact across signal processing, robotics, and probabilistic inference, and its methodological innovation (conditioning transport on population-level physics features) is more transferable.

claude-opus-4-6·Jun 10, 2026

Wonvs. Geometrically Averaged Hard Target Updates for Linear Q-Learning

Paper 2 addresses a critical real-world bottleneck (data scarcity and privacy in medical AI) with clear, immediate applications in healthcare diagnostics. Its knowledge-driven synthetic data approach demonstrates significant empirical performance gains across multiple DNN architectures. While Paper 1 offers a valuable theoretical refinement for Q-learning stability, Paper 2 has a much broader potential for immediate societal impact and cross-disciplinary relevance in both AI and medicine.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. A prism hierarchy of learning regimes in large linear autoencoders

Paper 1 provides a comprehensive theoretical framework for understanding learning regimes in autoencoders, offering a novel geometric (prism) characterization with analytical solutions. This has broader impact across machine learning theory by providing a systematic methodology applicable beyond the specific model studied. Paper 2, while practically useful for ECG classification, applies relatively standard techniques (synthetic data pre-training) to a specific application domain with incremental contributions. Paper 1's theoretical depth and generalizability give it higher potential for lasting scientific impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Multi-Agent Lipschitz Bandits

Paper 1 makes fundamental theoretical contributions to multi-agent bandits over continuous spaces, introducing novel coordination mechanisms with matching upper and lower bounds. This advances core bandit theory with broad implications for decentralized learning, resource allocation, and multi-agent systems. Paper 2 presents a practical but incremental contribution—using knowledge-driven synthetic ECG data for pre-training—building on well-established ideas (data augmentation, synthetic pre-training) with limited novelty in methodology. Paper 1's theoretical depth, generalizability, and rigorous framework give it substantially higher long-term scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Inverse Probability Weighting and Age-of-Information Aggregation for Decentralized Federated Learning under Partial Reception

Paper 1 addresses a broadly impactful problem—data scarcity in medical AI—with a practical, domain-knowledge-driven synthetic data generation approach for ECG classification. Its applicability across ten DNN architectures and demonstration of significant gains (33.2% for AFLT) with limited real data makes it highly relevant to healthcare AI, a rapidly growing field. Paper 2 tackles a more niche problem in decentralized federated learning over lossy networks. While technically rigorous, its scope is narrower, targeting a specific wireless FL scenario with fewer potential real-world applications and a smaller research community.

claude-opus-4-6·Jun 10, 2026

Wonvs. Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Paper 2 addresses a critical and universal challenge in medical machine learning—data scarcity and privacy—by demonstrating significant improvements (up to 33.2%) using domain-knowledge-driven synthetic data. Its approach has broad implications for healthcare AI and addresses real-world bottlenecks, whereas Paper 1 presents a more specialized architectural adaptation for speech emotion recognition.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. XtrAIn: Training-Guided Occlusion for Feature Attribution

Paper 1 offers a more novel and broadly relevant methodological contribution: reframing occlusion attribution via training-trajectory/parameter-space updates to address baseline bias and “attribution shift.” This could impact many domains using explainable AI, not just a single modality, and provides conceptual tools for studying evidence formation during training. Paper 2 has clear real-world medical applicability, but synthetic ECG generation and synthetic-to-real pretraining are more incremental and narrower in scope. Paper 1’s potential cross-field impact and timeliness in XAI likely yield higher scientific impact.

gpt-5.2·Jun 10, 2026

Lostvs. CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference

Paper 2 addresses a fundamental bottleneck in LLM inference—autoregressive decoding speed—which is a critical challenge affecting the entire AI community. It identifies a novel root cause (head-backbone competition), proposes an elegant and minimal solution (CLP with only ~5K parameters vs 1M), and demonstrates practical speedups with zero quality loss. The breadth of impact is larger given the ubiquity of LLMs. Paper 1, while useful for medical ECG classification with limited data, applies a relatively established concept (synthetic data pre-training) to a narrower domain with incremental gains.

claude-opus-4-6·Jun 10, 2026

Lostvs. TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Paper 1 has higher likely impact due to greater methodological novelty (turn/prefix-level rollout budget allocation in tree-structured agentic RL with a shared success-probability predictor) and broad relevance to rapidly growing RL-for-LLMs/agentic reasoning. Its contributions can generalize across tasks, models, and domains where outcome-only rewards and sampling cost are bottlenecks, making it timely and widely applicable. Paper 2 addresses an important applied problem, but the approach (knowledge-based synthetic ECG generation for pretraining) is more domain-specific and resembles established simulation/data-augmentation paradigms, limiting breadth despite clear clinical utility.

gpt-5.2·Jun 10, 2026

#4565of 5669·cs.LG

#4565 of 5669 · cs.LG

Tournament Score

1315±44

10501750

32%

Win Rate

Wins

Losses

Matches

Rating

4/ 10

Significance4

Rigor4.5

Novelty3.5

Clarity6.5