A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment from Routine Electrocardiography

Ziqing Yu, Yuhui Tao, Jiayu Huo, Lei Pan, Zilong Xiao, Juecheng Chen, Xiao Li, Jianxuan Li

May 25, 2026

arXiv:2605.25446v1 PDF

cs.AI(primary)cs.LG

#27of 2453·Artificial Intelligence

#27 of 2453 · Artificial Intelligence

Tournament Score

1587±46

10501800

97%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6

Novelty5

Clarity5.5

Tournament Score

1587±46

10501800

97%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Electrocardiography (ECG) is central to cardiovascular care, but conventional AI models are often restricted to common arrhythmias and may generalize poorly across populations or clinically subtle diseases. We developed ECG Contrastive Language-Image Pre-training (ECGCLIP), a signal-language contrastive learning framework that aligns ECG waveforms with expert diagnostic reports. ECGCLIP was pre-trained on 2,837,962 ECG studies from 1,324,856 patients and evaluated on a held-out internal test set plus nine independent external cohorts comprising about 1.5 million ECGs. Evaluation covered 89 downstream tasks, including 45 ECG diagnoses, 39 echocardiographic targets, and 5 rare cardiac diseases, using PRAUC as the primary metric. ECGCLIP consistently improved performance over random initialization and Merl-R18 baselines. On the internal test set, ECGCLIP-R34 achieved strong performance for atrial fibrillation (PRAUC 0.900) and ST-segment elevation myocardial infarction (PRAUC 0.383), with robust generalization across all external cohorts. It also improved low-prevalence and diagnostically elusive diseases, including Ebstein anomaly, constrictive pericarditis, dextrocardia, and cardiac amyloidosis, with internal PRAUC values of 0.253, 0.175, 0.121, and 0.201, respectively. ECGCLIP was data efficient, matching or exceeding full-dataset baseline performance with only 10% of training data. Feature visualization and saliency analysis suggested clinically meaningful representations aligned with established electrocardiographic criteria. These findings indicate that large-scale ECG-report contrastive pre-training can expand routine ECG interpretation beyond common arrhythmias toward broad cardiovascular assessment and opportunistic screening of echocardiographic and rare conditions.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: ECGCLIP — A Signal-Language Foundation Model for Broad-Spectrum Cardiovascular Assessment

1. Core Contribution

ECGCLIP adapts the CLIP (Contrastive Language-Image Pre-training) framework to align 12-lead ECG waveforms with expert-authored diagnostic reports using a dual-objective contrastive learning strategy (cross-modal alignment + uni-modal alignment). The core claim is that this approach, trained on ~2.8 million ECG-report pairs from a single large Chinese hospital, can serve as a foundation model enabling broad-spectrum cardiovascular assessment across 89 downstream tasks, including standard ECG diagnoses (45), echocardiographic phenotypes (39), and rare cardiac diseases (5).

The primary novelty lies in the scale of expert-curated data (substantially larger than prior work like Merl's ~800K pairs from MIMIC-IV) and the breadth of downstream evaluation, particularly extending into echocardiographic screening and rare disease detection from ECG alone. The paper positions ECGCLIP as moving beyond single-disease classifiers toward a "panoramic screening" paradigm.

2. Methodological Rigor

Strengths:

The patient-level data splitting is rigorously described, preventing leakage between pre-training and evaluation sets.

Evaluation across nine external cohorts spanning multiple countries (China, US, UK, Germany) provides meaningful evidence for generalizability.

The use of PRAUC as primary metric is appropriate given extreme class imbalance.

Bootstrap confidence intervals (1,000 iterations) and permutation tests for statistical comparison are sound.

Ablation analysis systematically isolates contributions of data scale (ECGCLIP-R18 vs. Merl-R18) and model depth (R18→R34→R50).

Weaknesses:

The framework is fundamentally an application of Merl's established architecture to a larger dataset. The methodological novelty is incremental — the CMA+UMA dual-objective framework is borrowed directly from prior work.

The baseline comparisons are limited. The paper only compares against random initialization and Merl-R18. No comparison against other ECG foundation models (e.g., KED, DeepECG, or supervised baselines with equivalent data scale) is provided. This makes it difficult to disentangle the contribution of the contrastive learning framework from simply having more labeled data.

For rare diseases, absolute PRAUC values remain very low (e.g., 0.201 for cardiac amyloidosis, 0.019 for ARVC). While the relative improvements over baselines are substantial, the clinical utility of such models remains questionable — the paper acknowledges this could lead to high false-positive rates.

The echocardiographic ground truth relies on temporal pairing within 30 days, which introduces noise, particularly for progressive conditions.

The translation of Chinese reports to English via GPT-4o introduces an unquantified source of error.

3. Potential Impact

The clinical vision is compelling: transforming routine ECGs into opportunistic screening tools for structural heart disease and rare cardiomyopathies. If validated prospectively, this could:

Enable earlier detection of conditions like cardiac amyloidosis in primary care settings

Optimize echocardiography referral by enriching pre-test probability

Democratize cardiovascular diagnostics in resource-limited settings

However, the gap between demonstrated discriminative performance and clinical deployment remains large. For rare diseases, the performance levels would likely generate unacceptable false-positive rates in real-world screening. The paper's code and model weights are publicly available, which is commendable for reproducibility.

4. Timeliness & Relevance

The paper addresses a genuinely important bottleneck: most AI-ECG models remain narrow, single-task classifiers. The foundation model paradigm for ECG is timely, with concurrent efforts from multiple groups (KED, DeepECG, Zhou et al.). ECGCLIP's contribution of scaling expert-curated multimodal pre-training is relevant, though the field is rapidly evolving.

The emphasis on rare disease detection and echocardiographic screening fills a genuine gap — most prior work focuses on common arrhythmias. However, the paper somewhat overstates readiness for clinical deployment given the modest absolute performance on these challenging tasks.

5. Strengths & Limitations

Key Strengths:

Massive pre-training dataset with expert-authored reports (not automated labels)

Comprehensive evaluation across 89 tasks and 10 cohorts (~1.5M external ECGs)

Strong data efficiency: matching baseline performance with 10% of training data

Interpretability analysis (Integrated Gradients, t-SNE) shows clinically meaningful representations

Open-source code and weights

Notable Limitations:

Training data overwhelmingly from a single Chinese institution — claims of demographic robustness are overstated given the lack of African/Latin American validation

Limited baseline comparisons; no head-to-head with contemporary foundation models

The ResNet backbone is relatively dated; transformer-based architectures are not explored

Rare disease cohorts are extremely small (e.g., 4 ARVC cases, 6 AC cases in test set), making statistical conclusions unreliable

No prospective validation or clinical outcome assessment

The paper is extremely long with extensive supplementary tables but could benefit from more focused presentation of key findings

Some claims in the Discussion ("redefines the ECG as a highly scalable clinical gatekeeper") are not fully supported by the evidence

Additional Observations:

The PTB-XL exception (where ECGCLIP underperforms baselines for simple classifications) reveals a meaningful limitation of the approach — semantic alignment may bias toward complex pathologies at the expense of simple deterministic patterns.

The 8-lead input design (dropping redundant leads) is practical but limits direct comparison with 12-lead models.

Summary

ECGCLIP represents a solid engineering contribution demonstrating that scaling expert-curated ECG-report contrastive learning improves downstream performance across a broad diagnostic spectrum. The evaluation is thorough and the clinical vision is important. However, the methodological novelty is modest (scaling an existing framework), baseline comparisons are insufficient, and absolute performance on the most clinically interesting tasks (rare diseases, structural screening) remains far from clinical utility. The paper's impact will depend heavily on whether the community can build upon this foundation to close the gap to clinical deployment.

Rating:6.5/ 10

Significance 6.5Rigor 6Novelty 5Clarity 5.5

Generated May 26, 2026

Comparison History (34)

vs. KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

gpt-5.25/26/2026

Paper 2 has higher potential impact due to broader cross-field reach and timeliness: it proposes a general knowledge-infrastructure scaffold enabling agentic use of process-based simulators across 14 Earth-science domains and 117+ models, addressing major barriers to climate-risk and resource decision support. If robust, it could change how simulation models are accessed, integrated, and maintained (a “living commons”), affecting many disciplines and user communities. Paper 1 is methodologically strong and clinically valuable, but its impact is largely confined to cardiovascular ECG interpretation and adjacent tasks.

vs. ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

claude-opus-4.65/26/2026

ECGCLIP presents a foundation model for cardiovascular assessment trained on ~2.8M ECGs, evaluated across 89 downstream tasks and 9 external cohorts. It addresses a critical clinical need by expanding ECG interpretation beyond common arrhythmias to rare diseases and echocardiographic targets, with strong data efficiency. Its potential real-world clinical impact—enabling broad cardiovascular screening from routine ECGs—far exceeds ACE-Bench, which is an incremental contribution to agent evaluation benchmarks. Paper 2 demonstrates greater novelty, broader cross-disciplinary impact (AI + cardiology), and immediate translational potential.

vs. Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

claude-opus-4.65/26/2026

ECGCLIP demonstrates higher scientific impact due to: (1) massive scale of training (2.8M ECGs) and evaluation (1.5M ECGs across 9 external cohorts), providing strong evidence of generalizability; (2) direct clinical applicability for cardiovascular screening using routine ECGs, addressing a real healthcare need; (3) novel signal-language contrastive learning framework extending CLIP to ECG interpretation across 89 tasks including rare diseases; (4) data efficiency findings enabling deployment in resource-limited settings; (5) broader cross-disciplinary impact spanning AI, cardiology, and public health. Paper 2, while valuable for AI agent evaluation, addresses a more niche benchmarking problem with narrower real-world impact.

vs. SkillEvolBench: Benchmarking the Evolution from Episodic Experience to Procedural Skills

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to major real-world clinical applicability (broad cardiovascular screening from routine ECG), strong methodological rigor (large-scale pretraining on ~2.8M studies, evaluation on nine external cohorts totaling ~1.5M ECGs, 89 downstream tasks), and clear performance/generalization gains including rare diseases and data efficiency. Its foundation-model signal-language alignment is timely and broadly relevant across medicine and ML. Paper 1 is novel and valuable for agent evaluation, but as a benchmark its immediate real-world impact and cross-field uptake are less certain than a clinically validated foundation model.

vs. Breaking the Chains of Probability: Neutrosophic Logic as a New Framework for Epistemic Uncertainty in Large Language Models

claude-opus-4.65/26/2026

Paper 1 presents a large-scale foundation model (ECGCLIP) trained on ~2.8M ECGs with extensive validation across 9 external cohorts and 89 downstream tasks, demonstrating clear clinical utility for cardiovascular assessment including rare diseases. Its methodological rigor, massive scale, practical clinical applications, and data efficiency make it highly impactful. Paper 2 applies neutrosophic logic to LLM outputs but relies on prompting strategies over only 4 GPT models with limited experimental scope, offers primarily conceptual contributions, and lacks integration into actual model architectures, limiting its practical impact.

vs. EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

gpt-5.25/26/2026

Paper 1 likely has higher scientific impact due to its large-scale, clinically grounded foundation-model approach with strong external validation across ~1.5M ECGs and 89 tasks, including rare diseases and echo targets. It is methodologically rigorous (contrastive pretraining with reports, multi-cohort generalization, data-efficiency, interpretability analyses) and has clear, high-stakes real-world applications in cardiovascular screening and decision support. Its breadth spans ML, cardiology, and health systems. Paper 2 is timely and valuable for evaluating coding agents, but its primary contribution is a benchmark (26 tasks), with narrower immediate societal impact than clinical deployment potential.

vs. Inference Time Context Sparsity: Illusion or Opportunity?

claude-opus-4.65/26/2026

ECGCLIP represents a more impactful contribution: it introduces a novel foundation model for cardiovascular assessment trained on ~2.8M ECGs, demonstrates strong performance across 89 downstream tasks including rare diseases, and shows robust generalization across 9 external cohorts. Its clinical applications—screening for rare cardiac conditions and echocardiographic abnormalities from routine ECGs—could directly impact patient care at scale. Paper 2 provides valuable empirical analysis of context sparsity in LLMs but is more incremental, confirming and systematizing known observations about attention sparsity rather than introducing a fundamentally new paradigm. Paper 1's breadth of validation and direct medical applicability give it higher potential impact.

vs. Summoning the Oracle to Slay It: Mitigating Look-Ahead Bias in Financial Backtesting with Large Language Models

gemini-3.15/26/2026

Paper 2 presents a massive-scale foundation model for healthcare, evaluated on millions of patients across numerous independent cohorts. Its potential to enable broad-spectrum cardiovascular assessment and screen for rare, life-threatening diseases gives it profound real-world clinical implications. While Paper 1 introduces a clever methodological fix for financial AI backtesting, Paper 2's scale, rigorous multi-cohort external validation, and direct impact on human health and medical AI represent a significantly broader and more critical scientific advancement.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

claude-opus-4.65/26/2026

Paper 1 demonstrates higher scientific impact due to its massive scale (2.8M ECGs, 1.3M patients), rigorous evaluation across 9 external cohorts and 89 downstream tasks, and direct clinical applicability to routine ECG interpretation. The ECGCLIP framework addresses a concrete clinical need—expanding ECG utility beyond common arrhythmias to rare diseases and echocardiographic screening—with strong generalization evidence. Paper 2 presents an interesting neuro-symbolic framework but offers only benchmark-comparable performance without clear advantages over existing LLMs, and lacks the clinical validation scale and real-world deployment potential of Paper 1.

vs. Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

claude-opus-4.65/26/2026

Paper 2 presents ECGCLIP, a foundation model for cardiovascular assessment trained on ~2.8M ECGs and validated across 9 external cohorts on 89 tasks including rare diseases. Its breadth of clinical applicability (arrhythmias, echocardiographic targets, rare cardiac diseases), data efficiency, and robust external validation give it enormous real-world impact potential in cardiology—a massive clinical field. While Paper 1 is innovative in applying LLM-guided search to disease forecasting, its impact is narrower (US respiratory season forecasting) and the autonomous model discovery paradigm, though promising, is less immediately translatable to broad clinical practice.

vs. Toward Enactive Artificial Intelligence

gemini-3.15/26/2026

Paper 2 presents a highly rigorous, data-intensive foundation model with immediate, life-saving clinical applications across broad cardiovascular assessments. Its massive scale (2.8M ECGs, 9 external validation cohorts) and empirical success in detecting rare and elusive diseases demonstrate profound methodological rigor and real-world utility. While Paper 1 offers valuable theoretical insights for AI, Paper 2's tangible technological advancement and transformative potential in healthcare give it a significantly higher estimated scientific and societal impact.

vs. Hallucination as Exploit: Evidence-Carrying Multimodal Agents

gemini-3.15/26/2026

Paper 2 has higher scientific impact due to its massive scale and profound life-saving potential in clinical medicine. Training on nearly 3 million ECGs and validating across 1.5 million external studies demonstrates exceptional methodological rigor and generalizability. While Paper 1 addresses an important AI safety issue, Paper 2 establishes a foundational medical AI framework capable of detecting 89 conditions, including rare diseases, from routine, low-cost tests. The direct translation to improving broad-spectrum cardiovascular care gives it significantly higher societal and real-world impact.

vs. HeartBeatAI: An Interpretable and Robust Deep Learning Framework for Multi-Label ECG Arrhythmia Detection

gemini-3.15/26/2026

Paper 2 introduces a foundation model paradigm (ECGCLIP) trained on a massive scale of nearly 3 million ECGs, demonstrating exceptional generalization across 9 external cohorts and 89 downstream tasks. While Paper 1 presents a solid supervised framework, it admits significant performance degradation in cross-institutional deployment for rare anomalies. Paper 2 successfully addresses this exact gap, showing robust detection of rare diseases and high data efficiency, indicating a much broader and transformative potential impact on cardiovascular care.

vs. CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to its scale (2.8M ECGs; 9 external cohorts), methodological rigor via broad external validation across 89 clinically relevant tasks, and direct real-world applicability to cardiovascular screening and diagnosis, including rare diseases. Its signal-language foundation approach is timely and broadly extensible in medical AI, with potential immediate translational value in healthcare systems. Paper 1 is novel and important for agent training infrastructure, but its impact is more concentrated within CUA/RL communities and depends on downstream adoption and robustness in real environments.

vs. SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to strong real-world clinical applicability, massive-scale pretraining (2.8M ECGs) with extensive external validation (~1.5M ECGs across nine cohorts), and broad downstream utility (89 tasks including rare diseases and echocardiographic targets). Its signal-language foundation-model framing is timely and could affect cardiology practice, screening, and multimodal representation learning. Paper 1 is methodologically novel with useful theory for LLM-driven discovery, but its immediate translational impact and validation breadth appear narrower than Paper 2’s potential to change routine cardiovascular assessment.

vs. RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

gemini-3.15/26/2026

Paper 2 presents a foundation model trained and validated on millions of patient records across multiple independent cohorts, offering massive scale and methodological rigor. Its ability to generalize to 89 clinical tasks, including rare diseases and opportunistic screening, presents a transformative, life-saving impact in healthcare. While Paper 1 offers a practical safety improvement for autonomous driving, Paper 2's sheer scale, cross-disciplinary relevance (AI and cardiology), and potential to fundamentally change routine cardiovascular assessment give it significantly higher scientific and real-world impact.

vs. Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

gpt-5.25/26/2026

Paper 2 likely has higher impact due to strong novelty (signal-language contrastive foundation model), very large-scale training (2.8M ECGs) and extensive external validation (~1.5M ECGs across nine cohorts) over 89 tasks, supporting methodological rigor and generalizability. Its real-world clinical applications are substantial (broad cardiovascular assessment, opportunistic screening, rare disease detection) with clear timeliness in foundation models for healthcare. Paper 1 is timely and useful for safer agentic LLM deployment, but its domain-specific dataset/evaluation framework and narrower application scope suggest comparatively less immediate cross-field and societal impact.

vs. AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to its large-scale foundation-model contribution (2.8M studies, extensive external validation), strong methodological rigor across 89 tasks and nine independent cohorts, and clear real-world clinical applicability (broad cardiovascular assessment, rare disease screening, data efficiency). Its signal-language contrastive approach can generalize across healthcare AI and representation learning. Paper 1 is timely and useful for AV generative model evaluation, but benchmarks/evaluators may have narrower cross-domain impact and adoption depends on community uptake and standardization.

vs. When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

gemini-3.15/26/2026

Paper 1 presents a massive-scale foundation model for ECGs with immediate, life-saving clinical applications. Its evaluation across 1.5 million external ECGs and 89 downstream tasks demonstrates exceptional methodological rigor and generalization. While Paper 2 offers valuable methodological insights into synthetic data for NLP, Paper 1's breakthrough in medical AI, particularly its ability to detect rare cardiac diseases and enable opportunistic screening, represents a significantly higher potential for broad scientific and real-world impact.

vs. A governance horizon for ethical-use constraints in open-weight AI models

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it introduces a scalable foundation-model paradigm (signal–language contrastive pretraining) trained on very large clinical data, with extensive multi-cohort external validation (~1.5M ECGs) across 89 tasks, suggesting strong methodological rigor and broad applicability to diagnosis, screening, and representation learning in healthcare. Its real-world translational potential is immediate (routine ECG workflows, rare disease detection, echocardiography proxy targets) and timely given foundation-model momentum. Paper 1 is novel and relevant for AI governance, but its impact is more policy/infrastructure-focused and less directly transformative across multiple scientific/clinical domains.