Jasmeet Singh Bindra, Siddharth Panwar, Shubhajit Roy Chowdhury
Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhythms in EEG, morphological complexes in ECG -- yet these signals sit atop a broadband aperiodic 1/f-like envelope that covaries with arousal, age, and pathology. We introduce a spectral audit framework combining aperiodic/periodic decomposition, phase-preserving Fourier interventions, sham controls, and simulation validation. Aperiodic reliance was task-dependent and architecture-general: across six neural architectures, flattening drops exceeded 0.42 balanced-accuracy points for sleep-wake classification, reached 0.07-0.13 for clinical abnormality detection, and remained minimal for motor imagery. Six of seven EEG foundation models showed FDR-significant aperiodic reliance on clinical EEG; age/sex and recording-era controls reduced but did not eliminate the effect. Applying the audit to PTB-XL ECG revealed neural drops of 0.32--0.36 persisting after demographic matching, confirming this confound class extends beyond EEG. Aperiodic controls should become standard for interpretable physiological time-series deep learning.
This paper introduces a spectral audit framework that systematically quantifies how much deep learning models on physiological time series rely on broadband aperiodic (1/f-like) spectral structure versus the oscillatory or morphological features typically invoked to interpret them. The framework combines: (1) aperiodic/periodic spectral decomposition (via SpecParam and IRASA), (2) phase-preserving Fourier interventions on raw signals, (3) sham controls, and (4) simulation validation. The key insight is that the 1/f spectral envelope—which covaries with age, arousal, and pathology—can be a dominant predictive feature exploited by models, even when researchers interpret model decisions through oscillatory biomarkers.
The paper demonstrates this across three EEG domains (sleep staging, clinical abnormality detection, motor imagery), seven EEG foundation models, six standard architectures, and extends the framework to PTB-XL ECG as a cross-modality proof of principle. The central finding is that aperiodic reliance is task-dependent: dominant for sleep-wake classification (~0.43 BA drop), intermediate for clinical EEG abnormality (~0.07–0.13), minimal for motor imagery, and substantial for ECG abnormality (~0.32–0.36).
The experimental design is notably thorough. Several aspects stand out:
Validation stack: The sham control is critical—it demonstrates that Fourier round-tripping itself doesn't damage performance (except for BENDR, which is transparently flagged). Simulation validation across four synthetic families (pure aperiodic, pure periodic, mixed, shortcut-confound) confirms the framework recovers known ground truth. IRASA-SpecParam agreement (median correlation 0.966) shows decomposition robustness.
Controls for confounds: The age/sex matching for TUAB and PTB-XL is well-executed, using same-sex pairs with five-year calipers within official splits. The temporal acquisition-proxy audit on TUAB (early vs. late recording eras) addresses a non-obvious confound. Importantly, effects persist after these controls—reduced but not eliminated—supporting the "simultaneous confound and biomarker" interpretation.
Statistical rigor: Subject-level bootstrapping, hierarchical seed/subject aggregation for multi-seed experiments, and Benjamini-Hochberg FDR correction across 31 primary tests are appropriate. The reporting of 25/31 surviving FDR correction provides a clear picture.
Limitations acknowledged well: The BENDR sham collapse is transparently reported as a fragility case rather than swept under the rug. ECG SpecParam fit quality (median R²=0.273) is honestly reported, and the PTB-XL analysis is correctly framed as proof-of-principle rather than definitive.
One weakness is the absence of a causal or mechanistic training-time solution. The paper diagnoses the problem but doesn't prescribe fixes (acknowledged by the authors). Additionally, the framework is evaluation-time only, meaning it cannot directly reveal *which* aperiodic features models use internally.
Immediate practical impact: The framework is computationally cheap (no retraining required), model-agnostic, and applicable to any pretrained checkpoint. This makes adoption realistic. The concrete recommendation—adding an "aperiodic intervention column" to foundation-model leaderboards—is actionable.
Interpretability of clinical AI: For clinical EEG/ECG systems approaching deployment, knowing whether performance rests on broadband shortcuts versus genuine biomarkers is critical for regulatory and clinical trust. The TUAB finding that aperiodic structure is simultaneously a demographic confounder and potential biomarker introduces a nuanced interpretive challenge more realistic than simple "shortcut learning."
Cross-domain generalizability: The extension to ECG elevates this from an EEG-specific contribution to a general principle for 1/f-bearing time series. The authors plausibly suggest EMG, HRV, speech, and financial time series as future targets.
Foundation model evaluation: With the rapid proliferation of EEG foundation models (seven audited here), demonstrating that 6/7 show significant aperiodic reliance challenges the assumption that scale solves representation quality. This could reshape how the community evaluates pretrained physiological encoders.
This work addresses a genuine and growing concern. EEG/ECG deep learning is accelerating toward clinical deployment, and foundation models are being positioned as general-purpose representations. Simultaneously, the neuroscience community has increasingly recognized aperiodic activity as physiologically meaningful (Donoghue et al., 2020; Brake et al., 2024). The paper bridges these two communities at exactly the right time—before aperiodic confounds become baked into deployed clinical systems.
The "spectral audit" framing is also timely given broader ML concerns about shortcut learning, spurious correlations, and interpretability. This paper provides a domain-specific instantiation of these concerns with concrete, testable methodology.
1. Comprehensive scope: Six architectures × three EEG domains + seven foundation models + one ECG domain, with systematic controls at each level
2. Elegant dissociation: The N2-vs-N3 robustness to flattening (vs. wake-vs-sleep collapse) is a compelling internal control that demonstrates specificity
3. Nuanced interpretation: The "duality" framing—aperiodic structure as both confound and biomarker—is more scientifically honest than a binary shortcut/non-shortcut narrative
4. Reproducibility design: Public datasets, clearly described preprocessing, and planned code release
5. Domain-agnostic design: The framework requires only a decomposable power spectrum
1. No training-time solution: The paper identifies the problem but leaves mitigation (e.g., adversarial regularization, spectral augmentation) to future work
2. ECG fit quality: Low SpecParam R² for ECG limits the precision of the ECG conclusions
3. Short-trial challenges: PhysioNet MI uses short cue-locked trials where spectral fitting is inherently noisier—this may partly explain the "minimal reliance" finding
4. No investigation of what models learn instead: When flattening drops are small (MI), what spectral/temporal features remain? The paper doesn't deeply investigate the positive case
5. Single ECG dataset: Cross-modality generalization rests on one ECG benchmark
This is a well-designed methodological contribution that addresses a real and underappreciated problem in physiological time-series deep learning. The framework is practical, the experimental coverage is broad, and the results carry clear implications for how the field reports and interprets model performance. The paper's primary limitation—lack of a training-time fix—is offset by the diagnostic value of the framework itself. This work should influence reporting standards in EEG/ECG deep learning and prompt foundation-model developers to include spectral robustness in their evaluation protocols.
Generated Jun 9, 2026
Paper 2 likely has higher impact due to broader and more immediate real-world relevance: it identifies and quantifies a pervasive confound (aperiodic 1/f components) affecting many physiological DL tasks across EEG and ECG, proposes a general, validated audit methodology, and motivates a community-wide best practice for interpretability and clinical robustness. Its cross-architecture, cross-dataset, cross-modality evidence suggests wide applicability in biomedicine and ML. Paper 1 is novel and rigorous with strong tooling/benchmark contributions, but its domain is narrower (geometric synthesis) and impact may concentrate within constrained-generation/verification subfields.
Paper 1 presents a highly innovative approach to optimizing LLM inference by using an agent-driven, statically-checked auto-research loop to synthesize CUDA megakernels. Given the massive global computation costs of LLM inference, this work offers immense real-world value and broad impact across AI infrastructure. While Paper 2 provides an important methodological audit for physiological ML, Paper 1's contribution to automated systems optimization and deep learning deployment represents a more timely and transformative leap in the highly active field of AI systems engineering.
Paper 1 addresses a critical methodological flaw in physiological deep learning by exposing widespread reliance on aperiodic confounds across EEG and ECG models. Its proposed spectral audit framework has profound implications for the safety, interpretability, and scientific validity of medical AI. While Paper 2 provides a valuable engineering benchmark for mobile AI agents, Paper 1 offers deeper fundamental scientific insights and methodological corrections that will directly impact the rigor of future biomedical machine learning research.
Paper 1 addresses a fundamental problem in representation learning—concept alignment—with broad implications across ML/AI. It provides a unifying theoretical framework, introduces a benchmark, identifies failures in existing methods, and proposes a practical solution (CoSAE). Its breadth of impact spans multiple fields (interpretability, multimodal learning, representation learning). Paper 2 makes a valuable methodological contribution to physiological signal processing, but its scope is narrower, primarily impacting EEG/ECG deep learning communities. Paper 1's framework-level contribution has wider applicability and longer-term influence potential.
Paper 2 addresses a fundamental problem in deep learning (plasticity loss in continual learning) with a theoretically grounded framework connecting dynamical isometry to the Neural Tangent Kernel. It introduces a novel optimizer (AdamO), provides theoretical unification of prior methods, and demonstrates broad applicability across supervised and reinforcement learning. Paper 1 offers a valuable diagnostic framework for physiological time-series models but targets a more specialized audience. Paper 2's broader applicability across ML, theoretical depth, and practical tools (optimizer, regularization) give it higher potential impact across the field.
Paper 2 addresses a fundamental and widespread methodological concern—aperiodic spectral confounds—affecting deep learning across multiple physiological signal domains (EEG, ECG). Its findings impact a large community (clinical ML, neuroscience, cardiology) and propose a reusable audit framework that could become standard practice. The breadth of validation (six architectures, seven foundation models, two signal modalities) strengthens its generalizability. Paper 1 addresses the important but narrower problem of LLM-agent failure attribution with promising but primarily synthetic validation, limiting its demonstrated real-world impact at this stage.
Paper 2 introduces a novel spectral audit framework applicable across multiple physiological signal domains (EEG, ECG) and deep learning architectures, addressing a fundamental confound (aperiodic 1/f components) that affects interpretability of deep learning models broadly used in clinical and neuroscience research. It has far greater breadth of impact, methodological novelty, and practical implications—potentially changing standard practices in physiological time-series deep learning. Paper 1 is a replication/validation study of a specific airline clustering analysis with limited generalizability and narrower audience.
Paper 2 has higher potential scientific impact due to its direct clinical applicability—predicting organ-level complications weeks to months before onset using existing routine lab data, requiring no additional infrastructure. It demonstrates broad generalizability across cancers and healthcare systems (external validation on MIMIC-IV and CoMMpass), addresses 162 complications across eight clinical categories, and could transform oncological surveillance. Paper 1 makes an important methodological contribution to interpretability of deep learning on physiological signals, but its impact is narrower, primarily serving as an auditing tool for the ML/neuroscience community rather than enabling new clinical capabilities.
Paper 2 likely has higher impact due to broader, modality-agnostic applicability (images, 3D, climate) and a clear scalability advance (removing inner-loop meta-learning; large memory/batch gains). Its locality+hierarchy priors for neural field tokenization can influence representation learning, generative modeling, and scientific ML across domains. Paper 1 is timely and methodologically careful, with important implications for physiological DL interpretability, but its scope is narrower (EEG/ECG) and more focused on auditing/confounds than enabling new general-purpose modeling capabilities.
Paper 2 introduces a broadly applicable spectral audit framework that reveals a previously underappreciated confound (aperiodic 1/f components) affecting deep learning across multiple physiological signal types (EEG, ECG), architectures, and tasks. Its implications are far-reaching: it could change standard practices for interpretable physiological AI, affecting clinical deployment and research methodology across neuroscience, cardiology, and ML. Paper 1 addresses tensor compression for LLMs—useful but incremental in a crowded field. Paper 2's cross-domain relevance, methodological rigor, and potential to reshape evaluation standards give it higher impact.