CLEF: EEG Foundation Model for Learning Clinical Semantics

Peng Cao, Ali Mirzazadeh, Jong Woo Lee, Aleksandar Videnovic, Dina Katabi

May 11, 2026

arXiv:2605.10817v1 PDF

cs.AI(primary)

#146of 2292·Artificial Intelligence

#146 of 2292 · Artificial Intelligence

Tournament Score

1532±46

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

8/ 10

Significance8.5

Rigor8

Novelty7.5

Clarity8.5

Tournament Score

1532±46

10501800

85%

Win Rate

Wins

Losses

Matches

Rating

8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Clinical EEG interpretation requires reasoning over full EEG sessions and integrating signal patterns with clinical context. Existing EEG foundation models are largely designed for short-window decoding and do not incorporate clinical context. We introduce CLEF, a clinically grounded long-context EEG foundation model. CLEF represents EEG sessions as 3D multitaper spectrogram tokens, enabling tractable Transformer modeling at session scale, and aligns embeddings with neurologist reports and structured EHR data through contrastive objectives. We evaluate CLEF on a new 234-task benchmark spanning disease phenotypes, medication exposures, and EEG findings, with more than 260k EEG sessions from over 108k patients. CLEF outperforms prior EEG foundation models on 229 of 234 tasks, improving mean AUROC from 0.65 to 0.74. Reconstruction-only pretraining surpasses prior EEG foundation models, while report and EHR alignment yields further gains. Held-out concept and external-cohort experiments suggest that these representations transfer beyond observed alignment targets. These results support session-scale, clinically grounded representation learning as a promising foundation-model paradigm for clinical EEG.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CLEF — EEG Foundation Model for Learning Clinical Semantics

1. Core Contribution

CLEF addresses a fundamental paradigm mismatch in EEG foundation models: existing models are designed for short-window BCI-style decoding (5–30 seconds), whereas clinical EEG interpretation requires reasoning over entire sessions (20+ minutes) in the context of patient history. The paper makes four interlinked contributions:

1. Session-scale tokenization via 3D multitaper spectrogram VQGAN that compresses a 19-channel, 1280-second recording into 2,048 discrete tokens — reducing ~5 million raw samples to a tractable Transformer sequence.

2. Clinical grounding through contrastive alignment with LLM-summarized neurologist reports and structured EHR data (demographics, medications, diagnoses encoded as learnable code embeddings).

3. A 234-task clinical EEG benchmark spanning disease phenotypes (120), medication exposures (99), and EEG features (15), constructed from >260k sessions across 108k patients.

4. Systematic evaluation demonstrating CLEF outperforms five recent EEG foundation models on 229/234 tasks, with mean AUROC improving from 0.65 to 0.74.

The paradigm shift from segment-level decoding to patient-level clinical representation learning is the paper's most important conceptual contribution. This reframes what an EEG foundation model should optimize for in clinical settings.

2. Methodological Rigor

Tokenization design. The choice of multitaper spectrograms is well-justified — EEG is a non-stationary stochastic process where phase is uninformative across recordings while spectral structure is stable and clinically meaningful. The inter-channel-emphasized reconstruction loss (Eq. 2) is a thoughtful detail: vanilla ℓ₁ collapses inter-channel variation because the shared mean dominates the loss across C channels, while the decomposed loss explicitly upweights differential components critical for lateralization and focal findings. Ablations in Table 10 confirm this matters.

Two-stage training. The sequential Recon→Align pipeline is validated against joint training (Table 14), showing the two objectives interfere when optimized simultaneously. This is a non-obvious finding — the dense position-level reconstruction signal provides a better initialization for the sparse sequence-level contrastive objective.

Evaluation protocol. The frozen-encoder probing protocol with matched sliding-window aggregation across all baselines is fair. The age/sex/site/setting-matched control design for each benchmark task reduces obvious confounders. The held-out concept experiment (Table 1) is particularly rigorous: removing 20% of clinical concepts from alignment targets shows zero degradation, providing evidence that CLEF learns generalizable representations rather than memorizing alignment targets.

Potential concerns. The benchmark draws labels from the same hospital system (HEEDB) used for pretraining, creating possible distributional coupling even with patient-level splits. While the external cohort results (TUAB, TUEP, HSP) partially address this, all three are relatively narrow tasks. The baselines have an in-distribution advantage on TUAB/TUEP (pretrained on TUH), making CLEF's gains there more impressive, but this asymmetry complicates interpretation. The paper also lacks statistical significance testing across the 234 tasks — reporting only per-task standard deviations across seeds.

3. Potential Impact

Clinical translation. With 75% of clinical EEGs interpreted by non-neurophysiologists, CLEF's patient-level embeddings could serve as a clinical decision support backbone. The ability to detect medication exposures, disease phenotypes, and EEG abnormalities from a single frozen representation is practically valuable — it enables a single model to support hundreds of downstream clinical queries.

Benchmark contribution. The 234-task benchmark fills a genuine void. Prior EEG benchmarks focus on BCI tasks (emotion recognition, motor imagery, sleep staging). A large-scale clinical benchmark with matched controls enables systematic comparison of future clinical EEG models.

Methodological influence. The spectrogram-tokenization approach could influence other biosignal domains (ECG, EMG, polysomnography). The EHR encoder design — learnable code embeddings with self-attention over variable-length sets — is a clean alternative to text serialization for tabular medical data.

Scaling signals. Tables 3–4 show monotonic improvement with both model size (17M→148M) and data size, with no saturation. This suggests the paradigm has room to grow, which is important for adoption.

4. Timeliness & Relevance

The paper arrives at a moment when foundation models are rapidly being adopted in medical imaging (radiology, pathology, ophthalmology) but clinical EEG lags behind. The BCI-focused EEG foundation model literature has matured enough (LaBraM, CBraMod, REVE, NeuroLM) that the clinical gap is now conspicuous. The PhysioNet Challenge 2026 on cognitive impairment from sleep EEG further signals community interest in clinical EEG AI. CLEF directly addresses this emerging need.

5. Strengths & Limitations

Key strengths:

The problem formulation (session-scale clinical representation) is more important than any single technical component, and it is convincingly argued.

Extraordinarily thorough ablations (Appendix D) validate nearly every design choice, from input representation to masking hyperparameters to training scheme.

The scale of evaluation (234 tasks, 260k sessions, 108k patients, 3 external datasets) is unprecedented for clinical EEG.

The held-out concept and external cohort experiments provide credible evidence of generalization.

Notable limitations:

No real-time capability limits ICU monitoring applications.

The 20-minute context window excludes sleep architecture and long-term monitoring patterns.

No interpretability mechanisms — a significant barrier to clinical adoption.

The benchmark is derived from a single hospital network (Harvard), potentially embedding institutional biases in documentation and coding practices.

Five of 234 tasks show CLEF underperforming baselines, with no analysis of failure modes.

Reproducibility is limited: HEEDB requires credentialed access, and the full pipeline (VQGAN + encoder + LLM summarizer + EHR encoder) has substantial complexity.

Overall Assessment

CLEF represents a well-executed paradigm shift for clinical EEG foundation models. The combination of session-scale modeling, clinical grounding, and comprehensive evaluation sets a new standard. The 234-task benchmark alone is a lasting contribution. While clinical deployment barriers remain (interpretability, bias auditing, real-time support), the foundational representation learning problem is convincingly advanced.

Rating:8/ 10

Significance 8.5Rigor 8Novelty 7.5Clarity 8.5

Generated May 12, 2026

Comparison History (23)

vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

claude-opus-4.65/18/2026

CLEF introduces a novel foundation model paradigm for clinical EEG that addresses fundamental limitations (short-window decoding, lack of clinical context) with a comprehensive approach combining 3D spectral tokenization, session-scale modeling, and multimodal alignment with clinical reports/EHR data. Its evaluation on a massive 234-task benchmark with 260k+ sessions demonstrates strong empirical results and broad clinical applicability. Paper 2 makes a solid contribution applying formal methods (LTL) to LLM monitoring, but addresses a more incremental problem in AI governance. Paper 1's scale, methodological innovation, new benchmark, and direct clinical impact give it higher potential impact.

vs. Imperfect World Models are Exploitable

claude-opus-4.65/18/2026

CLEF addresses a significant practical gap in clinical EEG interpretation with a comprehensive foundation model evaluated on a massive benchmark (234 tasks, 260k sessions). Its immediate clinical applicability, large-scale evaluation, and strong empirical results give it broad impact across neurology, clinical AI, and EEG research. While Paper 1 makes important theoretical contributions connecting reward hacking and model exploitation in RL, its impact is more niche and theoretical. Paper 2's combination of methodological innovation, scale, and direct clinical relevance positions it for broader and more immediate scientific impact.

vs. Agentic Discovery of Exchange-Correlation Density Functionals

claude-opus-4.65/16/2026

Paper 1 addresses a fundamental challenge in computational chemistry/physics (XC functional design) using a novel agentic LLM-based approach, achieving a ~9% improvement over a gold-standard functional. Its impact spans chemistry, materials science, and AI-for-science broadly. The cautionary insight about AI exploiting benchmarks is widely relevant. While Paper 2 makes strong contributions to clinical EEG with impressive scale and practical utility, Paper 1's novelty in automating scientific discovery of physical laws, combined with its cross-disciplinary implications, gives it higher potential impact.

vs. Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge

gpt-5.25/16/2026

Paper 2 (CLEF) likely has higher scientific impact due to a substantial methodological advance (session-scale EEG transformer with clinical-context alignment), large-scale real-world dataset (260k sessions, 108k patients), and broad applicability to clinical phenotyping, decision support, and multimodal health AI. It demonstrates strong empirical gains across a wide 234-task benchmark and includes transfer/external-cohort validation, indicating rigor and generalization. Paper 1 is timely and important for AI evaluation reliability, but its scope is narrower and its downstream applications are more indirect than a clinically deployable foundation model.

vs. When Reasoning Traces Become Performative: Step-Level Evidence that Chain-of-Thought Is an Imperfect Oversight Channel

claude-opus-4.65/16/2026

Paper 2 addresses a fundamental assumption underlying the rapidly growing use of chain-of-thought reasoning in LLMs—that visible reasoning traces faithfully reflect internal computation. This finding has broad implications for AI safety, interpretability, and oversight across the entire field of language model research. Its cross-model, cross-benchmark methodology and the introduction of a principled step-level framework make it highly generalizable. While Paper 1 is a strong clinical AI contribution with impressive scale, Paper 2's impact spans a wider research community and challenges core assumptions in one of AI's most active areas.

vs. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

claude-opus-4.65/16/2026

CLEF introduces a novel EEG foundation model with significant clinical utility, evaluated on a massive 234-task benchmark with 260k+ sessions. It advances the state of the art in clinical EEG interpretation with clear methodological contributions (3D multitaper spectrograms, clinical context alignment) and demonstrates strong empirical results. While Paper 2 provides valuable meta-analysis of AI benchmarking culture, its impact is more observational and policy-oriented. Paper 1's direct contributions to clinical neuroscience, its large-scale evaluation, and its potential to improve neurological diagnosis give it broader and deeper scientific impact.

vs. Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact: it introduces a clinically grounded, session-scale EEG foundation model with a large real-world dataset (260k sessions/108k patients) and a broad 234-task benchmark, showing substantial performance gains and transfer. Its methodological contribution (long-context tokenization + multimodal alignment to reports/EHR) is novel within clinical EEG and has direct translational potential across neurology, critical care, and EHR-linked phenotyping. Paper 2 is timely and valuable for evaluation rigor in agent benchmarks, but is more incremental/meta-methodological and its immediate real-world impact may be narrower.

vs. TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

claude-opus-4.65/16/2026

CLEF addresses a fundamental gap in clinical EEG modeling by introducing a foundation model that operates at session scale and integrates clinical semantics. Its evaluation on a massive 234-task benchmark with 260k+ sessions demonstrates broad clinical applicability across disease phenotypes, medications, and EEG findings. The paradigm of clinically grounded, long-context EEG foundation models opens new directions in clinical neuroscience and has direct real-world healthcare applications. While TACT presents a clever mechanistic intervention for LLM agent drift, its impact is narrower—focused on coding agents—and activation steering is an established technique. CLEF's cross-disciplinary impact (ML + clinical neurology) and novel benchmark give it higher potential.

vs. Workspace-Bench 1.0: Benchmarking AI Agents on Workspace Tasks with Large-Scale File Dependencies

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact due to substantial methodological innovation (session-scale EEG Transformers via spectrogram tokenization plus multimodal contrastive alignment with reports/EHR), strong rigor and scale (260k sessions, 108k patients, 234-task benchmark, external/held-out transfer), and clear real-world clinical applications (phenotyping, medication exposure inference, EEG finding prediction). It advances foundation modeling in healthcare with potential downstream effects across neurology and clinical ML. Paper 2 is timely and useful for agent evaluation, but is primarily a benchmark contribution with narrower immediate real-world deployment impact.

vs. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

claude-opus-4.65/16/2026

CLEF addresses a critical gap in clinical EEG interpretation with a novel foundation model paradigm that integrates long-context sessions with clinical semantics. Its evaluation on a massive 234-task benchmark with 260k+ sessions demonstrates broad clinical applicability across disease phenotypes, medications, and EEG findings. The work bridges neuroscience, clinical medicine, and deep learning, with immediate real-world clinical utility. While QuantumQA makes solid contributions to scientific reasoning in LLMs, its scope is narrower (quantum mechanics QA), and the approach builds more incrementally on existing RLVR paradigms.

vs. Towards Conversational Medical AI with Eyes, Ears and a Voice

claude-opus-4.65/16/2026

Paper 2 introduces a fundamentally new paradigm for medical AI—a multimodal conversational system that processes live audio-visual streams for real-time clinical decision-making. Its novelty spans multiple fields (AI, telemedicine, HCI), addresses a transformative real-world application, and introduces a rigorous evaluation framework (TelePACES) with a crossover study design. While Paper 1 makes strong contributions to EEG foundation models with impressive scale and benchmarking, it operates within a more specialized domain. Paper 2's broader interdisciplinary impact, paradigm-shifting concept of AI co-clinician, and implications for healthcare delivery give it higher potential impact.

vs. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

gemini-3.15/12/2026

Paper 1 addresses a critical challenge in clinical medicine by introducing a large-scale, long-context EEG foundation model. Its rigorous evaluation across 234 tasks using over 260k EEG sessions demonstrates massive real-world potential for improving neurological diagnostics. While Paper 2 offers an innovative approach to agentic data science, Paper 1's direct clinical applicability, vast dataset scale, and substantial performance improvements give it a broader and more immediate scientific and societal impact.

vs. Mental Health AI Safety Claims Must Preserve Temporal Evidence

gemini-3.15/12/2026

Paper 2 presents a massive-scale, clinically grounded foundation model with extensive empirical validation across 234 tasks and 260k EEG sessions. Its rigorous methodology, concrete clinical applications, and significant performance improvements over existing models demonstrate a higher potential for immediate, broad impact in medical AI compared to the more conceptual and proof-of-concept approach of Paper 1.

vs. PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

claude-opus-4.65/12/2026

CLEF presents a substantially more rigorous and impactful contribution. It introduces a clinically grounded EEG foundation model evaluated on a massive 234-task benchmark with 260k+ sessions, demonstrating clear quantitative improvements over prior models. The medical AI domain has high real-world impact potential. The methodology—combining spectral tokenization, contrastive alignment with clinical reports/EHR, and comprehensive evaluation including external cohorts—is thorough. Paper 2, while addressing an interesting proactive agent paradigm, is more conceptual, lacks comparable rigor in evaluation, and targets a less well-defined problem space with less convincing empirical evidence.

vs. Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

claude-opus-4.65/12/2026

CLEF introduces a novel foundation model paradigm for clinical EEG that addresses fundamental limitations of existing approaches—short-window decoding and lack of clinical context integration. It demonstrates substantial improvements across a massive 234-task benchmark with 260k+ sessions, showing strong methodological rigor and clear clinical applicability. The scale of evaluation, novel 3D spectrogram tokenization, and multimodal alignment with clinical reports/EHR represent significant technical contributions. Paper 2, while addressing an important topic in LLM safety evaluation, primarily identifies existing benchmark weaknesses without proposing transformative solutions, limiting its long-term impact.

vs. Mitigating Distribution Sharpening in Math RLVR via Distribution-Aligned Hint Synthesis and Backward Hint Annealing

claude-opus-4.65/12/2026

CLEF introduces a fundamentally new foundation model paradigm for clinical EEG that addresses critical gaps in neurology: session-scale modeling and clinical context integration. Its evaluation on a massive 234-task benchmark with 260k EEG sessions demonstrates broad clinical applicability across disease phenotypes, medications, and EEG findings. The breadth of impact spans neurology, clinical AI, and EHR integration. Paper 1, while technically sound, addresses an incremental optimization issue (distribution sharpening) in math RLVR with limited model sizes and narrow benchmarks, offering less transformative potential.

vs. Evidence Over Plans: Online Trajectory Verification for Skill Distillation

gemini-3.15/12/2026

Paper 1 presents a massive-scale, clinically grounded foundation model for EEG, evaluated on over 260k sessions and 234 tasks. Its methodological innovation in handling session-scale data and integrating multimodal clinical context offers profound potential impact for real-world healthcare and neurology. While Paper 2 provides valuable advancements in AI agent skill distillation, Paper 1's unprecedented scale, rigorous evaluation, and direct life-saving applications give it a higher overall scientific and societal impact.

vs. A Reflective Storytelling Agent for Older Adults: Integrating Argumentation Schemes and Argument Mining in LLM-Based Personalised Narratives

gemini-3.15/12/2026

Paper 1 presents a large-scale foundation model for clinical EEG, evaluated on a massive dataset of over 260k sessions across 234 clinical tasks. Its methodological rigor and potential to broadly transform neurological diagnostics and medical AI give it a significantly higher potential scientific impact compared to Paper 2, which focuses on a more niche HCI application with a much smaller evaluation scale.

vs. LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

gemini-3.15/12/2026

Paper 2 presents a novel, large-scale foundation model for clinical EEG analysis, evaluated on a massive dataset of 260k sessions across 234 tasks. Its methodological rigor and potential to significantly advance AI in healthcare give it profound real-world applicability and breadth of impact. In contrast, Paper 1 offers a practical but incremental software tool for LLM workflows, with a very small-scale qualitative evaluation (9 participants), making its scientific impact considerably lower.

vs. Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

gpt-5.25/12/2026

Paper 2 (CLEF) has higher likely scientific impact due to strong real-world clinical applicability, large-scale evidence (260k sessions, 108k patients), and broad utility across many downstream tasks (234-task benchmark) with substantial performance gains (AUROC 0.65→0.74). Its long-context session-scale modeling plus multimodal alignment to reports/EHR is timely and broadly relevant to ML-for-health and neuroscience. Paper 1 is novel infrastructure for agent meta-control with formalization and systems advances, but its impact is more specialized to agent runtime research and depends on ecosystem adoption.