CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su

May 17, 2026

arXiv:2605.17370v1 PDF

cs.AI(primary)

#1548of 2292·Artificial Intelligence

#1548 of 2292 · Artificial Intelligence

Tournament Score

1368±43

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor5.5

Novelty6.5

Clarity7.5

Tournament Score

1368±43

10501800

47%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: CBT-Audio

1. Core Contribution

CBT-Audio addresses a genuine gap at the intersection of AI for mental health and audio language model evaluation. The paper's central argument is well-motivated: CBT is fundamentally a spoken interaction, yet nearly all AI-CBT research operates on text alone, missing paralinguistic cues (tone, hesitation, pace, tremor) that therapists routinely use to assess patient distress. The paper contributes: (1) a dataset of 1,802 patient turns from 96 publicly available CBT recordings with turn-level distress labels on a 1–5 scale; (2) a systematic evaluation of 10 open-source ALMs under three input conditions (audio-only, transcript-only, audio+transcript); and (3) evidence that audio provides complementary information to transcripts, with audio+transcript improving over transcript-only in 8 of 10 model families.

The key finding—that audio doesn't reliably beat text alone but consistently helps *when combined* with text—is nuanced and practically important. It reframes the value of audio not as a replacement for text but as a complementary signal, particularly when verbal content and vocal delivery diverge.

2. Methodological Rigor

Strengths in design: The controlled three-condition evaluation (AO, TO, AT) on identical patient turns is well-designed, enabling clean modality comparisons. The use of paired Wilcoxon signed-rank tests with bootstrap confidence intervals is appropriate for this ordinal prediction task.

Labeling pipeline concerns: The reference labels are generated via GPT-audio-1.5 using a semantic similarity rating (SSR) approach rather than direct human annotation. While the SSR method (generating descriptions, embedding them, matching to anchors) is creative and avoids single-model scale bias, it means the ground truth is fundamentally model-generated. The expert validation on 194 clips (81.4% within ±1 agreement) provides some reassurance, but this is a lenient criterion on a 5-point scale—chance agreement within ±1 would already be substantial. The inter-rater reliability among human experts is only moderate (Krippendorff's α = 0.439), which both validates that distress rating is subjective and raises questions about the ceiling for any evaluation system.

Data limitations: The recordings are educational role-plays and case-study walkthroughs, not real therapy sessions. The authors acknowledge this but it significantly limits ecological validity. Actors and trainees may display more stereotypical or exaggerated emotional patterns than actual patients, potentially inflating the apparent utility of audio cues. The English-only constraint further limits generalizability.

Label distribution: The comparison between SSR and direct numeric prompting (Figure 5) shows SSR produces a broader distribution, but one could argue this is an artifact rather than a virtue—the highly concentrated distribution from direct prompting might reflect genuine base rates in educational recordings where extreme distress is rare.

3. Potential Impact

Immediate utility: CBT-Audio fills a practical need for benchmarking ALMs on clinical-adjacent speech tasks. The mental health AI community has been constrained by text-only datasets, and even an imperfect audio benchmark creates new evaluation possibilities.

Broader implications: The finding that audio+transcript outperforms transcript-only supports the development of multimodal therapy support tools. This could influence how future therapy AI systems are designed—arguing for audio processing capabilities even when transcripts are available.

Clinical translation: The paper is careful not to overclaim clinical applicability, which is appropriate. However, the distance from educational role-plays to real clinical settings is substantial, and the work should be viewed as a proof-of-concept for evaluation methodology rather than evidence for clinical deployment.

Dataset contribution: The release of metadata (URLs, timestamps, labels, code) without redistributing audio is a pragmatic approach to reproducibility under copyright constraints, though it introduces fragility—YouTube videos can be removed.

4. Timeliness & Relevance

The paper is well-timed. ALMs have proliferated rapidly (the 10 evaluated models span 2023–2026), and there is active interest in applying these models to mental health applications. The paper correctly identifies that existing speech emotion recognition benchmarks (IEMOCAP, MELD, RAVDESS) use general-domain categorical emotions rather than clinical constructs like distress intensity in therapeutic contexts. The focus on open-source models is also timely given privacy concerns around sending sensitive clinical audio to commercial APIs.

5. Strengths & Limitations

Key strengths:

Well-articulated motivation grounded in clinical practice (therapists attend to voice-content mismatches)

Clean experimental design enabling direct modality comparisons

Comprehensive evaluation across 10 diverse ALM architectures

Thoughtful case studies (Figure 3) that provide interpretable examples of when and why audio helps

Responsible framing—explicitly disclaims clinical diagnostic use

Expert validation panel with relevant clinical backgrounds

Notable limitations:

Model-generated reference labels create a circularity concern: the benchmark measures how well ALMs agree with another LM's assessment, not necessarily clinical ground truth

Educational role-play data may not generalize to real therapy

The 1–5 ordinal scale is coarse, and the "within ±1" agreement criterion is lenient

No fine-tuning experiments—all models are evaluated zero-shot, which limits understanding of what's achievable with task-specific training

The dataset is relatively small (1,802 turns) and English-only

No analysis of acoustic features that drive model decisions (e.g., pitch, pause duration), which would strengthen interpretability

URL-based distribution is fragile for long-term reproducibility

Missing comparisons: The paper doesn't compare against traditional speech emotion recognition systems or acoustic feature extractors (e.g., OpenSMILE + classifier), which would help contextualize whether ALMs offer advantages over simpler approaches.

Summary

CBT-Audio makes a meaningful contribution by introducing an evaluation framework for audio language models in a clinically-motivated task. The controlled experimental design is a strength, and the core finding about audio's complementary value is both believable and useful. However, the reliance on model-generated labels, educational role-play data, and the absence of traditional baselines temper the impact. This is best viewed as foundational work that opens a new evaluation direction rather than definitive evidence about ALM capabilities in clinical settings.

Rating:5.8/ 10

Significance 6Rigor 5.5Novelty 6.5Clarity 7.5

Generated May 19, 2026

Comparison History (19)

vs. Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

gpt-5.25/20/2026

Paper 1 offers a novel, general formalization of trust calibration for agentic tool use as preference learning, connecting it to Preferential Bayesian Optimization with a clear uncertainty-driven querying strategy. This has broad, timely applicability across autonomous agents, human-in-the-loop governance, safety, and deployment policy (allow/block/ask). Paper 2 provides a valuable dataset and evaluation for audio LMs in CBT, with concrete application relevance, but its impact is narrower to mental-health NLP and constrained by dataset size and privacy-driven limits on generalization. Overall, Paper 1 is likely to influence more methods and domains.

vs. Interference-Aware Multi-Task Unlearning

gpt-5.25/20/2026

Paper 1 is likely to have higher scientific impact due to its methodological novelty and broad relevance: it extends machine unlearning to realistic multi-task shared-backbone settings, identifies concrete interference mechanisms, and proposes a generally applicable optimization framework (task-aware projection + instance-level orthogonalization) with clear quantitative gains. This advances a timely area (unlearning/privacy/compliance) with potential deployment implications across many multi-task domains. Paper 2 provides a valuable evaluation dataset and benchmark for audio-based distress estimation in CBT, but its impact is narrower (mental-health audio evaluation) and more incremental on methodology.

vs. Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

claude-opus-4.65/19/2026

Paper 2 (CBT-Audio) addresses a critical gap in mental health AI by introducing a novel, ethically-grounded dataset for evaluating audio language models in CBT. It has broader real-world applicability in clinical mental health settings, fills an important data scarcity problem, and provides a reusable benchmark. Paper 1, while intellectually interesting in probing MLLM spatial reasoning and Theory of Mind, addresses a more niche problem with a complex framework whose practical applicability is less immediate. Paper 2's contribution of a shareable spoken CBT dataset and systematic evaluation across 10 models is more likely to catalyze follow-up research across NLP, clinical AI, and mental health communities.

vs. WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

claude-opus-4.65/19/2026

CBT-Audio addresses a significant gap at the intersection of AI, mental health, and audio language models. It introduces a novel dataset for an underexplored modality (audio) in CBT evaluation, with clear clinical relevance. The finding that audio provides complementary information to text—especially when verbal content and vocal delivery diverge—has meaningful implications for therapeutic AI systems. WebGameBench, while well-constructed, addresses a narrower problem in coding agent evaluation with less cross-disciplinary impact. CBT-Audio's contributions to mental health AI, multimodal evaluation, and clinical applications give it broader and more lasting scientific impact.

vs. Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to broader relevance and timeliness: it introduces a rigorous, decision-grade benchmark for deep research agents used in enterprise settings, with multi-layer evaluation (deterministic verifiers + SME rubrics) and adversarial “cognitive traps,” enabling more reliable measurement across models and deployments. Its methodology and scoring framework are generalizable across domains and can shape evaluation standards for agentic systems. Paper 1 is novel and valuable for mental-health audio modeling, but its impact is narrower due to dataset size, domain specificity, and data availability constraints.

vs. Skim: Speculative Execution for Fast and Efficient Web Agents

gemini-3.15/19/2026

Skim addresses critical bottlenecks—cost and latency—in the rapidly growing field of AI web agents. By introducing speculative execution to bypass heavyweight components without accuracy loss, it offers broad, immediate utility for deploying practical AI agents across numerous domains. While Paper 2 provides a valuable dataset for mental health AI, Paper 1's system-level optimization has wider cross-domain applicability and directly accelerates the broader adoption of autonomous agents.

vs. TopoEvo: A Topology-Aware Self-Evolving Multi-Agent Framework for Root Cause Analysis in Microservices

claude-opus-4.65/19/2026

CBT-Audio addresses a significant gap at the intersection of AI and mental health by introducing a novel dataset and evaluation framework for audio-based distress estimation in therapy. It tackles real ethical/privacy challenges, enables multimodal (audio+text) evaluation of language models in clinical settings, and has broad applicability across mental health AI research. Paper 2, while technically sophisticated, addresses a narrower DevOps/microservices problem with an incremental multi-agent framework. CBT-Audio's unique dataset contribution, clinical relevance, and potential to catalyze research in audio language models for mental health give it broader and more lasting impact.

vs. Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

claude-opus-4.65/19/2026

Paper 1 introduces a novel dataset (CBT-Audio) addressing a significant gap in mental health AI research—the lack of audio-based evaluation for CBT. It bridges audio and text modalities, demonstrating that vocal cues add value beyond transcripts for distress estimation. This has broad implications for clinical AI, multimodal language models, and mental health applications. Paper 2 makes valuable contributions to legal AI by examining contamination and neuro-symbolic methods, but addresses a narrower domain (tax law) with less novelty in its core findings. Paper 1's new resource and cross-disciplinary relevance give it higher impact potential.

vs. Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

gemini-3.15/19/2026

Paper 1 addresses a fundamental limitation in LLM agents—long-term memory retrieval—with a novel causal intervention framework. This foundational methodology offers significant breadth of impact across numerous domains relying on AI agents. While Paper 2 provides a valuable dataset and insights for mental health applications, its impact is primarily confined to the specialized subfield of clinical audio-NLP rather than advancing foundational AI capabilities.

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to its creation of a new, ethically sourced benchmark dataset (CBT-Audio) addressing a major data bottleneck in mental-health AI, with clear real-world clinical relevance and potential to shape evaluation across audio-language modeling. It offers broadly reusable resources (data, protocol, baselines) and is timely given rapid growth in multimodal/audio LMs. Paper 1 is methodologically innovative for RL in diffusion multimodal LLMs, but impact may be narrower (primarily generative model training) and more incremental within a crowded image-generation optimization space.

vs. $π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

gemini-3.15/19/2026

Paper 2 addresses a fundamental bottleneck in the rapidly expanding field of autonomous AI agents: proactive intent resolution in long-horizon tasks. While Paper 1 provides a highly valuable multimodal dataset for a crucial niche (mental health), Paper 2 introduces a benchmark that is broadly applicable across virtually all domains involving AI assistants. As AI research aggressively shifts from reactive chatbots to proactive, agentic workflows, Paper 2's methodology and focus offer significantly broader cross-disciplinary impact and relevance to the current frontier of AI.

vs. Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

gpt-5.25/19/2026

Paper 1 likely has higher scientific impact due to broader relevance and stronger methodological contribution. It provides a large, controlled, token-costed ablation study across multiple model families and agent design axes, yielding actionable design principles (e.g., state abstraction boosts return-per-token; “deliberation cascade” harms performance). These insights generalize to many LLM-agent deployments in partially observable/adversarial settings (cyber, robotics, operations), making it timely for agentic AI. Paper 2 is valuable and novel as a dataset/evaluation benchmark for CBT audio distress, but its scope is narrower and constrained by dataset size and domain-specific applicability.

vs. Learning Bilevel Policies over Symbolic World Models for Long-Horizon Planning

claude-opus-4.65/19/2026

Paper 1 addresses a fundamental challenge in embodied AI—long-horizon planning—with a novel bilevel approach combining symbolic reasoning and imitation learning. Its demonstrated scalability (10,000 objects) and generalization capabilities represent significant advances with broad applications in robotics and AI planning. Paper 2, while addressing a meaningful gap in mental health AI by introducing audio modality for CBT distress estimation, is more niche in scope, presents an evaluation benchmark rather than a methodological breakthrough, and has a smaller dataset (1,802 turns). Paper 1's broader applicability across robotics, planning, and AI gives it higher potential impact.

vs. How Much is Brain Data Worth for Machine Learning?

gemini-3.15/19/2026

While Paper 1 offers a valuable dataset with immediate practical applications in mental health, Paper 2 addresses a fundamental theoretical question in the emerging field of NeuroAI. By establishing mathematical scaling laws and formalizing the value of brain data for machine learning, Paper 2 provides foundational insights that could influence a much broader range of future research bridging neuroscience and artificial intelligence, leading to a higher potential for widespread scientific impact.

vs. DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

gemini-3.15/19/2026

Paper 1 offers higher scientific impact by addressing a critical bottleneck in AI for mental health: the lack of audio-inclusive datasets for CBT. By introducing a novel dataset capturing acoustic cues of distress—often missed by text-only models—it opens new interdisciplinary research pathways in multimodal clinical AI. While Paper 2 presents a highly successful, large-scale industrial deployment of LLMs, its contributions are primarily engineering and commercial optimizations. Paper 1 provides a foundational resource for a socially vital, technically challenging domain, granting it broader potential impact across the scientific community.

vs. PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

gpt-5.25/19/2026

Paper 2 likely has higher scientific impact due to its methodological novelty (prefix-aware internal reward modeling to enable dense step-level signals), broad applicability to multi-turn agent optimization across domains, and strong timeliness given current interest in RL for LLM agents and GRPO-like methods. It targets a core bottleneck—credit assignment without expensive rollouts, judges, or ground truth—offering practical scalability. Paper 1 is valuable and application-relevant, but its impact is narrower (CBT distress estimation) and more dataset/evaluation-centric with constrained generalization due to data privacy and domain specificity.

vs. Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

claude-opus-4.65/19/2026

CBT-Audio addresses a significant gap at the intersection of AI, mental health, and speech processing by introducing a novel benchmark dataset and evaluation framework for audio language models in clinical settings. It has broader real-world applications in mental healthcare, introduces a reusable resource (1,802 labeled turns), and opens a new research direction combining multimodal AI with psychotherapy. Paper 2 provides useful insights into LLM negotiation limitations but is more narrowly focused on documenting a known class of LLM shortcomings (reasoning-to-action gaps) in a controlled setting with less immediate practical impact.

vs. How do Humans Process AI-generated Hallucination Contents: a Neuroimaging Study

gpt-5.25/19/2026

Paper 1 is likely to have higher impact because it releases a scarce, ethically difficult-to-obtain evaluation dataset for spoken CBT, enabling reproducible benchmarking of audio language models in mental health—an area with clear real-world application. It tests multiple models and input modalities with expert-validated labels, offering actionable evidence that audio adds value beyond transcripts. Paper 2 is novel and timely but has narrower generalizability (EEG, n=27, specific MLLM task) and yields mainly descriptive neurocognitive findings, which may translate less directly into broadly adopted methods or resources.

vs. Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

gemini-3.15/19/2026

Paper 1 introduces a highly novel dataset and multimodal evaluation framework in mental health AI, a domain notoriously bottlenecked by privacy constraints. By bridging the gap between text and audio analysis in clinical settings, it enables significant future research in affective computing and healthcare AI. Paper 2, while showing strong practical results for enterprise SRE workflows, focuses on a narrower commercial application with less fundamental scientific breadth compared to clinical psychological modeling.