Emotional intelligence in large language models is fragmented across perception, cognition, and interaction
Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han
Abstract
As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.
AI Impact Assessments
(1 models)Scientific Impact Assessment: FACET — Emotional Intelligence in LLMs is Fragmented
1. Core Contribution
This paper introduces FACET (Functional Affective Competence and Empathy Test), a 480-item bilingual (English/Chinese) benchmark for evaluating emotional intelligence (EI) in LLMs, grounded in the Mayer-Salovey-Caruso four-branch ability model. The central finding is that EI in LLMs is fragmented across perception, cognition, and interaction dimensions—models that excel at objective emotion recognition often fail at interactive emotional engagement. The paper categorizes models into three performance profiles (cognitive-dominant, interactive-dominant, context-dependent), identifies hidden emotion recognition as a universal bottleneck, and introduces the concept of "stochastic empathy"—statistical mimicry of emotional syntax without integrated affective reasoning.
The key novelty lies in the structured decomposition of EI into objective (perception/cognition) and subjective (interaction) components with a dual-metric evaluation approach (accuracy + Elo ratings), revealing that these capabilities are dissociated rather than correlated. This is a meaningful conceptual advance over prior work like EQ-Bench and EmoBench, which treat EI more monolithically.
2. Methodological Rigor
Strengths: The benchmark is well-structured with 10 diagnostically distinct task types and expert-crafted items developed through a multi-stage pipeline involving senior psychologists and licensed psychotherapists. The use of Elo ratings for subjective interaction evaluation is appropriate, and bias mitigation strategies (position bias, verbosity bias, named participant bias) are thoughtfully implemented. The bilingual design enables cross-cultural analysis grounded in Hall's High-Context/Low-Context framework.
Concerns: Several methodological issues warrant scrutiny:
3. Potential Impact
The paper addresses a genuinely important problem: as LLMs are deployed in mental health support, companionship, and clinical settings, understanding the structural integrity of their emotional capabilities is critical for safety. The findings have several practical implications:
The benchmark itself could become a standard evaluation tool, though its 480-item size and expert-dependent construction may limit scalability.
4. Timeliness & Relevance
The paper is highly timely. The rapid deployment of LLMs in emotionally sensitive applications (therapy chatbots, crisis hotlines, companion AI) creates urgent need for nuanced EI evaluation beyond surface-level politeness metrics. The inclusion of very recent models (GPT-5, Claude-Sonnet-4, Grok-4) makes the evaluation immediately relevant. The growing concern about AI safety in emotional contexts—particularly the risk of models providing inappropriate responses in crisis situations—makes this work directly applicable to current regulatory and deployment discussions.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
FACET represents a meaningful advance in LLM emotional intelligence evaluation, offering both a well-designed benchmark and genuinely informative findings about the fragmented nature of machine emotional competence. The perception-interaction dissociation is the paper's most important contribution, with clear implications for alignment research. However, the psychometric validation claims exceed the evidence provided, and the reliance on LLM-as-judge for the most novel dimensions (interaction) introduces circularity that somewhat undermines the findings' robustness.
Generated May 26, 2026
Comparison History (23)
Paper 1 is likely to have higher impact: it introduces a substantial new psychometrically grounded benchmark (FACET, 480 expert items) tied to a canonical EI theory, reports empirical results across multiple frontier models, and surfaces actionable failure modes (e.g., hidden emotion recognition bottleneck, “stochastic empathy” from RLHF). This has immediate safety/clinical/HCI applications and could reshape evaluation and alignment practices. Paper 2 provides a useful unifying perspective and taxonomy, but is more synthesis than new empirical/methodological contribution, so its incremental impact is likely smaller.
Paper 1 addresses a critical safety gap in the high-stakes domain of clinical AI by evaluating how models adapt to changing patient contexts. Its focus on actual treatment decisions over static medical QA has immediate, life-saving implications for healthcare deployment, giving it higher potential impact than the broader, but less critical, evaluation of emotional intelligence in Paper 2.
Paper 2 addresses a more fundamental question about LLM capabilities (introspection/metacognition) with rigorous methodological critiques of existing work, introducing clever control conditions that expose confounds. Its findings challenge premature conclusions in the field and have broad implications for AI interpretability and alignment research. Paper 1, while introducing a useful benchmark (FACET) for emotional intelligence evaluation, is more incremental—primarily a benchmarking study. Paper 2's methodological contributions (distinguishing genuine introspection from pattern matching, demonstrating insufficiency of behavioral evidence) provide deeper, more generalizable insights for the field.
Paper 1 addresses a fundamental gap in AI safety by distinguishing controllability from alignment—a timely and critical issue as agentic AI systems proliferate. It introduces a concrete benchmark (ControlBench), proposes actionable architectural principles, and targets a problem with immediate real-world safety implications. Paper 2 offers a valuable psychometric framework for evaluating LLM emotional intelligence, but its domain is narrower and less urgent. The controllability framework has broader cross-field impact (safety, governance, deployment) and addresses a more pressing need as autonomous AI agents become widespread.
Paper 1 addresses a critical frontier in AI alignment by bridging psychometrics and LLM evaluation. Its finding that RLHF optimizes for 'stochastic empathy' challenges current training paradigms and has broad interdisciplinary implications across AI safety, psychology, and clinical applications. While Paper 2 offers a valuable technical solution for LLM agent efficiency, Paper 1's conceptual novelty and broader impact on how we understand and evaluate affective reasoning in foundational models give it a higher potential scientific impact.
Paper 2 likely has higher impact: it provides a first large-scale empirical characterization of an emerging socio-technical infrastructure (A2A collaboration networks) using substantial real-world data (1.5M assets, 128K agents) and identifies concrete, actionable failure modes (misaligned incentives, manipulable ranking, unverifiable testing) with clear design implications for secure, auditable AI ecosystems. Its findings generalize to platform design, economics, security, and AI governance. Paper 1 introduces a valuable EI benchmark and conceptual insight, but impact may be narrower (evaluation/psychometrics of LLM affect) and more sensitive to benchmark adoption and construct validity debates.
Paper 2 likely has higher scientific impact due to stronger novelty and cross-disciplinary breadth: it introduces a psychometrically grounded EI benchmark (FACET) anchored in an established theory, reveals a conceptually important finding (EI fragmentation across perception/cognition/interaction), and links results to alignment mechanisms (RLHF) with clear safety implications. Its applications span AI safety, HCI, mental health, and policy. Paper 1 is valuable and rigorous (large dataset + toolkit) but is more domain-specific (mobile GUI navigation) and its core insights (scaling + RL helps) are less broadly paradigm-shifting.
Paper 1 addresses a foundational challenge in AI alignment by bridging psychology and LLM evaluation. Its rigorous, theoretically anchored approach to deconstructing emotional intelligence has broad implications for deploying AI in sensitive domains like healthcare and customer service. Paper 2 is highly practical and addresses the critical issue of hallucinations, but its scope is more narrowly focused on a technical optimization (layer selection) compared to Paper 1's broader interdisciplinary impact and conceptual innovation regarding 'stochastic empathy'.
Paper 2 likely has higher scientific impact due to broader cross-field relevance (AI alignment, HCI, psychology/psychometrics, clinical and education applications) and timeliness as LLMs enter sensitive social domains. FACET’s theoretically grounded, psychometrically structured benchmark and its finding of fragmented EI (plus identifiable bottlenecks like hidden emotion recognition) offers a general diagnostic framework that can shape evaluation and training paradigms across many models and settings. Paper 1 is valuable and practical for autonomous driving, but is narrower in scope and relies on proxy/open-loop evaluation and an oracle applicability assumption, limiting immediate generalizability.
Paper 2 has higher potential impact due to its broad relevance across AI alignment, HCI, and psychology. While Paper 1 offers a strong technical contribution for embodied AI, Paper 2 addresses a critical, ubiquitous issue in foundational LLMs: the superficiality of RLHF-driven emotional responses ('stochastic empathy'). By bridging psychometric theory with LLM evaluation, Paper 2 provides a foundational benchmark that challenges current scaling assumptions and provides a roadmap for AI safety, making it highly influential for the future deployment of AI in human-facing and clinical applications.
Paper 1 addresses a fundamental, high-level capability of LLMs (Emotional Intelligence) by bridging AI alignment with established psychological theory. Its identification of 'stochastic empathy' and the fragmentation of EI challenges current scaling laws and RLHF paradigms, offering deep theoretical novelty. Paper 2, while highly practical, focuses on a narrower engineering challenge (robustness to UI noise). Paper 1 has broader interdisciplinary impact, influencing human-AI interaction, clinical applications, and AI safety methodologies.
Paper 2 is more likely to have higher scientific impact due to stronger novelty (using EEG-to-image generation as a proxy “visual translator” for grounding), clearer real-world clinical applications (EEG interpretation), and broader cross-field reach (BCI, neuroimaging, generative modeling, multimodal foundation models). The approach is timely amid brain-foundation-model efforts and data scarcity. Paper 1 introduces a rigorous EI benchmark and important safety insights, but its primary contribution is evaluative/diagnostic and may have narrower downstream utility compared to a method enabling new capabilities for neural-signal understanding.
Paper 1 introduces FACET, a novel psychometrically grounded benchmark for evaluating emotional intelligence in LLMs, addressing a critical gap in AI safety and alignment. Its findings—that EI is fragmented, not monolithic, and that RLHF produces 'stochastic empathy'—have broad implications across AI development, clinical applications, and alignment research. The identification of hidden emotion recognition as a universal bottleneck and distinct performance profiles provides actionable insights for the field. Paper 2 makes a solid contribution to LLM-assisted qualitative analysis but addresses a narrower methodological niche with more incremental improvements.
Paper 2 likely has higher scientific impact: it proposes a concrete, generally applicable PEFT method (PALoRA) addressing a widely felt practical problem—updating LLMs without degrading reasoning—grounded in spectral analysis with a clear algorithmic contribution and strong empirical evidence across models and benchmarks. Its applicability spans continual learning, model editing, deployment, and efficient adaptation, making it timely and broadly useful. Paper 1 offers a valuable benchmark and insight into EI fragmentation, but its impact is more niche and depends on adoption of the FACET test and on contested constructs/measurement validity in affective evaluation.
Paper 1 addresses a highly timely issue by introducing a novel, psychometrically grounded framework for evaluating emotional intelligence in frontier LLMs. Its findings challenge prevailing assumptions about model scaling and RLHF, potentially reshaping future AI alignment strategies. Paper 2, while clinically valuable, represents a more incremental application of existing ML techniques to speech analysis. Consequently, Paper 1 promises a broader, paradigm-shifting impact across AI safety, cognitive science, and human-computer interaction.
Paper 2 likely has higher scientific impact due to a concrete, training-free, broadly deployable method that directly improves reliability in vision-language systems—a major, timely bottleneck with immediate real-world applications. Its region-aware attention recalibration is actionable, computationally efficient, and validated on widely used benchmarks with state-of-the-art results, increasing adoption potential. Paper 1 offers a valuable psychometrically grounded EI benchmark and conceptual framing, but its impact may be narrower and more evaluative than method-enabling, with less direct downstream integration into production systems.
Paper 2 bridges AI, psychology, and safety, addressing a critical frontier in AI alignment. Its psychometrically grounded framework and insights into 'stochastic empathy' have broad, multi-disciplinary implications for human-computer interaction and clinical applications. Conversely, Paper 1 addresses an important but narrower systems engineering and performance measurement issue.
Paper 2 addresses a critical, timely issue with broad implications across AI safety, HCI, and psychology. By introducing a psychometrically grounded framework to evaluate LLMs and challenging current alignment paradigms (RLHF), it offers significantly wider cross-disciplinary impact and real-world relevance compared to Paper 1, which focuses on a more narrowly scoped algorithmic advancement for imperfect-information games.
Paper 1 addresses a foundational issue in AI alignment and human-AI interaction by evaluating emotional intelligence in frontier LLMs. Its introduction of a psychometrically grounded framework and discovery of 'stochastic empathy' have broad implications across psychology, HCI, and AI safety. In contrast, Paper 2 focuses on a narrower, albeit rigorous, application of proactive task-oriented dialogue for specific domains like outbound sales, making Paper 1's potential breadth of impact and relevance significantly higher.
Paper 1 has higher impact potential due to stronger novelty and methodological contribution: it introduces a new task (Grounded Personality Reasoning), a sizable multimodal dataset with timestamped evidence, and a multi-tier evaluation with concrete failure-mode metrics that can become standard for auditing “right answer for the right reason.” Its breadth spans multimodal grounding, social cognition, bias/prejudice analysis, and benchmark design, with clear real-world relevance for human-facing agents. Paper 2 is timely and psychometrically motivated, but is smaller in scale and less methodologically actionable beyond a fixed-item test.