Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

Minghao Lv, Lu Chen, Enchang Zhang, Anji Zhou, Xiaoran Xue, Hanyi Zhang, Fenghua Tang, Zhuo Rachel Han

May 23, 2026

arXiv:2605.24686v1 PDF

cs.AI(primary)

#782of 2682·Artificial Intelligence

#782 of 2682 · Artificial Intelligence

Tournament Score

1454±41

10501800

61%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1454±41

10501800

61%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As large language models (LLMs) are increasingly integrated into emotionally sensitive domains, the structural integrity of their emotional intelligence (EI) becomes a critical frontier for safety and alignment. Current benchmarks often conflate superficial politeness with deep affective reasoning, failing to distinguish between perceptual accuracy and interactive efficacy. Here, we introduce FACET (Functional Affective Competence and Empathy Test), a psychometrically grounded framework comprising 480 expert-crafted items. Unlike previous metrics, FACET is theoretically anchored in the Mayer-Salovey-Caruso four-branch ability model, operationalizing EI through perception, facilitation, understanding, and management of emotions. Through an evaluation of nine frontier models (including GPT-5, Claude-Sonnet-4), we demonstrate that emotional intelligence is not a monolithic capability but is fragmented across cognitive and interactive dimensions. While frontier models demonstrate robust proficiency in objective emotion recognition and social reasoning, this does not consistently translate to interactive success. We categorize these discrepancies into three distinct performance profiles: cognitive-dominant, interactive-dominant, and context-dependent. These typologies indicate that emotional skills do not scale uniformly with general intelligence or model size; rather, they are shaped by specific alignment paradigms. Notably, we identify hidden emotion recognition as a universal performance bottleneck across all architectures. Our results suggest that current RLHF processes may optimize for "stochastic empathy", a statistical mimicry of emotional syntax, at the expense of integrated affective reasoning. These findings challenge the assumption of linear emotional scaling and provide a rigorous roadmap for developing socially aware agents capable of genuine clinical resonance.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: FACET — Emotional Intelligence in LLMs is Fragmented

1. Core Contribution

This paper introduces FACET (Functional Affective Competence and Empathy Test), a 480-item bilingual (English/Chinese) benchmark for evaluating emotional intelligence (EI) in LLMs, grounded in the Mayer-Salovey-Caruso four-branch ability model. The central finding is that EI in LLMs is fragmented across perception, cognition, and interaction dimensions—models that excel at objective emotion recognition often fail at interactive emotional engagement. The paper categorizes models into three performance profiles (cognitive-dominant, interactive-dominant, context-dependent), identifies hidden emotion recognition as a universal bottleneck, and introduces the concept of "stochastic empathy"—statistical mimicry of emotional syntax without integrated affective reasoning.

The key novelty lies in the structured decomposition of EI into objective (perception/cognition) and subjective (interaction) components with a dual-metric evaluation approach (accuracy + Elo ratings), revealing that these capabilities are dissociated rather than correlated. This is a meaningful conceptual advance over prior work like EQ-Bench and EmoBench, which treat EI more monolithically.

2. Methodological Rigor

Strengths: The benchmark is well-structured with 10 diagnostically distinct task types and expert-crafted items developed through a multi-stage pipeline involving senior psychologists and licensed psychotherapists. The use of Elo ratings for subjective interaction evaluation is appropriate, and bias mitigation strategies (position bias, verbosity bias, named participant bias) are thoughtfully implemented. The bilingual design enables cross-cultural analysis grounded in Hall's High-Context/Low-Context framework.

Concerns: Several methodological issues warrant scrutiny:

LLM-as-a-Judge limitations: The subjective evaluation relies on Gemini-2.5-Pro as the primary judge, yet Gemini is simultaneously one of the nine evaluated models. While the authors report calibration against human expert annotations (75 questions, 405 instances), the Kappa values for Expression Naturalness are notably low across all candidate judges (0.33–0.57), suggesting this critical dimension may not be reliably evaluated. The circular dependency—using one frontier model to judge others—introduces systematic bias that is only partially addressed.

Sample size for psychometric validation: The paper claims "psychometric validation" but provides limited evidence of classical psychometric properties (e.g., internal consistency, test-retest reliability, factor analysis confirming the proposed dimensional structure). Content validity is claimed through expert involvement, but construct validity evidence is largely asserted rather than demonstrated statistically.

Temperature and sampling: Objective evaluations use "default temperature" without specifying exact values per model, which could introduce uncontrolled variance.

Cultural proxy assumption: Treating English and Chinese as proxies for Low-Context and High-Context cultures is a significant simplification. Both languages encompass diverse cultural contexts, and the dichotomy, while theoretically grounded, may oversimplify cultural variation.

3. Potential Impact

The paper addresses a genuinely important problem: as LLMs are deployed in mental health support, companionship, and clinical settings, understanding the structural integrity of their emotional capabilities is critical for safety. The findings have several practical implications:

Alignment research: The identification that RLHF may produce "stochastic empathy" provides actionable insight for alignment researchers to move beyond template-based emotional responses.

Clinical deployment: The crisis recognition analysis, showing conservative bias in risk assessment, has direct implications for models used in mental health triage.

Model development: The three-profile typology (cognitive-dominant, interactive-dominant, context-dependent) offers developers a diagnostic framework for understanding their models' emotional strengths and weaknesses.

Cross-cultural AI: The bilingual analysis revealing directional linguistic asymmetries (e.g., GPT-5 having Chinese cognitive advantage but English interactive advantage) is practically valuable for localization efforts.

The benchmark itself could become a standard evaluation tool, though its 480-item size and expert-dependent construction may limit scalability.

4. Timeliness & Relevance

The paper is highly timely. The rapid deployment of LLMs in emotionally sensitive applications (therapy chatbots, crisis hotlines, companion AI) creates urgent need for nuanced EI evaluation beyond surface-level politeness metrics. The inclusion of very recent models (GPT-5, Claude-Sonnet-4, Grok-4) makes the evaluation immediately relevant. The growing concern about AI safety in emotional contexts—particularly the risk of models providing inappropriate responses in crisis situations—makes this work directly applicable to current regulatory and deployment discussions.

5. Strengths & Limitations

Key Strengths:

Theoretically grounded: Anchoring in established psychological theory (MSCEIT, Relevance Theory, Social Constructivism) provides principled evaluation dimensions rather than ad hoc task design.

Actionable findings: The hidden emotion bottleneck, conservative crisis bias, and template dependency findings are specific and addressable.

Rich qualitative analysis: The paper provides illustrative examples (candlelit dinner scenario, parents arguing) that concretely demonstrate the differences between formulaic and resonant responses.

Comprehensive model coverage: Nine frontier models across both Western and Chinese AI ecosystems provide broad comparative analysis.

Notable Limitations:

Text-only modality: The authors acknowledge this but it significantly limits ecological validity, as real emotional intelligence is multimodal.

Single-turn evaluation: Most real emotional support occurs over multi-turn conversations with evolving emotional dynamics.

Reproducibility concerns: The expert-crafted nature of items, while ensuring quality, limits reproducibility and extensibility. The full dataset availability is not explicitly confirmed.

Statistical analysis depth: While correlations are reported, more sophisticated analyses (e.g., structural equation modeling of the proposed three-component structure, bootstrapped confidence intervals for Elo ratings) would strengthen claims.

The "stochastic empathy" concept, while evocative, is more of a descriptive label than a formally defined or empirically validated construct.

Overall Assessment

FACET represents a meaningful advance in LLM emotional intelligence evaluation, offering both a well-designed benchmark and genuinely informative findings about the fragmented nature of machine emotional competence. The perception-interaction dissociation is the paper's most important contribution, with clear implications for alignment research. However, the psychometric validation claims exceed the evidence provided, and the reliance on LLM-as-judge for the most novel dimensions (interaction) introduces circularity that somewhat undermines the findings' robustness.

Rating:7.2/ 10

Significance 7.5Rigor 6.5Novelty 7Clarity 7.5

Generated May 26, 2026

Comparison History (23)

vs. Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns

gpt-5.25/28/2026

Paper 1 is likely to have higher impact: it introduces a substantial new psychometrically grounded benchmark (FACET, 480 expert items) tied to a canonical EI theory, reports empirical results across multiple frontier models, and surfaces actionable failure modes (e.g., hidden emotion recognition bottleneck, “stochastic empathy” from RLHF). This has immediate safety/clinical/HCI applications and could reshape evaluation and alignment practices. Paper 2 provides a useful unifying perspective and taxonomy, but is more synthesis than new empirical/methodological contribution, so its incremental impact is likely smaller.

vs. Do Clinical Models Change Treatment Decisions?

gemini-3.15/28/2026

Paper 1 addresses a critical safety gap in the high-stakes domain of clinical AI by evaluating how models adapt to changing patient contexts. Its focus on actual treatment decisions over static medical QA has immediate, life-saving implications for healthcare deployment, giving it higher potential impact than the broader, but less critical, evaluation of emotional intelligence in Paper 2.

vs. Can LLMs Introspect? A Reality Check

claude-opus-4.65/27/2026

Paper 2 addresses a more fundamental question about LLM capabilities (introspection/metacognition) with rigorous methodological critiques of existing work, introducing clever control conditions that expose confounds. Its findings challenge premature conclusions in the field and have broad implications for AI interpretability and alignment research. Paper 1, while introducing a useful benchmark (FACET) for emotional intelligence evaluation, is more incremental—primarily a benchmarking study. Paper 2's methodological contributions (distinguishing genuine introspection from pattern matching, demonstrating insufficiency of behavioral evidence) provide deeper, more generalizable insights for the field.

vs. Position: AI Safety Requires Effective Controllability

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental gap in AI safety by distinguishing controllability from alignment—a timely and critical issue as agentic AI systems proliferate. It introduces a concrete benchmark (ControlBench), proposes actionable architectural principles, and targets a problem with immediate real-world safety implications. Paper 2 offers a valuable psychometric framework for evaluating LLM emotional intelligence, but its domain is narrower and less urgent. The controllability framework has broader cross-field impact (safety, governance, deployment) and addresses a more pressing need as autonomous AI agents become widespread.

vs. AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

gemini-3.15/27/2026

Paper 1 addresses a critical frontier in AI alignment by bridging psychometrics and LLM evaluation. Its finding that RLHF optimizes for 'stochastic empathy' challenges current training paradigms and has broad interdisciplinary implications across AI safety, psychology, and clinical applications. While Paper 2 offers a valuable technical solution for LLM agent efficiency, Paper 1's conceptual novelty and broader impact on how we understand and evaluate affective reasoning in foundational models give it a higher potential scientific impact.

vs. Behind EvoMap: Characterizing a Self-Evolving Agent-to-Agent Collaboration Network

gpt-5.25/27/2026

Paper 2 likely has higher impact: it provides a first large-scale empirical characterization of an emerging socio-technical infrastructure (A2A collaboration networks) using substantial real-world data (1.5M assets, 128K agents) and identifies concrete, actionable failure modes (misaligned incentives, manipulable ranking, unverifiable testing) with clear design implications for secure, auditable AI ecosystems. Its findings generalize to platform design, economics, security, and AI governance. Paper 1 introduces a valuable EI benchmark and conceptual insight, but impact may be narrower (evaluation/psychometrics of LLM affect) and more sensitive to benchmark adoption and construct validity debates.

vs. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact due to stronger novelty and cross-disciplinary breadth: it introduces a psychometrically grounded EI benchmark (FACET) anchored in an established theory, reveals a conceptually important finding (EI fragmentation across perception/cognition/interaction), and links results to alignment mechanisms (RLHF) with clear safety implications. Its applications span AI safety, HCI, mental health, and policy. Paper 1 is valuable and rigorous (large dataset + toolkit) but is more domain-specific (mobile GUI navigation) and its core insights (scaling + RL helps) are less broadly paradigm-shifting.

vs. Automatic Layer Selection for Hallucination Detection

gemini-3.15/27/2026

Paper 1 addresses a foundational challenge in AI alignment by bridging psychology and LLM evaluation. Its rigorous, theoretically anchored approach to deconstructing emotional intelligence has broad implications for deploying AI in sensitive domains like healthcare and customer service. Paper 2 is highly practical and addresses the critical issue of hallucinations, but its scope is more narrowly focused on a technical optimization (layer selection) compared to Paper 1's broader interdisciplinary impact and conceptual innovation regarding 'stochastic empathy'.

vs. RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to broader cross-field relevance (AI alignment, HCI, psychology/psychometrics, clinical and education applications) and timeliness as LLMs enter sensitive social domains. FACET’s theoretically grounded, psychometrically structured benchmark and its finding of fragmented EI (plus identifiable bottlenecks like hidden emotion recognition) offers a general diagnostic framework that can shape evaluation and training paradigms across many models and settings. Paper 1 is valuable and practical for autonomous driving, but is narrower in scope and relies on proxy/open-loop evaluation and an oracle applicability assumption, limiting immediate generalizability.

vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

gemini-3.15/26/2026

Paper 2 has higher potential impact due to its broad relevance across AI alignment, HCI, and psychology. While Paper 1 offers a strong technical contribution for embodied AI, Paper 2 addresses a critical, ubiquitous issue in foundational LLMs: the superficiality of RLHF-driven emotional responses ('stochastic empathy'). By bridging psychometric theory with LLM evaluation, Paper 2 provides a foundational benchmark that challenges current scaling assumptions and provides a roadmap for AI safety, making it highly influential for the future deployment of AI in human-facing and clinical applications.

vs. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

gemini-3.15/26/2026

Paper 1 addresses a fundamental, high-level capability of LLMs (Emotional Intelligence) by bridging AI alignment with established psychological theory. Its identification of 'stochastic empathy' and the fragmentation of EI challenges current scaling laws and RLHF paradigms, offering deep theoretical novelty. Paper 2, while highly practical, focuses on a narrower engineering challenge (robustness to UI noise). Paper 1 has broader interdisciplinary impact, influencing human-AI interaction, clinical applications, and AI safety methodologies.

vs. Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs

gpt-5.25/26/2026

Paper 2 is more likely to have higher scientific impact due to stronger novelty (using EEG-to-image generation as a proxy “visual translator” for grounding), clearer real-world clinical applications (EEG interpretation), and broader cross-field reach (BCI, neuroimaging, generative modeling, multimodal foundation models). The approach is timely amid brain-foundation-model efforts and data scarcity. Paper 1 introduces a rigorous EI benchmark and important safety insights, but its primary contribution is evaluative/diagnostic and may have narrower downstream utility compared to a method enabling new capabilities for neural-signal understanding.

vs. Agent-as-Peer-Debriefer: A Multi-Agent Framework with Perspective-Based Refinement for Qualitative Analysis

claude-opus-4.65/26/2026

Paper 1 introduces FACET, a novel psychometrically grounded benchmark for evaluating emotional intelligence in LLMs, addressing a critical gap in AI safety and alignment. Its findings—that EI is fragmented, not monolithic, and that RLHF produces 'stochastic empathy'—have broad implications across AI development, clinical applications, and alignment research. The identification of hidden emotion recognition as a universal bottleneck and distinct performance profiles provides actionable insights for the field. Paper 2 makes a solid contribution to LLM-assisted qualitative analysis but addresses a narrower methodological niche with more incremental improvements.

vs. PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it proposes a concrete, generally applicable PEFT method (PALoRA) addressing a widely felt practical problem—updating LLMs without degrading reasoning—grounded in spectral analysis with a clear algorithmic contribution and strong empirical evidence across models and benchmarks. Its applicability spans continual learning, model editing, deployment, and efficient adaptation, making it timely and broadly useful. Paper 1 offers a valuable benchmark and insight into EI fragmentation, but its impact is more niche and depends on adoption of the FACET test and on contested constructs/measurement validity in affective evaluation.

vs. Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care

gemini-3.15/26/2026

Paper 1 addresses a highly timely issue by introducing a novel, psychometrically grounded framework for evaluating emotional intelligence in frontier LLMs. Its findings challenge prevailing assumptions about model scaling and RLHF, potentially reshaping future AI alignment strategies. Paper 2, while clinically valuable, represents a more incremental application of existing ML techniques to speech analysis. Consequently, Paper 1 promises a broader, paradigm-shifting impact across AI safety, cognitive science, and human-computer interaction.

vs. Mitigating Object Hallucinations in Vision-Language Models through Region-Aware Attention Recalibration

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact due to a concrete, training-free, broadly deployable method that directly improves reliability in vision-language systems—a major, timely bottleneck with immediate real-world applications. Its region-aware attention recalibration is actionable, computationally efficient, and validated on widely used benchmarks with state-of-the-art results, increasing adoption potential. Paper 1 offers a valuable psychometrically grounded EI benchmark and conceptual framing, but its impact may be narrower and more evaluative than method-enabling, with less direct downstream integration into production systems.

vs. Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

gemini-3.15/26/2026

Paper 2 bridges AI, psychology, and safety, addressing a critical frontier in AI alignment. Its psychometrically grounded framework and insights into 'stochastic empathy' have broad, multi-disciplinary implications for human-computer interaction and clinical applications. Conversely, Paper 1 addresses an important but narrower systems engineering and performance measurement issue.

vs. MAPLE: Multi-State Aggregated Policy Evaluation for AlphaZero in Imperfect-Information Games

gemini-3.15/26/2026

Paper 2 addresses a critical, timely issue with broad implications across AI safety, HCI, and psychology. By introducing a psychometrically grounded framework to evaluate LLMs and challenging current alignment paradigms (RLHF), it offers significantly wider cross-disciplinary impact and real-world relevance compared to Paper 1, which focuses on a more narrowly scoped algorithmic advancement for imperfect-information games.

vs. Unlocking Proactivity in Task-Oriented Dialogue

gemini-3.15/26/2026

Paper 1 addresses a foundational issue in AI alignment and human-AI interaction by evaluating emotional intelligence in frontier LLMs. Its introduction of a psychometrically grounded framework and discovery of 'stochastic empathy' have broad implications across psychology, HCI, and AI safety. In contrast, Paper 2 focuses on a narrower, albeit rigorous, application of proactive task-oriented dialogue for specific domains like outbound sales, making Paper 1's potential breadth of impact and relevance significantly higher.

vs. Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

gpt-5.25/26/2026

Paper 1 has higher impact potential due to stronger novelty and methodological contribution: it introduces a new task (Grounded Personality Reasoning), a sizable multimodal dataset with timestamped evidence, and a multi-tier evaluation with concrete failure-mode metrics that can become standard for auditing “right answer for the right reason.” Its breadth spans multimodal grounding, social cognition, bias/prejudice analysis, and benchmark design, with clear real-world relevance for human-facing agents. Paper 2 is timely and psychometrically motivated, but is smaller in scale and less methodologically actionable beyond a fixed-item test.