Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?

Caixin Kang, Tianyu Yan, Sitong Gong, Mingfang Zhang, Liangyang Ouyang, Ruicong Liu, Bo Zheng, Huchuan Lu

May 21, 2026

arXiv:2605.22109v1 PDF

cs.AI(primary)cs.CVcs.CY

#1007of 2292·Artificial Intelligence

#1007 of 2292 · Artificial Intelligence

Tournament Score

1427±50

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance8

Rigor8

Novelty7.5

Clarity8.5

Tournament Score

1427±50

10501800

60%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Multimodal Large Language Models (MLLMs) are increasingly deployed in human-facing roles where personality perception is critical, yet existing benchmarks evaluate this capability solely on numerical Big Five score prediction, leaving open whether models truly perceive personality through behavioral understanding or merely prejudge through superficial pattern matching. We address this gap with three contributions. (i) A new task: we formalize Grounded Personality Reasoning (GPR), which requires MLLMs to anchor each Big Five rating in observable evidence through a chain of rating, reasoning, and grounding. (ii) A new dataset: we release MM-OCEAN (1,104 videos, 5,320 MCQs), produced by a multi-agent pipeline with human verification, with timestamped behavioral observations, evidence-grounded trait analyses, and seven categories of cue-grounding MCQs. (iii) Benchmark and analysis: we design a three-tier evaluation (rating, reasoning, grounding) plus four sample-level failure-mode metrics: Prejudice Rate (PR), Confabulation Rate (CR), Integration-failure Rate (IR), and Holistic-grounding Rate (HR), and benchmark 27 MLLMs (13 closed, 14 open). The analysis uncovers a striking Prejudice Gap: across the field, 51% of correct ratings are not grounded in retrieved cues, and the Holistic-Grounding Rate spans only 0-33.5%. These findings expose a disconnect between getting the right score and reasoning for the right reason, charting a roadmap for grounded social cognition in MLLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Perception or Prejudice: Can MLLMs Go Beyond First Impressions of Personality?"

1. Core Contribution

This paper makes three interconnected contributions addressing a significant gap in how we evaluate MLLMs' understanding of human personality. First, it formalizes Grounded Personality Reasoning (GPR), a task requiring models to not only predict Big Five personality ratings but anchor those ratings in observable behavioral evidence through a chain of rating → reasoning → grounding. Second, it introduces MM-OCEAN, a benchmark of 1,104 videos with ~13.5K human-verified behavioral observations and 5,320 cue-grounding MCQs spanning seven cognitive categories. Third, it proposes a three-tier evaluation framework with four interpretable failure-mode metrics (Prejudice Rate, Confabulation Rate, Integration-failure Rate, Holistic-Grounding Rate) and benchmarks 27 MLLMs.

The central finding—the "Prejudice Gap" where 51% of correct personality ratings lack grounded evidence and Holistic-Grounding Rate spans only 0–33.5%—is a striking and practically important result. It demonstrates that traditional rating-only evaluations systematically overestimate model competence.

2. Methodological Rigor

The paper demonstrates strong methodological rigor across multiple dimensions:

Annotation pipeline: The five-stage multi-agent pipeline with human verification is well-designed. The 78.2% acceptance rate for Observer drafts, 77% inter-annotator agreement, and the involvement of 24 trained annotators across 45,609 cue judgments lend credibility. The text-leakage filter (Stage 5a) using two text-only LLMs to ensure MCQs require genuine multimodal grounding is a thoughtful design choice.

Evaluation robustness: Multiple robustness checks strengthen the findings: threshold sensitivity sweeps across 27 combinations showing HR ranking stability (ρ ≥ 0.92), cross-judge validation with Claude Haiku 4.5 and Gemini 2.5 Flash-Lite confirming T2 rankings (ρ ≥ 0.92), and the confidently-wrong consistency check showing the AI judge tracks correctness rather than style (σ_Δ = 0.27).

Scale: Benchmarking 27 models from 12 families (13 proprietary, 14 open-source) provides comprehensive coverage of the current MLLM landscape.

One methodological limitation is the reliance on a single AI judge (GPT-4o-mini) for Task 2, though the cross-judge checks partially mitigate this concern. The operationalization of "grounding" as MCQ performance is a reasonable proxy but may not fully capture a model's grounding ability.

3. Potential Impact

Immediate impact on MLLM evaluation: The paper's most direct contribution is demonstrating that accuracy-only metrics are insufficient for personality-aware AI systems. The HR metric and failure-mode taxonomy offer the community a more nuanced evaluation vocabulary.

Regulatory alignment: The explicit connection to the EU AI Act's requirement for explainable evidence trails in personality-based systems gives this work immediate practical relevance. As AI systems increasingly enter hiring, mental health, and education domains, benchmarks that distinguish genuine understanding from superficial pattern matching become essential.

Broader methodological influence: The rating-reasoning-grounding evaluation chain and the failure-mode taxonomy (PR/CR/IR/HR) could generalize beyond personality to other social cognition tasks—emotion recognition, deception detection, intent inference. The "right answer for the wrong reason" problem the paper addresses is pervasive across AI evaluation.

Open-source ecosystem insights: The finding that the closed-vs-open gap is narrow for rating (Δ = -5.6%) and reasoning (Δ = -3.6%) but wide for cue retrieval (Δ = -26.6%) provides actionable guidance for open-source model development priorities.

4. Timeliness & Relevance

The paper addresses a timely convergence of needs: MLLMs are being deployed in personality-sensitive applications (AI interviews, companion systems, mental health triage) while regulatory frameworks increasingly demand explainability. The gap between "getting the right score" and "reasoning for the right reason" is exactly the kind of distinction that both researchers and regulators need to make. The benchmark is also timely in covering the latest generation of models (GPT-5.5, Gemini 3/3.1, Claude Opus 4.6, Qwen3.5).

5. Strengths & Limitations

Key Strengths:

The conceptual framing of "perception vs. prejudice" is compelling and clearly articulated, drawing on established person-perception psychology (Funder's Realistic Accuracy Model).

The seven-category MCQ taxonomy (reasoning cluster + visual grounding cluster) provides fine-grained diagnostic power, revealing that spatial localization and micro-expression detection are systemic bottlenecks.

The RGM metric cleanly identifies two model archetypes (Confident Raters vs. Cautious Reasoners), offering actionable development guidance.

The paper is exceptionally thorough in its appendices, providing extensive robustness checks, worked examples, and supplementary analyses.

HR's coefficient of variation (0.93) being far larger than any single-task metric demonstrates its discriminative value.

Notable Limitations:

The dataset inherits ChaLearn First Impressions V2's cultural/linguistic biases (predominantly Western English speakers), limiting cross-cultural generalizability.

15-second clips are quite short for personality inference, and "apparent personality" from crowd-sourced labels is a construct with known limitations.

The causal claims about reasoning capability (Appendix Q) are appropriately caveated but the observational comparison is still potentially misleading given the many confounds.

The benchmark size (1,104 videos, 5,320 MCQs) is moderate; scaling to more diverse scenarios would strengthen claims.

The paper doesn't explore whether models can be improved through targeted fine-tuning on grounded personality reasoning, which would strengthen the practical roadmap.

Overall Assessment

This is a well-executed benchmark paper that identifies an important and under-examined gap in MLLM evaluation. The "Prejudice Gap" finding is both novel and consequential. The methodology is sound with extensive robustness validation. The work sits at a productive intersection of multimodal AI evaluation, personality psychology, and AI safety/regulation. Its primary limitation is scope (short English-only clips, single construct), but within that scope, the contribution is thorough and convincing.

Rating:7.5/ 10

Significance 8Rigor 8Novelty 7.5Clarity 8.5

Generated May 22, 2026

Comparison History (15)

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gemini-3.15/22/2026

Paper 2 identifies a critical, counter-intuitive inverse scaling phenomenon in high-stakes forecasting domains like epidemiology and finance. Discovering that increased model capabilities degrade performance on tail-risk predictions addresses urgent AI safety and reliability concerns. Its profound implications for real-world decision-making and LLM evaluation methodology give it a broader, more significant potential scientific impact than Paper 1's narrower focus on personality perception.

vs. WorkstreamBench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

claude-opus-4.65/22/2026

Paper 2 introduces a novel conceptual framework (Grounded Personality Reasoning) that addresses a fundamental question about whether MLLMs truly understand or merely pattern-match—a question with broad implications across AI safety, fairness, and deployment. Its contributions (new task formalization, large dataset, comprehensive evaluation with novel failure-mode metrics, and benchmarking 27 models) are methodologically rigorous and reveal a striking 'Prejudice Gap' finding. This has broader cross-field impact (social AI, psychology, fairness/bias) compared to Paper 1, which provides a valuable but more domain-specific benchmark for spreadsheet tasks in finance.

vs. Towards a General Intelligence and Interface for Wearable Health Data

gemini-3.15/22/2026

Paper 1 introduces a foundation model trained on an unprecedented scale (5 million participants) for wearable health data, addressing critical challenges in personalized medicine. Its broad evaluation across 35 health tasks and integration into a clinician-validated Personal Health Agent suggest immense real-world applications and transformative potential in healthcare. Paper 2, while valuable for MLLM evaluation and AI fairness, is a benchmarking study with a narrower scope and less direct societal impact compared to the population-scale health insights enabled by Paper 1.

vs. AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems

gemini-3.15/22/2026

Paper 2 introduces a novel task, a new dataset (MM-OCEAN), and comprehensive benchmarks evaluating a critical issue in contemporary AI: whether Multimodal LLMs genuinely reason or rely on superficial prejudice. Benchmark datasets and evaluations of AI reasoning/bias typically garner high citation rates and spur broad downstream research. In contrast, Paper 1 is a synthesis chapter reviewing AI applications in serious games, which, while valuable, lacks the empirical novelty, rigorous benchmarking, and broad applicability across the rapidly moving fields of AI safety and multimodal model development seen in Paper 2.

vs. Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

gemini-3.15/22/2026

Paper 2 offers significantly higher scientific impact by introducing a novel dataset (MM-OCEAN), formalizing a new task, and comprehensively benchmarking 27 models. It addresses a critical flaw in AI evaluation—whether models truly reason or just rely on superficial prejudices. In contrast, Paper 1 presents an interesting approach to emotion recognition but relies on a highly limited case study of a single 245-second speech. Paper 2's robust methodology, new failure-mode metrics, and large-scale analysis give it much broader relevance and citation potential across AI alignment and computational social science.

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

gemini-3.15/22/2026

While Paper 1 provides a valuable infrastructure benchmark for agentic workflows, Paper 2 tackles a fundamental scientific question regarding how multimodal models reason about human attributes. By exposing the 'Prejudice Gap' and shifting evaluation from superficial score prediction to grounded behavioral reasoning, it addresses critical issues in AI safety, fairness, and interpretability, promising broader interdisciplinary impact across AI, psychology, and HCI.

vs. DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

claude-opus-4.65/22/2026

Paper 1 addresses a fundamental gap in MLLM evaluation—whether models truly understand personality or rely on superficial cues—introducing a novel task (GPR), a comprehensive dataset (MM-OCEAN), and revealing the striking 'Prejudice Gap' across 27 models. This exposes a deep issue (right answer, wrong reasoning) relevant to AI safety, trustworthiness, and social cognition broadly. Paper 2 contributes a useful benchmark for agentic delegation but addresses a narrower, more engineering-focused problem with less surprising findings. Paper 1's conceptual contribution and implications for responsible AI deployment give it broader and more lasting impact.

vs. CLORE: Content-Level Optimization for Reasoning Efficiency

claude-opus-4.65/22/2026

Paper 2 introduces a fundamentally new evaluation paradigm for MLLMs in social cognition, exposing a critical 'Prejudice Gap' where models appear to succeed without genuine understanding. It contributes a novel task formalization (GPR), a large-scale dataset (MM-OCEAN), and benchmarks 27 models with new failure-mode metrics. This has broader impact across AI safety, fairness, and human-AI interaction. Paper 1, while solid engineering work on reasoning efficiency, is more incremental—combining known techniques (DPO, content editing) for a well-studied problem of verbose reasoning in LLMs.

vs. Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

claude-opus-4.65/22/2026

Paper 2 introduces a novel evaluation framework (Grounded Personality Reasoning) with a new dataset, benchmark, and failure-mode metrics that expose fundamental limitations in MLLMs' social reasoning. It evaluates 27 models and reveals a striking 'Prejudice Gap' with broad implications for AI safety and deployment in human-facing applications. While Paper 1 makes valuable contributions questioning chess LLM claims, its scope is narrower (chess domain) and its core finding (pattern-matching over understanding) is less surprising. Paper 2's contributions span AI evaluation methodology, social cognition, and responsible AI deployment, giving it broader cross-field impact.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gpt-5.25/22/2026

Paper 2 likely has higher impact due to timeliness and broad applicability: privacy-preserving LLM agents are an immediate deployment concern across consumer, enterprise, and regulated domains. POLAR-Bench offers a clear, diagnostic evaluation framework (policy axes × attack strategies) with deterministic scoring and adversarial interaction, directly informing safety/privacy alignment and model selection—especially for widely used smaller open-weight models. Paper 1 is novel and valuable for grounded social cognition and bias diagnosis in MLLMs, but its application scope is narrower (personality inference) and less universally critical than privacy-utility trade-offs in agentic systems.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gemini-3.15/22/2026

Paper 2 addresses a critical, universal bottleneck in the deployment of autonomous LLM agents: data privacy and adversarial robustness. While Paper 1 provides excellent insights into multimodal social cognition and bias, Paper 2's focus on privacy-utility trade-offs impacts virtually every domain where AI agents interact with third-party systems. Furthermore, its specific finding that smaller, on-device models are highly vulnerable to data leakage provides immediate, highly actionable value for the open-source and AI security communities, granting it broader and more urgent real-world impact.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gpt-5.25/22/2026

Paper 1 has higher potential impact due to its broader, timelier framing of frontier AI evaluation: it challenges benchmark-centric paradigms and proposes open-world evaluations applicable across domains (agents, safety, governance, deployment readiness). The CRUX proposal and real deployment case study (shipping an iOS app) are likely to influence evaluation practice and policy beyond a single subfield. Paper 2 is methodologically strong with a valuable dataset/metrics for grounded personality perception in MLLMs, but its scope is narrower (social cognition/personality) and impacts fewer adjacent fields.

vs. Open-World Evaluations for Measuring Frontier AI Capabilities

gemini-3.15/22/2026

While Paper 1 provides a rigorous and valuable benchmark for a specific domain (personality perception), Paper 2 addresses a fundamental limitation in how frontier AI capabilities are currently measured across all domains. By advocating for and demonstrating 'open-world evaluations,' Paper 2 has a broader potential impact on AI safety, policy, and general capability tracking, shaping future methodologies for evaluating autonomous agents in real-world scenarios.

vs. Interference-Aware Multi-Task Unlearning

gemini-3.15/22/2026

Paper 2 addresses a critical flaw in Multimodal Large Language Models—the 'Clever Hans' effect in personality perception—by introducing a novel task, dataset, and benchmark. Its exposure of the 'Prejudice Gap' has broad implications for AI safety, fairness, and human-computer interaction, likely driving significant future research. While Paper 1 provides a strong technical solution for a specific subfield (multi-task unlearning), Paper 2's focus on foundational MLLM evaluation offers wider immediate relevance and broader interdisciplinary impact.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

claude-opus-4.65/22/2026

Paper 2 has higher potential scientific impact due to its broader relevance across AI safety, fairness, and social cognition research. It introduces a novel conceptual framework (Grounded Personality Reasoning), a reusable benchmark dataset (MM-OCEAN), and reveals a fundamental limitation ('Prejudice Gap') in 27 MLLMs that has implications for any human-facing AI deployment. The findings challenge assumptions about MLLM reasoning capabilities broadly. Paper 1, while technically rigorous, addresses a narrower domain (AV scenario generation) with incremental improvements over existing methods.