AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety

Yanjing Ren, Reza Ebrahimi, TengTeng Ma

Jun 3, 2026

arXiv:2606.04867v1 PDF

cs.AI(primary)

#2375of 3355·Artificial Intelligence

#2375 of 3355 · Artificial Intelligence

Tournament Score

1351±41

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty5

Clarity6

Tournament Score

1351±41

10501800

40%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx

AI Impact Assessments

(1 models)

Scientific Impact Assessment: AICompanionBench

1. Core Contribution

AICompanionBench introduces what the authors claim is the first publicly available benchmark dataset of human–AI companion conversations annotated with fine-grained safety risk categories. The dataset comprises 2,123 real-world Replika conversations sourced from Reddit, annotated across nine categories (eight unsafe + one no-harm). The secondary contribution is a systematic evaluation of 20 LLMs (open- and closed-source) under an LLM-as-judge framework for detecting unsafe interactions. The paper addresses a genuine gap: despite growing concerns about AI companion safety—amplified by teen suicide lawsuits against Character.AI and OpenAI—no publicly labeled dataset previously existed for this specific domain.

2. Methodological Rigor

Strengths in data collection: The pipeline for scraping Reddit screenshots, applying OCR, and distinguishing speakers by bubble position is practical and reproducible, though potentially noisy. The funnel-style filtering using six LLMs to identify potentially unsafe conversations is a reasonable approach for managing 43,851 conversations.

Significant weaknesses in annotation:

The ground truth relies on a single human annotator. This is a critical limitation for a benchmark paper. Without inter-annotator agreement metrics (multiple annotators), the reliability of the gold labels is questionable, particularly for subjective categories like manipulation, control, and verbal aggression.

The annotation process is circular: LLMs are used to filter and pre-screen conversations, a single annotator labels them, Cohen's kappa between machine predictions and initial human labels is reported (0.59, only moderate agreement), and then the annotator revises labels using model predictions as reference. This creates a risk of anchoring bias—the human annotator may be influenced by model consensus, undermining the independence of the ground truth.

The dataset is heavily skewed (~48% sexual behavior), with very few instances of manipulation, self-harm, and substance abuse. This class imbalance is acknowledged but not adequately addressed in evaluation metrics.

Evaluation concerns:

The paper primarily uses accuracy and precision as metrics, which are problematic given severe class imbalance. F1-score, macro-averaged metrics, and confusion matrices would provide substantially more informative evaluations.

The prompt used for all 20 models includes one-shot examples but is not varied or ablated, making it unclear whether performance differences stem from model capability or prompt sensitivity.

The false positive rate analysis is informative but incomplete—false negative rates and per-category recall are not systematically reported.

3. Potential Impact

The paper addresses a timely and societally important problem. AI companion safety is receiving increasing regulatory and media attention, and a public benchmark could catalyze research in this area. The dataset, if improved, could serve as a foundation for:

Developing safety classifiers for AI companion platforms

Informing policy and platform governance decisions

Training content moderation systems

However, the dataset's current size (2,123 conversations) and annotation quality limitations constrain its immediate utility as a definitive benchmark. The evaluation of 20 models provides a useful snapshot but the analysis remains largely descriptive rather than offering deep insights into *why* models fail on certain categories.

4. Timeliness & Relevance

The paper is highly timely. The AI companion market is growing rapidly, regulatory scrutiny is intensifying (particularly around minors), and there is genuine need for systematic safety evaluation tools. The paper directly addresses a current bottleneck—the absence of labeled datasets for this specific interaction type. The inclusion of recent models (GPT-5.4, Claude-opus-4.6, Qwen3, DeepSeek-v4) demonstrates currency.

5. Strengths & Limitations

Key Strengths:

First publicly available labeled dataset for AI companion safety conversations

Comprehensive model coverage (20 models across 6 families)

Practical, real-world data source (actual user-shared conversations)

Clearly identified finding that models struggle with implicit/nuanced harm categories (manipulation) and over-flag benign content

The finding that reasoning-enhanced models don't consistently outperform base models is interesting and actionable

Notable Limitations:

Single annotator fundamentally undermines benchmark credibility. Benchmark papers typically require multiple annotators with reported inter-annotator agreement.

Circular annotation process where model outputs influence final human labels creates methodological concerns about label independence.

Evaluation metrics are inadequate: accuracy alone is misleading with such imbalanced classes. No macro/micro F1, no weighted metrics, no confusion matrices.

Selection bias: conversations shared on Reddit are likely more extreme/noteworthy than typical interactions, and the LLM-based filtering further biases toward content that models recognize as potentially unsafe.

Taxonomy is borrowed entirely from Zhang et al. [9] without validation or adaptation—the paper doesn't contribute to defining what constitutes harm.

Limited analytical depth: the paper reports performance numbers but offers minimal analysis of failure modes, error patterns, or systematic biases.

Reproducibility concerns: the GitHub link uses "anonymousresearcher2026," suggesting the dataset URL may change.

No baseline comparison with traditional ML/DL classifiers, which would contextualize LLM-as-judge performance.

The paper does not discuss ethical considerations of releasing real user conversations, even if publicly shared on Reddit.

Additional Observations

The paper's framing as a "benchmark" sets high expectations that the methodological rigor doesn't fully meet. Strong benchmarks (e.g., SafetyBench, R-Judge) typically feature multiple annotators, rigorous validation, and comprehensive evaluation protocols. The contribution is better characterized as an exploratory dataset with preliminary model evaluation rather than a definitive benchmark. The writing is generally clear but the related work section is somewhat formulaic. The key findings, while intuitive, are empirically validated here for the first time in this specific domain.

Rating:4.5/ 10

Significance 5.5Rigor 3.5Novelty 5Clarity 6

Generated Jun 5, 2026

Comparison History (20)

vs. An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

claude-opus-4.66/6/2026

Paper 2 presents a more novel and broadly impactful framework combining LLMs with agent-based epidemiological modeling, integrating spatial/demographic heterogeneity for public health applications. It addresses a timely intersection of AI and infectious disease modeling with clear real-world policy implications. Paper 1, while valuable as a benchmark dataset for AI companion safety, is more narrowly focused on evaluating LLMs as safety judges for a specific application domain. Paper 2's interdisciplinary contribution spanning computational social science, epidemiology, and AI gives it broader potential impact across multiple fields.

vs. PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation

gemini-3.16/6/2026

Paper 2 presents a novel methodological framework with massive cross-industry applications. While Paper 1 provides a timely dataset for AI safety, Paper 2's ability to reliably simulate human UI/UX evaluations has the potential to fundamentally transform software development and HCI workflows, offering broader economic and scientific impact.

vs. Individual Gain, Collective Loss: Metacognitive Adaptation in AI-Assisted Creativity

claude-opus-4.66/6/2026

Paper 2 offers a novel theoretical framework explaining the paradox of AI-enhanced individual creativity vs. reduced collective diversity through selective metacognitive adaptation. This has broader interdisciplinary impact across cognitive science, HCI, creativity research, and AI design. Its conceptual contribution—identifying mechanisms rather than symptoms—provides generative theoretical infrastructure with testable predictions and design principles. Paper 1, while valuable as a benchmark dataset for AI companion safety, is more narrowly scoped as an evaluation resource. Paper 2's framework addresses a fundamental tension in human-AI collaboration that will grow increasingly important.

vs. PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

gpt-5.26/5/2026

Paper 1 is likely higher impact due to stronger methodological innovation (clause cards, anchor-driven instantiation, closed-loop verification) yielding auditable, by-construction ground truth and an agentic evaluation setting, addressing key evaluation gaps (evidence-grounded reasoning, information seeking, abstention). It targets a high-stakes clinical workflow with clear real-world applicability and regulatory relevance. While Paper 2 is timely and useful, it relies on scraped conversations and LLM-as-judge evaluation with more potential confounds and narrower generalizability; its primary contribution is a labeled dataset rather than a broadly reusable benchmark construction methodology.

vs. Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

claude-opus-4.66/5/2026

AICompanionBench addresses a rapidly growing societal concern (AI companion safety) with broad relevance across AI safety, policy, and HCI communities. It introduces the first public benchmark for a timely problem affecting millions of users, evaluates 20 LLMs comprehensively, and has immediate real-world applications for platform safety monitoring. Paper 2, while rigorous, addresses a narrower domain (TLA+ specification generation) with a smaller potential audience. The AI safety topic has broader interdisciplinary impact and higher urgency given the rapid deployment of AI companion systems.

vs. From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

gpt-5.26/5/2026

Paper 2 has higher impact potential because it proposes a novel, actionable guardrail-agent integration (TRIAD) that moves beyond risk labeling to iterative remediation, directly improving downstream agent behavior and enabling safer task completion. It evaluates on established agent safety benchmarks (ASB, AgentHarm) with clear safety-utility trade-offs, suggesting stronger methodological relevance and broader applicability to real-world LLM agents and tool-using systems. Paper 1’s benchmark is valuable and timely, but its impact is narrower (AI companions) and primarily evaluative rather than intervention-oriented.

vs. Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

gemini-3.16/5/2026

Paper 2 addresses a critical data bottleneck in neuroscience and brain-computer interfaces. Demonstrating that synthetic fMRI augmentation can significantly boost decoding performance, and even enable zero-shot decoding, offers transformative potential across cognitive science and medical imaging. This represents a more profound methodological breakthrough than the domain-specific AI safety benchmark presented in Paper 1.

vs. AIS-Based Vessel Trajectory Prediction Using Memory-Augmented Neural Networks

claude-opus-4.66/5/2026

AICompanionBench addresses a timely and rapidly growing concern about AI companion safety, introduces the first public benchmark dataset in this space, and evaluates 20 state-of-the-art LLMs. Its novelty (new benchmark + LLM-as-judge framework for AI safety), broad relevance across AI safety, NLP, and policy communities, and public dataset release give it higher impact potential. Paper 2 applies existing memory-augmented neural networks to vessel trajectory prediction—a more incremental contribution in a narrower domain with limited novelty beyond the application context.

vs. DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

gpt-5.26/5/2026

Paper 2 likely has higher impact due to broader applicability and scale: a large (286K screenshots, 3.5M tasks) benchmark/training set for drag-based GUI interactions addresses a clear, widely relevant bottleneck for GUI agents, enabling progress across web/mobile/desktop automation and HCI. It is timely with rapid interest in computer-use agents and can catalyze model development via both evaluation and training. Paper 1 is novel and socially important, but smaller (2,123 conversations), narrower domain-specific (Replika), and more sensitive to annotation/LLM-judge validity, limiting generalizability.

vs. Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development

claude-opus-4.66/5/2026

AICompanionBench addresses a timely and growing concern about AI companion safety with a publicly available benchmark dataset, rigorous evaluation of 20 LLMs, and clear methodological contributions that the broader research community can build upon. Paper 1, while practically useful, reads more as an industry framework/case study from Yahoo with limited generalizability and less methodological rigor (survey-based evaluation of 67 engineers). Paper 2's benchmark dataset, reproducible evaluation framework, and focus on the critical area of AI safety give it broader scientific impact potential across multiple research communities.

vs. What Should Agents Say? Action-state Communication for Efficient Multi-Agent Systems

claude-opus-4.66/5/2026

Paper 2 addresses a fundamental and broadly applicable problem in multi-agent LLM systems—efficient inter-agent communication—proposing a principled framework (PACT) that demonstrates concrete improvements across multiple topologies and production systems. Its impact spans the rapidly growing field of LLM-based multi-agent systems with immediate practical applications (reduced cost, improved performance). Paper 1, while timely and valuable for AI safety benchmarking in companion systems, is more niche in scope, serving primarily as a dataset contribution for a specific safety evaluation domain. Paper 2's methodological contribution has broader applicability and addresses a more fundamental architectural challenge.

vs. Characterizing initial human-AI proof formalization workflows

gemini-3.16/5/2026

AI safety is a critical, high-impact field with immediate real-world consequences given the explosive growth of AI companions. Paper 2 introduces the first public benchmark dataset for this domain, providing a foundational resource for NLP, HCI, and AI ethics researchers. While Paper 1 is rigorous and novel, its impact is largely confined to the niche domain of mathematical formalization, whereas Paper 2 addresses a widespread societal concern affecting millions of users.

vs. From Out-of-Distribution Detection to Hallucination Detection: A Geometric View

claude-opus-4.66/5/2026

Paper 1 offers a novel theoretical framework connecting two important research areas (OOD detection and hallucination detection) with a geometric perspective, providing training-free methods applicable to reasoning tasks where existing approaches fail. This has broader methodological impact across the LLM safety field. Paper 2 contributes a useful benchmark dataset for AI companion safety, but is more narrowly scoped as an empirical evaluation of existing models on a specific application domain, with less generalizable methodological innovation.

vs. Doing What They Say, Not What They Reason: Locating the Faithfulness Gap in LLM Agents

gpt-5.26/5/2026

Paper 2 has higher potential impact: it introduces a publicly available, real-world dataset with fine-grained safety taxonomy for a rapidly growing and societally sensitive application area (AI companions), enabling broad follow-on work, benchmarking, and policy-relevant evaluation. Its methodological contribution (annotated benchmark + multi-model LLM-as-judge evaluation across 20 models) is directly actionable and timely. Paper 1 is novel in isolating faithfulness gap components in a controlled agent setting, but its domain-specific simulator setup may limit immediate real-world applicability and breadth compared to a reusable safety benchmark.

vs. Hedge-Bench: Benchmarking Agents on Hard, Realistic Tasks Pertaining to Financial Reasoning

gpt-5.26/5/2026

Paper 2 is likely higher impact due to timeliness (AI companion safety is a rapidly growing, high-stakes area), broader cross-field relevance (AI safety, HCI, social computing, content moderation, policy), and clearer real-world applicability for monitoring deployed systems. Its dataset is larger and directly addresses safety risk taxonomy and evaluation of LLMs-as-judges, a widely used paradigm. Paper 1 is novel and rigorous (expert-trace, deterministic grading) but is narrower in domain (hedge-fund financial reasoning) and thus may have more limited breadth despite strong benchmark design.

vs. MindClaw: Closed-Loop Embodied Mental-State Reasoning for Precision Intervention

gpt-5.26/5/2026

Paper 2 (MindClaw) has higher potential impact due to its more novel closed-loop embodied Theory-of-Mind setting, integrating perception, belief memory, triggering, reasoning, and action with “intervene vs stay silent” calibration—an important step beyond offline QA benchmarks. It targets broad real-world applications in robotics and assistive agents and may generalize across embodied AI, HRI, planning, and multimodal learning. Paper 1 provides a valuable safety dataset/benchmark for AI companions, but its contribution is narrower (content moderation/judging) and methodologically depends heavily on LLM-as-judge evaluation, limiting breadth and innovation relative to MindClaw’s systems contribution.

vs. PieArena: Ranking and Profiling Language Agents in Realistic Negotiation Scenarios

claude-opus-4.66/5/2026

PieArena demonstrates higher potential scientific impact due to its broader methodological contributions (novel ranking model for continuous payoffs, multi-dimensional behavioral profiling, human-LM comparisons with trained negotiators) and wider applicability across AI evaluation, economics, and strategic reasoning. It addresses fundamental questions about LLM capabilities in complex multi-agent settings with real-world business relevance. While AICompanionBench addresses an important safety concern, it is more narrowly focused on content classification. PieArena's methodological innovations—order-invariant leaderboards, agentic scaffolding analysis, and cross-play evaluation—offer transferable frameworks for the broader AI evaluation community.

vs. Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

claude-opus-4.66/5/2026

AICompanionBench addresses a timely, practical problem (AI companion safety) with a concrete, publicly available benchmark dataset and empirical evaluation of 20 LLMs. It has clear real-world applications given growing concerns about AI companion platforms, and benchmarks are high-impact resources that drive community progress. Paper 2 (Trivium) proposes an interesting theoretical framework for temporal regret in causal-memory controllers, but its contributions are more speculative, with limited empirical validation (pilot studies only), and the framework's practical adoption remains uncertain. The narrower audience and preliminary nature reduce its near-term impact.

vs. Large AI Models in Dental Healthcare: From General-Purpose Systems to Domain-Specific Foundation Models

gemini-3.16/5/2026

Paper 1 introduces a novel, first-of-its-kind benchmark dataset for AI companion safety, a highly critical and rapidly growing area of AI alignment and ethics. Benchmark datasets typically yield high scientific impact by establishing standardized evaluation metrics that drive future model development across the broader AI community. In contrast, Paper 2 is a scoping review limited to a specific niche (dental AI). While useful for clinical applications, it synthesizes existing literature rather than providing a new dataset, model, or methodology, making its broader scientific impact comparatively lower.

vs. When to Re-Plan: Subgoal Persistence in Hierarchical Latent Reasoning

gpt-5.26/5/2026

Paper 2 has higher potential impact due to a more novel algorithmic contribution (subgoal persistence as a controllable stability–adaptivity knob for latent hierarchical reasoning) with clear ablations, quantified optima, and replication over seeds, suggesting stronger methodological rigor. The idea generalizes beyond specific tasks (planning, agentic RL, reasoning architectures) and is timely for long-horizon/agentic LLM research. Paper 1 is valuable and timely for AI safety, but its main contribution is a relatively small benchmark dataset with potential sampling/labeling biases and narrower methodological novelty; impact may be more domain-specific.