AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
Yanjing Ren, Reza Ebrahimi, TengTeng Ma
Abstract
As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx
AI Impact Assessments
(1 models)Scientific Impact Assessment: AICompanionBench
1. Core Contribution
AICompanionBench introduces what the authors claim is the first publicly available benchmark dataset of human–AI companion conversations annotated with fine-grained safety risk categories. The dataset comprises 2,123 real-world Replika conversations sourced from Reddit, annotated across nine categories (eight unsafe + one no-harm). The secondary contribution is a systematic evaluation of 20 LLMs (open- and closed-source) under an LLM-as-judge framework for detecting unsafe interactions. The paper addresses a genuine gap: despite growing concerns about AI companion safety—amplified by teen suicide lawsuits against Character.AI and OpenAI—no publicly labeled dataset previously existed for this specific domain.
2. Methodological Rigor
Strengths in data collection: The pipeline for scraping Reddit screenshots, applying OCR, and distinguishing speakers by bubble position is practical and reproducible, though potentially noisy. The funnel-style filtering using six LLMs to identify potentially unsafe conversations is a reasonable approach for managing 43,851 conversations.
Significant weaknesses in annotation:
Evaluation concerns:
3. Potential Impact
The paper addresses a timely and societally important problem. AI companion safety is receiving increasing regulatory and media attention, and a public benchmark could catalyze research in this area. The dataset, if improved, could serve as a foundation for:
However, the dataset's current size (2,123 conversations) and annotation quality limitations constrain its immediate utility as a definitive benchmark. The evaluation of 20 models provides a useful snapshot but the analysis remains largely descriptive rather than offering deep insights into *why* models fail on certain categories.
4. Timeliness & Relevance
The paper is highly timely. The AI companion market is growing rapidly, regulatory scrutiny is intensifying (particularly around minors), and there is genuine need for systematic safety evaluation tools. The paper directly addresses a current bottleneck—the absence of labeled datasets for this specific interaction type. The inclusion of recent models (GPT-5.4, Claude-opus-4.6, Qwen3, DeepSeek-v4) demonstrates currency.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The paper's framing as a "benchmark" sets high expectations that the methodological rigor doesn't fully meet. Strong benchmarks (e.g., SafetyBench, R-Judge) typically feature multiple annotators, rigorous validation, and comprehensive evaluation protocols. The contribution is better characterized as an exploratory dataset with preliminary model evaluation rather than a definitive benchmark. The writing is generally clear but the related work section is somewhat formulaic. The key findings, while intuitive, are empirically validated here for the first time in this specific domain.
Generated Jun 5, 2026
Comparison History (20)
Paper 2 presents a more novel and broadly impactful framework combining LLMs with agent-based epidemiological modeling, integrating spatial/demographic heterogeneity for public health applications. It addresses a timely intersection of AI and infectious disease modeling with clear real-world policy implications. Paper 1, while valuable as a benchmark dataset for AI companion safety, is more narrowly focused on evaluating LLMs as safety judges for a specific application domain. Paper 2's interdisciplinary contribution spanning computational social science, epidemiology, and AI gives it broader potential impact across multiple fields.
Paper 2 presents a novel methodological framework with massive cross-industry applications. While Paper 1 provides a timely dataset for AI safety, Paper 2's ability to reliably simulate human UI/UX evaluations has the potential to fundamentally transform software development and HCI workflows, offering broader economic and scientific impact.
Paper 2 offers a novel theoretical framework explaining the paradox of AI-enhanced individual creativity vs. reduced collective diversity through selective metacognitive adaptation. This has broader interdisciplinary impact across cognitive science, HCI, creativity research, and AI design. Its conceptual contribution—identifying mechanisms rather than symptoms—provides generative theoretical infrastructure with testable predictions and design principles. Paper 1, while valuable as a benchmark dataset for AI companion safety, is more narrowly scoped as an evaluation resource. Paper 2's framework addresses a fundamental tension in human-AI collaboration that will grow increasingly important.
Paper 1 is likely higher impact due to stronger methodological innovation (clause cards, anchor-driven instantiation, closed-loop verification) yielding auditable, by-construction ground truth and an agentic evaluation setting, addressing key evaluation gaps (evidence-grounded reasoning, information seeking, abstention). It targets a high-stakes clinical workflow with clear real-world applicability and regulatory relevance. While Paper 2 is timely and useful, it relies on scraped conversations and LLM-as-judge evaluation with more potential confounds and narrower generalizability; its primary contribution is a labeled dataset rather than a broadly reusable benchmark construction methodology.
AICompanionBench addresses a rapidly growing societal concern (AI companion safety) with broad relevance across AI safety, policy, and HCI communities. It introduces the first public benchmark for a timely problem affecting millions of users, evaluates 20 LLMs comprehensively, and has immediate real-world applications for platform safety monitoring. Paper 2, while rigorous, addresses a narrower domain (TLA+ specification generation) with a smaller potential audience. The AI safety topic has broader interdisciplinary impact and higher urgency given the rapid deployment of AI companion systems.
Paper 2 has higher impact potential because it proposes a novel, actionable guardrail-agent integration (TRIAD) that moves beyond risk labeling to iterative remediation, directly improving downstream agent behavior and enabling safer task completion. It evaluates on established agent safety benchmarks (ASB, AgentHarm) with clear safety-utility trade-offs, suggesting stronger methodological relevance and broader applicability to real-world LLM agents and tool-using systems. Paper 1’s benchmark is valuable and timely, but its impact is narrower (AI companions) and primarily evaluative rather than intervention-oriented.
Paper 2 addresses a critical data bottleneck in neuroscience and brain-computer interfaces. Demonstrating that synthetic fMRI augmentation can significantly boost decoding performance, and even enable zero-shot decoding, offers transformative potential across cognitive science and medical imaging. This represents a more profound methodological breakthrough than the domain-specific AI safety benchmark presented in Paper 1.
AICompanionBench addresses a timely and rapidly growing concern about AI companion safety, introduces the first public benchmark dataset in this space, and evaluates 20 state-of-the-art LLMs. Its novelty (new benchmark + LLM-as-judge framework for AI safety), broad relevance across AI safety, NLP, and policy communities, and public dataset release give it higher impact potential. Paper 2 applies existing memory-augmented neural networks to vessel trajectory prediction—a more incremental contribution in a narrower domain with limited novelty beyond the application context.
Paper 2 likely has higher impact due to broader applicability and scale: a large (286K screenshots, 3.5M tasks) benchmark/training set for drag-based GUI interactions addresses a clear, widely relevant bottleneck for GUI agents, enabling progress across web/mobile/desktop automation and HCI. It is timely with rapid interest in computer-use agents and can catalyze model development via both evaluation and training. Paper 1 is novel and socially important, but smaller (2,123 conversations), narrower domain-specific (Replika), and more sensitive to annotation/LLM-judge validity, limiting generalizability.
AICompanionBench addresses a timely and growing concern about AI companion safety with a publicly available benchmark dataset, rigorous evaluation of 20 LLMs, and clear methodological contributions that the broader research community can build upon. Paper 1, while practically useful, reads more as an industry framework/case study from Yahoo with limited generalizability and less methodological rigor (survey-based evaluation of 67 engineers). Paper 2's benchmark dataset, reproducible evaluation framework, and focus on the critical area of AI safety give it broader scientific impact potential across multiple research communities.
Paper 2 addresses a fundamental and broadly applicable problem in multi-agent LLM systems—efficient inter-agent communication—proposing a principled framework (PACT) that demonstrates concrete improvements across multiple topologies and production systems. Its impact spans the rapidly growing field of LLM-based multi-agent systems with immediate practical applications (reduced cost, improved performance). Paper 1, while timely and valuable for AI safety benchmarking in companion systems, is more niche in scope, serving primarily as a dataset contribution for a specific safety evaluation domain. Paper 2's methodological contribution has broader applicability and addresses a more fundamental architectural challenge.
AI safety is a critical, high-impact field with immediate real-world consequences given the explosive growth of AI companions. Paper 2 introduces the first public benchmark dataset for this domain, providing a foundational resource for NLP, HCI, and AI ethics researchers. While Paper 1 is rigorous and novel, its impact is largely confined to the niche domain of mathematical formalization, whereas Paper 2 addresses a widespread societal concern affecting millions of users.
Paper 1 offers a novel theoretical framework connecting two important research areas (OOD detection and hallucination detection) with a geometric perspective, providing training-free methods applicable to reasoning tasks where existing approaches fail. This has broader methodological impact across the LLM safety field. Paper 2 contributes a useful benchmark dataset for AI companion safety, but is more narrowly scoped as an empirical evaluation of existing models on a specific application domain, with less generalizable methodological innovation.
Paper 2 has higher potential impact: it introduces a publicly available, real-world dataset with fine-grained safety taxonomy for a rapidly growing and societally sensitive application area (AI companions), enabling broad follow-on work, benchmarking, and policy-relevant evaluation. Its methodological contribution (annotated benchmark + multi-model LLM-as-judge evaluation across 20 models) is directly actionable and timely. Paper 1 is novel in isolating faithfulness gap components in a controlled agent setting, but its domain-specific simulator setup may limit immediate real-world applicability and breadth compared to a reusable safety benchmark.
Paper 2 is likely higher impact due to timeliness (AI companion safety is a rapidly growing, high-stakes area), broader cross-field relevance (AI safety, HCI, social computing, content moderation, policy), and clearer real-world applicability for monitoring deployed systems. Its dataset is larger and directly addresses safety risk taxonomy and evaluation of LLMs-as-judges, a widely used paradigm. Paper 1 is novel and rigorous (expert-trace, deterministic grading) but is narrower in domain (hedge-fund financial reasoning) and thus may have more limited breadth despite strong benchmark design.
Paper 2 (MindClaw) has higher potential impact due to its more novel closed-loop embodied Theory-of-Mind setting, integrating perception, belief memory, triggering, reasoning, and action with “intervene vs stay silent” calibration—an important step beyond offline QA benchmarks. It targets broad real-world applications in robotics and assistive agents and may generalize across embodied AI, HRI, planning, and multimodal learning. Paper 1 provides a valuable safety dataset/benchmark for AI companions, but its contribution is narrower (content moderation/judging) and methodologically depends heavily on LLM-as-judge evaluation, limiting breadth and innovation relative to MindClaw’s systems contribution.
PieArena demonstrates higher potential scientific impact due to its broader methodological contributions (novel ranking model for continuous payoffs, multi-dimensional behavioral profiling, human-LM comparisons with trained negotiators) and wider applicability across AI evaluation, economics, and strategic reasoning. It addresses fundamental questions about LLM capabilities in complex multi-agent settings with real-world business relevance. While AICompanionBench addresses an important safety concern, it is more narrowly focused on content classification. PieArena's methodological innovations—order-invariant leaderboards, agentic scaffolding analysis, and cross-play evaluation—offer transferable frameworks for the broader AI evaluation community.
AICompanionBench addresses a timely, practical problem (AI companion safety) with a concrete, publicly available benchmark dataset and empirical evaluation of 20 LLMs. It has clear real-world applications given growing concerns about AI companion platforms, and benchmarks are high-impact resources that drive community progress. Paper 2 (Trivium) proposes an interesting theoretical framework for temporal regret in causal-memory controllers, but its contributions are more speculative, with limited empirical validation (pilot studies only), and the framework's practical adoption remains uncertain. The narrower audience and preliminary nature reduce its near-term impact.
Paper 1 introduces a novel, first-of-its-kind benchmark dataset for AI companion safety, a highly critical and rapidly growing area of AI alignment and ethics. Benchmark datasets typically yield high scientific impact by establishing standardized evaluation metrics that drive future model development across the broader AI community. In contrast, Paper 2 is a scoping review limited to a specific niche (dental AI). While useful for clinical applications, it synthesizes existing literature rather than providing a new dataset, model, or methodology, making its broader scientific impact comparatively lower.
Paper 2 has higher potential impact due to a more novel algorithmic contribution (subgoal persistence as a controllable stability–adaptivity knob for latent hierarchical reasoning) with clear ablations, quantified optima, and replication over seeds, suggesting stronger methodological rigor. The idea generalizes beyond specific tasks (planning, agentic RL, reasoning architectures) and is timely for long-horizon/agentic LLM research. Paper 1 is valuable and timely for AI safety, but its main contribution is a relatively small benchmark dataset with potential sampling/labeling biases and narrower methodological novelty; impact may be more domain-specific.