The DeepSpeak-Agentic Dataset

Sarah Barrington, Maty Bohacek, Hany Farid

Jun 2, 2026

arXiv:2606.03686v1 PDF

cs.AI(primary)

#2596of 3404·Artificial Intelligence

#2596 of 3404 · Artificial Intelligence

Tournament Score

1331±44

10501800

33%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6.5

Rigor5

Novelty6.5

Clarity7

Tournament Score

1331±44

10501800

33%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: The DeepSpeak-Agentic Dataset

1. Core Contribution

The paper introduces DeepSpeak-Agentic, a dataset of 200 semi-structured video conversations (~37 hours) between humans and embodied AI agents. The key novelty lies in the shift from static deepfake detection datasets (manipulated images/audio) to real-time, interactive, multi-modal synthetic media—where an LLM, synthetic voice, and visual avatar are jointly deployed in live conversation with an unsuspecting human participant. The paper also contributes a scalable data-capture pipeline (agent creation, participant pairing, recording, speaker separation) and provides baseline forensic evaluation across text, audio, and video modalities.

The problem addressed is timely: as commercial platforms (Tavus, HeyGen) enable real-time embodied agents, the forensic and HCI communities lack datasets capturing this interactive setting. Prior deepfake datasets focus on offline manipulation of existing recordings, not live agentic interactions.

2. Methodological Rigor

Dataset construction is well-documented. The use of 143 distinct agent configurations (varying LLMs, voices, visual personas, and scenarios) provides reasonable diversity. The four scenario types (conversational, professional, collaborative planning, creative) offer varied interaction contexts. The IRB-approved mild deception—not telling participants they were interacting with AI—is a sound methodological choice for eliciting naturalistic behavior.

Speaker isolation combines Pyannote diarization with MediaPipe lip-tracking, which is a practical and clever approach for separating interleaved streams. The post-processing pipeline (merging, padding, fade application) shows attention to audio quality.

Limitations in rigor are notable:

The moderation pipeline initially rejected 131 of 263 recordings, with 68 reinstated after manual review, indicating a ~52% false positive rate. This high error rate raises questions about consistency and reproducibility.

The forensic evaluation (Table 1) uses only off-the-shelf detectors without fine-tuning on the new domain. While this demonstrates a gap, it limits interpretive value—we cannot distinguish whether poor performance stems from domain shift or fundamental detector limitations. No cross-validation or confidence intervals are reported.

The human discriminability study is informative but rudimentary: 80.5% of participants detected the AI within 10 seconds, which somewhat undermines the "realism" narrative. The qualitative coding via LLM-assisted codebook (Table 3) lacks inter-rater reliability metrics.

The demographic pool, while gender-balanced, is heavily skewed toward White/Caucasian participants (75%), limiting generalizability of interaction patterns and perceptual findings.

3. Potential Impact

Forensic applications: The finding that current audio and video deepfake detectors perform poorly (best video EER: 33%, best audio EER: 23%) on agentic content is an important signal to the media forensics community. It demonstrates that real-time interactive agents represent a distinct challenge from pre-recorded deepfakes. The text detection result (Desklib EER: 8%) is encouraging and suggests LLM text detection may transfer to conversational settings.

HCI and AI safety: The conversational dynamics data (turn-taking, latency, word counts, speaking fractions) provide useful baselines for studying human-AI interaction patterns. The 3.79s mean agent latency, compared to ~250ms in natural conversation, highlights a key realism gap.

Benchmarking: As agent technology improves rapidly, having a temporal benchmark is valuable—though the paper correctly notes this is a snapshot, not a permanent standard.

Broader influence: The dataset could serve researchers in deepfake detection, conversational AI evaluation, human factors, trust/deception studies, and AI governance. The public release on HuggingFace with code enhances accessibility.

4. Timeliness & Relevance

The paper is highly timely. The opening anecdote about Zoom's CEO using an AI clone for earnings calls effectively frames the practical urgency. Commercial embodied agents are proliferating, and there is a clear gap between existing forensic datasets (static manipulations) and the interactive agent paradigm. The paper fills this gap at a moment when regulatory and safety communities are actively grappling with agentic AI governance.

5. Strengths & Limitations

Key Strengths:

First large-scale dataset of real-time human-embodied-agent video conversations

Scalable, automated collection pipeline that could be replicated and extended

Multi-modal analysis (text, audio, video) with both human and machine discriminability evaluation

Public release of data, metadata, and code

Well-designed experimental protocol with IRB approval and ethical considerations

Notable Weaknesses:

The dataset is relatively small (200 conversations, 37 hours) compared to major deepfake benchmarks

Agent technology is limited to two commercial providers, reducing diversity

The 80.5% instant detection rate suggests current agents are far from convincing, which may limit the dataset's forensic challenge value in the near term

No fine-tuned detection baselines are provided—only off-the-shelf evaluation, making the forensic contribution somewhat shallow

English-only, limiting cross-linguistic applicability

The paper is primarily descriptive rather than analytically deep; the "insights" section (Section 5) reports statistics but offers limited novel analysis of interaction patterns

No comparison with human-human conversation baselines for the reported metrics

Missing elements:

Statistical tests for claimed differences (e.g., word count disparities, latency differences across platforms)

Analysis of how different LLM/voice/avatar combinations affect detection or realism

Longitudinal considerations—how quickly will this dataset become obsolete?

Summary

DeepSpeak-Agentic makes a timely and useful contribution by introducing the first substantial dataset of live human-embodied-agent video interactions. Its primary value is as a community resource and benchmark rather than as a source of deep analytical insights. The forensic findings—particularly the failure of existing detectors—serve as a valuable call to action. However, the relatively small scale, limited agent diversity, descriptive analysis, and lack of rigorous statistical evaluation temper the immediate scientific impact. The dataset's longevity will depend on how quickly the field evolves and whether the authors deliver on promised extensions.

Rating:5.8/ 10

Significance 6.5Rigor 5Novelty 6.5Clarity 7

Generated Jun 3, 2026

Comparison History (21)

vs. Rashomon Memory: Towards Argumentation-Driven Retrieval for Multi-Perspective Agent Memory

claude-opus-4.66/5/2026

Paper 2 introduces a novel theoretical architecture (Rashomon Memory) that addresses a fundamental limitation in AI agent memory systems—handling conflicting interpretations through argumentation semantics. This combines formal argumentation theory with practical multi-agent memory, offering broad applicability across AI planning, negotiation, and decision-making. Paper 1, while valuable as a dataset contribution for deepfake/AI detection research, is more incremental and narrowly scoped. Paper 2's conceptual framework has greater potential to influence multiple research directions in agent architectures, explainability, and multi-perspective reasoning.

vs. Uncertainty Aware Functional Behavior Prediction and Material Fatigue Assessment for Circular Factory

gemini-3.16/5/2026

Paper 1 addresses the highly timely and critical challenge of AI forensics and human-agent interactions. Datasets and benchmarks in generative AI security typically achieve widespread adoption and high citation rates across multiple disciplines like computer vision, NLP, and cybersecurity. While Paper 2 demonstrates strong methodological rigor, its focus on predictive maintenance for circular factories is a niche industrial application, resulting in a narrower overall scientific impact.

vs. Distilling Answer-Set Programming Rules from LLMs for Neurosymbolic Visual Question Answering

gpt-5.26/3/2026

Paper 2 likely has higher impact due to broader applicability and timeliness: a large multimodal dataset (37+ hours) and scalable capture pipeline can become shared infrastructure for multiple communities (forensics/deepfake detection, human–AI interaction, embodied agents, multimodal ML). It directly targets urgent real-world needs around AI agent identification and evaluation. Paper 1 is novel and methodologically interesting (LLM-to-ASP rule distillation with solver feedback) but is more specialized to neurosymbolic VQA and may have narrower downstream adoption compared to a widely usable benchmark dataset.

vs. Uncertainty-Aware Clarification in LLM Agents with Information Gain

gpt-5.26/3/2026

Paper 2 likely has higher impact: it introduces a principled, generalizable clarification framework for LLM agents using an information-gain reward grounded in Bayesian belief updates, with cross-backbone evaluation and measurable task-success improvements. This is timely for agent reliability and applies broadly across tool-using agents, HCI, and decision-making under uncertainty. Paper 1 provides a valuable dataset and capture pipeline, but its impact may be narrower (forensics/embodied-agent interaction) and more dependent on downstream adoption, whereas Paper 2 offers a reusable method that can be integrated across many agent systems.

vs. NBQ: Next-Best-Question for Dynamic Profiling

gpt-5.26/3/2026

Paper 2 likely has higher impact because it releases a sizable, multimodal human–embodied-agent conversation dataset plus a scalable capture pipeline, enabling broad downstream research (forensics/AI attribution, human–agent interaction, multimodal LLMs, speech/face synthesis). This is timely given deepfake proliferation and agent deployment, and can become a standard benchmark. Paper 1 is a solid systems/algorithm contribution with clear applications (profiling/matchmaking) and efficiency gains, but its impact is narrower and more domain-specific, and depends more on task framing and evaluation choices.

vs. TrafficRAG: A Multimodal RAG Framework for Traffic Accident Liability Determination

claude-opus-4.66/3/2026

DeepSpeak-Agentic introduces a novel, publicly available dataset addressing the increasingly critical problem of AI-generated media detection and human-agent interaction analysis. Its contributions—a scalable data-capture pipeline and a multimodal benchmark spanning audio, video, and text forensics—have broader interdisciplinary impact across deepfake detection, HCI, and AI safety. Paper 1 (TrafficRAG) applies existing techniques (RAG, VLMs, hybrid retrieval) to a narrower domain. While competent, it represents incremental engineering rather than foundational contribution. Paper 2's dataset and benchmark are more likely to enable future research across multiple fields.

vs. CAREAgent: Clinical Agent with Structured Reasoning and Tool-Integrated for Order Generation

gpt-5.26/3/2026

Paper 2 (CAREAgent) has higher estimated scientific impact due to its direct, high-stakes real-world application in clinical order generation, a clear methodological contribution (tool-integrated structured reasoning, verifiable trajectories, filtering, SFT+RL with multi-dimensional rewards), and demonstrated performance gains on multiple benchmarks including an unseen test set. Its approach is timely with growing interest in clinical AI agents and safety/validity constraints, and it can influence medical NLP, agentic tool use, and reinforcement learning for constrained decision-making. Paper 1 is valuable but primarily a dataset/benchmark with narrower immediate application.

vs. What Benchmarks Don't Measure: The Case for Evaluating Abstention Competence in Autonomous Agents

gemini-3.16/3/2026

Paper 2 addresses a critical flaw in AI agent training and evaluation (compliance bias) with high relevance to AI safety and alignment. By introducing a novel taxonomy and evaluation protocols for abstention competence, it has broad implications for how future autonomous agents are benchmarked and deployed safely in real-world scenarios. While Paper 1 provides a useful dataset, Paper 2 offers a fundamental conceptual and methodological shift with wider potential impact across the AI field.

vs. CORE: Conflict-Oriented Reasoning for General Multimodal Manipulation Detection

gpt-5.26/3/2026

Paper 2 likely has higher scientific impact due to a novel, generalizable detection paradigm (conflict-oriented reasoning) that targets the core weakness of manipulation detectors—poor transfer to unseen manipulations—while leveraging MLLMs for few/zero-shot adaptation. Its methodological contribution spans a new framework plus a fine-grained annotated corpus (CAC) and is immediately relevant to urgent real-world misinformation threats, with broad applicability across multimodal AI safety, media forensics, and NLP/Vision. Paper 1 is valuable as a dataset/benchmark, but its scope is narrower (embodied-agent conversations/forensics) and may have more limited cross-domain uptake.

vs. S3TS: Stochastic Scenario-Structured Tree Search for Advanced Planning Under Uncertainty

gemini-3.16/3/2026

Paper 2 demonstrates higher potential scientific impact due to its broad applicability and timeliness. While Paper 1 offers a valuable algorithmic contribution to energy grid planning, Paper 2 addresses the highly critical, rapidly growing fields of embodied AI and AI forensics. By providing a multimodal dataset and a scalable capture system, Paper 2 serves as a foundational benchmark likely to be utilized across diverse disciplines—including human-computer interaction, cybersecurity, and machine learning—leading to higher citation potential and broader real-world relevance.

vs. Enhancing Operational Safety via Agentic Dialogue Hazard Identification Analysis

gemini-3.16/3/2026

Paper 1 introduces a foundational dataset and scalable pipeline for AI forensics and human-AI interaction. Benchmark datasets in AI detection and deepfakes typically achieve broad, high-impact usage across computer vision, audio, NLP, and security fields. While Paper 2 presents an innovative multi-agent approach for hazard analysis, its impact is more confined to the niche intersection of industrial safety and LLM applications, making Paper 1's general-purpose security benchmark more likely to yield widespread scientific impact.

vs. Think-Before-Speak: From Internal Evaluation to Public Expression in Multi-Agent Social Simulation

claude-opus-4.66/3/2026

Paper 2 introduces a novel theoretical framework (TBS) that bridges internal cognitive processes and public expression in multi-agent social simulation, offering mechanistic insights into opinion dynamics, silence, and deliberation. This has broader interdisciplinary impact across computational social science, psychology, and AI. Paper 1 contributes a valuable dataset for deepfake detection but is more incremental in nature—extending existing forensic evaluation to agentic conversations. Paper 2's framework for modeling internal states like dissonance appraisal and spiral of silence dynamics opens new research directions with greater theoretical depth.

vs. EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

claude-opus-4.66/3/2026

EvoTrainer introduces a fundamentally novel paradigm for autonomous LLM training—co-evolving both policies and training harnesses—addressing a core limitation in agentic RL. It demonstrates strong results across multiple challenging domains (math reasoning, code generation, software engineering) and proposes a generalizable framework that could reshape how LLM training is conducted. Paper 1 contributes a valuable dataset for deepfake detection and human-agent interaction research, but is more incremental in scope. Paper 2's methodological innovation and broad applicability across the rapidly growing LLM/RL field give it higher potential impact.

vs. Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

gemini-3.16/3/2026

Paper 2 introduces a novel methodological advancement (Imaginative Perception Tokens) that addresses a fundamental limitation in Multimodal Large Language Models regarding spatial reasoning. Demonstrating that visual intermediate representations outperform textual Chain-of-Thought provides critical architectural insights for a highly active field. While Paper 1 offers a timely dataset for forensics and human-agent interactions, its relatively small scale (37 hours) likely limits its foundational, long-term scientific impact compared to the algorithmic innovations presented in Paper 2.

vs. Do Real-World Datasets Contain Natural Experiments? An Empirical Study Using Causal Feature Selection

gemini-3.16/3/2026

Paper 1 introduces a conceptually novel approach to identifying and leveraging natural experiments in standard observational datasets using causal discovery. This offers fundamental methodological advancements with broad impact across any discipline relying on machine learning and observational data. In contrast, while Paper 2 provides a timely and useful dataset for AI forensics and human-AI interaction, its scientific scope is narrower and primarily serves as a benchmark rather than proposing a foundational methodological shift. Thus, Paper 1 has higher potential for widespread, cross-disciplinary scientific impact.

vs. From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

claude-opus-4.66/3/2026

Paper 1 addresses a critical gap in LLM evaluation for chemistry—process-level reasoning verification rather than just final-answer correctness. It introduces a novel, scalable benchmark (ChemCoTBench-V2) with deterministic chemical rule checking, avoiding costly LLM judges. This has broad impact across AI safety, scientific reasoning evaluation, and chemistry AI applications. Paper 2 contributes a useful but more niche dataset for deepfake detection in human-agent interactions. While timely, its methodological contribution (data collection pipeline + benchmark) is more incremental compared to Paper 1's novel evaluation paradigm for scientific reasoning verification.

vs. SAGE: A Quantitative Evaluation of Socialized Evolution in Agent Ecosystems

claude-opus-4.66/3/2026

SAGE introduces a novel evaluation framework addressing an under-studied question about social vs. self-improvement in language agents, with rigorous methodology (compute-matched conditions, counterfactual controls) across diverse domains. Its findings about when and how peer experience helps agents have broad implications for multi-agent system design, a rapidly growing field. Paper 2 contributes a useful dataset for deepfake detection and human-agent interaction, but is more incremental—primarily a resource contribution rather than generating new scientific insights or methodology with broad theoretical implications.

vs. ChatHealthAI: Aligning Electronic Health Record Representations with Large Language Models for Grounded Clinical Reasoning

gpt-5.26/3/2026

Paper 1 has higher potential impact due to its novel, timely integration of EHR foundation models with LLMs to enable grounded, interpretable clinical reasoning—an area with clear, high-stakes real-world applications in healthcare. It proposes a concrete methodological contribution (task-aware resampler aligning representations) and demonstrates benefits on established predictive tasks, suggesting stronger rigor and immediate utility. Paper 2 provides a useful dataset and infrastructure for agent forensics and interaction studies, but its impact is more indirect and may be narrower unless broadly adopted as a standard benchmark.

vs. RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

gemini-3.16/3/2026

Paper 1 offers a timely and novel dataset addressing the critical and rapidly growing field of AI forensics and human-agent interaction. Its potential impact spans AI safety, security, and HCI, providing foundational resources for detecting deepfakes and studying embodied AI. Paper 2 presents a valuable but more incremental optimization technique for cost-saving in multi-hop QA systems, which, while highly practical, has a narrower scientific scope compared to the broader societal and interdisciplinary implications of Paper 1.

vs. Tracking the Behavioral Trajectories of Adapting Agents

gpt-5.26/3/2026

Paper 2 is more methodologically novel and broadly applicable: it introduces a general framework to quantify and track behavioral trait shifts from agent configuration/skill edits, with a concrete evaluation protocol (including intermediary-based assessment) relevant to safety, governance, and continual agent development. Its approach can transfer across agent platforms and traits, and addresses a timely problem (monitoring adapting agents). Paper 1 provides a useful dataset and capture system, but its impact is narrower (embodied conversation forensics/interaction study) and largely infrastructural rather than a new analytical paradigm.