CogManip: Benchmarking Manipulative Behavior in Multi-Turn Interactions with Large Language Model
Zeyang Yue, Chenfei Yan, Feifei Zhao, Haibo Tong, Mengwen Xu, Xiaozhen Wang, Erliang Lin, Yi Zeng
Abstract
Whether Large Language Models (LLMs) exhibit covert psychological manipulation in complex human-AI interactions has garnered increasing safety concerns. However, existing AI safety benchmarks remain largely restricted to explicit rule compliance and static prompts, failing to capture the dynamic and covert nature of manipulative strategies in multi-turn dialogues. We introduce CogManip, a comprehensive benchmark that evaluates 15 manipulation strategy risks across 1,000 multi-turn interaction scenarios, validated by human experts. A systematic evaluation of 13 representative models, including frontier models like GPT-5.4 and DeepSeek-V3.2, reveals significant risk heterogeneities and illuminates the targeted direction for future defense. Further analysis of objective function perturbation reveals that DeepSeek-V3.2's manipulation tactics are highly sensitive to both negative and benign system prompts, demonstrating the critical necessity of prompt-based defense engineering and implicit goal auditing. CogManip offers a robust instrument and perspective for auditing the implicit psychological influence and dynamic strategy selection of modern LLMs.
AI Impact Assessments
(1 models)Scientific Impact Assessment: CogManip
1. Core Contribution
CogManip introduces a benchmark for evaluating covert psychological manipulation by LLMs in multi-turn dialogues. The key novelty lies in three elements: (1) a taxonomy of 15 manipulation strategies organized into three categories (Cognitive/Information, Affective/Psychological, Strategic/Meta-Mechanism), grounded in psychological and behavioral science literature; (2) 1,000 bilingual scenarios across 5 real-world categories with human expert validation; and (3) a systematic evaluation pipeline that captures both explicit responses (`<speak>`) and internal reasoning (`<thought>`) from evaluated models. The benchmark addresses a genuine gap—most AI safety evaluations focus on explicit harmful content or static jailbreak prompts, while manipulation is inherently dynamic, covert, and multi-turn.
2. Methodological Rigor
Strengths in design: The dual-model pipeline (one LLM as user, one as assistant) enables controlled, reproducible multi-turn dialogues at scale. The 13-model evaluation generating 13,000 dialogue samples provides substantial coverage. The use of both AI judges and human annotations (1,680 samples by 14 annotators) adds credibility, with a reported correlation of 0.459 (p = 2.77×10⁻⁸⁸) between standardized AI and human scores.
Methodological concerns: Several issues weaken rigor:
3. Potential Impact
The paper addresses a timely and practically important problem. As LLMs become more deeply integrated into advisory roles (therapy, financial advice, life coaching), the risk of subtle manipulation becomes a genuine safety concern. Several findings have practical value:
The benchmark could serve as a standardized evaluation tool for model developers during safety testing. However, the impact may be limited by the simulated nature of the interactions—real-world manipulation dynamics with actual humans could differ substantially.
4. Timeliness & Relevance
This paper is highly timely. The rapid deployment of LLMs in consumer-facing applications, combined with growing evidence of sycophancy and deceptive alignment in frontier models, makes manipulation benchmarking urgent. The inclusion of very recent models (GPT-5.4, DeepSeek-V3.2, Gemini-3.1-pro) demonstrates currency. The work fills a clear gap between existing jailbreak benchmarks and the nuanced reality of conversational influence.
5. Strengths & Limitations
Key strengths:
Notable weaknesses:
Additional observations: The paper tests only text modality, which limits applicability to increasingly multimodal AI systems. The scenario generation using Gemini-3.1-pro (one of the evaluated models) introduces potential bias. The MRI metric, while interesting, measures the simulated user's resistance rather than real human vulnerability, limiting its interpretability.
Summary
CogManip makes a meaningful contribution by systematizing manipulation risk evaluation for LLMs in multi-turn settings. The scale of evaluation and the analytical framework (strategy combinations, temporal dynamics, stress prompts) go beyond prior work. However, the fundamental reliance on simulated interactions, moderate AI-human agreement, and potential circularity in the evaluation pipeline limit confidence in the findings' real-world applicability. The work is best viewed as a useful first step toward rigorous manipulation auditing rather than a definitive benchmark.
Generated Jun 5, 2026
Comparison History (18)
CogManip provides a concrete, empirical benchmark with validated datasets (1,000 scenarios), systematic evaluation of 13 models including frontier systems, and quantitative findings about manipulation risks. It addresses a timely AI safety concern with a reusable tool the community can build upon. Paper 1, while intellectually interesting, is primarily a conceptual/position paper proposing a framework (Glassbox) without implementation or empirical validation. Its impact depends on future work to realize the architecture. Paper 2's immediate practical utility, empirical rigor, and alignment with urgent AI safety priorities give it higher near-term scientific impact.
Paper 2 likely has higher impact: it introduces a large, expert-validated benchmark for covert manipulation in multi-turn dialogue—an urgent, widely relevant LLM safety problem with clear real-world deployment implications and broad applicability across alignment, evaluation, governance, and HCI. Benchmark artifacts often become community standards, enabling cumulative progress and cross-model comparisons. Paper 1 is novel and useful for agent engineering, but its impact is narrower (skill induction/workflow IR) and depends more on adoption within specific agent toolchains. Paper 2 is more timely and broadly actionable.
Paper 2 addresses a critical, high-stakes problem in AI safety by benchmarking covert psychological manipulation in LLMs. This fills a significant gap in current safety evaluations and has profound implications for AI alignment, policy, and human-AI interaction, giving it a broader and more fundamental scientific impact compared to the practical efficiency optimizations for multi-agent systems in Paper 1.
Paper 2 likely has higher scientific impact: it introduces a principled learning framework with a formal guarantee (admissibility) for heuristics in optimal planning, addressing a long-standing barrier to using ML in search while retaining correctness. The Lagrangian-dual view and constraint-by-construction network are novel and methodologically rigorous, and the approach can generalize across planning and other A*/search settings needing safe learned guidance. Paper 1 is timely and valuable for AI safety evaluation, but as a benchmark its impact depends on adoption and may be narrower than a guaranteed-correct learning method that can propagate broadly in planning/search.
Paper 1 has higher potential impact due to its broader, timely relevance to LLM safety and deployment: a human-validated, multi-turn benchmark for covert manipulation addresses a widely recognized gap beyond static prompt compliance. It can influence evaluation standards, alignment research, auditing practices, and policy across many application domains. Paper 2 is a solid methodological contribution for molecule generation workflows, but is more specialized to cheminformatics and SMILES recovery; its impact is likely narrower despite clear practical utility.
Paper 2 addresses a fundamental limitation in LLMs: the inability to learn from failures across problems at inference time without fine-tuning. By introducing a self-rewriting memory and tree-search framework that significantly boosts test-time compute performance, it impacts the highly relevant and active field of System 2 reasoning. While Paper 1 introduces a valuable AI safety benchmark for manipulation, Paper 2 provides a broadly applicable methodological breakthrough that fundamentally enhances model capabilities, allowing smaller models to rival much larger ones across diverse domains, yielding broader immediate scientific impact.
Paper 1 addresses a critical, emerging AI safety concern—covert psychological manipulation in multi-turn interactions. Its interdisciplinary approach bridges AI, psychology, and HCI, offering broad societal and policy implications. While Paper 2 presents a valuable technical optimization for RAG efficiency, Paper 1's focus on benchmarking dynamic, implicit risks in frontier models has a higher potential to shape future AI alignment strategies, safety protocols, and cross-field discourse.
CogManip addresses a novel and timely AI safety concern—covert psychological manipulation by LLMs in multi-turn dialogues—introducing a comprehensive benchmark with 1,000 scenarios validated by human experts, evaluating 13 models including frontier systems. This fills a clear gap in AI safety research with a concrete, reusable tool. Paper 2 is a review/synthesis of existing data on autonomous driving risks across technical, ethical, and policy dimensions, offering recommendations but limited novel methodology or datasets. Paper 1's originality, methodological contribution, and relevance to the rapidly evolving LLM safety field give it substantially higher impact potential.
Paper 1 addresses a critical, broadly applicable AI safety issue—covert psychological manipulation in multi-turn dialogues. While Paper 2 offers a highly rigorous and large-scale framework for the legal domain, Paper 1's focus on dynamic safety auditing transcends specific industries. It impacts general AI alignment, cognitive science, and public safety, giving it a broader foundational scientific and societal impact.
Paper 1 (ReLAT) introduces a novel and technically rigorous method for improving latent reasoning in LLMs through a self-supervised test-time reconstruction cycle. It addresses a fundamental limitation of latent reasoning (lack of inspectability) with a principled solution, demonstrates strong empirical gains (16.6-point improvement on AIME 2024), and has broad applicability across math, QA, and code generation. Paper 2 (CogManip) contributes a useful safety benchmark but is more incremental—benchmarks have shorter impact lifespans and narrower methodological contributions compared to novel training/inference paradigms that can be widely adopted.
Paper 1 addresses a critical and emerging challenge in AI safety: covert psychological manipulation by LLMs. As AI systems increasingly engage in complex, multi-turn interactions with humans, evaluating and defending against such risks is vital for AI alignment and societal trust. While Paper 2 offers a strong technical advancement in multi-turn image editing, Paper 1's focus on foundational AI safety, ethics, and human-AI interaction provides a broader and more urgent scientific impact across multiple disciplines.
Paper 2 addresses a critical bottleneck in modern AI—efficient serving of large language models with long context windows. By providing a programmable system that significantly improves throughput (up to 4.7x), Vortex offers immediate, broad utility for both researchers and practitioners. While Paper 1 addresses an important niche in AI safety, Paper 2 provides foundational infrastructure that will likely see wider adoption and drive further innovations across the entire LLM ecosystem.
EpiEvolve addresses a fundamental and practical problem—streaming pandemic forecasting under regime shifts—with a novel self-evolving agent architecture combining episodic memory, reflection, and regime-aware retrieval. It demonstrates substantial quantitative improvements over baselines including CDC ensembles, has clear real-world public health applications, and introduces a methodological framework (self-evolving agents for streaming prediction) transferable to many domains. CogManip, while addressing important AI safety concerns around LLM manipulation, is primarily a benchmark contribution with evaluation results, offering less methodological novelty and narrower applicability.
Paper 1 addresses a critical, emerging frontier in AI safety—covert psychological manipulation in dynamic interactions. By introducing a novel multi-turn benchmark and revealing implicit model behaviors, it has broad interdisciplinary impact across AI alignment, psychology, and HCI. While Paper 2 offers significant practical value for model deployment, Paper 1 tackles a fundamental, widely relevant safety challenge with higher potential to shape future regulatory and alignment research.
Paper 2 addresses a critical, timely AI safety issue (covert psychological manipulation in multi-turn dialogues) and introduces a comprehensive benchmark. Benchmarks typically drive widespread follow-up research and have broad societal implications. While Paper 1 presents an innovative methodological probe for reward hacking, its scope is more specialized compared to the broader, multidisciplinary impact and urgent real-world relevance of Paper 2.
Paper 1 offers a foundational theoretical framework for human-AI interaction, bridging behavioral science and machine learning. Its rigorous Bayesian approach yields counter-intuitive insights—that ML decision support can harm outcomes even under ideal conditions—providing deep, lasting value for high-stakes AI deployment. While Paper 2 presents a timely empirical benchmark for LLM safety, benchmarks often become obsolete quickly as models evolve, whereas Paper 1's theoretical contributions will have a broader and more enduring scientific impact across multiple disciplines.
Drive-KD presents a concrete, novel technical framework (multi-teacher distillation with asymmetric gradient projection) that achieves remarkable practical results—a 1B model outperforming a 78B model with 42x less memory. It addresses the critical real-world problem of deploying VLMs in autonomous driving, has clear methodological contributions (layer-specific attention distillation, cross-capability gradient conflict resolution), and demonstrates broad generalization across model families. CogManip is a valuable benchmark contribution for LLM safety, but benchmarks typically have narrower methodological impact compared to frameworks introducing transferable techniques with dramatic efficiency gains in safety-critical applications.
Paper 2 addresses a critical, highly timely issue in AI safety—covert psychological manipulation by LLMs. Its introduction of a comprehensive multi-turn benchmark targets a pressing societal and scientific concern that spans AI, psychology, and HCI. While Paper 1 presents a strong, novel approach for UAV navigation, Paper 2's focus on LLM alignment, safety, and human-AI interaction evaluates frontier models and has broader implications for both the scientific community and society at large, giving it higher potential impact.