What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct
Meryl Ye, Lujain Ibrahim, Jessica Y. Bo, Myra Cheng, Ida Mattsson, Daniel Vennemeyer, Robert Kraut, Steve Rathje
Abstract
AI sycophancy has become a prominent concern in large language model (LLM) research. Yet the term lacks a consistent definition and has been applied to behaviors ranging from agreeing with a user's false claim to excessively praising the user to withholding corrective feedback. When researchers, companies, and policymakers use the same term to describe different behaviors, evaluation results become difficult to compare, mitigation strategies fail to transfer, and systems that are resistant to one form of sycophancy continue exhibiting other forms. To address this, we make two contributions. First, we reviewed 70 papers on AI sycophancy to develop a taxonomy of how the behavior has been defined and measured. The taxonomy distinguishes (1) whether a model is sycophantic toward a user's positions and beliefs, or toward the user's broader personal traits and emotions, and (2) whether this occurs through explicit, direct language or more implicit, subtle behaviors such as framing, omission, or tone. Mapping existing literature to our taxonomy reveals that current research has focused on overt forms of sycophancy toward users' beliefs, leaving more subtle and person-directed behaviors relatively understudied. Second, we surveyed 106 experts in AI sycophancy and related fields to examine whether researchers agree on which model behaviors are sycophantic. While experts are nearly unanimous in believing that sycophancy is a significant problem in current AI systems (94.3% agree), they disagree substantially on which specific behaviors qualify. Together, these findings demonstrate that AI sycophancy is a broad family of behaviors with different measurement challenges, intervention requirements, and governance implications. Our taxonomy provides a shared vocabulary for understanding and addressing these behaviors.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper tackles the conceptual fragmentation surrounding "AI sycophancy" — a term used inconsistently across research, industry, and policy. The authors make two contributions: (1) a taxonomy of sycophantic behaviors organized along two dimensions (Referent: Position vs. Person; Explicitness: Explicit vs. Implicit), derived from a review of 70 papers, and (2) an expert survey (N=106) demonstrating that while researchers nearly universally agree sycophancy is a significant problem (94.3%), they disagree substantially on which specific behaviors qualify. The key empirical finding is a significant Referent×Explicitness interaction: Position-directed sycophancy is recognized regardless of how it's expressed, while Person-directed sycophancy is only recognized when explicit (e.g., flattery), with implicit forms (e.g., deferential tone, softened feedback) receiving near-neutral ratings.
This is fundamentally a construct clarification paper — a "meta-scientific" contribution that aims to organize and discipline a rapidly expanding research area. Such papers can be highly impactful when they arrive at the right moment and provide vocabulary that subsequent work adopts.
Methodological Rigor
The literature review covers 70 papers with a clear inclusion criterion and systematic annotation procedure. Inter-rater reliability for taxonomy assignment was substantial (88.3% agreement), with disagreements adjudicated via an LLM third coder — a pragmatic if somewhat novel approach. The annotation of survey items by four authors across seven dimensions showed acceptable to good ICC values (.47–.86).
The expert survey design is competent: 24 items on a bipolar 7-point scale, randomized blocks, multilevel regression with crossed random effects for respondents and items. The statistical analysis is thorough, with appropriate model comparisons via likelihood ratio tests and extensive robustness checks (dichotomized DV, collapsed negative pole, moderator analyses by demographics and research area).
Some methodological limitations deserve note. The bipolar scale created interpretive ambiguity — five participants reported post-survey confusion about the negative pole. The robustness analysis showed the primary interaction becomes non-significant when negative ratings are collapsed to zero (p=.108), suggesting the effect partially depends on how non-sycophantic exemplars are scored. The CFA models showed uniformly poor fit (CFI<.65, RMSEA>.10), indicating the taxonomy dimensions don't correspond to stable latent factors in expert judgments — the authors acknowledge this but it somewhat undermines claims about the taxonomy reflecting perceptual structure in expert ratings.
The expert sample, while reasonable in size (N=106), is heavily skewed toward US-based academics (67.9% US, 84% academia), limiting generalizability of "expert consensus" claims. The 47/106 respondents being authors of reviewed papers creates potential circularity — they may rate behaviors consistent with their own published operationalizations.
Potential Impact
The paper's impact potential lies in several domains:
Research standardization: The taxonomy provides a shared vocabulary that could discipline future benchmark development. The demonstration that SycEval and ELEPHANT produce inverted model rankings because they measure different taxonomy cells is a compelling illustration of why this matters. If adopted, this framework could reduce the frequency of apples-to-oranges comparisons in model evaluation.
Industry governance: The analysis of OpenAI and Anthropic model specifications through the taxonomy lens reveals specific gaps (particularly implicit behaviors). This provides actionable vocabulary for companies drafting safety policies. The mapping of evaluation paradigms to taxonomy cells (Table 6) is a practical resource.
Legislative clarity: The analysis of California SB 1119 and similar legislation, which define sycophancy by downstream consequences without specifying behaviors, highlights a gap that the taxonomy could help fill.
Mitigation strategy design: The finding from Vennemeyer et al. (2025) that sycophantic agreement and praise are independently steerable aligns with the taxonomy's cell distinctions and suggests mitigation must be cell-specific.
Timeliness & Relevance
This paper is exceptionally well-timed. AI sycophancy has become a prominent concern in both academic and public discourse (documented by the Google Trends data in Figure 1), with companies making public commitments to reduce it and legislators writing it into bills. The field is at precisely the stage where construct clarification can have maximal impact — early enough that vocabulary isn't yet ossified, but late enough that fragmentation is creating real problems for comparison and accountability. The reference to the April 2025 GPT-4o sycophancy incident and GPT-5 evaluation discrepancies demonstrates immediate practical relevance.
Strengths
1. Comprehensive literature mapping: The 70-paper review with per-paper taxonomy cell assignments (Table 5) is itself a valuable reference resource, alongside the evaluation paradigm mapping (Table 6).
2. Empirical grounding of disagreement: Rather than simply proposing a taxonomy, the authors demonstrate empirically that experts disagree along its dimensions, lending the framework evidential weight.
3. Cross-domain analysis: The paper connects academic, corporate, and legislative uses of the term, demonstrating fragmentation's practical consequences across all three.
4. Actionable structure: The taxonomy is simple enough (2×2 with sub-referents) to be readily adopted while capturing meaningful distinctions.
5. Strong supplementary materials: The appendices are unusually thorough, including the full survey instrument, all 70 paper annotations, and extensive robustness analyses.
Limitations
1. Taxonomy validation gap: The poor CFA fit and low within-quadrant reliability suggest the taxonomy captures researcher intuitions imperfectly. The dimensions may organize the literature better than they organize expert judgments.
2. Sample representativeness: The US-academic skew and overlap with reviewed-paper authors limits confidence in generalizability.
3. Descriptive rather than prescriptive: The paper deliberately avoids defining what sycophancy *should* mean, which limits its power to resolve fragmentation rather than merely describe it.
4. No behavioral data: All findings rest on expert ratings of behavior descriptions rather than responses to actual model outputs, which introduces abstraction that may not map to real evaluation contexts.
5. Implicit behaviors under-theorized: The paper identifies that implicit person-directed sycophancy is under-recognized but doesn't fully explain why — is this because these behaviors are genuinely less harmful, harder to detect, or simply less familiar to the research community?
Overall Assessment
This is a well-executed construct clarification paper arriving at an optimal moment for a rapidly growing but conceptually disorganized research area. Its primary value is organizational rather than empirically novel — it provides shared vocabulary and demonstrates that the absence of such vocabulary creates real problems. The expert survey adds empirical texture but is methodologically bounded. Impact will depend on whether the community adopts the taxonomy's vocabulary in subsequent benchmarking and policy documents.
Generated May 22, 2026
Comparison History (18)
Paper 2 addresses a fundamental conceptual gap in LLM research by providing a taxonomy and shared vocabulary for AI sycophancy—a cross-cutting concern affecting alignment, safety, and governance. Its breadth of impact spans multiple fields (AI safety, policy, HCI), and its contributions (taxonomy from 70 papers, survey of 106 experts) offer a foundational framework that will shape future research directions. Paper 1 makes a valuable but narrower contribution to clinical LLM evaluation. While important, its findings (that interactive settings reduce LLM accuracy) are somewhat expected, whereas Paper 2's conceptual clarification has broader, more lasting influence on the field.
Paper 2 demonstrates a groundbreaking application of LLMs to solve open mathematical problems (including 9 Erdős problems), representing a concrete, verifiable advance in mathematical research. Its impact spans multiple mathematical subfields and establishes a new paradigm for AI-assisted theorem proving. While Paper 1 provides a useful taxonomy and survey of AI sycophancy—an important conceptual contribution—Paper 2's demonstrated ability to solve previously unsolved problems represents a more transformative scientific contribution with immediate, measurable real-world impact on mathematics research.
Paper 2 provides a tangible technological advancement by solving visual-semantic bottlenecks in LLM chemistry applications. It introduces a novel framework (ChemVA) and dataset (OCRD-Bench) that directly accelerate automated chemical research and discovery, bridging AI and hard sciences. While Paper 1 offers valuable conceptual clarity for AI alignment, Paper 2's methodological rigor, quantifiable performance gains, and direct utility for applied scientific innovation give it a higher potential for broad, transformative impact.
Paper 2 addresses AI sycophancy, a widely recognized and timely concern across LLM research, policy, and industry. Its taxonomy based on 70 papers and survey of 106 experts provides a foundational conceptual framework that can shape future research directions, evaluation standards, and governance. It has broad cross-disciplinary relevance (AI safety, alignment, HCI, policy). Paper 1 addresses a narrower, more technical issue (KG construction from CSV tables) with useful but domain-specific contributions. Paper 2's potential to standardize terminology and redirect research efforts gives it significantly broader scientific impact.
Paper 1 offers a novel, technical evaluation metric (Synergistic Faithfulness) grounded in Shapley interactions to fix a concrete failure mode in VLM explainability evaluation, with strong quantitative evidence (high surrogate correlation, large speedup) and broad applicability to auditing multimodal systems in high-stakes settings. Its methodological contribution is directly actionable for benchmarking and improving XAI methods across architectures/datasets, likely influencing subsequent empirical work. Paper 2 is timely and useful for conceptual clarity and governance, but is primarily taxonomic/survey-based with less direct methodological leverage for model development and evaluation.
Paper 2 likely has higher impact due to a concrete, novel algorithmic “hygiene” loop for self-evolving LLM agents, strong quantitative gains on established benchmarks (MBPP+, SWE-bench Verified), multi-seed runs, extensive ablations, and a supporting non-divergence proposition. It is timely for agentic systems and could transfer broadly to tool/skill-learning, continual learning without finetuning, and autonomous software engineering. Paper 1 provides valuable conceptual clarification (taxonomy + expert survey) for evaluation/governance, but is less likely to directly shift performance capabilities or spawn immediate downstream methods.
Paper 2 addresses AI sycophancy in LLMs—a timely, high-visibility topic with broad cross-disciplinary relevance spanning AI safety, policy, and governance. Its taxonomy and expert survey (106 experts, 70 papers reviewed) provide foundational conceptual infrastructure that will likely be widely cited as LLM alignment research accelerates. Paper 1, while technically solid, addresses a more niche intersection of circular manufacturing and CPPS architectures with a narrower audience. The timeliness and breadth of impact of Paper 2, touching research methodology, AI evaluation, and policy, give it higher estimated scientific impact.
Paper 2 addresses a critical conceptual bottleneck in AI alignment by providing a much-needed taxonomy and shared vocabulary for AI sycophancy. Foundational taxonomy papers in rapidly growing fields like AI safety tend to accrue high citations and broadly influence future research, evaluations, and policy. While Paper 1 offers a strong methodological improvement in model training, Paper 2's broader relevance across AI research, governance, and HCI suggests a higher overarching scientific impact.
Paper 2 has higher potential scientific impact due to its broader relevance and cross-field applicability: it clarifies a fragmented, high-salience construct in LLM alignment/safety, provides a literature-grounded taxonomy (70 papers), and adds empirical evidence via an expert survey (n=106). This enables more comparable evaluations, better-targeted mitigations, and clearer policy/governance discussions—benefits that propagate across research, industry, and regulation. Paper 1 is a useful engineering framework with immediate developer utility, but its impact is narrower, more tool-ecosystem-specific, and less likely to generalize as a lasting scientific contribution.
Paper 2 introduces a novel architectural framework (SR²AM) that decomposes agentic reasoning into three systems with concrete empirical results showing competitive performance at dramatically reduced computational cost. It addresses a pressing efficiency problem in LLM reasoning with broad applicability across diverse tasks. While Paper 1 provides a valuable taxonomy and expert survey on AI sycophancy—an important conceptual contribution—it is primarily a definitional/organizational work. Paper 2's methodological innovation, strong empirical results, and generalizable principles for self-regulated reasoning are likely to drive more follow-on research and real-world impact.
Paper 1 has higher likely impact due to timeliness and breadth: it tackles an active, high-stakes LLM alignment problem with direct implications for evaluation, mitigation, and governance. Its taxonomy plus expert survey can standardize terminology and measurement across many downstream studies and products, potentially shaping community norms and policy discussions. Paper 2 proposes an interesting causality+argumentation XAI method, but similar hybrid explainability approaches exist, and the contribution appears narrower with limited validation (two datasets), making its near-term influence more field-specific.
Paper 2 addresses a fundamental conceptual problem in LLM research—the lack of a consistent definition of AI sycophancy—that affects the entire field. By providing a taxonomy validated by 106 experts and systematically reviewing 70 papers, it creates shared vocabulary that can standardize future research, evaluation, and policy. Its breadth of impact spans AI safety, alignment, governance, and evaluation methodology. Paper 1, while impressive in production deployment, is more narrowly focused on operational workflow automation in cloud database services, limiting its cross-field influence.
Paper 2 likely has higher impact: it introduces a large, standardized, multi-turn benchmark (502 tasks, 102 targets) for evaluating LLM agents on real-world small-molecule drug design—an application area with major scientific and commercial relevance. The benchmark + public leaderboard can catalyze measurable progress, enable cross-model comparisons, and influence both ML and cheminformatics workflows. Paper 1 is valuable for conceptual clarification and governance (taxonomy + expert survey), but its contributions are primarily definitional/organizational and less likely to directly drive technical advances or broad downstream adoption than a widely used drug-design benchmark.
Paper 1 introduces a benchmark for LLMs in small molecule drug design, directly bridging AI with a high-impact scientific domain (pharmacology/medicine). Its potential real-world applications in accelerating drug discovery offer profound societal and economic value. While Paper 2 provides valuable insights for AI alignment through its taxonomy of sycophancy, Paper 1's contribution is more likely to drive tangible scientific breakthroughs and tool development across the broader scientific community.
SciCore-Mol addresses a fundamental technical challenge in integrating molecular science with LLMs through a novel modular architecture with concrete performance gains. It has direct applications in drug design, chemical synthesis, and scientific discovery—fields with enormous practical impact. While Paper 2 provides a valuable conceptual contribution (taxonomy and expert survey on sycophancy), it is primarily organizational rather than technically innovative. Paper 1's methodological contributions (topology-aware perception, latent diffusion generation, reaction-aware reasoning modules) are more likely to spawn follow-up research and real-world applications across multiple scientific domains.
Paper 1 introduces a novel, scalable self-play framework for geospatial reasoning, eliminating the need for massive human-annotated datasets. Its use of verifiable rewards and executable programs advances fundamental VLM reasoning capabilities. While Paper 2 provides a valuable taxonomy for AI alignment, Paper 1 offers a technical breakthrough with broad real-world applications in remote sensing and autonomous systems, along with a new benchmark, likely leading to higher broader scientific impact.
Paper 2 addresses a timely and broadly relevant problem in LLM research—AI sycophancy—that spans technical AI, policy, and governance communities. Its taxonomy and expert survey (n=106) provide foundational infrastructure that many subsequent studies will reference and build upon. The breadth of impact across fields (ML, HCI, AI safety, policy) and the current urgency of LLM alignment issues give it wider reach. Paper 1, while methodologically rigorous and useful, addresses a more niche evaluation methodology problem with a narrower audience of uncertainty quantification researchers.
Paper 1 addresses a highly timely and critical issue in AI safety by providing a foundational taxonomy for AI sycophancy. Standardizing terminology and highlighting research gaps in a rapidly growing field typically leads to widespread adoption, high citation counts, and significant impact on future evaluations and policies. Paper 2 offers a rigorous methodological contribution, but its scope is narrower compared to the foundational and broadly applicable insights of Paper 1.