Can LLMs Introspect? A Reality Check

Shashwat Singh, Tal Linzen, Shauli Ravfogel

May 25, 2026

arXiv:2605.26242v1 PDF

cs.AI(primary)

#690of 2682·Artificial Intelligence

#690 of 2682 · Artificial Intelligence

Tournament Score

1461±43

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance7.5

Rigor7

Novelty6.5

Clarity8

Tournament Score

1461±43

10501800

65%

Win Rate

Wins

Losses

Matches

Rating

7.4/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Can LLMs Introspect? A Reality Check"

1. Core Contribution

This paper provides a systematic critique of two prominent paradigms that have been used to argue LLMs possess metacognitive monitoring capabilities. The authors make two complementary arguments:

Empirical: In the "biofeedback" paradigm (Ji-An et al., 2025; Steinmetz Yalon et al., 2026), models appear to predict labels derived from their own hidden states, but the authors demonstrate that simple classifiers operating only on input embeddings (layer-0 representations) achieve equivalent performance. A relabeling control—where semantic correlates between input and probe labels are broken—reduces model performance to near chance. In the steering-awareness paradigm (Lindsey, 2025), the authors introduce a three-way classification design (adding "gaslight" input-level interventions) and show models cannot distinguish activation-level from input-level perturbations, suggesting detection reflects general anomaly sensitivity rather than introspection.

Principled: The authors argue that even privileged access to hidden states is insufficient for introspection. Drawing on cognitive science, they contend that introspection requires evidence of a dissociable *second-order* process, which no purely behavioral paradigm can establish.

2. Methodological Rigor

The experimental controls are well-designed and appropriately targeted:

Random relabeling control for the biofeedback paradigm is methodologically clean: it preserves a valid linear direction in representation space while breaking semantic correlations, directly testing whether models need introspective access or can rely on input semantics. The information-theoretic framing (showing I(t; y(t)) is high, violating the privileged access condition) is sound.

Layer-0 probing as a baseline is a simple but powerful diagnostic. Showing that uncontextualized embeddings predict PCA-derived labels as well as the model's own ICL predictions is a compelling deflation of the original claims.

Three-way steering design is a clever extension that disambiguates general anomaly detection from introspection. The addition of input-level interventions ("gaslight" prompts) that are matched in effect but differ in mechanism provides the needed separation.

However, there are notable methodological limitations. The authors cannot replicate Lindsey (2025) directly because Claude is inaccessible, so results are on open-weight models (Llama, Qwen, Gemma). While they reproduce Lindsey's two-way results on some of these models, the generalizability gap is real. Additionally, the steering experiments show high inter-concept variability and sensitivity to prompt wording, which somewhat weakens the conclusions. The sample sizes for vector-steering experiments (500 total across concepts) are modest, and the authors acknowledge compute constraints.

The Belief Dominance experiments (Section 5.2) are convincing in their simplicity: probes on concatenated entity embeddings match or exceed ICL performance, strongly suggesting the labels are input-predictable. The use of 15 train-test splits with reported standard deviations adds reliability.

3. Potential Impact

This paper has significant potential to shape the discourse around LLM metacognition and AI consciousness/self-awareness claims. Its contributions operate at multiple levels:

Immediate methodological impact: It provides a template for control experiments that should become standard practice in metacognition research on LLMs. The relabeling control and input-only baselines are straightforward to implement and should be adopted widely.

Conceptual clarification: The distinction between privileged access (necessary) and second-order processing (required for strong introspection) is an important conceptual contribution that raises the evidentiary bar for future claims. The connection to the human metacognition literature, particularly the cautionary lessons from Nisbett & Wilson (1977) and Koriat (1997), provides valuable grounding.

AI safety relevance: Claims about LLM introspection have implications for AI alignment and safety (e.g., whether models can reliably report their own biases or deceptive tendencies). Deflating premature claims is practically important.

4. Timeliness & Relevance

This paper is exceptionally timely. Lindsey (2025) generated significant attention, and the broader question of LLM self-awareness is increasingly prominent in both research and public discourse. The paper directly addresses a current trend of anthropomorphizing LLM capabilities, providing needed skepticism grounded in principled analysis. The rapid proliferation of claims about LLM metacognition (several papers from 2025-2026 are cited) makes this critical examination particularly valuable.

5. Strengths & Limitations

Key Strengths:

Clean, well-motivated experimental controls that directly test the confound of input-driven pattern matching

Strong conceptual grounding in cognitive science literature on metacognition

The principled argument about the insufficiency of behavioral evidence for second-order processing claims is important and well-articulated

The paper is constructive, identifying what *would* be needed (mechanistic evidence of dissociable second-order computation) rather than merely debunking

Notable Limitations:

The paper cannot directly test Claude (the model in Lindsey, 2025), limiting the strength of the replication critique

The principled argument, while important, is somewhat abstract—the paper does not provide a concrete operationalization of what mechanistic evidence for introspection would look like (beyond pointing to Macar et al., 2026)

The gaslight prompts are a specific operationalization of input-level intervention; it's unclear whether the models' failure reflects a fundamental limitation versus the particular implementation

Results are sensitive to prompt wording, which introduces some ambiguity about robustness

The paper focuses on pretrained models, explicitly excluding finetuned models, which limits scope

Overall Assessment: This is a valuable contribution that provides much-needed critical scrutiny of claims about LLM introspection. The experimental controls are simple yet effective, the conceptual framework is well-grounded, and the timing is excellent. The paper does not definitively rule out LLM introspection but convincingly demonstrates that current evidence is insufficient to support it—a measured and scientifically appropriate conclusion. It is likely to influence how future metacognition studies in LLMs are designed and evaluated.

Rating:7.4/ 10

Significance 7.5Rigor 7Novelty 6.5Clarity 8

Generated May 27, 2026

Comparison History (20)

vs. Localizing Input Uncertainty Quantification for Large Language Models via Shapley Values

gemini-3.15/28/2026

Paper 2 addresses a fundamental, highly debated question regarding LLM capabilities (introspection and metacognition). By rigorously debunking flawed evaluation paradigms and introducing better controls, it prevents the field from pursuing scientifically unsound directions. This foundational 'reality check' is likely to broadly influence how researchers evaluate and interpret LLM internal states, generating broader theoretical impact than Paper 1's specific (though practically useful) methodological contribution to uncertainty quantification.

vs. Refusal Before Decoding: Detecting and Exploiting Refusal Signals in Intermediate LLM Activations

gemini-3.15/28/2026

Paper 1 offers a highly actionable, novel methodology (Mechanistic AutoDAN) that significantly improves the efficiency of LLM red-teaming and provides concrete insights into mechanistic interpretability. While Paper 2 provides an important critical analysis of LLM metacognition, Paper 1 introduces a practical tool and optimization technique that will likely see broader immediate adoption and follow-up work in the rapidly growing field of AI safety evaluation.

vs. Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

gpt-5.25/28/2026

Paper 1 likely has higher impact: it introduces a new, reproducible benchmark targeting an under-measured but practically critical variable (agent harness configuration) with substantial empirical scale (106 tasks, 5,194 trajectories) and rich logged artifacts/traces enabling broad downstream research (reliability, tool use, eval methodology, systems). Its applications are immediate for deployed agent stacks and could standardize reporting at the model–harness level. Paper 2 is timely and conceptually important as a critique with stronger controls, but its scope is narrower (specific introspection paradigms) and is less directly enabling for engineering and cross-field benchmarking.

vs. LiveBrowseComp: Are Search Agents Searching, or Just Verifying What They Already Know?

gemini-3.15/28/2026

While both papers critically evaluate current LLM capabilities, Paper 2 introduces a novel benchmark (LiveBrowseComp) that addresses a critical flaw in how search agents are evaluated. Given the immense current focus on RAG and autonomous web agents, a rigorous dataset to separate intrinsic knowledge from actual search capabilities offers immediate, highly practical utility for researchers. This benchmark is likely to see broader adoption and citation than the more theoretical cognitive arguments presented in Paper 1.

vs. Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

gpt-5.25/28/2026

Paper 2 proposes a novel, actionable framework (cognitive scheduling of visual evidence acquisition) that changes the multimodal reasoning pipeline and shows consistent zero-shot gains across benchmarks, suggesting clearer methodological contribution and nearer-term applicability to real systems needing faithful visual grounding. Its modular “invoke perception on demand” idea is broadly relevant across VLMs, agents, and interactive perception. Paper 1 is a valuable critique with careful controls that improves evaluation rigor for LLM metacognition, but it is primarily corrective/diagnostic and may have narrower immediate downstream application than a new performant architecture-level approach.

vs. Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

gemini-3.15/27/2026

Paper 1 tackles fundamental theoretical questions about LLM capabilities (introspection and metacognition) and rigorously challenges flawed assumptions in existing evaluation paradigms. By redirecting future research and preventing the field from building on invalid behavioral evidence, it promises a broader and more profound scientific impact across AI, cognitive science, and alignment compared to the valuable but more domain-specific benchmarking methodology presented in Paper 2.

vs. Emotional intelligence in large language models is fragmented across perception, cognition, and interaction

claude-opus-4.65/27/2026

Paper 2 addresses a more fundamental question about LLM capabilities (introspection/metacognition) with rigorous methodological critiques of existing work, introducing clever control conditions that expose confounds. Its findings challenge premature conclusions in the field and have broad implications for AI interpretability and alignment research. Paper 1, while introducing a useful benchmark (FACET) for emotional intelligence evaluation, is more incremental—primarily a benchmarking study. Paper 2's methodological contributions (distinguishing genuine introspection from pattern matching, demonstrating insufficiency of behavioral evidence) provide deeper, more generalizable insights for the field.

vs. Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

gemini-3.15/27/2026

Paper 1 challenges fundamental assumptions about LLM metacognition, offering a critical re-evaluation that impacts AI safety, alignment, and cognitive science. By highlighting flaws in current evaluation paradigms, it has the potential to redirect future theoretical and empirical research across the broader AI community. While Paper 2 offers significant practical advancements in medical AI, Paper 1 addresses a more foundational scientific question with broader interdisciplinary implications.

vs. ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental question about LLM introspection with rigorous methodology, introducing controlled experimental paradigms that challenge premature claims in a rapidly growing field. It bridges cognitive science and AI, offering broad implications for how we evaluate LLM capabilities. Its critical examination of existing evaluation methods will likely influence future research standards. Paper 2, while practically useful, is primarily an engineering contribution (a copilot tool) with narrower methodological novelty and less potential to shift research directions across fields.

vs. Clustering as Reasoning: A $k$-Means Interpretation of Chain-of-Thought Graph Learning

gemini-3.15/27/2026

Paper 1 addresses a foundational and highly debated question regarding LLM introspection and metacognition. By rigorously challenging existing claims and introducing better-controlled evaluation paradigms, it provides critical insights into AI interpretability, safety, and our fundamental understanding of LLMs. Its findings have broad implications across the entire AI community. Paper 2, while theoretically novel in linking Transformers to k-means, focuses on a narrower niche (CoT on text-attributed graphs), giving it a more specialized and limited potential impact compared to Paper 1.

vs. Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

gpt-5.25/27/2026

Paper 1 proposes a novel data-management abstraction (GEM) for long-term agent memory, with formal operators and correctness conditions plus a prototype (MemState), offering clear methodological structure and immediate systems implications for a timely, fast-growing agent ecosystem. Its impact could span databases, AI systems, and safety/auditing by reframing memory as a state-trajectory workload. Paper 2 is a valuable corrective that tightens evaluation rigor for LLM “introspection,” but is primarily a critique of existing paradigms with narrower direct application. Overall, Paper 1 has higher potential for foundational and cross-field impact.

vs. Experiments in Agentic AI for Science

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental question about LLM metacognition with rigorous experimental methodology, introducing important controls that challenge prior claims. It provides a conceptual framework (distinguishing introspection from pattern matching) with broad implications for AI interpretability and alignment research. Paper 2 describes practical engineering systems for scientific workflows but is more incremental, describing domain-specific tools without deep methodological novelty. Paper 1's findings will likely influence a wider research community working on LLM understanding, evaluation, and safety.

vs. From Model Scaling to System Scaling: Scaling the Harness in Agentic AI

gpt-5.25/27/2026

Paper 2 has higher likely scientific impact due to its methodological rigor and timeliness: it directly challenges prominent claims about LLM introspection with stronger controls, alternative explanations, and negative/clarifying results that can reshape evaluation standards. Its conclusions affect interpretability, safety, benchmarking, and cognitive-science-inspired ML, giving broad cross-field relevance. Paper 1 is a valuable systems agenda and tooling contribution, but it is more conceptual/engineering-oriented and its impact depends on adoption of a specific harness and benchmarks, whereas Paper 2 provides sharper falsification and widely applicable evaluation guidance.

vs. DarkForest: Less Talk, Higher Accuracy for Multi-Agent LLMs

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact: it challenges a prominent and timely claim (LLM introspection/metacognition) with tighter controls, reframing evaluation methodology and setting stronger evidentiary standards. This kind of corrective, conceptual-plus-empirical critique can influence many subsequent papers across interpretability, alignment, cognitive science, and evaluation. Paper 1 is practically valuable and novel for multi-agent coordination efficiency, but its impact is more engineering-focused within a narrower subarea and may be superseded by rapid system-level iteration.

vs. Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental and broadly relevant question about LLM introspection/metacognition that impacts the entire AI/ML community. It provides rigorous methodological critique of existing paradigms, introduces better-controlled experimental settings, and draws important connections to human metacognition research. Its findings—that current evidence for LLM introspection is insufficient—have significant implications for interpretability, alignment, and our understanding of LLM capabilities. Paper 2, while practically useful for supply chain applications, addresses a narrower domain-specific problem with an engineering-focused contribution that is less likely to influence broader scientific discourse.

vs. Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

claude-opus-4.65/27/2026

Paper 2 addresses a fundamental question about LLM introspection/metacognition that is highly timely given the explosive growth of LLM research. It provides important methodological corrections to prior claims, introduces better-controlled experimental paradigms, and draws meaningful connections to human metacognition research. Its findings have broad implications across AI safety, interpretability, and philosophy of mind. Paper 1, while solid, addresses a more niche problem in multi-agent RL with incremental methodological contributions. Paper 2's potential to reshape how the community evaluates LLM self-knowledge gives it broader and more lasting impact.

vs. It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

gpt-5.25/27/2026

Paper 2 likely has higher impact due to its direct applicability to real-world LLM agent deployment: it identifies non-monotone harness effects, provides controlled multi-model/multi-harness evidence, a failure taxonomy, and actionable harness-selection guidance. This is timely for agent reliability and evaluation, and its findings can influence tooling, benchmarks, and operational practices across software engineering and AI safety. Paper 1 is methodologically important and conceptually clarifying, but primarily constrains claims about LLM introspection; its immediate applied leverage and cross-industry uptake are likely smaller.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

gemini-3.15/27/2026

Paper 2 identifies a fundamental flaw in how multi-hop reasoning is currently evaluated and introduces a novel, robust methodology (the double-gate protocol) to isolate true compositionality from atomic knowledge. Because post-training evaluation and reasoning capabilities are central to current LLM development, this framework is likely to see broad adoption and directly influence model optimization. While Paper 1 provides a rigorous and necessary critique of LLM metacognition claims, its impact is more narrowly focused on the interpretability and cognitive science discourse rather than core model training and evaluation pipelines.

vs. UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

claude-opus-4.65/27/2026

Paper 1 addresses a fundamental question about LLM introspection/metacognition with rigorous methodology, challenging premature claims in the field. It introduces important conceptual distinctions (genuine introspection vs. pattern matching) and controlled experimental paradigms that will influence how future research evaluates LLM self-knowledge. Its broad epistemological implications affect AI safety, interpretability, and cognitive science. Paper 2 presents a useful engineering framework for multi-agent RL optimization, but is more incremental—extending existing infrastructure (verl) with multi-agent support. While practically valuable, it has narrower conceptual impact compared to Paper 1's foundational critique.

vs. PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

claude-opus-4.65/27/2026

Paper 2 addresses a fundamental question about LLM capabilities (introspection/metacognition) that has broad implications across AI safety, interpretability, and cognitive science. Its methodological contribution—introducing controlled baselines to distinguish genuine introspection from pattern matching—provides a rigorous framework applicable to many future claims about LLM self-knowledge. While Paper 1 is a strong applied contribution to polymer science, Paper 2's findings are more likely to influence a wider research community, reshape ongoing debates about LLM understanding, and establish important methodological standards for evaluating emergent AI capabilities.