Can LLMs Introspect? A Reality Check
Shashwat Singh, Tal Linzen, Shauli Ravfogel
Abstract
Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Can LLMs Introspect? A Reality Check"
1. Core Contribution
This paper provides a systematic critique of two prominent paradigms that have been used to argue LLMs possess metacognitive monitoring capabilities. The authors make two complementary arguments:
Empirical: In the "biofeedback" paradigm (Ji-An et al., 2025; Steinmetz Yalon et al., 2026), models appear to predict labels derived from their own hidden states, but the authors demonstrate that simple classifiers operating only on input embeddings (layer-0 representations) achieve equivalent performance. A relabeling control—where semantic correlates between input and probe labels are broken—reduces model performance to near chance. In the steering-awareness paradigm (Lindsey, 2025), the authors introduce a three-way classification design (adding "gaslight" input-level interventions) and show models cannot distinguish activation-level from input-level perturbations, suggesting detection reflects general anomaly sensitivity rather than introspection.
Principled: The authors argue that even privileged access to hidden states is insufficient for introspection. Drawing on cognitive science, they contend that introspection requires evidence of a dissociable *second-order* process, which no purely behavioral paradigm can establish.
2. Methodological Rigor
The experimental controls are well-designed and appropriately targeted:
However, there are notable methodological limitations. The authors cannot replicate Lindsey (2025) directly because Claude is inaccessible, so results are on open-weight models (Llama, Qwen, Gemma). While they reproduce Lindsey's two-way results on some of these models, the generalizability gap is real. Additionally, the steering experiments show high inter-concept variability and sensitivity to prompt wording, which somewhat weakens the conclusions. The sample sizes for vector-steering experiments (500 total across concepts) are modest, and the authors acknowledge compute constraints.
The Belief Dominance experiments (Section 5.2) are convincing in their simplicity: probes on concatenated entity embeddings match or exceed ICL performance, strongly suggesting the labels are input-predictable. The use of 15 train-test splits with reported standard deviations adds reliability.
3. Potential Impact
This paper has significant potential to shape the discourse around LLM metacognition and AI consciousness/self-awareness claims. Its contributions operate at multiple levels:
4. Timeliness & Relevance
This paper is exceptionally timely. Lindsey (2025) generated significant attention, and the broader question of LLM self-awareness is increasingly prominent in both research and public discourse. The paper directly addresses a current trend of anthropomorphizing LLM capabilities, providing needed skepticism grounded in principled analysis. The rapid proliferation of claims about LLM metacognition (several papers from 2025-2026 are cited) makes this critical examination particularly valuable.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment: This is a valuable contribution that provides much-needed critical scrutiny of claims about LLM introspection. The experimental controls are simple yet effective, the conceptual framework is well-grounded, and the timing is excellent. The paper does not definitively rule out LLM introspection but convincingly demonstrates that current evidence is insufficient to support it—a measured and scientifically appropriate conclusion. It is likely to influence how future metacognition studies in LLMs are designed and evaluated.
Generated May 27, 2026
Comparison History (20)
Paper 2 addresses a fundamental, highly debated question regarding LLM capabilities (introspection and metacognition). By rigorously debunking flawed evaluation paradigms and introducing better controls, it prevents the field from pursuing scientifically unsound directions. This foundational 'reality check' is likely to broadly influence how researchers evaluate and interpret LLM internal states, generating broader theoretical impact than Paper 1's specific (though practically useful) methodological contribution to uncertainty quantification.
Paper 1 offers a highly actionable, novel methodology (Mechanistic AutoDAN) that significantly improves the efficiency of LLM red-teaming and provides concrete insights into mechanistic interpretability. While Paper 2 provides an important critical analysis of LLM metacognition, Paper 1 introduces a practical tool and optimization technique that will likely see broader immediate adoption and follow-up work in the rapidly growing field of AI safety evaluation.
Paper 1 likely has higher impact: it introduces a new, reproducible benchmark targeting an under-measured but practically critical variable (agent harness configuration) with substantial empirical scale (106 tasks, 5,194 trajectories) and rich logged artifacts/traces enabling broad downstream research (reliability, tool use, eval methodology, systems). Its applications are immediate for deployed agent stacks and could standardize reporting at the model–harness level. Paper 2 is timely and conceptually important as a critique with stronger controls, but its scope is narrower (specific introspection paradigms) and is less directly enabling for engineering and cross-field benchmarking.
While both papers critically evaluate current LLM capabilities, Paper 2 introduces a novel benchmark (LiveBrowseComp) that addresses a critical flaw in how search agents are evaluated. Given the immense current focus on RAG and autonomous web agents, a rigorous dataset to separate intrinsic knowledge from actual search capabilities offers immediate, highly practical utility for researchers. This benchmark is likely to see broader adoption and citation than the more theoretical cognitive arguments presented in Paper 1.
Paper 2 proposes a novel, actionable framework (cognitive scheduling of visual evidence acquisition) that changes the multimodal reasoning pipeline and shows consistent zero-shot gains across benchmarks, suggesting clearer methodological contribution and nearer-term applicability to real systems needing faithful visual grounding. Its modular “invoke perception on demand” idea is broadly relevant across VLMs, agents, and interactive perception. Paper 1 is a valuable critique with careful controls that improves evaluation rigor for LLM metacognition, but it is primarily corrective/diagnostic and may have narrower immediate downstream application than a new performant architecture-level approach.
Paper 1 tackles fundamental theoretical questions about LLM capabilities (introspection and metacognition) and rigorously challenges flawed assumptions in existing evaluation paradigms. By redirecting future research and preventing the field from building on invalid behavioral evidence, it promises a broader and more profound scientific impact across AI, cognitive science, and alignment compared to the valuable but more domain-specific benchmarking methodology presented in Paper 2.
Paper 2 addresses a more fundamental question about LLM capabilities (introspection/metacognition) with rigorous methodological critiques of existing work, introducing clever control conditions that expose confounds. Its findings challenge premature conclusions in the field and have broad implications for AI interpretability and alignment research. Paper 1, while introducing a useful benchmark (FACET) for emotional intelligence evaluation, is more incremental—primarily a benchmarking study. Paper 2's methodological contributions (distinguishing genuine introspection from pattern matching, demonstrating insufficiency of behavioral evidence) provide deeper, more generalizable insights for the field.
Paper 1 challenges fundamental assumptions about LLM metacognition, offering a critical re-evaluation that impacts AI safety, alignment, and cognitive science. By highlighting flaws in current evaluation paradigms, it has the potential to redirect future theoretical and empirical research across the broader AI community. While Paper 2 offers significant practical advancements in medical AI, Paper 1 addresses a more foundational scientific question with broader interdisciplinary implications.
Paper 1 addresses a fundamental question about LLM introspection with rigorous methodology, introducing controlled experimental paradigms that challenge premature claims in a rapidly growing field. It bridges cognitive science and AI, offering broad implications for how we evaluate LLM capabilities. Its critical examination of existing evaluation methods will likely influence future research standards. Paper 2, while practically useful, is primarily an engineering contribution (a copilot tool) with narrower methodological novelty and less potential to shift research directions across fields.
Paper 1 addresses a foundational and highly debated question regarding LLM introspection and metacognition. By rigorously challenging existing claims and introducing better-controlled evaluation paradigms, it provides critical insights into AI interpretability, safety, and our fundamental understanding of LLMs. Its findings have broad implications across the entire AI community. Paper 2, while theoretically novel in linking Transformers to k-means, focuses on a narrower niche (CoT on text-attributed graphs), giving it a more specialized and limited potential impact compared to Paper 1.
Paper 1 proposes a novel data-management abstraction (GEM) for long-term agent memory, with formal operators and correctness conditions plus a prototype (MemState), offering clear methodological structure and immediate systems implications for a timely, fast-growing agent ecosystem. Its impact could span databases, AI systems, and safety/auditing by reframing memory as a state-trajectory workload. Paper 2 is a valuable corrective that tightens evaluation rigor for LLM “introspection,” but is primarily a critique of existing paradigms with narrower direct application. Overall, Paper 1 has higher potential for foundational and cross-field impact.
Paper 1 addresses a fundamental question about LLM metacognition with rigorous experimental methodology, introducing important controls that challenge prior claims. It provides a conceptual framework (distinguishing introspection from pattern matching) with broad implications for AI interpretability and alignment research. Paper 2 describes practical engineering systems for scientific workflows but is more incremental, describing domain-specific tools without deep methodological novelty. Paper 1's findings will likely influence a wider research community working on LLM understanding, evaluation, and safety.
Paper 2 has higher likely scientific impact due to its methodological rigor and timeliness: it directly challenges prominent claims about LLM introspection with stronger controls, alternative explanations, and negative/clarifying results that can reshape evaluation standards. Its conclusions affect interpretability, safety, benchmarking, and cognitive-science-inspired ML, giving broad cross-field relevance. Paper 1 is a valuable systems agenda and tooling contribution, but it is more conceptual/engineering-oriented and its impact depends on adoption of a specific harness and benchmarks, whereas Paper 2 provides sharper falsification and widely applicable evaluation guidance.
Paper 2 likely has higher scientific impact: it challenges a prominent and timely claim (LLM introspection/metacognition) with tighter controls, reframing evaluation methodology and setting stronger evidentiary standards. This kind of corrective, conceptual-plus-empirical critique can influence many subsequent papers across interpretability, alignment, cognitive science, and evaluation. Paper 1 is practically valuable and novel for multi-agent coordination efficiency, but its impact is more engineering-focused within a narrower subarea and may be superseded by rapid system-level iteration.
Paper 1 addresses a fundamental and broadly relevant question about LLM introspection/metacognition that impacts the entire AI/ML community. It provides rigorous methodological critique of existing paradigms, introduces better-controlled experimental settings, and draws important connections to human metacognition research. Its findings—that current evidence for LLM introspection is insufficient—have significant implications for interpretability, alignment, and our understanding of LLM capabilities. Paper 2, while practically useful for supply chain applications, addresses a narrower domain-specific problem with an engineering-focused contribution that is less likely to influence broader scientific discourse.
Paper 2 addresses a fundamental question about LLM introspection/metacognition that is highly timely given the explosive growth of LLM research. It provides important methodological corrections to prior claims, introduces better-controlled experimental paradigms, and draws meaningful connections to human metacognition research. Its findings have broad implications across AI safety, interpretability, and philosophy of mind. Paper 1, while solid, addresses a more niche problem in multi-agent RL with incremental methodological contributions. Paper 2's potential to reshape how the community evaluates LLM self-knowledge gives it broader and more lasting impact.
Paper 2 likely has higher impact due to its direct applicability to real-world LLM agent deployment: it identifies non-monotone harness effects, provides controlled multi-model/multi-harness evidence, a failure taxonomy, and actionable harness-selection guidance. This is timely for agent reliability and evaluation, and its findings can influence tooling, benchmarks, and operational practices across software engineering and AI safety. Paper 1 is methodologically important and conceptually clarifying, but primarily constrains claims about LLM introspection; its immediate applied leverage and cross-industry uptake are likely smaller.
Paper 2 identifies a fundamental flaw in how multi-hop reasoning is currently evaluated and introduces a novel, robust methodology (the double-gate protocol) to isolate true compositionality from atomic knowledge. Because post-training evaluation and reasoning capabilities are central to current LLM development, this framework is likely to see broad adoption and directly influence model optimization. While Paper 1 provides a rigorous and necessary critique of LLM metacognition claims, its impact is more narrowly focused on the interpretability and cognitive science discourse rather than core model training and evaluation pipelines.
Paper 1 addresses a fundamental question about LLM introspection/metacognition with rigorous methodology, challenging premature claims in the field. It introduces important conceptual distinctions (genuine introspection vs. pattern matching) and controlled experimental paradigms that will influence how future research evaluates LLM self-knowledge. Its broad epistemological implications affect AI safety, interpretability, and cognitive science. Paper 2 presents a useful engineering framework for multi-agent RL optimization, but is more incremental—extending existing infrastructure (verl) with multi-agent support. While practically valuable, it has narrower conceptual impact compared to Paper 1's foundational critique.
Paper 2 addresses a fundamental question about LLM capabilities (introspection/metacognition) that has broad implications across AI safety, interpretability, and cognitive science. Its methodological contribution—introducing controlled baselines to distinguish genuine introspection from pattern matching—provides a rigorous framework applicable to many future claims about LLM self-knowledge. While Paper 1 is a strong applied contribution to polymer science, Paper 2's findings are more likely to influence a wider research community, reshape ongoing debates about LLM understanding, and establish important methodological standards for evaluating emergent AI capabilities.