Superficial Beliefs in LLM Decision-Making

Gabriel Freedman, Francesca Toni

Jun 9, 2026arXiv:2606.11016v1

cs.AI

#1564of 3489·Artificial Intelligence

#1564 of 3489 · Artificial Intelligence

Tournament Score

1412±43

10501800

54%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor6.5

Novelty6

Clarity7

Abstract

We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Superficial Beliefs in LLM Decision-Making"

1. Core Contribution

This paper introduces the concept of "superficial belief" in LLM decision-making — a middle ground between arbitrary outputs and fully articulated belief states. The central claim is that LLMs exhibit systematic, predictable decision behavior (recoverable via simple behavioral models), but their explicit self-reports about *why* they chose as they did only partially align with the behaviorally inferred drivers.

The methodology involves: (1) constructing synthetic binary decision problems with four graded attributes across multiple themes, (2) fitting binomial logistic regression models to LLM choices to recover implicit attribute weightings, (3) comparing these "revealed drivers" against two explicit elicitation methods — direct self-report and a score-based judge approach. The key finding is a persistent gap: choice-level prediction accuracy (~80%) substantially exceeds attribute-level agreement (~61%), suggesting models "know" what they prefer but can't fully articulate why.

2. Methodological Rigor

The experimental design is thoughtful in several respects. The benchmark controls for positional bias through systematic A/B label reversal and attribute ordering permutations. Control themes with deliberately irrelevant attributes (Packaging Symmetry, Label Border Thickness) provide a useful sanity check — these are rarely selected as drivers (<0.3% for direct reports), confirming semantic sensitivity rather than random attribute naming.

The study covers four model families (GPT-5-mini, GPT-5-nano, Qwen3-14B, Ministral-3-14B), each in thinking and non-thinking modes (8 conditions total), across three substantive themes and two control themes. This breadth strengthens generalizability claims.

However, several methodological concerns arise:

The behavioral model is quite simple. A four-predictor logistic regression is a coarse proxy for the "true" decision structure. While alternative models (M1, M2) are tested in the appendix with broadly similar results, the 80% choice prediction rate leaves 20% unexplained — this residual could harbor substantial complexity that the linear model misses, potentially inflating the apparent gap between behavioral inference and self-report.

The attribute-level comparison is inherently noisy. Comparing a single "most important attribute" to a behaviorally inferred top contributor involves discrete matching, which is strict. When two attributes have similar contributions, the "correct" answer is fragile. The paper doesn't report how often the top two behavioral contributions are close in magnitude, which would contextualize the 61% figure.

The synthetic setting limits ecological validity. Three-level ordinal attributes with clear dominance patterns (low/medium/high) are far simpler than real-world decision contexts. The paper acknowledges this limitation but it substantially constrains the interpretive reach.

3. Potential Impact

The paper addresses an important question in AI interpretability and alignment: can we trust LLM self-explanations? The finding that behavior is more structured than explanations is practically relevant for:

AI safety and alignment: If models cannot accurately report their decision drivers, relying on chain-of-thought explanations for oversight is insufficient.

Decision support systems: Understanding that LLM-generated justifications may diverge from actual decision patterns matters for deploying LLMs in advisory roles.

Philosophical AI: The "superficial belief" framing connects to ongoing debates about machine consciousness and intentionality, providing empirical grounding for Schwitzgebel's superficialism.

The contribution is more conceptual-diagnostic than it is a tool or method that others would directly adopt. The benchmark itself, while well-constructed, is narrowly scoped.

4. Timeliness & Relevance

This work arrives at a critical moment. The rapid deployment of LLMs in consequential decision-making contexts (healthcare, policy, hiring) makes the question of explanation faithfulness urgent. The paper complements concurrent work on stated vs. revealed preferences (Gu et al., 2025; Shen et al., 2025) and unfaithful chain-of-thought reasoning (Chen et al., 2025; Turpin et al., 2023), but distinguishes itself by proposing a formal behavioral reconstruction methodology rather than relying on indirect tests.

The use of very recent models (GPT-5-mini/nano, June 2026 preprint) ensures currency, though rapid model evolution may quickly date specific numerical findings.

5. Strengths & Limitations

Key Strengths:

Clean experimental design with appropriate controls (control attributes, label reversal, order permutation)

The targeted occlusion analysis (Figure 4) provides compelling causal-style evidence that behaviorally ranked attributes genuinely influence model behavior

Breadth across model families and settings strengthens the claim that the pattern is general

Honest framing — the authors are careful not to overclaim, positioning findings as "weak, decision-local" superficial belief

The comparison between reproducibility and behavioral alignment (Figure 3, Panel B) yields a non-obvious insight: the score-based judge is more reproducible but less behaviorally aligned for choices

Notable Limitations:

The synthetic setting is highly constrained — four attributes, three levels, binary choice. Real decisions involve heterogeneous, continuous, and interdependent factors

The 61% attribute agreement rate, while below the 80% choice agreement, is hard to benchmark. What would "good" attribute recovery look like? Without a clear ceiling or comparison to human performance on analogous tasks, the magnitude of the gap is difficult to interpret

The paper doesn't explore whether the gap narrows with more sophisticated behavioral models (e.g., neural networks), which could suggest the gap is partially an artifact of model misspecification

The connection to "superficial belief" as a philosophical concept is suggestive but somewhat loosely operationalized — the empirical findings could be interpreted through multiple theoretical lenses

6. Additional Observations

The paper's Table 3 showing that equal-weight additive rules underperform by ~11% is important but also reveals that much of the "systematic structure" is captured by trivial baselines. The marginal contribution of learned weights over equal weights is modest, which somewhat tempers the claim about rich latent decision structure.

The framing around argumentation frameworks (score-based judge via ArgLLM) feels somewhat forced — the connection is conceptually thin and the judge doesn't substantially outperform direct elicitation.

Overall, this is a carefully executed empirical study that operationalizes an interesting philosophical question. Its impact is primarily diagnostic and conceptual rather than methodological, and its scope remains limited by the synthetic setting.

Rating:5.8/ 10

Significance 6Rigor 6.5Novelty 6Clarity 7

Generated Jun 10, 2026

Comparison History (24)

Wonvs. A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

Paper 1 addresses a fundamental scientific question about LLM cognition—whether models have genuine internal decision structures versus superficial verbal rationalizations. The concept of 'superficial belief' is novel and contributes to the growing field of mechanistic interpretability and AI alignment. It uses rigorous behavioral methodology and has broad implications for trust in AI reasoning. Paper 2, while practically useful, is primarily an engineering reference architecture for enterprise governance—important for industry but narrower in scientific contribution, offering architectural patterns rather than new scientific insights.

claude-opus-4-6·Jun 11, 2026

Wonvs. StatefulDiscovery: Evidence-Calibrated Claim Formation in Open-Ended Scientific Discovery

Paper 1 addresses a fundamental question about LLM reasoning and self-knowledge—whether models truly understand the drivers of their own decisions. The concept of 'superficial belief' provides a novel theoretical framework with broad implications for AI alignment, interpretability, and trust in LLM outputs. This finding is relevant across virtually all LLM applications. Paper 2 presents a useful engineering framework for scientific discovery agents, but is more incremental and narrowly scoped. Paper 1's insights about the gap between LLM behavior and self-report have deeper, more cross-cutting implications for the field.

claude-opus-4-6·Jun 11, 2026

Lostvs. TouchThinker: Scaling Tactile Commonsense Reasoning to the Open World with Large-scale Data and Action-aware Representation

Paper 1 has higher estimated scientific impact due to a substantial enabling contribution: a million-scale, multi-source tactile reasoning dataset plus a new benchmark, alongside an action-aware representation tailored to tactile redundancy. These resources and methods can unlock broader progress in embodied AI, robotics manipulation, and multimodal foundation models, with clear real-world applications. Paper 2 offers a timely, rigorous behavioral analysis of LLM rationalization (“superficial belief”) with solid methodological controls, but its contribution is primarily conceptual/diagnostic without comparable new infrastructure, and likely narrower in downstream capability gains.

gpt-5.2·Jun 11, 2026

Wonvs. Automating Geometry-Intensive Compliance Checking in BIM: Graph-Based Semantic Reasoning Framework

Paper 2 addresses a fundamental question about LLM reasoning and self-knowledge that has broad implications across AI safety, interpretability, and cognitive science. The concept of 'superficial belief' — where LLMs exhibit structured decision behavior but lack accurate self-report access — is novel and highly relevant given the widespread deployment of LLMs. This finding impacts trust, alignment, and evaluation methodologies across many fields. Paper 1, while technically solid, addresses a narrower domain-specific problem (BIM compliance checking) with more limited cross-disciplinary reach.

claude-opus-4-6·Jun 11, 2026

Lostvs. Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

Paper 2 introduces a radical, paradigm-shifting theoretical framework for AI alignment, addressing core issues like deceptive alignment and instrumental convergence at their root. While Paper 1 offers valuable empirical insights into LLM interpretability, Paper 2's novel concept of 'Existential Indifference' and its bold methodological approach could fundamentally alter the trajectory of AGI safety research, offering a much higher ceiling for breadth of impact and real-world application in developing aligned superintelligence.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. MODF-SIR: A Multi-agent Omni-modal Distilled Framework for Social Intelligence Reasoning

Paper 1 offers a clearer conceptual contribution: an empirically supported characterization (“superficial belief”) of the relationship between LLM choices and stated rationales, with controlled synthetic setups, behavioral model fitting, robustness checks, and interpretable analyses. This targets a timely, broadly relevant problem (faithfulness, interpretability, evaluation of reasoning) with impact across alignment, HCI, and cognitive modeling. Paper 2 is application/engineering-heavy (multi-agent + distillation + TTA/LoRA + formatting) and likely yields benchmark gains, but appears more incremental via composition of known techniques and its impact may be narrower to social-intelligence benchmarks.

gpt-5.2·Jun 11, 2026

Wonvs. TRL-Bench: Standardizing Cross-Paradigm Representation-Level Evaluation of Tabular Encoders

Paper 1 addresses a fundamental question about LLM reasoning and self-knowledge—whether models truly understand their own decision processes—which has broad implications for AI interpretability, alignment, and trust. The concept of 'superficial belief' is novel and contributes to the growing body of work on LLM introspection. Paper 2, while methodologically solid and practically useful, is a benchmark contribution for a narrower subfield (tabular representation learning). Benchmarks have impact but are more incremental; Paper 1's findings about the gap between LLM behavior and self-reports have wider cross-disciplinary relevance and timeliness given current AI safety concerns.

claude-opus-4-6·Jun 10, 2026

Lostvs. One Token per Multimodal Evidence: Latent Memory for Resource-Constrained QA

Paper 2 likely has higher impact: it introduces a concrete, scalable systems-method (latent-space external memory with one-token evidence) that directly addresses pressing deployment constraints (token/storage costs) and demonstrates strong empirical results across many text and multimodal QA benchmarks with large efficiency gains. The approach is timely for RAG/VLM applications and broadly applicable to resource-constrained settings. Paper 1 is conceptually novel for interpretability of LLM decision structure, but its impact may be narrower and more diagnostic than enabling, with fewer immediate real-world deployments.

gpt-5.2·Jun 10, 2026

Wonvs. HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

Paper 1 investigates a fundamental question regarding LLM interpretability and alignment, revealing a disconnect between stated rationales and actual decision drivers. This insight has profound implications for AI safety, cognitive science of LLMs, and trust in AI systems. While Paper 2 offers a valuable methodological improvement for long-horizon agents, Paper 1 addresses deeper theoretical issues with broader interdisciplinary impact.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. A History-Aware Visually Grounded Critic for Computer Use Agents

Paper 1 addresses a fundamental question regarding the interpretability and reasoning mechanisms of LLMs, distinguishing between actual decision structures and superficial rationale generation. This foundational insight into how LLMs operate has broad implications across AI safety, alignment, and cognitive science. While Paper 2 offers a valuable, practical architectural improvement for GUI agents, Paper 1's focus on the core nature of LLM behavior provides deeper theoretical contributions with wider long-term scientific impact.

gemini-3.1-pro-preview·Jun 10, 2026

#1564of 3489·Artificial Intelligence

#1564 of 3489 · Artificial Intelligence

Tournament Score

1412±43

10501800

54%

Win Rate

Wins

Losses

Matches

Rating

5.8/ 10

Significance6

Rigor6.5

Novelty6

Clarity7