Gabriel Freedman, Francesca Toni
We ask whether large language models (LLMs) merely imitate rationales when choosing between two options, or whether their choices reflect a systematic underlying decision structure. Using synthetic binary decision settings in which models choose between profiles defined by graded attributes, we compare the attribute a model says mattered most with the attribute that best explains its choice under a behavioural model fit to prior decisions. The behavioural model predicts held-out choices well, showing that model behaviour is systematically related to the visible attributes rather than being random. However, direct self-reports and a separate score-based judge recover the behaviourally inferred driver only partially. The resulting picture is neither one of arbitrary behaviour nor one of fully articulated belief - outputs are structured enough to support prediction, but explicit reasons track the recovered driver only imperfectly. This qualitative pattern persists across prompt-order and sampling perturbations, alternative behavioural models, targeted occlusion analyses, and structurally varied decision settings. We interpret this as evidence for ``superficial belief'' in LLM decision-making: models behave as if guided by probabilistic local priorities over attributes, while having only limited verbal access to the attributes that drive their decisions.
This paper introduces the concept of "superficial belief" in LLM decision-making — a middle ground between arbitrary outputs and fully articulated belief states. The central claim is that LLMs exhibit systematic, predictable decision behavior (recoverable via simple behavioral models), but their explicit self-reports about *why* they chose as they did only partially align with the behaviorally inferred drivers.
The methodology involves: (1) constructing synthetic binary decision problems with four graded attributes across multiple themes, (2) fitting binomial logistic regression models to LLM choices to recover implicit attribute weightings, (3) comparing these "revealed drivers" against two explicit elicitation methods — direct self-report and a score-based judge approach. The key finding is a persistent gap: choice-level prediction accuracy (~80%) substantially exceeds attribute-level agreement (~61%), suggesting models "know" what they prefer but can't fully articulate why.
The experimental design is thoughtful in several respects. The benchmark controls for positional bias through systematic A/B label reversal and attribute ordering permutations. Control themes with deliberately irrelevant attributes (Packaging Symmetry, Label Border Thickness) provide a useful sanity check — these are rarely selected as drivers (<0.3% for direct reports), confirming semantic sensitivity rather than random attribute naming.
The study covers four model families (GPT-5-mini, GPT-5-nano, Qwen3-14B, Ministral-3-14B), each in thinking and non-thinking modes (8 conditions total), across three substantive themes and two control themes. This breadth strengthens generalizability claims.
However, several methodological concerns arise:
The paper addresses an important question in AI interpretability and alignment: can we trust LLM self-explanations? The finding that behavior is more structured than explanations is practically relevant for:
The contribution is more conceptual-diagnostic than it is a tool or method that others would directly adopt. The benchmark itself, while well-constructed, is narrowly scoped.
This work arrives at a critical moment. The rapid deployment of LLMs in consequential decision-making contexts (healthcare, policy, hiring) makes the question of explanation faithfulness urgent. The paper complements concurrent work on stated vs. revealed preferences (Gu et al., 2025; Shen et al., 2025) and unfaithful chain-of-thought reasoning (Chen et al., 2025; Turpin et al., 2023), but distinguishes itself by proposing a formal behavioral reconstruction methodology rather than relying on indirect tests.
The use of very recent models (GPT-5-mini/nano, June 2026 preprint) ensures currency, though rapid model evolution may quickly date specific numerical findings.
The paper's Table 3 showing that equal-weight additive rules underperform by ~11% is important but also reveals that much of the "systematic structure" is captured by trivial baselines. The marginal contribution of learned weights over equal weights is modest, which somewhat tempers the claim about rich latent decision structure.
The framing around argumentation frameworks (score-based judge via ArgLLM) feels somewhat forced — the connection is conceptually thin and the judge doesn't substantially outperform direct elicitation.
Overall, this is a carefully executed empirical study that operationalizes an interesting philosophical question. Its impact is primarily diagnostic and conceptual rather than methodological, and its scope remains limited by the synthetic setting.
Generated Jun 10, 2026
Paper 1 addresses a fundamental scientific question about LLM cognition—whether models have genuine internal decision structures versus superficial verbal rationalizations. The concept of 'superficial belief' is novel and contributes to the growing field of mechanistic interpretability and AI alignment. It uses rigorous behavioral methodology and has broad implications for trust in AI reasoning. Paper 2, while practically useful, is primarily an engineering reference architecture for enterprise governance—important for industry but narrower in scientific contribution, offering architectural patterns rather than new scientific insights.
Paper 1 addresses a fundamental question about LLM reasoning and self-knowledge—whether models truly understand the drivers of their own decisions. The concept of 'superficial belief' provides a novel theoretical framework with broad implications for AI alignment, interpretability, and trust in LLM outputs. This finding is relevant across virtually all LLM applications. Paper 2 presents a useful engineering framework for scientific discovery agents, but is more incremental and narrowly scoped. Paper 1's insights about the gap between LLM behavior and self-report have deeper, more cross-cutting implications for the field.
Paper 1 has higher estimated scientific impact due to a substantial enabling contribution: a million-scale, multi-source tactile reasoning dataset plus a new benchmark, alongside an action-aware representation tailored to tactile redundancy. These resources and methods can unlock broader progress in embodied AI, robotics manipulation, and multimodal foundation models, with clear real-world applications. Paper 2 offers a timely, rigorous behavioral analysis of LLM rationalization (“superficial belief”) with solid methodological controls, but its contribution is primarily conceptual/diagnostic without comparable new infrastructure, and likely narrower in downstream capability gains.
Paper 2 addresses a fundamental question about LLM reasoning and self-knowledge that has broad implications across AI safety, interpretability, and cognitive science. The concept of 'superficial belief' — where LLMs exhibit structured decision behavior but lack accurate self-report access — is novel and highly relevant given the widespread deployment of LLMs. This finding impacts trust, alignment, and evaluation methodologies across many fields. Paper 1, while technically solid, addresses a narrower domain-specific problem (BIM compliance checking) with more limited cross-disciplinary reach.
Paper 2 introduces a radical, paradigm-shifting theoretical framework for AI alignment, addressing core issues like deceptive alignment and instrumental convergence at their root. While Paper 1 offers valuable empirical insights into LLM interpretability, Paper 2's novel concept of 'Existential Indifference' and its bold methodological approach could fundamentally alter the trajectory of AGI safety research, offering a much higher ceiling for breadth of impact and real-world application in developing aligned superintelligence.
Paper 1 offers a clearer conceptual contribution: an empirically supported characterization (“superficial belief”) of the relationship between LLM choices and stated rationales, with controlled synthetic setups, behavioral model fitting, robustness checks, and interpretable analyses. This targets a timely, broadly relevant problem (faithfulness, interpretability, evaluation of reasoning) with impact across alignment, HCI, and cognitive modeling. Paper 2 is application/engineering-heavy (multi-agent + distillation + TTA/LoRA + formatting) and likely yields benchmark gains, but appears more incremental via composition of known techniques and its impact may be narrower to social-intelligence benchmarks.
Paper 1 addresses a fundamental question about LLM reasoning and self-knowledge—whether models truly understand their own decision processes—which has broad implications for AI interpretability, alignment, and trust. The concept of 'superficial belief' is novel and contributes to the growing body of work on LLM introspection. Paper 2, while methodologically solid and practically useful, is a benchmark contribution for a narrower subfield (tabular representation learning). Benchmarks have impact but are more incremental; Paper 1's findings about the gap between LLM behavior and self-reports have wider cross-disciplinary relevance and timeliness given current AI safety concerns.
Paper 2 likely has higher impact: it introduces a concrete, scalable systems-method (latent-space external memory with one-token evidence) that directly addresses pressing deployment constraints (token/storage costs) and demonstrates strong empirical results across many text and multimodal QA benchmarks with large efficiency gains. The approach is timely for RAG/VLM applications and broadly applicable to resource-constrained settings. Paper 1 is conceptually novel for interpretability of LLM decision structure, but its impact may be narrower and more diagnostic than enabling, with fewer immediate real-world deployments.
Paper 1 investigates a fundamental question regarding LLM interpretability and alignment, revealing a disconnect between stated rationales and actual decision drivers. This insight has profound implications for AI safety, cognitive science of LLMs, and trust in AI systems. While Paper 2 offers a valuable methodological improvement for long-horizon agents, Paper 1 addresses deeper theoretical issues with broader interdisciplinary impact.
Paper 1 addresses a fundamental question regarding the interpretability and reasoning mechanisms of LLMs, distinguishing between actual decision structures and superficial rationale generation. This foundational insight into how LLMs operate has broad implications across AI safety, alignment, and cognitive science. While Paper 2 offers a valuable, practical architectural improvement for GUI agents, Paper 1's focus on the core nature of LLM behavior provides deeper theoretical contributions with wider long-term scientific impact.