Representation Without Control: Testing the Realization Effect in Language Models
Ciarán Walsh, Emilio Barkett
Abstract
Large language models are increasingly used as behavioral simulators, but it remains unclear when their outputs reflect human-like cognitive mechanisms rather than prompt-sensitive surface patterns. We study this question through the realization effect, a well-characterized finding in behavioral economics in which risk-taking differs systematically after paper versus realized gains and losses. We evaluate LLM behavior at three levels: prompt-only behavioral sensitivity, linear readout of internal representations, and causal control via activation steering. Prompt-only results show systematic condition sensitivity, but the directional pattern does not reproduce human realization-effect predictions. Gemma's residual stream contains a linearly decodable realization-status signal at layer 18 that generalizes to held-out prompts. Steering along this direction does not, however, reliably shift downstream risk choices, a null result that holds across positive scales and in a negative sign-symmetry run. Behavioral sensitivity, latent readout, and causal control are three distinct properties that do not automatically co-occur, and successful latent readout is insufficient evidence that a model behaviorally relies on a representation during downstream decision-making.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper investigates a conceptually important question: when LLMs produce outputs that appear to reflect human cognitive phenomena, do they actually implement the underlying mechanism? The authors study this through the "realization effect" — a behavioral economics finding where risk-taking differs depending on whether prior gains/losses are "paper" (open) or "realized" (closed). The core contribution is demonstrating a dissociation between linear readout and causal control: Gemma 3 4B encodes a linearly decodable realization-status signal in its residual stream (layer 18), but steering along this direction fails to shift downstream risk behavior. The authors frame this as evidence that behavioral sensitivity, latent readout, and causal control are three distinct evidential standards that do not automatically co-occur.
This conceptual decomposition — distinguishing representation from behavioral relevance — is the paper's most valuable contribution. The argument that linear decodability is insufficient evidence for mechanistic implementation echoes and extends prior theoretical arguments (Park et al., 2024) with concrete empirical demonstration.
Methodological Rigor
The experimental design is logical and well-structured, but the execution has several notable limitations that weaken confidence in the conclusions:
Strengths:
Weaknesses:
Potential Impact
The paper contributes to two active research conversations:
1. LLMs as behavioral surrogates: The finding that prompt-only behavioral patterns don't match human realization-effect predictions adds to a growing body of evidence cautioning against naïve use of LLMs as human subject replacements. However, this negative behavioral result alone is incremental — similar warnings have been sounded repeatedly.
2. Mechanistic interpretability: The readout-without-control dissociation is the more impactful finding. It provides a concrete example supporting the theoretical concern that linear probes may detect epiphenomenal rather than causally relevant representations. This has practical implications for the interpretability community, which frequently uses probing accuracy as evidence that models "understand" or "represent" particular concepts. The result suggests that such claims require causal validation.
The practical impact is moderate. The paper doesn't introduce new methods — it applies existing techniques (mean-difference directions, activation steering) to a new domain. The conceptual message, while important, is not entirely new (Park et al. made the theoretical argument; representation engineering papers have noted similar gaps). The empirical demonstration adds value, but the narrow experimental scope (one model, one layer, one construct) limits generalizability.
Timeliness & Relevance
The paper is well-timed. The use of LLMs as simulated human subjects is accelerating, and the interpretability community is actively debating what probing results actually mean. The paper contributes to the "causal turn" in interpretability, arguing that correlational methods are insufficient — a message that is gaining traction but still needs empirical grounding. The choice of the realization effect as a test case is clever: it's specific enough to have clear directional predictions, yet complex enough that surface-level sensitivity could mimic genuine understanding.
Strengths & Limitations
Key strengths:
Notable weaknesses:
Overall Assessment
This is a conceptually sound paper that makes a valid point — readout ≠ causal relevance — but the empirical evidence, while consistent with the claim, is narrowly scoped and doesn't conclusively establish it. The null steering result could reflect methodological limitations rather than a genuine dissociation. The paper would be substantially strengthened by multi-layer sweeps, alternative steering methods, cross-model replication, and a convincing positive control demonstrating that *some* direction can steer risk behavior through the same pipeline. As it stands, this is a useful contribution to ongoing debates in interpretability and LLM-as-subject research, but falls short of the definitive demonstration the framing suggests.
Generated May 26, 2026
Comparison History (19)
Paper 1 challenges a fundamental assumption in LLM interpretability by demonstrating that decodable internal representations do not necessarily exert causal control over behavior. This mechanistic insight has profound implications for AI alignment, steering, and the use of LLMs as behavioral simulators, offering a deep methodological contribution likely to broadly influence future research. While Paper 2 presents a valuable framework for AI ethics, Paper 1's rigorous decoupling of representation and control addresses a critical, foundational gap in our understanding of model mechanics.
Paper 1 addresses the timely and practically important problem of AI safety/control in deployed coding agents, providing actionable findings on retrying vs resampling strategies with concrete empirical results that contradict prior work. Its direct relevance to AI safety infrastructure (Claude Code, Codex) gives it broad real-world applicability. Paper 2 offers a methodologically careful mechanistic interpretability study but yields primarily null/negative results about activation steering, with narrower scope (one cognitive bias in one model). Paper 1's findings are more likely to influence deployed AI safety practices and future AI control research.
Paper 2 introduces a concrete, reusable pipeline (Anchor) addressing a recognized practical problem (artifact drift in agent benchmarks) and releases a substantial benchmark (ERP-Bench) for economically valuable enterprise tasks. This has broader impact: it enables reproducible evaluation of AI agents on real-world business operations, serves multiple research communities, and provides infrastructure others can build upon. Paper 1 offers a thoughtful mechanistic interpretability study with an important null result (representation without causal control), but its scope is narrower, focused on one cognitive effect in one model, limiting its broader influence.
Paper 1 likely has higher impact: it proposes a concrete RL framework (Calibrated Interactive RL) addressing a practical, widely encountered failure mode (compounding multi-turn distribution shift), offers theoretical analysis, and demonstrates state-of-the-art improvements across dialogue tasks—clear real-world deployment relevance and broad applicability to interactive LLM systems. Paper 2 is methodologically careful and conceptually important for interpretability/behavioral validity (decoding vs causal control), but is more diagnostic than enabling and its primary contribution is a null/clarifying result with narrower immediate application scope.
Paper 1 addresses a critical, widespread problem (object hallucination in LVLMs) with a practical, training-free solution demonstrating state-of-the-art results on standard benchmarks. Its immediate applicability to deployed vision-language systems gives it broad impact. Paper 2 offers valuable methodological insights about LLM interpretability—distinguishing representation from causal control—but its scope is narrower (one specific cognitive phenomenon) and its primary contribution is a null/cautionary result, which, while important, typically generates less follow-on work and adoption than a demonstrably effective new method.
Paper 2 likely has higher scientific impact due to stronger conceptual novelty and broader cross-field relevance: it offers a clear methodological framework separating behavioral sensitivity, representation, and causal dependence, with rigorous negative/causal results (steering nulls, controls, generalization). This directly informs how to interpret mechanistic evidence in LLMs and affects alignment, interpretability, and computational social science. Paper 1 is practically valuable for multi-agent LLM coordination and efficiency, but is more incremental within an active engineering space and may age with rapidly changing agentic baselines.
Paper 2 has higher broader scientific impact: it introduces a clear evaluation framework separating behavioral sensitivity, linear decodability, and causal control, with a timely and generalizable negative result relevant to interpretability, cognitive modeling, and LLM evaluation. Its methodology (readout + activation steering + generalization tests) targets a foundational question across many LLM applications. Paper 1 is highly impactful industrially and methodologically solid, but it is more domain-specific (livestreaming recommendation) and its novelty is mainly in system design/engineering rather than a broadly transferable scientific claim.
Paper 1 offers higher potential impact due to its immediate real-world applicability in deploying resource-intensive Vision-Language Models. By introducing a structured pruning method that preserves Chain-of-Thought reasoning, it solves a major bottleneck in AI efficiency. While Paper 2 provides valuable theoretical insights into LLM interpretability and behavioral simulation, Paper 1 addresses a more pressing, widespread engineering challenge. The ability to compress state-of-the-art VLMs by up to 30-50% without losing reasoning consistency will directly benefit researchers and industry practitioners, ensuring broader and more immediate technological adoption.
Paper 2 is more likely to have higher impact: it introduces a self-improving loop combining weighted A* search with a learned relational GNN heuristic trained via Q-learning, showing striking zero-shot combinatorial generalization (e.g., Blocksworld 30→488 blocks) across multiple planning benchmarks. This targets a central, timely RL/planning challenge with clear downstream applications (automated planning, robotics, optimization) and broad relevance across DRL, classical planning, and graph learning. Paper 1 is methodologically careful and valuable for interpretability, but its main contribution is a negative result with narrower immediate application scope.
Paper 1 makes a methodologically rigorous and novel contribution by demonstrating that representational readout and causal control are dissociable in LLMs—a finding with broad implications for mechanistic interpretability research. It introduces a clear three-level evaluation framework (behavioral sensitivity, linear readout, causal steering) and provides a concrete cautionary result against over-interpreting probe-based evidence. Paper 2 offers useful benchmark design guidance but is more of a framework/taxonomy contribution with less empirical novelty. Paper 1's findings are more likely to influence ongoing interpretability and AI behavioral simulation research directions.
Paper 2 has higher potential impact due to its clear, broadly relevant conceptual contribution: it separates prompt-level behavior, decodable representations, and causal control, showing that linear readout does not imply behavioral reliance. This directly informs interpretability, behavioral evaluation, and the use of LLMs as human simulators across many domains (cog sci, econ, AI safety, mechanistic interpretability). The methodology (prompt tests, generalizing readouts, activation steering with sign/scale controls) is rigorous and timely. Paper 1 is useful but more incremental and narrower to symbolic regression workflows.
Paper 1 (SkillWeave) addresses a highly practical and timely challenge in LLM deployment—efficient multi-domain specialization under memory constraints—with concrete, quantifiable improvements (outperforming a 32B model with a 9B model, 4x speedup). This has broad applicability across the LLM industry and research community. Paper 2 makes a valuable methodological contribution to interpretability by distinguishing behavioral sensitivity, latent readout, and causal control, but its scope is narrower (one specific cognitive effect in one model), and its primary finding is a null result on causal steering, which, while informative, limits its immediate downstream impact.
Paper 2 addresses a fundamental question about whether LLMs genuinely implement human-like cognitive mechanisms versus producing surface-level behavioral mimicry. Its key finding—that representational readout and causal control are dissociable—has broad methodological implications for the rapidly growing field of mechanistic interpretability and LLM-as-simulator research. This insight will influence how researchers validate claims about model cognition. Paper 1, while technically detailed, proposes a niche actuarial framework for AI agent control that, despite rigor, addresses a narrower audience and relies on domain-specific constructs with less generalizable scientific insight.
Paper 1 likely has higher scientific impact due to greater novelty and broader conceptual relevance: it disentangles prompt sensitivity, representational decodability, and causal use via activation steering, challenging common interpretability assumptions. This contributes to mechanistic understanding of LLM cognition and evaluation methodology, with implications across interpretability, behavioral simulation, and safety. Paper 2 is practically useful and timely for SLM deployment, but its contribution is a prompting heuristic in a narrower setting (MCQA abstention), with more incremental scientific novelty and narrower cross-field impact.
Paper 1 offers a highly practical and novel methodological breakthrough by adapting LLMs for censoring-aware survival analysis, a foundational and notoriously challenging task in clinical machine learning. By demonstrating superior performance over established clinical scores and deep learning models on real-world medical datasets, it has immediate, high-impact applications in healthcare. Paper 2 provides a valuable cautionary finding for mechanistic interpretability, but Paper 1's direct improvements to clinical predictive modeling give it broader, more tangible cross-disciplinary impact.
LipoAgent addresses a critical bottleneck in drug delivery design by combining fine-tuned LLMs with a safety-aware multi-agent framework for lipid nanoparticle discovery. It demonstrates strong quantitative improvements (32% relative improvement) and wet-lab validation, bridging computational prediction and real biological outcomes. This has direct translational applications in mRNA therapeutics. Paper 2, while methodologically interesting in probing LLM cognitive mechanisms, yields primarily null results regarding causal control and has narrower impact, contributing mainly to the interpretability/AI psychology niche without clear practical applications.
Paper 2 makes a methodologically rigorous empirical contribution with broad implications for AI interpretability and mechanistic understanding of LLMs. Its key finding—that linear readout of internal representations does not imply causal control over behavior—challenges a widespread assumption in the growing mechanistic interpretability field. This distinction between representation and functional reliance has immediate methodological implications across AI safety, interpretability, and computational cognitive science. Paper 1, while addressing a timely topic (agentic AI governance), is primarily a conceptual/theoretical framework with illustrative cases rather than empirical validation, limiting its immediate scientific impact.
Paper 2 has higher likely impact: it introduces a novel, quantifiable concept ("governance horizon") and validates it at ecosystem scale (2.1M repos) with strong empirical fit, plus policy-relevant counterfactual interventions. Its applications are immediate for AI governance, licensing, compliance, and supply-chain provenance across open-weight model ecosystems, with relevance to platforms and regulators. Breadth spans ML, policy, security, and software supply-chain research. Paper 1 is methodologically interesting for interpretability/causal representation claims, but its narrower scope and largely null causal-control result likely limit broad downstream impact.
Paper 1 presents a novel operational framework integrating virtual water accounting into data center power dispatch optimization, combining differentiable optimization layers with deep learning—a technically innovative approach addressing the critical water-energy nexus. It has clear real-world applications for sustainable data center operations, an increasingly urgent topic. Paper 2 offers valuable mechanistic insights into LLM representations but is more narrowly focused on one cognitive effect and yields primarily null results on causal control. Paper 1's cross-disciplinary impact (energy, water, computing) and practical applicability give it higher potential impact.