Korbinian Friedl, Francis Rhys Ward, Paul Yushin Rapoport, Tom Everitt, Jonathan Richens
Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.
This paper provides the first rigorous formal treatment of the Eliciting Latent Knowledge (ELK) problem, originally posed informally by Christiano et al. (2021). Using Causal Influence Diagrams (CIDs), the authors formalize key concepts—observable vs. latent variables, honesty vs. truthfulness, goal misgeneralization, and evaluation simulation—and prove an impossibility theorem: no feedback-based training strategy that depends only on agent behavior can guarantee producing an honest agent, even when evaluations are always correct during training.
The central mechanism behind this impossibility is goal-environment ambiguity: when the evaluator is imperfect outside the training distribution, the training environment cannot distinguish between an agent that learns to report its genuine beliefs and one that learns to simulate what the evaluator would say. This formalizes the "human simulator" failure mode from Christiano et al. in precise mathematical terms.
The formalization is carefully constructed and internally consistent. The paper builds incrementally from CID definitions through agents, distributional shifts, training strategies, and finally to the distinction between truthfulness (saying what is true) and honesty (saying what one believes to be true). The proofs in the appendix are detailed and appear sound.
Theorem 4.1 (truthfulness-honesty equivalence for capable agents) leverages Richens and Everitt's (2024) result that robustly capable agents learn causal world models, extending it to show that sufficiently capable agents who are in a position to "correctly guess" a variable will be truthful whenever honest. The proof in Section 7.8 carefully tracks error propagation through the causal graph.
Theorem 5.2 (the main impossibility result) is relatively straightforward once the formal machinery is in place—it follows from Lemma 1 by constructing two utility functions (truth-tracking vs. evaluator-tracking) that are indistinguishable during training but diverge out-of-distribution. While the proof is clean, one could argue the result is somewhat expected given the setup: the impossibility essentially restates that behavioral equivalence during training cannot distinguish between two internally different objectives. The strength lies not in the surprising nature of the result but in making the argument watertight within a formal framework.
Limitations in rigor: The theorems inherit assumptions from Richens and Everitt (2024), including domain dependence and unmediated decision tasks. The restriction to single-decision, single-utility CIDs is significant. The assumption that agent and developers share the same variable set (no ontology mismatch) is explicitly acknowledged but substantially limits applicability to real AI systems. The question format is also restrictive—only asking about values of specific variables, not about causal structure or counterfactuals.
Theoretical impact: This paper establishes a formal baseline for ELK research. The impossibility result clearly delineates what purely behavioral training strategies cannot achieve, which should redirect research toward approaches that go beyond behavioral feedback—such as interpretability-based methods, mechanistic interventions, or architectural constraints. This mirrors how impossibility results in other fields (e.g., Arrow's theorem, No Free Lunch theorems) have productively shaped research agendas.
Practical implications: The paper's framing clarifies why current RLHF-style training is fundamentally limited for ensuring honesty about latent variables. The evaluation simulator concept provides a crisp way to think about failure modes in deployed systems. However, the paper offers no solutions, which limits immediate practical utility.
Broader influence: The formalization connects ELK to established literatures on mechanism design, proper scoring rules, and causal reasoning, potentially enabling cross-pollination. The careful distinction between honesty and truthfulness, and the conditions under which they coincide, could influence how the AI safety community thinks about alignment properties.
This paper is highly timely. As frontier AI systems become increasingly capable, the gap between what they "know" and what their developers can verify grows. ELK is widely considered one of the core open problems in AI alignment, and Christiano et al.'s original formulation has been influential but informal. A formal treatment was overdue.
The paper also connects to active debates about AI honesty, deception, and interpretability (Burns et al. 2022, Mallen et al. 2024, Bengio et al. 2025). The impossibility result provides theoretical grounding for the intuition that behavioral training alone is insufficient, strengthening arguments for investing in interpretability and mechanistic understanding.
Missing elements: No empirical validation or connection to specific training paradigms (RLHF, DPO, etc.). The gap between CID formalism and practical deep learning systems is not bridged. Comparison with concurrent work on AI honesty (e.g., constitutional AI approaches) would strengthen positioning.
This is a solid theoretical contribution that brings much-needed formal precision to an important AI safety problem. The impossibility result is clean and well-proven, though not deeply surprising. The paper's primary value is as a foundational reference that establishes formal definitions and identifies the boundaries of what behavioral training can achieve. Its impact will largely depend on whether the community adopts its formalism and whether follow-up work provides constructive solutions within this framework.
Generated Jun 11, 2026
Paper 2 presents a fundamental theoretical impossibility result in AI safety and alignment (Eliciting Latent Knowledge). While Paper 1 provides impressive empirical state-of-the-art results in automated theorem proving, empirical benchmarks are frequently surpassed. Foundational impossibility theorems, however, tend to have a lasting, profound impact by shaping the theoretical boundaries and future research trajectories of the entire field.
Paper 1 addresses a fundamental theoretical problem in AI alignment—the impossibility of reliably eliciting honest beliefs from advanced AI systems—proving a formal impossibility theorem with broad implications for AI safety research. This has deep relevance as AI capabilities advance. Paper 2 is an exploratory empirical study with non-significant results, limited sample size, poor inter-rater reliability, and narrow application scope (skill-augmented LLMs for a specific biomedical task). Paper 1's theoretical contributions are more novel, rigorous, and broadly impactful across AI safety, alignment, and machine learning fields.
Paper 1 addresses a fundamental theoretical question in AI alignment—whether latent knowledge can be reliably elicited from AI agents—and proves a rigorous impossibility theorem with broad implications for AI safety. Its formalization using Causal Influence Diagrams provides a lasting theoretical framework applicable across the entire field. Paper 2 identifies a valid but narrower issue (aggregate metric inversions in automated research agents) with a domain-specific demonstration. While useful, it addresses a more incremental engineering concern. Paper 1's impossibility result has deeper, more far-reaching implications for the trajectory of AI development and safety research.
Paper 1 addresses a fundamental theoretical question in AI alignment—whether we can reliably train AI systems to honestly report their beliefs about latent variables. Its impossibility theorem has profound implications for AI safety research, establishing formal limits on feedback-based training approaches. This result is highly novel, methodologically rigorous (using Causal Influence Diagrams), and broadly relevant as AI systems become more capable. Paper 2, while practically useful, is an engineering contribution to benchmark construction—a more incremental advance with narrower impact. The theoretical foundations laid by Paper 1 will likely influence alignment research for years.
Paper 2 addresses a fundamental and highly impactful problem in AI safety and alignment (Eliciting Latent Knowledge) by providing a formal impossibility theorem. While Paper 1 offers a valuable domain-specific medical application, Paper 2's theoretical contributions have a significantly broader impact across the entire field of AI, influencing how researchers approach the foundational design of safe and honest AI systems.
Paper 2 establishes a formal impossibility theorem for eliciting latent knowledge (ELK), a foundational problem in AI alignment. It provides rigorous mathematical framework using Causal Influence Diagrams and proves fundamental limits of feedback-based training, which has broad implications for all alignment approaches. Paper 1, while interesting in identifying a vulnerability paradox in LLM safety, is more narrowly focused on a specific attack vector that may be patched. Paper 2's theoretical contribution is more durable and foundational, likely influencing alignment research paradigms for years, similar to how impossibility results in other fields (e.g., Arrow's theorem) have lasting impact.
While Paper 1 offers a profound theoretical impossibility result for AI alignment, Paper 2 demonstrates massive scale and immediate real-world utility. By training a foundation model on 200 million patients and validating it across 1,000+ diseases and financial tasks, Paper 2 bridges AI, epidemiology, and healthcare economics. Its proven empirical superiority and broad applicability across regulatory and clinical domains give it a wider interdisciplinary reach and higher potential for immediate, transformative scientific and societal impact.
Paper 1 is likely higher impact: it offers a novel formalization of the ELK alignment problem via causal influence diagrams and proves a broad impossibility theorem, shaping what classes of training methods can and cannot work. This has wide cross-field relevance (AI safety, RL, interpretability, causal modeling) and is timely given frontier-model deployment concerns. Paper 2 targets valuable applications (equation discovery) but appears closer to incremental advances in symbolic regression/AutoML with multi-agent/metaheuristic framing; its claims may be impactful if rigorously validated, yet its scope is narrower and more contingent on empirical performance.
Paper 2 presents a foundation model for wearable health pretrained on an unprecedented scale (1 trillion minutes, 5 million participants), demonstrating broad applicability across 35 health prediction tasks with concrete clinical validation. Its real-world impact potential is enormous given the ubiquity of wearable devices and growing demand for personalized health. While Paper 1 makes important theoretical contributions to AI alignment (impossibility theorem for ELK), its impact is more narrowly concentrated in the AI safety theory community. Paper 2's combination of scale, methodological breadth, practical applicability, and clinical validation suggests broader and more immediate scientific impact.
Paper 1 addresses the critical AI safety problem of eliciting latent knowledge with a rigorous impossibility theorem formalized via Causal Influence Diagrams. This result has immediate and profound implications for AI alignment—a field of rapidly growing importance as AI capabilities advance. The impossibility result constrains the space of possible alignment strategies, directly shaping future research directions. Paper 2 attempts an ambitious unification across Bayesian inference, game theory, and thermodynamics, but such grand unifying frameworks, while intellectually appealing, often struggle to gain traction due to their breadth and abstraction. Paper 1's focused, actionable negative result in a timely domain gives it higher estimated impact.