Imperfect World Models are Exploitable
Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy
Abstract
We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Imperfect World Models are Exploitable"
1. Core Contribution
This paper introduces a formal, ordinal definition of model exploitation in reinforcement learning: two transition models are exploitable if they disagree on the ordering of some pair of policies. This is directly analogous to the reward hacking definition of Skalse et al. (2022), but applied to transition dynamics rather than reward functions. The paper's central message is that imperfect world models almost inevitably invert policy orderings on sufficiently rich policy sets—a sobering result for model-based RL.
The key contributions are: (1) a formal definition of exploitation as value inversion between transition models; (2) a unified theory connecting reward hacking and model exploitation via the rationality of value functions in policy space; (3) a proof that exploitation is unavoidable on any policy set containing an open subset (Theorem 1); (4) a demonstration that finite-policy-set guarantees available for reward hacking do not transfer to exploitation; and (5) a relaxed ε-exploitability notion with a tight safe horizon bound (Theorem 2).
2. Methodological Rigor
The mathematical development is rigorous and well-structured. The proofs leverage the rationality of the value function in policy parameters (Proposition 4), which is a clean and underappreciated observation that enables real-analytic machinery. The gradient-based argument (Lemmas 1 and 2) provides an elegant geometric characterization: at any point in policy space, gradients of two value functions are either linearly independent (yielding exploitation via Lemma 1), antiparallel (same), or positively proportional (forcing equivalence via Lemma 2). This trichotomy is complete and intuitive.
The paper is careful about edge cases. The authors identify that the proof strategy from Skalse et al. (2022) does not transfer due to the nonlinearity of value in transition parameters and the constrained geometry of probability simplices (Section 3.2.1). They also catch a minor gap in Skalse et al.'s Theorem 2 (Proposition 2, requiring non-collinearity of visit counts), which demonstrates careful reading of prior work.
The safe horizon result (Theorem 2) is proven tight via an explicit two-state construction, which strengthens its utility. The connection to the tight simulation lemma of Lobel and Parr (2024) is natural and well-executed.
3. Potential Impact
Theoretical impact: The paper establishes model exploitation as a formal concept parallel to reward hacking, creating a bridge between two previously separate safety concerns in RL. The unified framework via value function rationality is a contribution that may find applications beyond the specific results here. The observation that exploitation is structurally harder to avoid than hacking (no finite-policy-set guarantees) is a meaningful negative result.
Practical relevance: The ε-exploitability framework and safe horizon bound (Theorem 2, Corollary 4) offer practitioners actionable guidance: given a model error bound δ and tolerance ε, one can compute the maximum effective horizon for safe planning. The "square-root heuristic" (1/(1−γ) < √(ε/δ)) is memorable and practically useful. This connects to and corroborates prior work by Jiang et al. (2015) on effective planning horizons.
Breadth of influence: The results apply in principle to any model-based decision-making system—Dreamer, MuZero, JEPA-based models, sim-to-real transfer, etc. The paper's framing connects to economics (Lucas critique), robust control, and evolutionary biology, suggesting broad interdisciplinary relevance, though concrete applications to these domains remain to be demonstrated.
4. Timeliness & Relevance
This work is timely given the explosion of interest in world models (Dreamer V3, JEPA, etc.) and the increasing reliance on learned dynamics models for policy synthesis. As world models scale to richer observation spaces (pixels, language), understanding when planning inside such models is safe becomes increasingly critical. The paper addresses this directly from a theoretical angle. The AI safety community's growing concern about reward hacking and specification gaming makes the parallel treatment of model exploitation particularly relevant.
5. Strengths & Limitations
Strengths:
Limitations:
Overall Assessment: This is a solid theoretical contribution that formalizes an important but previously informal concept in model-based RL. The unified treatment of reward hacking and model exploitation via value function rationality is the paper's most lasting contribution. The results are primarily negative (exploitation is unavoidable) with one constructive result (safe horizon). The work would benefit from extensions to continuous/latent-state settings and empirical grounding, but as a foundational theoretical paper, it establishes the right definitions and baseline results for future work.
Generated May 18, 2026
Comparison History (23)
Paper 1 addresses a highly timely and universally relevant topic: the productivity impact of GenAI on knowledge workers. By introducing and empirically validating the concept of 'AI Interaction Competence', it provides a highly citable framework applicable across economics, management, education, and HCI. While Paper 2 offers rigorous theoretical contributions to AI safety, Paper 1's immediate real-world applicability, randomized controlled experimental design, and broad cross-disciplinary appeal give it a higher potential for widespread scientific and societal impact.
Paper 2 establishes fundamental theoretical bridges between reward hacking and model exploitation in reinforcement learning, addressing core AI safety challenges. Its general theory and proofs are likely to impact a broad range of AI and RL research. Paper 1, while highly relevant for autonomous driving safety, is a narrower empirical study focused on a specific class of VLA models, resulting in a more localized scientific impact.
Paper 1 makes fundamental theoretical contributions by formally defining model exploitation in RL, proving its essential unavoidability, and establishing a formal bridge between reward hacking and model exploitation. These results have broad implications for AI safety, world model-based planning, and alignment research—areas of growing importance. Paper 2 presents a useful engineering contribution (latent action reparameterization for LLM agents) that improves inference efficiency, but it is more incremental and narrower in scope. Paper 1's theoretical framework is likely to influence multiple research directions and serve as a foundational reference for safe planning under imperfect models.
Paper 2 likely has higher impact due to timely relevance to AI governance and LLM deployment, clear real-world applicability (auditing/monitoring/intervention for compliance), and breadth across ML, formal methods, and policy. It proposes concrete techniques (LTL-based offline/online monitoring, predictive and intervening monitors) with empirical results and comparative baselines, supporting methodological rigor and adoption potential. Paper 1 offers strong theoretical novelty in RL/world-model exploitation, but its impact may be narrower and more conceptual, with less immediate deployment leverage than compliance tooling for widely used LLM systems.
Paper 1 provides a fully open and auditable pipeline for clinical LLMs, directly addressing the critical need for transparency and reproducibility in medical AI. Its practical utility, open-source release, and application to a high-stakes domain like healthcare suggest it will have widespread adoption and substantial real-world impact. While Paper 2 offers valuable theoretical insights into RL safety, Paper 1's immediate relevance to clinical decision support systems gives it broader cross-disciplinary potential.
Paper 1 establishes fundamental theoretical results about model exploitation in RL, proving its near-inevitability and drawing formal connections to reward hacking—a broadly relevant safety concern. These results have wide implications for any system using learned world models (model-based RL, planning, AI safety). Paper 2 presents a solid engineering contribution combining LLMs with RL for crystal generation, but is more domain-specific and incremental (combining existing techniques: chain-of-thought reasoning, RL alignment, LLMs). Paper 1's theoretical framework is more likely to influence multiple research communities and shape foundational understanding of safe AI planning.
CLEF addresses a significant practical gap in clinical EEG interpretation with a comprehensive foundation model evaluated on a massive benchmark (234 tasks, 260k sessions). Its immediate clinical applicability, large-scale evaluation, and strong empirical results give it broad impact across neurology, clinical AI, and EEG research. While Paper 1 makes important theoretical contributions connecting reward hacking and model exploitation in RL, its impact is more niche and theoretical. Paper 2's combination of methodological innovation, scale, and direct clinical relevance positions it for broader and more immediate scientific impact.
Paper 2 addresses a fundamental theoretical question about the reliability of world models in RL, establishing formal connections between reward hacking and model exploitation with proofs of inevitability. This has broad implications across all model-based RL and AI safety research. Its theoretical contributions (novel definitions, impossibility results, safe horizons) provide foundational tools applicable across many domains. Paper 1, while solid applied work on online strategy optimization for LLM social agents, addresses a narrower problem with incremental methodological contributions (combining bandits with neural surrogates) on a specific benchmark.
Paper 1 establishes a fundamental theoretical framework for model exploitation and reward hacking in reinforcement learning. By proving the inevitability of exploitation under certain conditions, it provides crucial, long-lasting insights into AI safety and the limits of world models. While Paper 2 offers a valuable empirical benchmark and system for GUI control, Paper 1's theoretical contributions have broader, more foundational implications across multiple subfields of AI, leading to higher potential scientific impact.
Paper 2 has higher estimated impact: it introduces a principled, general framework (SMC-based) for LLM-driven program evolution with explicit convergence control and finite-sample complexity bounds, plus broad empirical validation across diverse scientific-discovery tasks and practical efficiency gains (fewer LLM calls). This combination of methodological rigor, real-world applicability, and cross-domain relevance is likely to influence both automated discovery and ML systems research. Paper 1 is conceptually novel and important for RL safety theory, but its impact may be narrower and more theoretical with less immediate tooling/benchmark-driven adoption.
Paper 2 introduces an open-source framework and state-of-the-art models for LLM agents, addressing a critical bottleneck in empirical AI research. Its practical utility, scalable infrastructure, and strong benchmark results across multiple domains will likely drive widespread adoption and accelerate applied research, leading to a broader and faster scientific impact than the theoretical contributions of Paper 1.
Paper 1 develops fundamental theoretical results connecting reward hacking and model exploitation in reinforcement learning, proving near-inevitability of exploitation and establishing safe planning horizons. This addresses a core problem in AI safety with broad implications for any system using learned world models. Paper 2, while practically useful for EDA/RTL benchmarking, addresses a narrower domain-specific problem (maintaining RTL generation benchmarks) with more limited cross-disciplinary impact. Paper 1's theoretical contributions are more likely to influence multiple research directions in RL, AI safety, and planning.
Paper 2 offers fundamental theoretical contributions to Reinforcement Learning and AI safety by formalizing model exploitation and its inevitability, establishing a bridge to reward hacking. Such foundational insights into the limits of safe planning in world models have a broader, longer-lasting scientific impact across the broader AI community compared to Paper 1, which, while practically valuable, presents a more domain-specific engineering framework for microservice AIOps.
Paper 1 establishes fundamental theoretical results about model exploitation in RL, proving its near-inevitability and drawing formal connections to reward hacking. These foundational insights have broad implications for AI safety, world model-based planning, and alignment research. While Paper 2 presents a useful practical contribution (NudgeRL for efficient exploration in RLVR), it is more incremental and narrowly focused on a specific training paradigm for LLMs. Paper 1's theoretical framework is likely to have longer-lasting and broader influence across multiple subfields of AI research.
Paper 1 offers a more foundational, broadly applicable theoretical contribution: a formal definition of model exploitation, a general theory unifying exploitation and reward hacking, and inevitability/avoidability results (including safe-horizon conditions). This targets core questions in RL safety and world-model planning with likely cross-domain impact (alignment, verification, agent design). Paper 2 is timely and practically valuable for legal NLP, especially contamination-aware evaluation and neuro-symbolic robustness, but its scope is narrower (tax law) and primarily empirical; its broader scientific impact depends on generalization beyond the domain and benchmark.
Paper 2 addresses a fundamental theoretical problem in reinforcement learning—model exploitation and its relationship to reward hacking—establishing formal definitions, impossibility results, and safe planning horizons. This has broad implications across RL, AI safety, and model-based planning. Paper 1 presents a useful engineering contribution (property-guided LLM synthesis for planning heuristics) with strong empirical results, but its scope is narrower. Paper 2's theoretical framework is more likely to influence multiple research directions and has higher relevance given growing concerns about AI safety and reliable world models.
Paper 1 has higher likely scientific impact: it introduces a formal definition of model exploitation in RL, develops general theory linking exploitation and reward hacking, proves (near-)unavoidability results, and derives a “safe horizon” characterization. This is methodologically rigorous, broadly relevant to RL, model-based planning, AI safety, and evaluation, and timely given widespread world-model use. Paper 2 targets an important application (legal AI) and proposes a promising graph-constrained framework plus a dataset, but current evidence is limited (small corpus, no direct RAG baseline comparison), and the contribution is more domain-specific and engineering-oriented.
Paper 2 offers higher scientific impact by providing foundational, theoretical proofs on the limits of safe planning in reinforcement learning. By formally defining 'model exploitation' and proving its inevitability in large policy sets, it establishes a rigorous bridge to reward hacking. While Paper 1 provides highly practical empirical insights for LLM monitoring, Paper 2's theoretical contributions offer deeper methodological rigor. Its formalization of critical safety limitations will likely shape future fundamental research directions across the broader field of AI alignment and model-based RL, creating a longer-lasting conceptual impact.
Paper 1 addresses a fundamental theoretical problem in reinforcement learning—model exploitation and its relationship to reward hacking—establishing formal impossibility results and safety guarantees applicable broadly across RL and AI safety. Its contributions are foundational, with implications for any system using learned world models. Paper 2 presents an applied engineering contribution (an LLM agent framework for colloidal packing simulations) that, while useful, has narrower scope and incremental novelty, primarily integrating existing tools (HOOMD-blue, MCP, LLMs) rather than advancing core scientific understanding.
Paper 1 establishes fundamental theoretical results connecting reward hacking and model exploitation in RL, proving near-inevitability of exploitation and deriving safe planning horizons. This creates a new formal framework with broad implications for model-based RL safety. Paper 2 presents a practical but incremental contribution—a new alignment training method for reasoning LLMs. While useful, CASPO builds on existing DPO methods with confidence calibration. Paper 1's theoretical contributions are more foundational, likely to be widely cited across RL, AI safety, and planning communities, with longer-lasting impact.