Imperfect World Models are Exploitable

Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

May 15, 2026

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →

#296of 2292·Artificial Intelligence

#296 of 2292 · Artificial Intelligence

Tournament Score

1502±46

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor8.5

Novelty7

Clarity8.5

Tournament Score

1502±46

10501800

74%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Imperfect World Models are Exploitable"

1. Core Contribution

This paper introduces a formal, ordinal definition of model exploitation in reinforcement learning: two transition models are exploitable if they disagree on the ordering of some pair of policies. This is directly analogous to the reward hacking definition of Skalse et al. (2022), but applied to transition dynamics rather than reward functions. The paper's central message is that imperfect world models almost inevitably invert policy orderings on sufficiently rich policy sets—a sobering result for model-based RL.

The key contributions are: (1) a formal definition of exploitation as value inversion between transition models; (2) a unified theory connecting reward hacking and model exploitation via the rationality of value functions in policy space; (3) a proof that exploitation is unavoidable on any policy set containing an open subset (Theorem 1); (4) a demonstration that finite-policy-set guarantees available for reward hacking do not transfer to exploitation; and (5) a relaxed ε-exploitability notion with a tight safe horizon bound (Theorem 2).

2. Methodological Rigor

The mathematical development is rigorous and well-structured. The proofs leverage the rationality of the value function in policy parameters (Proposition 4), which is a clean and underappreciated observation that enables real-analytic machinery. The gradient-based argument (Lemmas 1 and 2) provides an elegant geometric characterization: at any point in policy space, gradients of two value functions are either linearly independent (yielding exploitation via Lemma 1), antiparallel (same), or positively proportional (forcing equivalence via Lemma 2). This trichotomy is complete and intuitive.

The paper is careful about edge cases. The authors identify that the proof strategy from Skalse et al. (2022) does not transfer due to the nonlinearity of value in transition parameters and the constrained geometry of probability simplices (Section 3.2.1). They also catch a minor gap in Skalse et al.'s Theorem 2 (Proposition 2, requiring non-collinearity of visit counts), which demonstrates careful reading of prior work.

The safe horizon result (Theorem 2) is proven tight via an explicit two-state construction, which strengthens its utility. The connection to the tight simulation lemma of Lobel and Parr (2024) is natural and well-executed.

3. Potential Impact

Theoretical impact: The paper establishes model exploitation as a formal concept parallel to reward hacking, creating a bridge between two previously separate safety concerns in RL. The unified framework via value function rationality is a contribution that may find applications beyond the specific results here. The observation that exploitation is structurally harder to avoid than hacking (no finite-policy-set guarantees) is a meaningful negative result.

Practical relevance: The ε-exploitability framework and safe horizon bound (Theorem 2, Corollary 4) offer practitioners actionable guidance: given a model error bound δ and tolerance ε, one can compute the maximum effective horizon for safe planning. The "square-root heuristic" (1/(1−γ) < √(ε/δ)) is memorable and practically useful. This connects to and corroborates prior work by Jiang et al. (2015) on effective planning horizons.

Breadth of influence: The results apply in principle to any model-based decision-making system—Dreamer, MuZero, JEPA-based models, sim-to-real transfer, etc. The paper's framing connects to economics (Lucas critique), robust control, and evolutionary biology, suggesting broad interdisciplinary relevance, though concrete applications to these domains remain to be demonstrated.

4. Timeliness & Relevance

This work is timely given the explosion of interest in world models (Dreamer V3, JEPA, etc.) and the increasing reliance on learned dynamics models for policy synthesis. As world models scale to richer observation spaces (pixels, language), understanding when planning inside such models is safe becomes increasingly critical. The paper addresses this directly from a theoretical angle. The AI safety community's growing concern about reward hacking and specification gaming makes the parallel treatment of model exploitation particularly relevant.

5. Strengths & Limitations

Strengths:

Clean formalization that fills a genuine gap in the theory of model-based RL

The unified theory via value function rationality is technically elegant and yields both exploitation and hacking results as special cases

Careful treatment of the boundary between reward hacking and model exploitation (Proposition 3 shows exploitation → hacking but not vice versa)

The safe horizon result is tight and practically interpretable

Excellent exposition: Figures 1-3 and the running 3-state MDP example provide strong geometric intuition

Identification of a correction needed in prior work (Proposition 2)

Limitations:

Results are restricted to finite state-action MDPs, which limits direct applicability to modern deep RL settings with continuous spaces or latent-state world models

The impossibility results (Theorem 1) are essentially topological—they say exploitation exists but provide no information about its likelihood, severity, or practical frequency

The ε-exploitability bounds (Table 2) are conservative in practice, limiting their direct utility for tight safety certificates

The open Question 1 about finite policy sets is left unresolved, which is the most practically relevant regime

No empirical validation: while the paper is theoretical, even simple experiments showing exploitation in learned models would strengthen the practical motivation

The paper does not address how its results interact with common defenses (ensembles, pessimism, domain randomization)

Overall Assessment: This is a solid theoretical contribution that formalizes an important but previously informal concept in model-based RL. The unified treatment of reward hacking and model exploitation via value function rationality is the paper's most lasting contribution. The results are primarily negative (exploitation is unavoidable) with one constructive result (safe horizon). The work would benefit from extensions to continuous/latent-state settings and empirical grounding, but as a foundational theoretical paper, it establishes the right definitions and baseline results for future work.

Rating:6.8/ 10

Significance 7Rigor 8.5Novelty 7Clarity 8.5

Generated May 18, 2026

Comparison History (23)

vs. Generative AI and the Productivity Divide: Human-AI Complementarities in Education

gemini-3.15/19/2026

Paper 1 addresses a highly timely and universally relevant topic: the productivity impact of GenAI on knowledge workers. By introducing and empirically validating the concept of 'AI Interaction Competence', it provides a highly citable framework applicable across economics, management, education, and HCI. While Paper 2 offers rigorous theoretical contributions to AI safety, Paper 1's immediate real-world applicability, randomized controlled experimental design, and broad cross-disciplinary appeal give it a higher potential for widespread scientific and societal impact.

vs. Is VLA Reasoning Faithful? Probing Safety of Chain-of-Causation

gemini-3.15/19/2026

Paper 2 establishes fundamental theoretical bridges between reward hacking and model exploitation in reinforcement learning, addressing core AI safety challenges. Its general theory and proofs are likely to impact a broad range of AI and RL research. Paper 1, while highly relevant for autonomous driving safety, is a narrower empirical study focused on a specific class of VLA models, resulting in a more localized scientific impact.

vs. Latent Action Reparameterization for Efficient Agent Inference

claude-opus-4.65/19/2026

Paper 1 makes fundamental theoretical contributions by formally defining model exploitation in RL, proving its essential unavoidability, and establishing a formal bridge between reward hacking and model exploitation. These results have broad implications for AI safety, world model-based planning, and alignment research—areas of growing importance. Paper 2 presents a useful engineering contribution (latent action reparameterization for LLM agents) that improves inference efficiency, but it is more incremental and narrower in scope. Paper 1's theoretical framework is likely to influence multiple research directions and serve as a foundational reference for safe planning under imperfect models.

vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

gpt-5.25/18/2026

Paper 2 likely has higher impact due to timely relevance to AI governance and LLM deployment, clear real-world applicability (auditing/monitoring/intervention for compliance), and breadth across ML, formal methods, and policy. It proposes concrete techniques (LTL-based offline/online monitoring, predictive and intervening monitors) with empirical results and comparative baselines, supporting methodological rigor and adoption potential. Paper 1 offers strong theoretical novelty in RL/world-model exploitation, but its impact may be narrower and more conceptual, with less immediate deployment leverage than compliance tooling for widely used LLM systems.

vs. Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

gemini-3.15/18/2026

Paper 1 provides a fully open and auditable pipeline for clinical LLMs, directly addressing the critical need for transparency and reproducibility in medical AI. Its practical utility, open-source release, and application to a high-stakes domain like healthcare suggest it will have widespread adoption and substantial real-world impact. While Paper 2 offers valuable theoretical insights into RL safety, Paper 1's immediate relevance to clinical decision support systems gives it broader cross-disciplinary potential.

vs. CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

claude-opus-4.65/18/2026

Paper 1 establishes fundamental theoretical results about model exploitation in RL, proving its near-inevitability and drawing formal connections to reward hacking—a broadly relevant safety concern. These results have wide implications for any system using learned world models (model-based RL, planning, AI safety). Paper 2 presents a solid engineering contribution combining LLMs with RL for crystal generation, but is more domain-specific and incremental (combining existing techniques: chain-of-thought reasoning, RL alignment, LLMs). Paper 1's theoretical framework is more likely to influence multiple research communities and shape foundational understanding of safe AI planning.

vs. CLEF: EEG Foundation Model for Learning Clinical Semantics

claude-opus-4.65/18/2026

CLEF addresses a significant practical gap in clinical EEG interpretation with a comprehensive foundation model evaluated on a massive benchmark (234 tasks, 260k sessions). Its immediate clinical applicability, large-scale evaluation, and strong empirical results give it broad impact across neurology, clinical AI, and EEG research. While Paper 1 makes important theoretical contributions connecting reward hacking and model exploitation in RL, its impact is more niche and theoretical. Paper 2's combination of methodological innovation, scale, and direct clinical relevance positions it for broader and more immediate scientific impact.

vs. ALSO: Adversarial Online Strategy Optimization for Social Agents

claude-opus-4.65/18/2026

Paper 2 addresses a fundamental theoretical question about the reliability of world models in RL, establishing formal connections between reward hacking and model exploitation with proofs of inevitability. This has broad implications across all model-based RL and AI safety research. Its theoretical contributions (novel definitions, impossibility results, safe horizons) provide foundational tools applicable across many domains. Paper 1, while solid applied work on online strategy optimization for LLM social agents, addresses a narrower problem with incremental methodological contributions (combining bandits with neural surrogates) on a specific benchmark.

vs. PAGER: Bridging the Semantic-Execution Gap in Point-Precise Geometric GUI Control

gemini-3.15/18/2026

Paper 1 establishes a fundamental theoretical framework for model exploitation and reward hacking in reinforcement learning. By proving the inevitability of exploitation under certain conditions, it provides crucial, long-lasting insights into AI safety and the limits of world models. While Paper 2 offers a valuable empirical benchmark and system for GUI control, Paper 1's theoretical contributions have broader, more foundational implications across multiple subfields of AI, leading to higher potential scientific impact.

vs. SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution

gpt-5.25/18/2026

Paper 2 has higher estimated impact: it introduces a principled, general framework (SMC-based) for LLM-driven program evolution with explicit convergence control and finite-sample complexity bounds, plus broad empirical validation across diverse scientific-discovery tasks and practical efficiency gains (fewer LLM calls). This combination of methodological rigor, real-world applicability, and cross-domain relevance is likely to influence both automated discovery and ML systems research. Paper 1 is conceptually novel and important for RL safety theory, but its impact may be narrower and more theoretical with less immediate tooling/benchmark-driven adoption.

vs. Orchard: An Open-Source Agentic Modeling Framework

gemini-3.15/18/2026

Paper 2 introduces an open-source framework and state-of-the-art models for LLM agents, addressing a critical bottleneck in empirical AI research. Its practical utility, scalable infrastructure, and strong benchmark results across multiple domains will likely drive widespread adoption and accelerate applied research, leading to a broader and faster scientific impact than the theoretical contributions of Paper 1.

vs. RTL-BenchMT: Dynamic Maintenance of RTL Generation Benchmark Through Agent-Assisted Analysis and Revision

claude-opus-4.65/18/2026

Paper 1 develops fundamental theoretical results connecting reward hacking and model exploitation in reinforcement learning, proving near-inevitability of exploitation and establishing safe planning horizons. This addresses a core problem in AI safety with broad implications for any system using learned world models. Paper 2, while practically useful for EDA/RTL benchmarking, addresses a narrower domain-specific problem (maintaining RTL generation benchmarks) with more limited cross-disciplinary impact. Paper 1's theoretical contributions are more likely to influence multiple research directions in RL, AI safety, and planning.

vs. STAR: A Stage-attributed Triage and Repair framework for RCA Agents in Microservices

gemini-3.15/18/2026

Paper 2 offers fundamental theoretical contributions to Reinforcement Learning and AI safety by formalizing model exploitation and its inevitability, establishing a bridge to reward hacking. Such foundational insights into the limits of safe planning in world models have a broader, longer-lasting scientific impact across the broader AI community compared to Paper 1, which, while practically valuable, presents a more domain-specific engineering framework for microservice AIOps.

vs. Nudging Beyond the Comfort Zone: Efficient Strategy-Guided Exploration for RLVR

claude-opus-4.65/18/2026

Paper 1 establishes fundamental theoretical results about model exploitation in RL, proving its near-inevitability and drawing formal connections to reward hacking. These foundational insights have broad implications for AI safety, world model-based planning, and alignment research. While Paper 2 presents a useful practical contribution (NudgeRL for efficient exploration in RLVR), it is more incremental and narrowly focused on a specific training paradigm for LLMs. Paper 1's theoretical framework is likely to have longer-lasting and broader influence across multiple subfields of AI research.

vs. Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness in Tax Law

gpt-5.25/18/2026

Paper 1 offers a more foundational, broadly applicable theoretical contribution: a formal definition of model exploitation, a general theory unifying exploitation and reward hacking, and inevitability/avoidability results (including safe-horizon conditions). This targets core questions in RL safety and world-model planning with likely cross-domain impact (alignment, verification, agent design). Paper 2 is timely and practically valuable for legal NLP, especially contamination-aware evaluation and neuro-symbolic robustness, but its scope is narrower (tax law) and primarily empirical; its broader scientific impact depends on generalization beyond the domain and benchmark.

vs. Property-Guided LLM Program Synthesis for Planning

claude-opus-4.65/18/2026

Paper 2 addresses a fundamental theoretical problem in reinforcement learning—model exploitation and its relationship to reward hacking—establishing formal definitions, impossibility results, and safe planning horizons. This has broad implications across RL, AI safety, and model-based planning. Paper 1 presents a useful engineering contribution (property-guided LLM synthesis for planning heuristics) with strong empirical results, but its scope is narrower. Paper 2's theoretical framework is more likely to influence multiple research directions and has higher relevance given growing concerns about AI safety and reliable world models.

vs. Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

gpt-5.25/18/2026

Paper 1 has higher likely scientific impact: it introduces a formal definition of model exploitation in RL, develops general theory linking exploitation and reward hacking, proves (near-)unavoidability results, and derives a “safe horizon” characterization. This is methodologically rigorous, broadly relevant to RL, model-based planning, AI safety, and evaluation, and timely given widespread world-model use. Paper 2 targets an important application (legal AI) and proposes a promising graph-constrained framework plus a dataset, but current evidence is limited (small corpus, no direct RAG baseline comparison), and the contribution is more domain-specific and engineering-oriented.

vs. Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

gemini-3.15/18/2026

Paper 2 offers higher scientific impact by providing foundational, theoretical proofs on the limits of safe planning in reinforcement learning. By formally defining 'model exploitation' and proving its inevitability in large policy sets, it establishes a rigorous bridge to reward hacking. While Paper 1 provides highly practical empirical insights for LLM monitoring, Paper 2's theoretical contributions offer deeper methodological rigor. Its formalization of critical safety limitations will likely shape future fundamental research directions across the broader field of AI alignment and model-based RL, creating a longer-lasting conceptual impact.

vs. ColPackAgent: Agent-Skill-Guided Hard-Particle Monte Carlo Workflows for Colloidal Packing

claude-opus-4.65/18/2026

Paper 1 addresses a fundamental theoretical problem in reinforcement learning—model exploitation and its relationship to reward hacking—establishing formal impossibility results and safety guarantees applicable broadly across RL and AI safety. Its contributions are foundational, with implications for any system using learned world models. Paper 2 presents an applied engineering contribution (an LLM agent framework for colloidal packing simulations) that, while useful, has narrower scope and incremental novelty, primarily integrating existing tools (HOOMD-blue, MCP, LLMs) rather than advancing core scientific understanding.

vs. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

claude-opus-4.65/18/2026

Paper 1 establishes fundamental theoretical results connecting reward hacking and model exploitation in RL, proving near-inevitability of exploitation and deriving safe planning horizons. This creates a new formal framework with broad implications for model-based RL safety. Paper 2 presents a practical but incremental contribution—a new alignment training method for reasoning LLMs. While useful, CASPO builds on existing DPO methods with confidence calibration. Paper 1's theoretical contributions are more foundational, likely to be widely cited across RL, AI safety, and planning communities, with longer-lasting impact.