Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements
James M. Mazzu
Abstract
As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper attempts to formalize a widely discussed intuition in AI safety: that external control mechanisms will eventually be insufficient to constrain sufficiently capable AI systems, and therefore safety must be "intrinsic" to the system. It does so using a control-theoretic framework, establishing two main results: (1) a class-wide impossibility theorem showing that externally enforced safety strategies fail once system effects exceed bounded external control capacity, and (2) a conditional necessity result that any remaining safety-sustaining strategy must be intrinsic, accompanied by four structural requirements.
The paper's contribution is primarily organizational and clarificatory rather than technically deep. It takes arguments that have circulated informally in AI safety discourse (Bostrom, Russell, Yampolskiy) and places them within a formal control-theoretic frame. The four structural requirements (no external enforcement dependence, safety-compatible genesis, self-modification invariance, capability-scaling consistency) are essentially restatements of known desiderata from the alignment literature packaged under a unified derivation.
Methodological Rigor
The control-theoretic formalization is clean but ultimately straightforward. The core theorem follows almost directly from the assumptions. The proof structure is: assume external control is bounded (A1), assume system effects exceed this bound at the safety boundary (A2), assume the boundary is reachable (A3), then conclude external control cannot maintain invariance. This is essentially a formalization of "if X is stronger than Y, then Y cannot contain X," dressed in control-theoretic language.
The critical issue is that assumption A2 — the supercritical boundary control-authority gap — essentially assumes the conclusion. A2 states that the system's outward effects exceed bounded external correction. The theorem then proves that bounded external correction cannot maintain safety. The logical gap between premise and conclusion is narrow. The paper acknowledges this conditionality but frames A2 as an "empirical premise" rather than engaging with whether or how it could be verified.
The lemma proof is correct but trivial: it adds two inequalities. The theorem proof correctly applies standard invariance theory (Nagumo/Blanchini conditions). The mathematical machinery is sound but not technically demanding.
The additional premises E1–E4 for the necessity result are more problematic. E3 (exhaustive binary partition into externally enforced vs. intrinsic) is the load-bearing assumption, and the paper acknowledges that rejecting it breaks the necessity result. E4 simply assumes the existence of an intrinsic candidate, which is precisely what one would want to prove or at least motivate more carefully.
Potential Impact
The paper's potential impact is primarily conceptual rather than technical. It provides a formal vocabulary for discussing a structural concern about AI safety strategies. However, several factors limit its impact:
1. The gap between formalism and practice is large. The paper explicitly acknowledges its framework is "deliberately idealized" and does not model current AI systems. The continuous-time, deterministic, control-affine setup is far from the discrete, stochastic, highly nonlinear reality of modern AI systems.
2. Limited actionable guidance. The four structural requirements are necessary conditions stated at a high level of abstraction. They do not provide mechanisms, algorithms, or empirical criteria. The paper acknowledges Φ (safety-compatible configurations) is "defined conceptually rather than operationally."
3. Circular risk in the argument. The paper's strongest claim (impossibility of external control) depends on assuming conditions under which external control has already been overwhelmed. Without independent methods to assess when A2 holds, the theorem cannot guide policy or engineering decisions.
4. Self-promotional aspects. Section 2.7 promotes the author's prior work on "Supertrust" as a candidate intrinsic strategy. While positioned as merely an example, this raises questions about the paper's motivation and framing.
Timeliness & Relevance
The paper addresses a genuinely important question — the long-term sustainability of AI safety — at a time of rapid capability advancement. The concern is timely. However, the AI safety field has largely moved toward empirical and engineering approaches (mechanistic interpretability, RLHF, constitutional AI, red-teaming), and purely theoretical impossibility results disconnected from practical systems have limited traction.
Strengths
Limitations
Overall Assessment
This paper formalizes an important but well-known intuition about the limits of external AI control. The formalization is correct but technically shallow, with assumptions that largely encode the conclusion. The structural requirements derived are sensible but not novel. The paper's primary value is as a clearly written conceptual contribution that organizes existing concerns under a control-theoretic umbrella, but it falls short of providing actionable insights or technically deep results that would significantly advance the field.
Generated May 14, 2026
Comparison History (36)
Paper 2 addresses AI safety, one of the most consequential and timely topics in science and technology. Its control-theoretic formalization of external impossibility results for AI safety provides foundational structural insights applicable across the entire AI safety field. The formal proof that externally enforced safety strategies are structurally insufficient, combined with necessary conditions for viable alternatives, has broad implications for AI governance, policy, and technical safety research. Paper 1, while methodologically interesting in unifying cognitive science debates, addresses a narrower niche in cognitive psychology with less transformative potential.
Paper 2 is more likely to have higher near-term scientific impact: it proposes a concrete, implementable method (dynamic multi-objective reward weighting plus data-utility reweighting) and claims extensive benchmark improvements, making it readily adoptable in RLHF/post-training pipelines across labs and products. Its applications are immediate for aligning and improving LLMs under non-stationary, heterogeneous data/reward settings, and it is timely given current post-training practice. Paper 1 offers valuable conceptual/structural results but is more theoretical, premise-dependent, and less directly actionable, likely limiting broader uptake.
Paper 1 addresses a fundamental and timely question about AI safety sustainability using control theory, providing formal impossibility results and structural requirements for viable safety strategies. Given the rapid advancement of AI capabilities and widespread concern about AI safety, this work has high relevance and broad potential impact across AI safety, policy, and governance. Paper 2, while technically sound, addresses a more specialized topic in abstract argumentation with narrower scope. The timeliness of AI safety research and its potential to influence both research directions and policy gives Paper 1 significantly greater estimated impact.
Paper 2 addresses a critical bottleneck in current AI development (training LLMs for complex reasoning via RL) with a highly practical, empirically validated algorithmic improvement over state-of-the-art methods like GRPO. Its immediate applicability to current, highly active research trends gives it higher potential for rapid, widespread adoption and impact. While Paper 1 offers foundational theoretical work on AI safety, its long-term, abstract nature makes its immediate scientific impact less certain compared to Paper 2's direct utility.
Paper 1 addresses a fundamental, high-stakes question about AI safety with formal control-theoretic proofs establishing structural impossibility results for externally enforced safety strategies. Its implications span the entire AI safety field and policy landscape, providing rigorous foundations for a critical ongoing debate. Paper 2, while novel in its brain-inspired memory architecture for AI agents, addresses a more narrowly scoped engineering problem. Paper 1's formal results about the limits of external control have broader, more lasting implications for AI governance, alignment research, and the trajectory of AI development.
Paper 2 likely has higher scientific impact due to clearer near-term applicability and evaluable empirical gains on standard benchmarks for a widely used problem (LLM updating/hallucinations). It proposes a concrete, scalable method (LightEdit) with demonstrated improvements and cost reductions, making adoption and follow-on work more likely across NLP/ML systems. Paper 1 is conceptually novel and timely for AI safety theory, but it is largely structural/conditional and offers no complete actionable strategy; its impact depends on acceptance of premises and may be narrower and less immediately actionable.
Paper 1 offers higher potential scientific impact by providing foundational mathematical proofs addressing a critical bottleneck in advanced AI: safety and alignment. By formalizing the impossibility of external control for capable AI using control theory, it forces a paradigm shift toward intrinsic safety strategies. While Paper 2 presents a practical, empirical advancement in LLM agent memory, Paper 1 tackles a universally urgent, existential problem with rigorous structural theorems. Its cross-disciplinary approach promises broader, long-term theoretical impact on how the scientific community approaches AGI development.
While Paper 1 offers strong empirical advancements in multi-agent pathfinding, Paper 2 tackles a foundational issue in AI safety with far broader implications. By providing formal control-theoretic proofs regarding the impossibility of external AI control and defining structural requirements for intrinsic safety, Paper 2 establishes critical theoretical boundaries for AGI alignment, a highly timely and universally impactful field.
Paper 1 addresses a foundational challenge in AI safety, offering formal control-theoretic proofs on the limits of external control for highly capable AI. Its theoretical framework provides broad, long-term implications for AGI alignment and safety policy, giving it a wider scope and higher potential impact than Paper 2, which focuses on a more specific, albeit rigorous, algorithmic improvement in multi-agent reinforcement learning.
Paper 2 addresses a fundamental theoretical problem in AI safety using formal control theory to establish impossibility and necessity results. Establishing structural bounds on AI control has profound, long-term implications for AGI development and policy, giving it broader foundational impact than Paper 1's empirical, application-specific framework for multi-agent moderation.
Paper 2 provides foundational theoretical results (impossibility theorems) on AI safety using control theory, addressing a critical, long-term challenge for the field. While Paper 1 offers valuable empirical improvements in LLM reasoning, Paper 2's structural proofs regarding the necessity of intrinsic control for highly capable AI systems have profound, lasting implications for AI alignment, offering a broader and more fundamental scientific impact.
Paper 1 addresses a fundamental, high-stakes question about the structural limits of external AI safety control using formal control-theoretic methods. Its impossibility result and derived structural requirements for viable safety strategies have broad implications across AI safety, alignment research, and policy. The formalization of widely-held intuitions about control limits provides lasting theoretical contributions. Paper 2, while technically sound and useful for social simulation, addresses a narrower problem (LLM-based opinion dynamics) with more incremental contributions. The breadth, timeliness, and foundational nature of Paper 1's results give it higher potential impact.
Paper 2 provides mathematical proofs establishing structural limits on external AI control, forcing a paradigm shift toward intrinsic safety in alignment research. While Paper 1 is highly relevant and timely, it is a position paper formalizing an already trending paradigm (agentic AI). The rigorous impossibility results in Paper 2 offer foundational theoretical constraints for a critical existential problem, giving it higher potential for long-term transformative impact.
Paper 1 is likely to have higher near-term scientific impact: it introduces a concrete, reusable benchmark that bridges real egocentric data and executable environments, directly enabling measurable progress in embodied AI under partial observability. It has clear applications (evaluation/training of household agents, memory/belief tracking), is timely with current interest in embodied agents and video-to-simulation, and can influence multiple communities (robotics, embodied AI, vision, planning/SLAM-style state tracking, benchmarks). Paper 2 offers valuable formal framing for AI safety, but is more foundational/philosophical, harder to operationalize, and its results depend on strong premises, limiting immediate empirical uptake.
Paper 1 offers a concrete, technically grounded advance in knowledge representation and reasoning: it provides missing size-complexity bounds (polynomial growth) for first-order progression in established action classes and shows fragment preservation for decidable FO fragments, directly improving practical applicability and enabling implementable systems with decidability guarantees. Its methodological rigor and clear integration into formal logic/CS theory suggest durable, citable impact across AI planning, KR, and automated reasoning. Paper 2 is timely and conceptually broad, but its impact depends heavily on the realism of premises and offers fewer immediately actionable technical artifacts.
Paper 1 addresses a fundamental, timely question about the structural limits of AI safety as systems become increasingly capable. Its use of control theory to derive formal impossibility results and necessary conditions for viable safety strategies provides a theoretical foundation that could shape the entire field of AI alignment. While Paper 2 offers a useful incremental engineering contribution (5-20% improvements on existing benchmarks for LLM verification), Paper 1's results have broader implications across AI safety research, policy, and governance, with potential to influence how the field conceptualizes long-term safety strategies.
Paper 2 has higher estimated scientific impact due to broader cross-field relevance (AI safety, control theory, governance, alignment), strong novelty in formalizing class-wide impossibility/necessity results, and high timeliness given frontier-model concerns. Its structural theorems and requirements can shape subsequent theoretical and practical safety research agendas beyond any single application domain. Paper 1 is methodologically grounded and practically valuable for educational AI, but its impact is narrower (learner modeling/edtech) and more incremental relative to ongoing neural-symbolic and knowledge tracing work.
Paper 2 addresses the fundamental and timely problem of AI safety as systems become increasingly capable, providing formal control-theoretic proofs about the structural limits of external enforcement strategies. Its results—establishing impossibility of externally enforced safety and necessary conditions for intrinsic safety—have broad implications across AI alignment, policy, and governance. The formalization of widely-held intuitions about AI control limits gives it high potential to influence both theoretical research and practical safety frameworks. Paper 1, while methodologically sound, addresses a narrower problem in multi-criteria decision making with incremental contributions.
Paper 1 has higher likely impact because it presents concrete, experimentally supported findings about a widely used technique (LLM-based memory consolidation) and shows a counterintuitive failure mode with measurable performance regressions on relevant benchmarks, plus actionable design guidance (treat episodes as evidence; gate consolidation). This is timely for agentic LLM systems and broadly applicable to tooling, evaluation, and system design. Paper 2 offers formal, high-level impossibility/necessity claims in AI safety, but without a concrete strategy or empirical grounding its near-term uptake and testability may be lower.
Paper 2 addresses a foundational problem in AI safety by mathematically proving the limits of external control for capable AI systems. While Paper 1 offers a practical and useful optimization for LLM deployment, Paper 2 has a significantly higher potential for broad scientific impact. It establishes theoretical boundaries and necessary conditions for intrinsic AI alignment, which could shape future research paradigms and policy regarding long-term AGI safety.