Sustaining AI safety: Control-theoretic external impossibility, intrinsic necessity, and structural requirements

James M. Mazzu

May 13, 2026

arXiv:2605.12963v1 PDF

cs.AI(primary)

#1215of 2821·Artificial Intelligence

#1215 of 2821 · Artificial Intelligence

Tournament Score

1425±31

10501800

64%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance4

Rigor4.5

Novelty3

Clarity7

Tournament Score

1425±31

10501800

64%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

As AI systems become increasingly capable, safety strategies must be evaluated not only by how much they reduce present risk, but by whether they could sustain safety once external control can no longer reliably constrain system behavior. This paper addresses that problem by using control theory to clarify, at a structural level, whether externally enforced safety-sustaining strategies can succeed and, if not, what any alternative strategy would have to satisfy in order to be viable. It establishes two main results. First, under explicit premises including a reachability condition, it proves a class-wide external impossibility result: once the system's effects exceed what bounded external control can counteract, no strategy that depends in any degree on continued external enforcement can sustain AI safety. This failure is structural across the entire externally enforced class rather than contingent on any particular strategy. Second, it establishes a conditional class-level necessity result: if at least one candidate safety-sustaining strategy remains after that elimination, then all such remaining strategies must be intrinsic. It then states four structural requirements for viability: safety may not depend on continued external enforcement; the system's terminal objective must be safety-compatible when first formed; that objective must remain stable under self-modification; and safety must continue to be preserved as capability grows. The paper does not propose a complete strategy for sustaining AI safety. Its contribution is to give formal structure to a widely held concern about the limits of external control. It does so by deriving explicit conditional results that identify which safety-sustaining strategies are ruled out and what any remaining strategies must satisfy.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper attempts to formalize a widely discussed intuition in AI safety: that external control mechanisms will eventually be insufficient to constrain sufficiently capable AI systems, and therefore safety must be "intrinsic" to the system. It does so using a control-theoretic framework, establishing two main results: (1) a class-wide impossibility theorem showing that externally enforced safety strategies fail once system effects exceed bounded external control capacity, and (2) a conditional necessity result that any remaining safety-sustaining strategy must be intrinsic, accompanied by four structural requirements.

The paper's contribution is primarily organizational and clarificatory rather than technically deep. It takes arguments that have circulated informally in AI safety discourse (Bostrom, Russell, Yampolskiy) and places them within a formal control-theoretic frame. The four structural requirements (no external enforcement dependence, safety-compatible genesis, self-modification invariance, capability-scaling consistency) are essentially restatements of known desiderata from the alignment literature packaged under a unified derivation.

Methodological Rigor

The control-theoretic formalization is clean but ultimately straightforward. The core theorem follows almost directly from the assumptions. The proof structure is: assume external control is bounded (A1), assume system effects exceed this bound at the safety boundary (A2), assume the boundary is reachable (A3), then conclude external control cannot maintain invariance. This is essentially a formalization of "if X is stronger than Y, then Y cannot contain X," dressed in control-theoretic language.

The critical issue is that assumption A2 — the supercritical boundary control-authority gap — essentially assumes the conclusion. A2 states that the system's outward effects exceed bounded external correction. The theorem then proves that bounded external correction cannot maintain safety. The logical gap between premise and conclusion is narrow. The paper acknowledges this conditionality but frames A2 as an "empirical premise" rather than engaging with whether or how it could be verified.

The lemma proof is correct but trivial: it adds two inequalities. The theorem proof correctly applies standard invariance theory (Nagumo/Blanchini conditions). The mathematical machinery is sound but not technically demanding.

The additional premises E1–E4 for the necessity result are more problematic. E3 (exhaustive binary partition into externally enforced vs. intrinsic) is the load-bearing assumption, and the paper acknowledges that rejecting it breaks the necessity result. E4 simply assumes the existence of an intrinsic candidate, which is precisely what one would want to prove or at least motivate more carefully.

Potential Impact

The paper's potential impact is primarily conceptual rather than technical. It provides a formal vocabulary for discussing a structural concern about AI safety strategies. However, several factors limit its impact:

1. The gap between formalism and practice is large. The paper explicitly acknowledges its framework is "deliberately idealized" and does not model current AI systems. The continuous-time, deterministic, control-affine setup is far from the discrete, stochastic, highly nonlinear reality of modern AI systems.

2. Limited actionable guidance. The four structural requirements are necessary conditions stated at a high level of abstraction. They do not provide mechanisms, algorithms, or empirical criteria. The paper acknowledges Φ (safety-compatible configurations) is "defined conceptually rather than operationally."

3. Circular risk in the argument. The paper's strongest claim (impossibility of external control) depends on assuming conditions under which external control has already been overwhelmed. Without independent methods to assess when A2 holds, the theorem cannot guide policy or engineering decisions.

4. Self-promotional aspects. Section 2.7 promotes the author's prior work on "Supertrust" as a candidate intrinsic strategy. While positioned as merely an example, this raises questions about the paper's motivation and framing.

Timeliness & Relevance

The paper addresses a genuinely important question — the long-term sustainability of AI safety — at a time of rapid capability advancement. The concern is timely. However, the AI safety field has largely moved toward empirical and engineering approaches (mechanistic interpretability, RLHF, constitutional AI, red-teaming), and purely theoretical impossibility results disconnected from practical systems have limited traction.

Strengths

Clear logical structure. The paper is well-organized with explicit assumptions, numbered definitions, and clear conditional statements.

Honest about limitations. The paper carefully labels which results are conditional on which premises and discusses falsifiability.

Useful conceptual taxonomy. The distinction between externally enforced and intrinsic safety-sustaining strategies, while not new, is clearly articulated.

Comprehensive related work. The literature review connects to multiple relevant threads.

Limitations

Low technical novelty. The mathematical content is elementary control theory. The theorem is a near-tautological consequence of the assumptions.

Assumption A2 does most of the work. It formalizes the conclusion rather than providing an independent structural insight.

No empirical grounding. No examples, simulations, or case studies illustrate when A2 might hold or how the framework applies to actual systems.

Φ remains undefined operationally. The central concept of safety-compatible internal configurations is acknowledged to be conceptual only.

Binary partition (E3) is debatable. Many real safety architectures involve layered, hybrid, or architectural approaches that resist clean binary classification.

Does not advance beyond prior informal arguments. Bostrom (2014), Russell (2019), and others have made essentially the same argument informally. The formalization adds precision but limited new insight.

Single-author paper from a commercial entity without peer review history in control theory or AI safety venues. This affects the paper's standing in the technical community.

Overall Assessment

This paper formalizes an important but well-known intuition about the limits of external AI control. The formalization is correct but technically shallow, with assumptions that largely encode the conclusion. The structural requirements derived are sensible but not novel. The paper's primary value is as a clearly written conceptual contribution that organizes existing concerns under a control-theoretic umbrella, but it falls short of providing actionable insights or technically deep results that would significantly advance the field.

Rating:3.5/ 10

Significance 4Rigor 4.5Novelty 3Clarity 7

Generated May 14, 2026

Comparison History (36)

vs. Overhang Tower: Resource-Rational Adaptation in Sequential Physical Planning

claude-opus-4.65/14/2026

Paper 2 addresses AI safety, one of the most consequential and timely topics in science and technology. Its control-theoretic formalization of external impossibility results for AI safety provides foundational structural insights applicable across the entire AI safety field. The formal proof that externally enforced safety strategies are structurally insufficient, combined with necessary conditions for viable alternatives, has broad implications for AI governance, policy, and technical safety research. Paper 1, while methodologically interesting in unifying cognitive science debates, addresses a narrower niche in cognitive psychology with less transformative potential.

vs. SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility

gpt-5.25/14/2026

Paper 2 is more likely to have higher near-term scientific impact: it proposes a concrete, implementable method (dynamic multi-objective reward weighting plus data-utility reweighting) and claims extensive benchmark improvements, making it readily adoptable in RLHF/post-training pipelines across labs and products. Its applications are immediate for aligning and improving LLMs under non-stationary, heterogeneous data/reward settings, and it is timely given current post-training practice. Paper 1 offers valuable conceptual/structural results but is more theoretical, premise-dependent, and less directly actionable, likely limiting broader uptake.

vs. Diversity of Extensions in Abstract Argumentation

claude-opus-4.65/14/2026

Paper 1 addresses a fundamental and timely question about AI safety sustainability using control theory, providing formal impossibility results and structural requirements for viable safety strategies. Given the rapid advancement of AI capabilities and widespread concern about AI safety, this work has high relevance and broad potential impact across AI safety, policy, and governance. Paper 2, while technically sound, addresses a more specialized topic in abstract argumentation with narrower scope. The timeliness of AI safety research and its potential to influence both research directions and policy gives Paper 1 significantly greater estimated impact.

vs. Selective Off-Policy Reference Tuning with Plan Guidance

gemini-3.15/14/2026

Paper 2 addresses a critical bottleneck in current AI development (training LLMs for complex reasoning via RL) with a highly practical, empirically validated algorithmic improvement over state-of-the-art methods like GRPO. Its immediate applicability to current, highly active research trends gives it higher potential for rapid, widespread adoption and impact. While Paper 1 offers foundational theoretical work on AI safety, its long-term, abstract nature makes its immediate scientific impact less certain compared to Paper 2's direct utility.

vs. Cognifold: Always-On Proactive Memory via Cognitive Folding

claude-opus-4.65/14/2026

Paper 1 addresses a fundamental, high-stakes question about AI safety with formal control-theoretic proofs establishing structural impossibility results for externally enforced safety strategies. Its implications span the entire AI safety field and policy landscape, providing rigorous foundations for a critical ongoing debate. Paper 2, while novel in its brain-inspired memory architecture for AI agents, addresses a more narrowly scoped engineering problem. Paper 1's formal results about the limits of external control have broader, more lasting implications for AI governance, alignment research, and the trajectory of AI development.

vs. Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression

gpt-5.25/14/2026

Paper 2 likely has higher scientific impact due to clearer near-term applicability and evaluable empirical gains on standard benchmarks for a widely used problem (LLM updating/hallucinations). It proposes a concrete, scalable method (LightEdit) with demonstrated improvements and cost reductions, making adoption and follow-on work more likely across NLP/ML systems. Paper 1 is conceptually novel and timely for AI safety theory, but it is largely structural/conditional and offers no complete actionable strategy; its impact depends on acceptance of premises and may be narrower and less immediately actionable.

vs. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

gemini-3.15/14/2026

Paper 1 offers higher potential scientific impact by providing foundational mathematical proofs addressing a critical bottleneck in advanced AI: safety and alignment. By formalizing the impossibility of external control for capable AI using control theory, it forces a paradigm shift toward intrinsic safety strategies. While Paper 2 presents a practical, empirical advancement in LLM agent memory, Paper 1 tackles a universally urgent, existential problem with rigorous structural theorems. Its cross-disciplinary approach promises broader, long-term theoretical impact on how the scientific community approaches AGI development.

vs. Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

gemini-3.15/14/2026

While Paper 1 offers strong empirical advancements in multi-agent pathfinding, Paper 2 tackles a foundational issue in AI safety with far broader implications. By providing formal control-theoretic proofs regarding the impossibility of external AI control and defining structural requirements for intrinsic safety, Paper 2 establishes critical theoretical boundaries for AGI alignment, a highly timely and universally impactful field.

vs. Macro-Action Based Multi-Agent Instruction Following through Value Cancellation

gemini-3.15/14/2026

Paper 1 addresses a foundational challenge in AI safety, offering formal control-theoretic proofs on the limits of external control for highly capable AI. Its theoretical framework provides broad, long-term implications for AGI alignment and safety policy, giving it a wider scope and higher potential impact than Paper 2, which focuses on a more specific, albeit rigorous, algorithmic improvement in multi-agent reinforcement learning.

vs. Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

gemini-3.15/14/2026

Paper 2 addresses a fundamental theoretical problem in AI safety using formal control theory to establish impossibility and necessity results. Establishing structural bounds on AI control has profound, long-term implications for AGI development and policy, giving it broader foundational impact than Paper 1's empirical, application-specific framework for multi-agent moderation.

vs. Learning to Reason with Insight for Informal Theorem Proving

gemini-3.15/14/2026

Paper 2 provides foundational theoretical results (impossibility theorems) on AI safety using control theory, addressing a critical, long-term challenge for the field. While Paper 1 offers valuable empirical improvements in LLM reasoning, Paper 2's structural proofs regarding the necessity of intrinsic control for highly capable AI systems have profound, lasting implications for AI alignment, offering a broader and more fundamental scientific impact.

vs. ScioMind: Cognitively Grounded Multi-Agent Social Simulation with Anchoring-Based Belief Dynamics and Dynamic Profiles

claude-opus-4.65/14/2026

Paper 1 addresses a fundamental, high-stakes question about the structural limits of external AI safety control using formal control-theoretic methods. Its impossibility result and derived structural requirements for viable safety strategies have broad implications across AI safety, alignment research, and policy. The formalization of widely-held intuitions about control limits provides lasting theoretical contributions. Paper 2, while technically sound and useful for social simulation, addresses a narrower problem (LLM-based opinion dynamics) with more incremental contributions. The breadth, timeliness, and foundational nature of Paper 1's results give it higher potential impact.

vs. Position: Agentic AI System Is a Foreseeable Pathway to AGI

gemini-3.15/14/2026

Paper 2 provides mathematical proofs establishing structural limits on external AI control, forcing a paradigm shift toward intrinsic safety in alignment research. While Paper 1 is highly relevant and timely, it is a position paper formalizing an already trending paradigm (agentic AI). The rigorous impossibility results in Paper 2 offer foundational theoretical constraints for a critical existential problem, giving it higher potential for long-term transformative impact.

vs. Ego2World: Compiling Egocentric Cooking Videos into Executable Worlds for Belief-State Planning

gpt-5.25/14/2026

Paper 1 is likely to have higher near-term scientific impact: it introduces a concrete, reusable benchmark that bridges real egocentric data and executable environments, directly enabling measurable progress in embodied AI under partial observability. It has clear applications (evaluation/training of household agents, memory/belief tracking), is timely with current interest in embodied agents and video-to-simulation, and can influence multiple communities (robotics, embodied AI, vision, planning/SLAM-style state tracking, benchmarks). Paper 2 offers valuable formal framing for AI safety, but is more foundational/philosophical, harder to operationalize, and its results depend on strong premises, limiting immediate empirical uptake.

vs. On the Size Complexity and Decidability of First-Order Progression

gpt-5.25/14/2026

Paper 1 offers a concrete, technically grounded advance in knowledge representation and reasoning: it provides missing size-complexity bounds (polynomial growth) for first-order progression in established action classes and shows fragment preservation for decidable FO fragments, directly improving practical applicability and enabling implementable systems with decidability guarantees. Its methodological rigor and clear integration into formal logic/CS theory suggest durable, citable impact across AI planning, KR, and automated reasoning. Paper 2 is timely and conceptually broad, but its impact depends heavily on the realism of premises and offers fewer immediately actionable technical artifacts.

vs. Trust but Verify: Introducing DAVinCI -- A Framework for Dual Attribution and Verification in Claim Inference for Language Models

claude-opus-4.65/14/2026

Paper 1 addresses a fundamental, timely question about the structural limits of AI safety as systems become increasingly capable. Its use of control theory to derive formal impossibility results and necessary conditions for viable safety strategies provides a theoretical foundation that could shape the entire field of AI alignment. While Paper 2 offers a useful incremental engineering contribution (5-20% improvements on existing benchmarks for LLM verification), Paper 1's results have broader implications across AI safety research, policy, and governance, with potential to influence how the field conceptualizes long-term safety strategies.

vs. Neural-Symbolic Knowledge Tracing: Injecting Educational Knowledge into Deep Learning for Responsible Learner Modelling

gpt-5.25/14/2026

Paper 2 has higher estimated scientific impact due to broader cross-field relevance (AI safety, control theory, governance, alignment), strong novelty in formalizing class-wide impossibility/necessity results, and high timeliness given frontier-model concerns. Its structural theorems and requirements can shape subsequent theoretical and practical safety research agendas beyond any single application domain. Paper 1 is methodologically grounded and practically valuable for educational AI, but its impact is narrower (learner modeling/edtech) and more incremental relative to ongoing neural-symbolic and knowledge tracing work.

vs. Unweighted ranking for value-based decision making with uncertainty

claude-opus-4.65/14/2026

Paper 2 addresses the fundamental and timely problem of AI safety as systems become increasingly capable, providing formal control-theoretic proofs about the structural limits of external enforcement strategies. Its results—establishing impossibility of externally enforced safety and necessary conditions for intrinsic safety—have broad implications across AI alignment, policy, and governance. The formalization of widely-held intuitions about AI control limits gives it high potential to influence both theoretical research and practical safety frameworks. Paper 1, while methodologically sound, addresses a narrower problem in multi-criteria decision making with incremental contributions.

vs. Useful Memories Become Faulty When Continuously Updated by LLMs

gpt-5.25/14/2026

Paper 1 has higher likely impact because it presents concrete, experimentally supported findings about a widely used technique (LLM-based memory consolidation) and shows a counterintuitive failure mode with measurable performance regressions on relevant benchmarks, plus actionable design guidance (treat episodes as evidence; gate consolidation). This is timely for agentic LLM systems and broadly applicable to tooling, evaluation, and system design. Paper 2 offers formal, high-level impossibility/necessity claims in AI safety, but without a concrete strategy or empirical grounding its near-term uptake and testability may be lower.

vs. Saliency-Aware Regularized Quantization Calibration for Large Language Models

gemini-3.15/14/2026

Paper 2 addresses a foundational problem in AI safety by mathematically proving the limits of external control for capable AI systems. While Paper 1 offers a practical and useful optimization for LLM deployment, Paper 2 has a significantly higher potential for broad scientific impact. It establishes theoretical boundaries and necessary conditions for intrinsic AI alignment, which could shape future research paradigms and policy regarding long-term AGI safety.