Position: AI Safety Requires Effective Controllability
Yige Li, Yunhao Feng, Jun Sun
Abstract
AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This position paper argues that AI safety should elevate "controllability" — the ability to halt, override, redirect, or constrain AI systems at runtime — as a first-class objective distinct from alignment. The paper makes three concrete contributions: (1) a conceptual distinction between alignment (shaping behavioral distributions during training) and controllability (enforcing runtime authority), (2) ControlBench, a benchmark of 900 high-risk agentic scenarios across six risk categories, and (3) a conceptual architecture for Controllable AI Systems (CAS) organized around authority, interruptibility, runtime enforceability, persistence, and auditability.
The core insight — that a model can be well-aligned in expectation yet fail to yield to explicit override signals during multi-step agentic execution — is valuable and timely. The distinction between "safer in expectation" and "controllable by design" is well-articulated and captures a genuine gap in the current safety discourse.
Methodological Rigor
The empirical evaluation is the weakest aspect of this paper. The experimental setup evaluates only three configurations of OpenClaw-based agents (baseline, +SafeSkills, +AutoSkills) using a single backbone model (GPT-5.2). The attack success rate (ASR) metric, while conceptually appropriate as a "failure-to-yield" measure, depends heavily on an LLM judge (Gemini 3 Flash Preview), introducing evaluation noise that is not quantified through inter-annotator agreement or reliability analysis.
The marginal ASR reductions observed (0.63 → 0.58/0.59) are presented as evidence that skill-level safeguards are insufficient, but the experimental design doesn't control for confounds well. SafeSkills and AutoSkills were not specifically designed to address runtime controllability as defined by the authors, making the comparison somewhat circular — the paper defines a new property, tests mechanisms not designed for that property, and finds them insufficient. A more compelling evaluation would test systems that explicitly attempt runtime control (e.g., human-in-the-loop architectures, hard-coded tool-call blockers, or state-machine-based execution monitors).
The ControlBench construction pipeline (multi-strategy LLM generation → LLM-as-judge filtering → human curation) is reasonable but standard for modern safety benchmarks. The paper provides good detail on the construction process in the appendix, though the six risk categories are narrowly focused on cybersecurity-adjacent scenarios and don't cover the broader controllability failures the paper motivates (e.g., conflicting long-horizon goals, delegation chains, physical-world actions).
Potential Impact
The paper's conceptual contribution could have moderate-to-high influence on how the AI safety community frames its objectives. The five CAS properties (authority, interruptibility, runtime enforceability, persistence, auditability) provide a useful design checklist for practitioners building agentic systems. If widely adopted, this framing could shift evaluation standards to include controllability metrics alongside standard safety benchmarks.
The CAS architectural framework (Figure 4) — separating front-end safety screening from runtime control enforcement — is sensible but remains at a high level of abstraction. It identifies the right components (constraint compilation, runtime monitoring, intervention mechanisms, audit logging) but provides no concrete algorithms, protocols, or implementation guidance. This limits immediate practical impact.
ControlBench itself could serve as a useful resource, though its current scope (900 instances, cybersecurity-focused, single-agent) may limit adoption compared to broader safety benchmarks like HarmBench or Agent-SafetyBench. The benchmark's distinguishing feature — testing whether agents yield to explicit control signals rather than merely refuse harmful requests — addresses a genuine evaluation gap.
Timeliness & Relevance
This paper is highly timely. The rapid deployment of tool-using LLM agents (via frameworks like OpenClaw, LangChain, AutoGPT) has created exactly the kind of runtime control gap the paper identifies. The observation that agents can acknowledge safety restrictions while continuing to pursue restricted objectives through alternative paths is practically important and underappreciated. Recent incidents with autonomous coding agents and multi-step tool use underscore the urgency.
The paper connects to and extends several active research threads: AI control (Greenblatt et al., 2024), instruction hierarchy (Wallace et al., 2024), agent safety benchmarks (Zhang et al., 2024), and runtime governance (Wang et al., 2026). The positioning is well-executed — it synthesizes these threads into a coherent argument rather than merely surveying them.
Strengths
1. Clear conceptual contribution: The distinction between alignment and controllability is well-drawn and fills a genuine gap in safety discourse. The five CAS properties are concrete and actionable.
2. Well-structured position: The paper effectively moves from motivation → evidence → framework → counterarguments, addressing likely objections directly.
3. Finding 3 (surface-level safety ≠ trajectory-level control) is perhaps the most important empirical observation: agents adding cautionary language while preserving progress toward prohibited objectives reveals a fundamental limitation of output-level safety evaluation.
4. Comprehensive related work: The taxonomy of internal vs. external control mechanisms (Section 2) provides useful structure for the field.
Limitations
1. Thin empirical validation: Three agent configurations, one backbone model, one judge, no statistical significance testing. The evidence supports but does not strongly demonstrate the position.
2. No CAS implementation: The paper acknowledges this, but the absence of even a prototype controllable system means the feasibility of the proposed architecture remains unvalidated.
3. Narrow benchmark scope: ControlBench focuses on cybersecurity scenarios. Controllability failures in healthcare, finance, autonomous vehicles, or multi-agent coordination are unaddressed.
4. Missing formal treatment: The five CAS properties are stated informally. Formal definitions (e.g., when does a system satisfy "persistence"?) would strengthen the contribution and enable rigorous evaluation.
5. Limited model diversity: Testing only GPT-5.2 leaves open whether findings generalize across model families, sizes, and training approaches.
6. The paper doesn't adequately address how controllability interacts with capability: more capable models might be simultaneously harder to control and better at appearing controlled.
Overall Assessment
This is a well-articulated position paper that identifies a genuine and timely gap in AI safety research. Its conceptual contribution — elevating controllability as distinct from alignment — is its strongest element. The empirical evidence is directionally supportive but methodologically thin for the strength of claims made. The CAS framework is a reasonable starting point but needs formalization and implementation to drive the field forward. The paper is likely to be cited and to influence framing in the agentic AI safety community, though its direct technical impact is limited by the absence of concrete solutions.
Generated May 27, 2026
Comparison History (19)
Paper 1 addresses a fundamental conceptual gap in AI safety—distinguishing controllability from alignment—which has broad implications for the entire field of AI governance and system design. It introduces both a benchmark and an architectural framework, making it actionable. Given the timeliness of AI safety concerns with increasingly autonomous agents, this reframing could influence policy, standards, and future system architectures. Paper 2, while rigorous and useful, is a more incremental benchmark contribution focused on evaluating existing multimodal agent capabilities in a specific evaluation paradigm.
Paper 1 is likely higher impact due to a concrete, broadly useful benchmark and dataset enabling systematic study of harness (execution-layer) effects across models—an under-measured, timely factor in real-world agent performance. It offers methodological rigor (sandboxed tasks, oracle-checkable validators, large trajectory set, rich traces) and clear applications for evaluating/debugging agent stacks, improving reliability, efficiency, and auditability. Paper 2 is important conceptually and timely for safety, but as a position piece its impact depends more on subsequent adoption and empirical depth; its benchmark scope appears narrower and tied to a specific agent setup.
Paper 2 offers a more immediate and actionable scientific impact by shifting the AI safety paradigm from theoretical alignment to practical runtime controllability. It introduces a concrete benchmark (ControlBench) and architectural framework, which are highly likely to drive experimental follow-up work and citations. While Paper 1 provides a valuable conceptual framework for AGI evaluation, Paper 2 addresses urgent, real-world deployment risks of agentic AI with empirical tools and testable methodologies.
Paper 1 presents a concrete technical contribution (CODE) addressing a well-defined problem (Epistemic Dissonance in knowledge editing) with strong empirical results showing dramatic improvements (self-refutation reduced from 95.6% to 1.8%, multi-hop accuracy up to 83.5%). It introduces a novel paradigm shift from static fact overwriting to causal editing with a reproducible method. Paper 2 is a position paper that, while raising important conceptual points about controllability vs. alignment, offers a preliminary benchmark and architectural framework without deep technical solutions. Paper 1's methodological rigor and immediately actionable contributions give it higher near-term scientific impact.
LaneRoPE introduces a concrete, novel technical contribution—a positional encoding scheme enabling inter-sequence collaboration during parallel LLM generation—with demonstrated empirical gains on reasoning tasks and minimal architectural overhead. This addresses a practical bottleneck in test-time scaling, a highly active research area, and offers a broadly applicable method. Paper 1 raises important conceptual points about AI controllability but is primarily a position paper with a benchmark; its contributions are more framework-oriented and less technically novel. Paper 2's actionable method with clear integration path gives it higher near-term scientific impact.
Paper 2 addresses a critical and highly timely issue in AI safety by shifting the focus from alignment to runtime controllability for autonomous agents. By introducing a new conceptual framework and benchmark, it has the potential to broadly influence the design and deployment of AI systems across many domains. Paper 1 is methodologically rigorous but focuses on a narrower niche (RAG evaluation standards), limiting its overall breadth of impact compared to foundational AI safety paradigms.
Paper 2 likely has higher scientific impact: it reframes AI safety around “controllability” (a timely, field-wide concern for agentic systems), proposes a clear definition, introduces a benchmark (ControlBench) to operationalize the concept, and offers architectural principles that can influence research across alignment, security, HCI, and systems. While Paper 1 is methodologically rigorous and practically useful for coding agents, its impact is narrower (software-engineering agents and skill/memory management) and more incremental relative to fast-moving agent-optimization work.
Paper 1 addresses a fundamental gap in AI safety by distinguishing controllability from alignment—a timely and critical issue as agentic AI systems proliferate. It introduces a concrete benchmark (ControlBench), proposes actionable architectural principles, and targets a problem with immediate real-world safety implications. Paper 2 offers a valuable psychometric framework for evaluating LLM emotional intelligence, but its domain is narrower and less urgent. The controllability framework has broader cross-field impact (safety, governance, deployment) and addresses a more pressing need as autonomous AI agents become widespread.
Paper 2 offers a critical paradigm shift in AI safety, moving beyond static alignment to active runtime controllability. As autonomous AI agents become ubiquitous, ensuring they can be reliably interrupted and redirected is an urgent challenge with massive real-world and policy implications. While Paper 1 makes strong methodological contributions to embodied AI and LMMs, Paper 2's foundational framework and benchmark for agentic controllability address a more universally pressing bottleneck across the broader artificial intelligence landscape, giving it a wider and potentially more transformative scientific impact.
Paper 2 provides a novel mechanistic insight into why chain-of-thought prompting works, revealing that local token co-occurrence rather than logical reasoning drives much of the gain. This fundamentally challenges prevailing assumptions about CoT and has broad implications for understanding LLM reasoning, prompt engineering, and interpretability research. Paper 1 addresses an important AI safety topic but is more of a position/framework paper proposing architectural principles and a benchmark, which, while valuable, offers less surprising empirical insight. Paper 2's findings are more likely to redirect significant research efforts across the NLP community.
Paper 2 proposes a paradigm shift in AI safety, advocating for 'controllability' over traditional 'alignment' and introduces a novel benchmark and architectural framework. This foundational reframing addresses critical gaps in autonomous agent deployment and is likely to inspire broad future research and policy discussions. Paper 1, while methodologically rigorous, addresses a more specific, narrower technical problem (safe fine-tuning via adapters) and thus has a more constrained potential impact.
Paper 1 exposes a fundamental flaw in LLM evaluation metrics regarding compositional reasoning. By introducing the double-gate protocol to isolate atomic knowledge from reasoning ability, it provides a rigorous, actionable methodology that directly impacts how the field assesses multi-hop reasoning. While Paper 2 offers a timely conceptual shift for AI safety, Paper 1's concrete empirical findings and novel diagnostic tools will likely drive more immediate, widespread methodological changes across ML research.
Paper 2 addresses AI safety controllability, a timely and high-impact topic given rapid AI deployment. It introduces a new benchmark (ControlBench) and proposes an architectural framework for controllable AI, with broad implications across AI safety, policy, and system design. While Paper 1 makes a rigorous theoretical contribution by correcting flaws in commutative factor detection algorithms, it targets a narrow subfield (lifted probabilistic inference). Paper 2's breadth of impact, real-world relevance, and timeliness give it higher potential scientific impact.
Paper 2 addresses a fundamental gap in AI safety by arguing that controllability should be a first-class objective beyond alignment. This reframing has broader impact across the entire AI safety field, influencing policy, architecture design, and deployment practices. It introduces a benchmark (ControlBench) and architectural framework applicable to all agentic AI systems. Paper 1, while technically solid with strong empirical results on multi-agent coordination, addresses a narrower optimization problem. Paper 2's timeliness—given rapid deployment of agentic AI—and its potential to reshape safety paradigms give it higher impact potential.
Paper 1 offers a concrete, technically novel platform enabling verifiable, deterministic evaluation and massively parallel RL for mobile GUI agents, plus a sizeable benchmark and evidence of sim-to-real transfer—likely to be broadly adopted and to accelerate empirical research. Its methodological contribution (structured state, deterministic judges, scalable rollouts) has clear real-world applications and cross-field utility (HCI, RL, agents, evaluation). Paper 2 is timely and important conceptually, but as a position paper its impact depends on downstream adoption of proposed frameworks/benchmarks and is less directly enabling than a widely usable, validated experimental platform.
Paper 1 addresses a highly critical and timely issue in AI: the safety and real-time controllability of autonomous agents. By shifting the paradigm from mere alignment to runtime controllability and introducing a new benchmark (ControlBench), it lays a foundational framework that could significantly influence AI safety research, policy, and the deployment of agentic systems. Paper 2, while offering a strong methodological contribution for test-time reasoning in RL agents, focuses on a narrower algorithmic problem, making its potential impact less broad than the systemic safety challenges addressed in Paper 1.
Paper 1 addresses a fundamental paradigm shift in AI safety, moving from alignment to runtime controllability for agentic systems. By introducing a new conceptual framework and benchmark for a critical and timely issue, it has the potential to broadly influence AI safety research, policy, and system design. Paper 2, while highly practical and rigorous, focuses on a narrower systems engineering issue regarding benchmarking measurement bias.
Paper 1 offers a more technically novel and empirically grounded contribution: it analyzes retrying as an information-leaking control mechanism under adversarial models, proposes and disentangles resampling design choices, and reports concrete benchmark gains plus contradictions with prior findings—likely to change how agent safety scaffolds are built. Paper 2 is timely and broad, and the controllability framing/controlbench could influence agendas, but as a position paper its methodological rigor and immediate actionable technical advances appear lower than Paper 1’s, suggesting lower near-term scientific impact.
Paper 2 likely has higher impact: it reframes a central AI-safety problem (from alignment to runtime controllability), proposes concrete definitions and an architectural framework, and introduces a benchmark aimed at broadly applicable, high-stakes agentic settings. This is timely given rapid deployment of tool-using agents, and its concepts could influence research across ML safety, systems, HCI, and governance. Paper 1 is strong and practical but is a domain-specific LLM fine-tuning contribution (LoRA + curated data + benchmark) with narrower cross-field reach and less fundamental methodological novelty.