-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi
Abstract
Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose -Monitor, a bi-level safety monitor for D-LLMs. -Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, -Monitor achieves state-of-the-art performance with a compact parameter footprint ( 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.
AI Impact Assessments
(1 models)Scientific Impact Assessment: D²-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing
1. Core Contribution
This paper addresses safety monitoring for diffusion large language models (D-LLMs), a nascent but rapidly growing model family. The key insight is that D-LLMs' iterative denoising process exposes a multi-step trajectory of hidden states, and that safety hesitation — defined as intermediate hidden states repeatedly falling near a probe's decision boundary — serves as a strong proxy for sample difficulty. The authors propose D²-Monitor, a bi-level cascade system: a lightweight linear probe runs always-on, computing per-step margins to estimate hesitation severity; when the number of hesitation steps exceeds a threshold, a heavier probe (MLP or temporal attention) is triggered for second-stage classification.
The novelty is threefold: (1) the mechanistic discovery that margin-based hesitation severity in D-LLM trajectories predicts probe failure; (2) a training pipeline that uses out-of-fold scoring to curate hesitation trajectories for training the advanced probe; and (3) a dynamic routing mechanism that allocates compute proportional to difficulty. This is distinct from prior bi-level monitoring work in AR-LLMs (e.g., Constitutional Classifiers++, McKenzie et al.) which pair lightweight probes with external LLMs rather than using trajectory-intrinsic routing signals.
2. Methodological Rigor
The experimental design is thorough and well-controlled:
One methodological strength is the counterfactual validation: the authors measure the accuracy gap between base and advanced probes on routed subsets, confirming that margin-based routing isolates samples that genuinely benefit from the heavier probe (13.5% gap vs. ~7% for entropy/confidence). The adversarial fraction analysis (Appendix E.3) adds semantic interpretability — hesitation severity selectively captures adversarially designed prompts.
However, statistical significance testing beyond seed robustness (Table 7) is limited, and confidence intervals are not provided for the main results.
3. Potential Impact
Immediate practical value: With ≤0.85M parameters and inference times comparable to single-step methods, D²-Monitor is deployable in resource-constrained settings (edge devices, user-side monitoring). This is timely given Mercury 2's 1009 tokens/sec speed — safety monitoring must match this throughput.
Broader implications:
Limitations on impact: D-LLMs are still an emerging paradigm, and the practical deployment base is small relative to AR-LLMs. The paper only tests models up to 16B parameters. The authors acknowledge vulnerability to adaptive adversaries who could suppress hesitation steps to evade routing — a significant concern for real-world deployment that remains unaddressed experimentally.
4. Timeliness & Relevance
This paper is exceptionally well-timed. D-LLMs have gained rapid momentum (Mercury 2 commercial deployment, LLaDA 2.0 at 100B scale), yet their safety infrastructure lags far behind AR-LLMs. The paper correctly identifies this gap and provides the first systematic study of probe-based safety monitoring for D-LLMs. The observation that alignment alone is insufficient (citing adversarial attack vulnerability) motivates the need for external monitors. Given regulatory pressure (EU AI Act, executive orders) requiring safety guardrails, lightweight monitoring solutions like D²-Monitor address a genuine deployment need.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Summary
D²-Monitor makes a solid contribution at the intersection of D-LLM safety and efficient monitoring. The hesitation severity concept is well-grounded, the experimental evaluation is thorough within its scope, and the practical implications for deployment are clear. The main limitations are scale and adversarial robustness, both acknowledged by the authors. This paper will likely serve as a foundational reference for safety monitoring in D-LLMs.
Generated May 26, 2026
Comparison History (21)
Paper 2 is likely higher impact because it identifies a broad, protocol-dependent measurement confound in LLM confidence calibration that affects many benchmarks, model families, and downstream uses (evaluation, uncertainty estimation, decision-making). Its findings generalize across AR models and provide actionable guidance (a reporting checklist) that can reshape community standards and improve reproducibility. Paper 1 is novel and timely for diffusion LLM safety, but its impact is narrower (limited to D-LLMs and a specific monitoring architecture) and depends on the adoption trajectory of diffusion LLMs.
Paper 1 likely has higher impact due to stronger timeliness and broader applicability: safety monitoring is a cross-cutting requirement for deploying LLMs, and diffusion LLMs are an emerging paradigm with underexplored safety tooling. The hesitation-aware routing idea leverages unique trajectory signals in D-LLMs and offers a practical efficiency–effectiveness trade-off with lightweight always-on monitoring, making real-world adoption plausible. Paper 2 is methodologically interesting for agent RL skill internalization, but is narrower in scope (specific benchmarks/settings) and its near-term deployment relevance is more limited than safety monitoring.
Paper 1 targets an emerging, broadly relevant problem—safety monitoring for diffusion LLMs—introducing a novel trajectory-based “hesitation” signal and an efficient dynamic routing monitor with strong empirical validation across multiple datasets and models. Its impact could span AI safety, model monitoring, and deployment governance, and it is timely given rapid diffusion-model adoption. Paper 2 addresses an important applied domain (battery health forecasting) with solid Transformer innovations and clear real-world value, but its scope is narrower and more domain-specific, likely limiting breadth of cross-field impact.
Paper 2 addresses a novel and timely problem—safety monitoring for diffusion LLMs—an emerging architecture with limited prior safety research. It introduces a principled concept (safety hesitation), a practical framework (D^2-Monitor), and demonstrates state-of-the-art results across multiple datasets and models with strong baselines. Paper 1, while offering useful practical insights on harness sensitivity, is limited by single-model-per-tier design (432 runs but n=1 per tier), reducing generalizability. Paper 2 has broader impact potential across AI safety, diffusion models, and efficient monitoring, with stronger methodological rigor.
Paper 1 addresses safety and alignment in Large Reasoning Models (LRMs) and Chain-of-Thought, a highly relevant and rapidly growing area of AI research. Understanding how CoT interacts with activation steering and refusal offers critical insights for AI security. Paper 2 focuses on Diffusion LLMs, which currently have much less widespread adoption than autoregressive LRMs. The immediate relevance, broader applicability to state-of-the-art models, and novel insights into CoT's role in model control make Paper 1 more scientifically impactful.
Paper 2 is likely higher impact due to stronger novelty and timeliness: it targets diffusion LLM safety monitoring, an underexplored and rapidly emerging model class, and leverages a distinctive diffusion-specific signal (trajectory “hesitation”) unavailable to AR-LLMs. The proposed dynamic routing monitor is broadly applicable to safety deployment with clear real-world benefits (efficient always-on moderation) and general relevance across many D-LLMs and datasets. Paper 1 addresses an important medical agent issue, but its impact may be narrower (medical tool orchestration) and more incremental relative to existing RL-based tool-selection work.
Paper 1 addresses a highly timely and critical issue (safety monitoring) for an emerging foundation model architecture (diffusion LLMs). Its novel approach exploiting trajectory-level 'safety hesitation' offers an efficient, scalable solution. Given the massive deployment and societal implications of LLMs, safety mechanisms have broader immediate real-world applications and cross-field relevance compared to the more specialized multi-agent RL focus of Paper 2.
Paper 2 addresses a highly timely and critical challenge: safety monitoring for emerging Diffusion LLMs. By leveraging the unique multi-step denoising process to identify 'safety hesitation,' it introduces a novel and efficient dynamic routing mechanism for content moderation. Given the massive real-world implications of LLM safety and the rapid growth of generative AI, this work offers broader immediate applications and cross-disciplinary relevance compared to Paper 1. While Paper 1 demonstrates impressive algorithmic scaling for POMDPs, Paper 2's focus on AI safety alignment provides a higher potential for widespread societal and scientific impact.
Paper 2 addresses long-horizon reasoning and memory in LLM agents, a highly active and broadly applicable area in current AI research. Its state-adaptive memory framework offers scalable improvements for agentic systems across diverse domains. While Paper 1 presents a novel safety approach for diffusion LLMs, its impact is currently limited by the narrower adoption of D-LLMs compared to standard autoregressive models used in widespread agent architectures.
Paper 1 is more novel and impactful: it introduces a diffusion-LLM-specific safety monitoring paradigm leveraging intermediate denoising trajectories and a hesitation-based difficulty signal, then turns it into a practical, compute-adaptive routing system with strong multi-model, multi-dataset results. This has clear real-world deployment relevance (efficient, always-on safety) and broad implications for monitoring other iterative generative models. Paper 2 provides an important diagnostic/metrics insight (median vs mean CE) but is more incremental and primarily affects evaluation/reporting practices rather than enabling new system capabilities.
Paper 1 is more likely to have higher scientific impact due to its novel, diffusion-specific safety signal (trajectory “hesitation”) and a clear, efficient bi-level routing mechanism with strong empirical validation across multiple datasets and D-LLMs. The work is timely (LLM safety) and broadly applicable to monitoring/guardrailing generative models, potentially influencing both research and deployment practices. Paper 2 is innovative and application-relevant for engineering design, but its evidence is limited to two case studies with moderate success rates and higher system-level brittleness, which may constrain generalizability and near-term uptake.
Paper 1 addresses a timely and novel problem—safety monitoring for diffusion-based LLMs—with a rigorous methodology, introducing the concept of 'safety hesitation' and a dynamic routing mechanism. It is evaluated across multiple datasets and models, demonstrating state-of-the-art results. The topic is highly relevant given growing concerns about AI safety and the emergence of diffusion LLMs. Paper 2 is a review/clarification of axiomatic design problem formulation, offering practical guidance but limited novelty, as it primarily revisits existing literature rather than introducing new methods or empirical findings.
Paper 1 addresses a novel and timely problem—safety monitoring for diffusion LLMs—an emerging architecture with growing interest. It introduces a creative concept (safety hesitation) and a practical dynamic routing mechanism, combining novelty with methodological rigor across multiple datasets and models. Paper 2 contributes a useful benchmark for multi-page document parsing, but benchmarks generally have narrower impact unless they become widely adopted standards. Paper 1's contribution to AI safety for a new paradigm of language models has broader implications and higher potential to influence future research directions.
Paper 1 is more novel and timely: it targets safety monitoring for diffusion LLMs, an emerging model class with distinct intermediate-state signals, and introduces a hesitation-aware, dynamically routed monitor with strong efficiency–effectiveness trade-offs. It is evaluated across multiple datasets and D-LLMs with competitive baselines, suggesting solid methodological rigor and broader ML safety relevance. Paper 2 is useful for BPM practice and improves text-to-BPMN modeling with resource awareness, but its impact is more domain-specific and less likely to generalize broadly across fields.
Paper 2 addresses a highly timely and widely relevant problem: the fragility of autonomous computer-use agents in real-world, dynamic environments. With the rapid deployment of MLLM-based agents, benchmarking and improving their robustness to common UI corruptions has immediate, broad real-world applicability. In contrast, Paper 1 is highly innovative but focuses on diffusion LLMs, which currently have a narrower adoption footprint compared to autoregressive models and agentic workflows, leading to a comparatively more niche impact.
Paper 1 likely has higher impact due to its broad, timely benchmark infrastructure for evaluating LLMs on scalable algorithm design in realistic large-scale optimization—an area with clear real-world stakes across industries (logistics, energy, scheduling) and multiple research communities (LLMs/agents, OR, benchmarking, software engineering). Its methodological rigor (expert-derived tasks, standardized instances, hidden evaluation) supports durable, field-wide adoption. Paper 2 is novel and relevant for diffusion-LLM safety, but targets a narrower model class and application niche; its techniques may be less broadly reusable than a benchmark that can steer progress across many models and tasks.
Paper 2 likely has higher scientific impact: it proposes a general, standards-based, executable governance infrastructure (RDF/OWL, SHACL, PROV-O) with a formal model and a “regulatory compiler,” enabling scalable compliance across many AI systems and domains. Its applications span critical infrastructure and regulation, giving broad cross-field relevance (AI, law, semantic web, systems governance) and strong timeliness amid expanding AI governance requirements. Paper 1 is novel and rigorous for diffusion-LLM safety monitoring, but its impact is narrower to a specific model class and monitoring setting.
Paper 2 establishes a rigorous, foundational mathematical framework for widely used deep learning architectures (CNNs, ResNets) using lattice theory. By providing theoretical explanations for the representational power of depth and proposing novel idempotent layer designs, it offers profound, long-lasting theoretical contributions. Paper 1, while highly practical and timely for AI safety, focuses on a specific and currently niche subfield (diffusion LLMs), limiting its broader scientific impact compared to the overarching theoretical advancements of Paper 2.
Paper 2 addresses AI safety in emerging Diffusion LLMs, a highly active and critical field with broad societal and interdisciplinary implications. While Paper 1 offers valuable contributions to hardware verification, its impact is largely confined to a specialized niche. The focus on efficient safety monitoring in generative AI gives Paper 2 broader relevance and higher potential for widespread scientific impact.
Paper 2 likely has higher impact due to timeliness and broad real-world relevance: safety monitoring for diffusion LLMs is an emerging, high-stakes need with immediate deployment pathways. It introduces a clear, general mechanism (hesitation-aware routing) that leverages diffusion-specific trajectory signals, validated across multiple datasets and models with strong efficiency claims. Paper 1 is innovative and strong on label-free diagnostics and inference-time gains, but appears more niche (recursive reasoning/specific tasks) and may have narrower applicability beyond structured reasoning benchmarks.