$D^2$ -Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

Aoxi Liu, Yupeng Chen, James Oldfield, Guanzhe Hong, Junchi Yu, Baoyuan Wu, Philip Torr, Adel Bibi

May 25, 2026

arXiv:2605.25893v1 PDF

cs.AI(primary)

#1425of 2682·Artificial Intelligence

#1425 of 2682 · Artificial Intelligence

Tournament Score

1402±42

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7

Rigor7.5

Novelty7.5

Clarity8

Tournament Score

1402±42

10501800

62%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Despite the emergence of diffusion large language models (D-LLMs) as an alternative to autoregressive large language models (AR-LLMs), safety monitoring for D-LLMs remains largely unexplored. Unlike AR-LLMs, D-LLMs generate text through a multi-step denoising process, exposing intermediate hidden representations that may contain safety-relevant information unavailable in standard single-step monitoring setups. Motivated by the suitability of lightweight probes for always-on monitoring, we analyze which trajectory-level signals best indicate when such probes are likely to struggle. We find that the most informative signal is safety hesitation: intermediate hidden states repeatedly falling within a small margin of the probe's decision boundary. The number of such hesitation steps in D-LLM's trajectory predicts probe failure effectively, providing a proxy of sample difficulty. Building on this analysis, we propose $D^{2}$ -Monitor, a bi-level safety monitor for D-LLMs. $D^{2}$ -Monitor adopts a lightweight probe as an always-on monitor to jointly estimate hesitation and perform base classification. When the hesitation level exceeds a threshold, a more expressive but computationally heavier probe is activated. This dynamic routing mechanism allocates monitoring resources efficiently at test time. Evaluated on 3 datasets (WildguardMix, ToxicChat, OpenAI-Moderation) across 4 D-LLMs, $D^{2}$ -Monitor achieves state-of-the-art performance with a compact parameter footprint ( $\leq$ 0.85M parameters), and exhibits the best trade-off between effectiveness and efficiency relative to 8 baselines.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: D²-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

1. Core Contribution

This paper addresses safety monitoring for diffusion large language models (D-LLMs), a nascent but rapidly growing model family. The key insight is that D-LLMs' iterative denoising process exposes a multi-step trajectory of hidden states, and that safety hesitation — defined as intermediate hidden states repeatedly falling near a probe's decision boundary — serves as a strong proxy for sample difficulty. The authors propose D²-Monitor, a bi-level cascade system: a lightweight linear probe runs always-on, computing per-step margins to estimate hesitation severity; when the number of hesitation steps exceeds a threshold, a heavier probe (MLP or temporal attention) is triggered for second-stage classification.

The novelty is threefold: (1) the mechanistic discovery that margin-based hesitation severity in D-LLM trajectories predicts probe failure; (2) a training pipeline that uses out-of-fold scoring to curate hesitation trajectories for training the advanced probe; and (3) a dynamic routing mechanism that allocates compute proportional to difficulty. This is distinct from prior bi-level monitoring work in AR-LLMs (e.g., Constitutional Classifiers++, McKenzie et al.) which pair lightweight probes with external LLMs rather than using trajectory-intrinsic routing signals.

2. Methodological Rigor

The experimental design is thorough and well-controlled:

Breadth: Evaluation spans 3 datasets (WildGuardMix, ToxicChat, OpenAI-Moderation), 4 D-LLMs (LLaDA-8B-Base/Instruct, LLaDA-1.5, LLaDA-2.0-mini), and 8 baselines covering single-step, full-trajectory, and sequence-based probes.

Fair comparisons: The authors carefully control for training data size, normalization, and architectural capacity. Out-of-fold scoring prevents data leakage when identifying hesitation trajectories.

Ablations are comprehensive: The paper ablates routing signal choice (margin vs. entropy vs. confidence), hesitation threshold sensitivity, generation length, step length, remasking strategy, and random seeds. The margin signal consistently outperforms alternatives, with a clear mechanistic explanation (probe-intrinsic vs. probe-extrinsic uncertainty).

Efficiency metrics: Beyond accuracy/F1, the paper reports expected parameters (E[P]), FLOPs (analytically derived), inference time, and false rejection rate, giving a multi-dimensional view of the efficiency-effectiveness tradeoff.

One methodological strength is the counterfactual validation: the authors measure the accuracy gap between base and advanced probes on routed subsets, confirming that margin-based routing isolates samples that genuinely benefit from the heavier probe (13.5% gap vs. ~7% for entropy/confidence). The adversarial fraction analysis (Appendix E.3) adds semantic interpretability — hesitation severity selectively captures adversarially designed prompts.

However, statistical significance testing beyond seed robustness (Table 7) is limited, and confidence intervals are not provided for the main results.

3. Potential Impact

Immediate practical value: With ≤0.85M parameters and inference times comparable to single-step methods, D²-Monitor is deployable in resource-constrained settings (edge devices, user-side monitoring). This is timely given Mercury 2's 1009 tokens/sec speed — safety monitoring must match this throughput.

Broader implications:

The hesitation concept could generalize beyond safety to other probe-based monitoring tasks in D-LLMs (hallucination detection, factuality checking).

The finding that multi-step trajectories contain richer safety signals than single-step representations establishes a foundation for future D-LLM interpretability research.

The cascade design principle — using intrinsic uncertainty to route compute — could inspire analogous systems in other iterative generative models (e.g., diffusion image models, iterative refinement architectures).

Limitations on impact: D-LLMs are still an emerging paradigm, and the practical deployment base is small relative to AR-LLMs. The paper only tests models up to 16B parameters. The authors acknowledge vulnerability to adaptive adversaries who could suppress hesitation steps to evade routing — a significant concern for real-world deployment that remains unaddressed experimentally.

4. Timeliness & Relevance

This paper is exceptionally well-timed. D-LLMs have gained rapid momentum (Mercury 2 commercial deployment, LLaDA 2.0 at 100B scale), yet their safety infrastructure lags far behind AR-LLMs. The paper correctly identifies this gap and provides the first systematic study of probe-based safety monitoring for D-LLMs. The observation that alignment alone is insufficient (citing adversarial attack vulnerability) motivates the need for external monitors. Given regulatory pressure (EU AI Act, executive orders) requiring safety guardrails, lightweight monitoring solutions like D²-Monitor address a genuine deployment need.

5. Strengths & Limitations

Key Strengths:

Novel and well-motivated signal: Safety hesitation is an elegant concept that bridges D-LLM dynamics with probe uncertainty, supported by rigorous empirical validation (crossing probability, margin persistence).

Excellent efficiency-effectiveness tradeoff: Consistently top-left in Pareto plots across all settings, with 2-150× FLOPs reduction vs. full-trajectory baselines.

Cross-dataset generalization: Performance gains hold when training on WildGuardMix and testing on ToxicChat/OpenAI-Moderation, suggesting the hesitation signal captures model-intrinsic rather than dataset-specific difficulty.

Practical design: The cascade naturally supports budget-constrained deployment via the λ threshold.

Notable Weaknesses:

Scale limitation: No experiments beyond 16B parameters; the trend may not hold at 100B+ scale where D-LLMs become competitive.

Adversarial robustness untested: The paper acknowledges but does not experimentally evaluate adaptive attacks that could suppress hesitation signals.

Limited probe expressivity exploration: Only two advanced probe architectures (MLP, TimeAttn) are tested; the framework's flexibility claim is underexplored.

Binary classification only: Safety is treated as binary safe/unsafe; multi-category or severity-graded detection is not addressed.

D-LLM diversity: All four models are from the LLaDA family; testing on architecturally distinct D-LLMs (e.g., Mercury, MDLM) would strengthen generalization claims.

Summary

D²-Monitor makes a solid contribution at the intersection of D-LLM safety and efficient monitoring. The hesitation severity concept is well-grounded, the experimental evaluation is thorough within its scope, and the practical implications for deployment are clear. The main limitations are scale and adversarial robustness, both acknowledged by the authors. This paper will likely serve as a foundational reference for safety monitoring in D-LLMs.

Rating:7/ 10

Significance 7Rigor 7.5Novelty 7.5Clarity 8

Generated May 26, 2026

Comparison History (21)

vs. Asking Is Not Enough: Protocol Sensitivity in LLM Confidence Calibration

gpt-5.25/28/2026

Paper 2 is likely higher impact because it identifies a broad, protocol-dependent measurement confound in LLM confidence calibration that affects many benchmarks, model families, and downstream uses (evaluation, uncertainty estimation, decision-making). Its findings generalize across AR models and provide actionable guidance (a reporting checklist) that can reshape community standards and improve reproducibility. Paper 1 is novel and timely for diffusion LLM safety, but its impact is narrower (limited to D-LLMs and a specific monitoring architecture) and depends on the adoption trajectory of diffusion LLMs.

vs. SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment

gpt-5.25/28/2026

Paper 1 likely has higher impact due to stronger timeliness and broader applicability: safety monitoring is a cross-cutting requirement for deploying LLMs, and diffusion LLMs are an emerging paradigm with underexplored safety tooling. The hesitation-aware routing idea leverages unique trajectory signals in D-LLMs and offers a practical efficiency–effectiveness trade-off with lightweight always-on monitoring, making real-world adoption plausible. Paper 2 is methodologically interesting for agent RL skill internalization, but is narrower in scope (specific benchmarks/settings) and its near-term deployment relevance is more limited than safety monitoring.

vs. BatteryMFormer: Multi-level Learning for Battery Degradation Trajectory Forecasting

gpt-5.25/27/2026

Paper 1 targets an emerging, broadly relevant problem—safety monitoring for diffusion LLMs—introducing a novel trajectory-based “hesitation” signal and an efficient dynamic routing monitor with strong empirical validation across multiple datasets and models. Its impact could span AI safety, model monitoring, and deployment governance, and it is timely given rapid diffusion-model adoption. Paper 2 addresses an important applied domain (battery health forecasting) with solid Transformer innovations and clear real-world value, but its scope is narrower and more domain-specific, likely limiting breadth of cross-field impact.

vs. It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

claude-opus-4.65/27/2026

Paper 2 addresses a novel and timely problem—safety monitoring for diffusion LLMs—an emerging architecture with limited prior safety research. It introduces a principled concept (safety hesitation), a practical framework (D^2-Monitor), and demonstrates state-of-the-art results across multiple datasets and models with strong baselines. Paper 1, while offering useful practical insights on harness sensitivity, is limited by single-model-per-tier design (432 runs but n=1 per tier), reducing generalizability. Paper 2 has broader impact potential across AI safety, diffusion models, and efficient monitoring, with stronger methodological rigor.

vs. Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

gemini-3.15/27/2026

Paper 1 addresses safety and alignment in Large Reasoning Models (LRMs) and Chain-of-Thought, a highly relevant and rapidly growing area of AI research. Understanding how CoT interacts with activation steering and refusal offers critical insights for AI security. Paper 2 focuses on Diffusion LLMs, which currently have much less widespread adoption than autoregressive LRMs. The immediate relevance, broader applicability to state-of-the-art models, and novel insights into CoT's role in model control make Paper 1 more scientifically impactful.

vs. Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

gpt-5.25/27/2026

Paper 2 is likely higher impact due to stronger novelty and timeliness: it targets diffusion LLM safety monitoring, an underexplored and rapidly emerging model class, and leverages a distinctive diffusion-specific signal (trajectory “hesitation”) unavailable to AR-LLMs. The proposed dynamic routing monitor is broadly applicable to safety deployment with clear real-world benefits (efficient always-on moderation) and general relevance across many D-LLMs and datasets. Paper 1 addresses an important medical agent issue, but its impact may be narrower (medical tool orchestration) and more incremental relative to existing RL-based tool-selection work.

vs. Partner-Aware Hierarchical Skill Discovery for Robust Human-AI Collaboration

gemini-3.15/26/2026

Paper 1 addresses a highly timely and critical issue (safety monitoring) for an emerging foundation model architecture (diffusion LLMs). Its novel approach exploiting trajectory-level 'safety hesitation' offers an efficient, scalable solution. Given the massive deployment and societal implications of LLMs, safety mechanisms have broader immediate real-world applications and cross-field relevance compared to the more specialized multi-agent RL focus of Paper 2.

vs. Scaling Observation-aware Planning in Uncertain Domains

gemini-3.15/26/2026

Paper 2 addresses a highly timely and critical challenge: safety monitoring for emerging Diffusion LLMs. By leveraging the unique multi-step denoising process to identify 'safety hesitation,' it introduces a novel and efficient dynamic routing mechanism for content moderation. Given the massive real-world implications of LLM safety and the rapid growth of generative AI, this work offers broader immediate applications and cross-disciplinary relevance compared to Paper 1. While Paper 1 demonstrates impressive algorithmic scaling for POMDPs, Paper 2's focus on AI safety alignment provides a higher potential for widespread societal and scientific impact.

vs. SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

gemini-3.15/26/2026

Paper 2 addresses long-horizon reasoning and memory in LLM agents, a highly active and broadly applicable area in current AI research. Its state-adaptive memory framework offers scalable improvements for agentic systems across diverse domains. While Paper 1 presents a novel safety approach for diffusion LLMs, its impact is currently limited by the narrower adoption of D-LLMs compared to standard autoregressive models used in widespread agent architectures.

vs. When Mean CE Fails: Median CE Can Better Track Language Model Quality

gpt-5.25/26/2026

Paper 1 is more novel and impactful: it introduces a diffusion-LLM-specific safety monitoring paradigm leveraging intermediate denoising trajectories and a hesitation-based difficulty signal, then turns it into a practical, compute-adaptive routing system with strong multi-model, multi-dataset results. This has clear real-world deployment relevance (efficient, always-on safety) and broad implications for monitoring other iterative generative models. Paper 2 provides an important diagnostic/metrics insight (median vs mean CE) but is more incremental and primarily affects evaluation/reporting practices rather than enabling new system capabilities.

vs. TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

gpt-5.25/26/2026

Paper 1 is more likely to have higher scientific impact due to its novel, diffusion-specific safety signal (trajectory “hesitation”) and a clear, efficient bi-level routing mechanism with strong empirical validation across multiple datasets and D-LLMs. The work is timely (LLM safety) and broadly applicable to monitoring/guardrailing generative models, potentially influencing both research and deployment practices. Paper 2 is innovative and application-relevant for engineering design, but its evidence is limited to two case studies with moderate success rates and higher system-level brittleness, which may constrain generalizability and near-term uptake.

vs. A Deep Dive into Axiomatic Design -- Part I: Problem Formulation

claude-opus-4.65/26/2026

Paper 1 addresses a timely and novel problem—safety monitoring for diffusion-based LLMs—with a rigorous methodology, introducing the concept of 'safety hesitation' and a dynamic routing mechanism. It is evaluated across multiple datasets and models, demonstrating state-of-the-art results. The topic is highly relevant given growing concerns about AI safety and the emergence of diffusion LLMs. Paper 2 is a review/clarification of axiomatic design problem formulation, offering practical guidance but limited novelty, as it primarily revisits existing literature rather than introducing new methods or empirical findings.

vs. MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

claude-opus-4.65/26/2026

Paper 1 addresses a novel and timely problem—safety monitoring for diffusion LLMs—an emerging architecture with growing interest. It introduces a creative concept (safety hesitation) and a practical dynamic routing mechanism, combining novelty with methodological rigor across multiple datasets and models. Paper 2 contributes a useful benchmark for multi-page document parsing, but benchmarks generally have narrower impact unless they become widely adopted standards. Paper 1's contribution to AI safety for a new paradigm of language models has broader implications and higher potential to influence future research directions.

vs. Beyond Control-Flow: Integrating the Resource Perspective into Multi-Collaborative Process Modeling from Text

gpt-5.25/26/2026

Paper 1 is more novel and timely: it targets safety monitoring for diffusion LLMs, an emerging model class with distinct intermediate-state signals, and introduces a hesitation-aware, dynamically routed monitor with strong efficiency–effectiveness trade-offs. It is evaluated across multiple datasets and D-LLMs with competitive baselines, suggesting solid methodological rigor and broader ML safety relevance. Paper 2 is useful for BPM practice and improves text-to-BPMN modeling with resource awareness, but its impact is more domain-specific and less likely to generalize broadly across fields.

vs. AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

gemini-3.15/26/2026

Paper 2 addresses a highly timely and widely relevant problem: the fragility of autonomous computer-use agents in real-world, dynamic environments. With the rapid deployment of MLLM-based agents, benchmarking and improving their robustness to common UI corruptions has immediate, broad real-world applicability. In contrast, Paper 1 is highly innovative but focuses on diffusion LLMs, which currently have a narrower adoption footprint compared to autoregressive models and agentic workflows, leading to a comparatively more niche impact.

vs. FrontierOR: Benchmarking LLMs' Capacity for Efficient Algorithm Design in Large-Scale Optimization

gpt-5.25/26/2026

Paper 1 likely has higher impact due to its broad, timely benchmark infrastructure for evaluating LLMs on scalable algorithm design in realistic large-scale optimization—an area with clear real-world stakes across industries (logistics, energy, scheduling) and multiple research communities (LLMs/agents, OR, benchmarking, software engineering). Its methodological rigor (expert-derived tasks, standardized instances, hidden evaluation) supports durable, field-wide adoption. Paper 2 is novel and relevant for diffusion-LLM safety, but targets a narrower model class and application niche; its techniques may be less broadly reusable than a benchmark that can steer progress across many models and tasks.

vs. Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems

gpt-5.25/26/2026

Paper 2 likely has higher scientific impact: it proposes a general, standards-based, executable governance infrastructure (RDF/OWL, SHACL, PROV-O) with a formal model and a “regulatory compiler,” enabling scalable compliance across many AI systems and domains. Its applications span critical infrastructure and regulation, giving broad cross-field relevance (AI, law, semantic web, systems governance) and strong timeliness amid expanding AI governance requirements. Paper 1 is novel and rigorous for diffusion-LLM safety monitoring, but its impact is narrower to a specific model class and monitoring setting.

vs. Lattice theory and algebraic models for deep convolutional learning based on mathematical morphology

gemini-3.15/26/2026

Paper 2 establishes a rigorous, foundational mathematical framework for widely used deep learning architectures (CNNs, ResNets) using lattice theory. By providing theoretical explanations for the representational power of depth and proposing novel idempotent layer designs, it offers profound, long-lasting theoretical contributions. Paper 1, while highly practical and timely for AI safety, focuses on a specific and currently niche subfield (diffusion LLMs), limiting its broader scientific impact compared to the overarching theoretical advancements of Paper 2.

vs. SpecAlign: A Semantic Alignment Framework for SystemVerilog Assertion Generation

gemini-3.15/26/2026

Paper 2 addresses AI safety in emerging Diffusion LLMs, a highly active and critical field with broad societal and interdisciplinary implications. While Paper 1 offers valuable contributions to hardware verification, its impact is largely confined to a specialized niche. The focus on efficient safety monitoring in generative AI gives Paper 2 broader relevance and higher potential for widespread scientific impact.

vs. Boosting Inference with Guided Reasoning: Stochastic Exploration for Recursive Models

gpt-5.25/26/2026

Paper 2 likely has higher impact due to timeliness and broad real-world relevance: safety monitoring for diffusion LLMs is an emerging, high-stakes need with immediate deployment pathways. It introduces a clear, general mechanism (hesitation-aware routing) that leverages diffusion-specific trajectory signals, validated across multiple datasets and models with strong efficiency claims. Paper 1 is innovative and strong on label-free diagnostics and inference-time gains, but appears more niche (recursive reasoning/specific tasks) and may have narrower applicability beyond structured reasoning benchmarks.

D2D^2D2-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

Abstract

AI Impact Assessments

Scientific Impact Assessment: D²-Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing

1. Core Contribution

2. Methodological Rigor

3. Potential Impact

4. Timeliness & Relevance

5. Strengths & Limitations

Key Strengths:

Notable Weaknesses:

Summary

Comparison History (21)

$D^2$ -Monitor: Dynamic Safety Monitoring for Diffusion LLMs via Hesitation-Aware Routing