Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

Jun Wen Leong

Jun 10, 2026arXiv:2606.11949v1

cs.LGcs.CRstat.ML

#3824of 5669·cs.LG

#3824 of 5669 · cs.LG

Tournament Score

1357±42

10501750

45%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6

Rigor7

Novelty4

Clarity7.5

Abstract

We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8%]) with mean latency of 39.5 steps. Detection holds across three ground-truth regimes: synthetic onset (86.6%), real temporal jailbreaks (85%, 17/20), and GCG adversarial attacks. Weighted conformal prediction recovers up to 39 pp of lost coverage for DeBERTa (ESS=46/300) but collapses for all other classifiers (ESS~300): logistic density ratio estimation achieves perfect source/target separability in high-dimensional embedding spaces, clipping all importance weights to the floor. DeBERTa shows a gradient from effective correction (paraphrase, ESS=46) to near-total collapse (adversarial suffix, ESS=206). PCA to 32 dimensions breaks the collapse, recovering 33 pp for Llama Guard and 21 pp for ShieldGemma. Variance decomposition reveals classifier (eta^2=0.243), shift type (eta^2=0.237), and their interaction (eta^2=0.185) all contribute substantially to detection latency variance (all p<0.001), indicating per-classifier monitoring profiles are necessary.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper presents an online monitoring system for detecting distributional shift in deployed LLM safety classifiers, combining sliding-window KS statistics with conformal prediction for post-detection adaptation. The system is evaluated in a pre-registered factorial design crossing 4 classifiers × 5 shift conditions × 20 seeds × 2 window sizes (800 cells). The three main contributions are: (1) a multi-channel shift detection framework achieving 86.6% valid detection with controlled false alarm rates, (2) identification of a density-ratio collapse mechanism in high-dimensional generative embeddings that defeats weighted conformal prediction, and (3) a variance decomposition revealing substantial classifier×shift-type interactions that necessitate per-classifier monitoring profiles.

The problem addressed—silent degradation of safety classifiers under distribution shift—is practically important. The paper correctly identifies that production safety systems typically lack real-time labels, making proactive monitoring essential.

Methodological Rigor

Strengths in experimental design: The pre-registered factorial design is commendable and unusual for this area. The 800-cell evaluation with 20 seeds per condition provides genuine statistical power, and the variance decomposition is a sophisticated analytical choice that reveals interaction effects invisible to single-classifier studies. The paper is unusually transparent about its metrics—the "valid detection" criterion requiring both true detection and clean negative controls is conservative and honest.

Concerns: The methodological novelty is limited. The individual components (sliding-window KS tests, weighted conformal prediction, MMD detection) are all well-established. The paper acknowledges this, positioning itself as an empirical study rather than a methodological contribution, but this limits its theoretical impact. The sliding-window KS variant lacks the time-uniform guarantee of confidence sequences, and the empirical FAR calibration (50 null streams, 97th percentile threshold) is pragmatic but theoretically unsatisfying.

The conformal adaptation evaluation reveals more failures than successes. Weighted conformal prediction works meaningfully only for DeBERTa on paraphrase shift (ESS=46, +39pp recovery). For 11 of 12 classifier×shift combinations, density-ratio estimation collapses entirely. While diagnosing this collapse is valuable, the system's adaptation component is largely non-functional as presented. The PCA-to-32-dimensions fix is demonstrated only on temporal shift (with a secondary confirmation on paraphrase using different calibration splits), leaving generalizability uncertain.

The Regime C (adversarial) evaluation is based on only 22 examples with identical Llama Guard scores (0.78), making the "14/40 detections" a replication of a single observation rather than 14 independent events. The paper acknowledges this limitation but still presents it as a ground-truth regime.

Potential Impact

Practical relevance: The finding that per-classifier monitoring profiles are necessary (due to substantial interaction effects) is directly actionable for production safety systems. The crossover interaction—encoders detect paraphrase fast but adversarial slow, decoders vice versa—provides concrete deployment guidance.

The density-ratio collapse diagnosis is potentially the most impactful finding, as it warns practitioners against naively applying weighted conformal prediction in high-dimensional embedding spaces of generative models. The demonstration that PCA to ≤32 dimensions breaks the collapse provides a practical remedy, though validation is limited.

The CS vs. KS trade-off at low contamination (97% vs 43% detection at 30% mixing) is operationally significant, as real-world drift is rarely 100% contamination. This finding alone justifies the multi-channel architecture.

However, the system is evaluated only on binary safety classifiers with scalar outputs. Multi-category taxonomies, which are increasingly standard in production, are explicitly excluded. The reference distributions use only WildGuardMix unharmful examples, which may not represent production traffic variability.

Timeliness & Relevance

The paper addresses a genuine current need. LLM safety classifiers are deployed at massive scale, adversarial adaptation is accelerating (GCG attacks, jailbreak evolution), and the silent failure mode is well-documented. The timing is appropriate—safety classifier monitoring is an emerging operational concern with limited empirical study.

The paper's positioning against recent work (Prinster et al., 2025; Sahoo et al., 2026) is well-executed, and it identifies a gap in the literature: no prior work analyzes classifier×shift interactions for safety classifiers using factorial designs.

Strengths

1. Pre-registered factorial design with 800 cells and honest reporting of all deviations

2. Variance decomposition revealing interaction effects (η²=0.185) that are invisible to single-classifier studies

3. Transparent failure reporting: the density-ratio collapse, Regime C corpus limitations, and N=5 vs N=20 replication differences are all honestly characterized

4. Multi-channel architecture with clear operational roles for each detector

5. Reproducibility: code, configurations, and verification scripts are provided

Limitations

1. Limited methodological novelty: the system assembles known components rather than introducing new methods

2. Conformal adaptation mostly fails: 11/12 classifier×shift combinations show density-ratio collapse

3. Narrow classifier scope: only binary classifiers with scalar outputs

4. Small adversarial corpus (22 examples) undermines Regime C claims

5. PCA fix validated on limited shifts: temporal (primary) and paraphrase (secondary with different calibration), leaving generalizability open

6. No comparison to recent unified frameworks like WATCH (Prinster et al., 2025) on the same benchmarks

Overall Assessment

This is a solid empirical study that addresses a practical problem with a well-designed evaluation. Its primary value lies in the factorial analysis revealing interaction effects and the diagnosis of density-ratio collapse, rather than in the monitoring system itself, which assembles known components. The conformal adaptation component is largely non-functional, which limits the paper's claim of providing an adaptive system. The work would benefit from comparison against unified monitoring frameworks and extension to multi-category classifiers. The honest reporting of failures and limitations is a notable strength.

Rating:5.5/ 10

Significance 6Rigor 7Novelty 4Clarity 7.5

Generated Jun 11, 2026

Comparison History (20)

Wonvs. When Does Routing Become Interpretable? Causal Probes on Block Attention Residuals

Paper 2 has higher impact potential due to strong real-world applicability (deployed safety classifier monitoring), timely relevance to LLM safety, and higher methodological rigor (pre-registered, large factorial evaluation with CIs and variance decomposition). It offers actionable findings about when conformal adaptation fails (importance-weight collapse) and practical mitigations (dimensionality reduction), with implications for online ML monitoring and reliability engineering. Paper 1 is novel for interpretability of routing and provides useful causal analysis, but its contributions are narrower and more architecture-specific, likely limiting breadth and immediate deployment impact.

gpt-5.2·Jun 12, 2026

Wonvs. Enhanced Low-Density Region Exploration in Classifier-Guided Diffusion Models Through Modified Reverse Diffusion Sampling

Paper 1 addresses the critical and highly timely challenge of AI safety under distribution shifts and adversarial attacks. Its rigorous, pre-registered factorial evaluation and practical solutions for high-dimensional conformal adaptation offer broad real-world utility for deploying robust AI systems. Paper 2 presents a valuable but comparatively narrower algorithmic improvement for generative diffusion models.

gemini-3.1-pro-preview·Jun 12, 2026

Lostvs. The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

Paper 2 introduces a novel geometric framework (Stable Recovery Manifold hypothesis) that fundamentally recharacterizes catastrophic forgetting as an accessibility/alignment problem rather than information destruction. This conceptual reframing has broad implications across continual learning, neuroscience-inspired AI, and lifelong learning systems. The clean theoretical insight (k_t stability, principal-angle drift predicting recoverability) is elegant and generalizable. Paper 1, while technically thorough with its factorial evaluation of conformal adaptation for safety classifiers, addresses a more narrowly scoped engineering problem and reveals significant limitations (ESS collapse for most classifiers), reducing its practical impact.

claude-opus-4-6·Jun 12, 2026

Lostvs. A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

Paper 2 likely has higher impact: it proposes a broadly applicable, theoretically grounded framework (Radon–Nikodym/path-measure derivation, convergence to reward-tilted distributions, provably optimal AJD loss) for fine-tuning any-length discrete diffusion models—an emerging, timely paradigm in generative modeling. The method can influence multiple areas (sequence modeling, RL-style fine-tuning, decoding/inference) and may generalize across tasks and modalities. Paper 1 is rigorous and valuable for deployed safety monitoring, but its impact is narrower and results reveal brittleness (importance-weight collapse) that may limit general adoption.

gpt-5.2·Jun 12, 2026

Wonvs. Accelerating Speculative Diffusions via Block Verification

Paper 2 addresses a highly critical and timely issue—the reliability of deployed AI safety classifiers under distribution shift and adversarial attacks. Its methodological rigor is exceptional, featuring a pre-registered factorial evaluation and deep statistical variance analysis. While Paper 1 offers a neat algorithmic adaptation for diffusion models, its practical impact (a 6.3% speedup) is relatively marginal compared to the urgent real-world need for robust, adaptive safety mechanisms in LLM deployment demonstrated in Paper 2.

gemini-3.1-pro-preview·Jun 12, 2026

Wonvs. WHAR Arena: Benchmarking the State of the Art in Efficient Wearable Human Activity Recognition

Paper 1 has higher potential impact due to its timely focus on deployed AI safety under distribution shift, combining online shift detection with conformal risk control—an approach that can generalize across safety-critical NLP deployments. Its pre-registered, factorial evaluation and variance decomposition indicate strong methodological rigor and actionable insights (e.g., when importance-weighted conformal fails and how dimensionality reduction mitigates it). The work connects to multiple fields (ML monitoring, sequential testing, conformal inference, adversarial robustness, AI safety). Paper 2 is highly useful infrastructure, but benchmarks typically yield narrower conceptual novelty.

gpt-5.2·Jun 12, 2026

Wonvs. Easy-to-Use Shielding for Reinforcement Learning

Paper 2 demonstrates higher potential impact due to its exceptional methodological rigor, including a pre-registered factorial evaluation and detailed statistical analysis. Furthermore, it addresses an extremely timely and critical problem: the robustness of deployed safety classifiers against distributional shifts, real-world jailbreaks, and adversarial attacks. While Paper 1 provides useful infrastructure for Safe RL, Paper 2 tackles immediate real-world deployment challenges for modern foundational models with a statistically sound adaptive framework.

gemini-3.1-pro-preview·Jun 11, 2026

Wonvs. Finding Multiple Interpretations in Datasets

Paper 1 has higher likely impact due to a more novel and timely contribution to deploying safety classifiers under distribution shift, combining online sequential shift detection with conformal abstention adaptation and a large, preregistered factorial evaluation. It targets real-world ML safety operations (monitoring, jailbreak/adversarial shift, error-rate control) with quantified reliability and clear failure analysis (importance-weight collapse, PCA fix), suggesting broadly applicable tooling across domains using deployed classifiers. Paper 2 is useful for interpretability/model multiplicity but is narrower in demonstrated scope (single dataset) and methodological detail.

gpt-5.2·Jun 11, 2026

Wonvs. Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

Paper 2 addresses a highly critical and timely challenge in AI safety—monitoring deployed classifiers for distributional shifts and adversarial attacks. It features exceptional methodological rigor with a large-scale pre-registered factorial evaluation and provides deep insights into the failure modes of conformal prediction in high-dimensional spaces. In contrast, Paper 1 offers a solid but straightforward application of an existing foundation model to a specific predictive maintenance task, lacking the broader theoretical and interdisciplinary implications of Paper 2.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Harness In-Context Operator Learning with Chain of Operators

Paper 1 introduces Chain of Operators (CHOP), a novel framework that extends in-context operator learning to out-of-distribution tasks without retraining, drawing creative parallels between prompt engineering in LLMs and operator composition in scientific computing. This bridges two active research areas (neural operators and in-context learning) with broad applicability across PDEs and scientific domains. Paper 2 addresses an important but narrower engineering problem of monitoring deployed safety classifiers, with solid empirical work but limited novelty beyond combining existing techniques (conformal prediction, sequential statistics, importance weighting). CHOP's transferability across PDE families suggests deeper theoretical implications.

claude-opus-4-6·Jun 11, 2026

#3824of 5669·cs.LG

#3824 of 5669 · cs.LG

Tournament Score

1357±42

10501750

45%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance6

Rigor7

Novelty4

Clarity7.5