Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao, Yanyan Zhao, Yutai Hou, Qianchao Wang, Dandan Tu

#201 of 2292 · Artificial Intelligence
Share
Tournament Score
1518±47
10501800
84%
Win Rate
21
Wins
4
Losses
25
Matches
Rating
7.2/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies and formalizes Safety Geometry Collapse (SGC) — a representation-geometric failure mode explaining why multimodal LLMs (MLLMs) fail to refuse harmful inputs expressed through non-text modalities despite strong text-only safety alignment. The key insight is that multimodal inputs induce a systematic representation drift that compresses the separability along the text-aligned refusal direction, making the model unable to distinguish harmful from benign inputs. The authors formalize this through a two-dimensional diagnostic space (refusal direction + modality-induced drift direction) and propose Conditional Refusal Separability (CRS) as a quantitative metric. They then propose ReGap, a training-free, inference-time method that adaptively corrects modality-induced drift using a self-rectification signal — an internal indicator of whether drift correction causes the model to recover refusal behavior during forward dynamics.

Methodological Rigor

The paper follows a well-structured investigative methodology organized around three research questions, building from diagnosis to intervention to mechanism analysis:

1. RQ1 (Diagnosis): The CRS metric and its correlation with ASR across drift levels is convincingly demonstrated across three models. The sliding-window analysis in Figure 3 shows consistent monotonic relationships between drift magnitude, CRS degradation, and ASR increase.

2. RQ2 (Causal validation): The fixed-strength intervention experiment provides interventional evidence by subtracting the drift direction (orthogonalized against the refusal direction), which is a clean experimental design. The fact that improvement occurs despite the intervention being orthogonal to the refusal direction at the intervention layer strengthens the causal argument.

3. RQ3 (Mechanism): The self-rectification analysis reveals that harmful inputs exhibit stronger refusal-recovery dynamics than benign inputs after partial drift correction, providing both mechanistic understanding and a practical signal for adaptive correction.

Limitations in rigor: The method requires per-model, per-modality hyperparameter tuning (diagnostic layers, thresholds, strong/weak λ values — see Tables 3 and 4), which somewhat undermines the claim of a general-purpose method. The reliance on a single calibration dataset (Omni-SafetyBench) and the binary adaptive scheme (two discrete correction strengths) are acknowledged but non-trivial constraints. The evaluation uses gpt-5-mini as judge, which, while validated with 95.8% human agreement, introduces external model dependency.

Potential Impact

Practical impact: ReGap offers a deployable, training-free inference-time defense for multimodal safety that does not require retraining or access to multimodal safety data — a significant practical advantage over post-training methods like VLGuard-SFT and SPA-VL-DPO, which require curated safety datasets for each modality. The preservation of utility (Tables 1-2) while reducing ASR substantially is compelling for real-world deployment.

Conceptual impact: The SGC framework provides a unifying geometric explanation for multimodal safety failures that goes beyond the "insufficient safety data" narrative. This reframes the problem from a data issue to a representation alignment issue, which could redirect research efforts toward modality alignment rather than modality-specific safety tuning. The connection to text-only jailbreak literature (where adversarial inputs shift representations to bypass safety boundaries) provides theoretical coherence.

Broader influence: The self-rectification phenomenon — where models recover safety behavior through their own forward dynamics once drift is corrected — is an interesting mechanistic finding that could influence interpretability research and inform future alignment strategies. The idea that models already "know" harmful content but are geometrically prevented from acting on this knowledge is powerful.

Timeliness & Relevance

This paper addresses a critical and timely problem. As MLLMs are deployed across modalities (vision, audio, video, omni-modal), the multimodal safety gap is an active concern. Recent work has demonstrated numerous multimodal jailbreaks, and existing defenses are fragmented across modalities. The paper's evaluation across vision, audio, and omni-modal settings on recent models (Qwen2.5-Omni, Qwen3-Omni, MiniCPM-o) demonstrates relevance to the current model landscape. The training-free, inference-time nature of ReGap addresses deployment practicalities where retraining is costly.

Strengths

1. Elegant geometric framework: The two-dimensional safety space (refusal + drift) is simple, interpretable, and analytically productive. CRS provides a principled metric linking representation geometry to behavioral outcomes.

2. Strong experimental design: The progression from observation → quantification → causal intervention → mechanism → method is exemplary scientific methodology.

3. Cross-modality generality: Unlike prior work focused on vision-only safety, this paper addresses vision, audio, video, and omni-modal settings within a unified framework.

4. Self-rectification discovery: The finding that models can self-correct when drift is partially removed is mechanistically interesting and practically useful as an adaptive signal.

5. Comprehensive evaluation: 4,644 harmful + 1,500 benign examples across 6 benchmarks, 3 models, with human validation of the automated judge.

Limitations & Weaknesses

1. Hyperparameter sensitivity: Tables 3-4 reveal substantial model- and modality-specific tuning (18 separate configurations). The sensitivity to these choices is not thoroughly ablated.

2. Unified drift assumption: Using a single drift direction per modality is an acknowledged simplification. The paper shows this works but doesn't explore when it might fail (e.g., for highly diverse multimodal inputs).

3. Limited model scope: Three models from two families. The approach requires access to internal representations, excluding closed-source systems.

4. Incomplete ASR reduction: ReGap still leaves substantial residual ASR in some settings (e.g., 63% on AudioJailBench Video for MiniCPM-o, 65.4% on MM-SafetyBench for MiniCPM-o), suggesting the geometric account captures only part of the safety failure.

5. Computational overhead: Despite claims of efficiency, the probe forward pass adds ~26% latency for Qwen2.5-Omni-3B (Table 5), which is non-trivial for production systems.

6. Static calibration: The calibrated thresholds and directions are fixed post-calibration, potentially limiting robustness to distribution shift in deployment.

Overall Assessment

This is a well-executed mechanistic study that provides both diagnostic understanding and a practical method for an important problem. The geometric framework is elegant and the experimental methodology is thorough. The primary weaknesses lie in the substantial per-model tuning required and incomplete ASR reduction in some settings. Nevertheless, the conceptual contribution of SGC and the practical utility of ReGap represent meaningful advances in multimodal safety research.

Rating:7.2/ 10
Significance 7.5Rigor 7Novelty 7.5Clarity 8

Generated May 19, 2026

Comparison History (25)

vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
gemini-3.15/20/2026

Paper 2 addresses the critical and highly timely issue of safety in Multimodal LLMs. By providing a geometric understanding of the multimodal safety gap and introducing a training-free, inference-time correction method (ReGap), it offers immediate, practical real-world applications for deploying safe AI systems. While Paper 1 provides strong methodological rigor and novel theoretical bounds for multi-agent RL, Paper 2's focus on AI safety aligns with one of the most pressing challenges in the broader AI community today, likely leading to wider and faster scientific impact.

vs. SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning
gemini-3.15/19/2026

Paper 1 addresses a critical bottleneck in LLM reasoning—step-level credit assignment—by introducing a novel, self-contained hindsight self-distillation method. Eliminating the need for external teachers or annotations significantly advances scalable oversight and reinforcement learning for agentic systems. While Paper 2 offers valuable insights into multimodal safety and representation engineering, the methodological innovation in Paper 1 has broader, more fundamental implications for advancing autonomous reasoning capabilities, positioning it for higher widespread impact across the AI community.

vs. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
claude-opus-4.65/19/2026

Paper 2 addresses a critical and timely problem—multimodal LLM safety—with both theoretical insight (Safety Geometry Collapse) and a practical solution (ReGap). It introduces a novel geometric framework explaining why safety fails across modalities, provides causal validation, and offers a training-free fix. This has broad real-world impact given widespread MLLM deployment. Paper 1, while comprehensive, is primarily a benchmark contribution for a niche area (qualitative spatial/temporal reasoning) with less immediate practical impact and narrower audience appeal.

vs. Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners
gpt-5.25/19/2026

Paper 2 has higher likely impact due to its cross-disciplinary contribution: it introduces a rare joint benchmark linking model performance, human learning behavior, and concurrent fMRI, and reports large, controlled gains in brain-activity predictability over RL baselines. This can influence cognitive neuroscience, computational psychiatry, AI evaluation, and model design. Its applications span mechanistic understanding of learning and principled human-alignment metrics. Paper 1 is timely and practically valuable for multimodal safety, but its impact is more confined to safety engineering for MLLMs and relies on representation interventions whose generality may vary across architectures and modalities.

vs. TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning
gemini-3.15/19/2026

Paper 2 addresses the urgent and widespread issue of multimodal LLM safety. By formalizing 'Safety Geometry Collapse' and proposing a training-free intervention (ReGap), it provides deep theoretical insights and immediate practical solutions for AI alignment. While Paper 1 offers strong advances for embodied AI and robotics, Paper 2 has a broader, more immediate impact across the rapidly expanding field of foundation models, directly tackling critical safety vulnerabilities that affect millions of current AI deployments.

vs. LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection
gpt-5.25/19/2026

Paper 1 has higher likely impact due to its novel representation-geometry framing of multimodal safety failure (“Safety Geometry Collapse”), a causal intervention validating the mechanism, and a practical training-free inference-time mitigation (ReGap) with broad relevance to rapidly growing multimodal LLM deployment. Its applications are immediate (real-time safety hardening) and potentially generalizable across models/modalities, affecting ML safety, alignment, and representation learning. Paper 2 is methodologically sound and useful for prognostics/RUL, but its domain scope is narrower and depends on evidence-bank curation and task-specific knowledge bases, limiting cross-field breadth and timeliness relative to LLM safety.

vs. Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap
gpt-5.25/19/2026

Paper 2 has higher impact potential due to a clearer novel framing (representation-geometric “Safety Geometry Collapse”), stronger methodological rigor (quantified metrics, causal intervention, benchmarked evaluation), and immediate real-world applicability (training-free inference-time safety improvement for deployed multimodal LLMs). Its contributions generalize across models and modalities, affecting AI safety, robustness, and representation learning broadly, and are highly timely given rapid MLLM deployment. Paper 1 is interesting but appears more domain-specific, relies heavily on LLM-generated FCMs with less evident validation, and has narrower, more speculative downstream impact.

vs. ADR: An Agentic Detection System for Enterprise Agentic AI Security
claude-opus-4.65/19/2026

Paper 1 offers a deeper scientific contribution by identifying and formalizing 'Safety Geometry Collapse' as a fundamental representation-geometric phenomenon explaining why multimodal LLMs fail to transfer safety capabilities. It provides novel theoretical insights (refusal direction, modality drift, conditional refusal separability), causal validation through interventions, and a principled training-free method (ReGap). This mechanistic understanding has broad implications for the alignment and safety research community. Paper 2, while practically impactful as an engineering system deployed at scale, is more of a systems/engineering contribution with narrower scientific novelty, focused on a specific enterprise security application.

vs. Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents
gemini-3.15/19/2026

Paper 1 addresses a highly critical and timely issue—multimodal LLM safety—offering immediate, practical applications for deploying safer AI. Its geometric analysis and training-free intervention (ReGap) provide rigorous, scalable solutions. In contrast, Paper 2 focuses on theoretical and philosophical concepts of artificial subjectivity in simple gridworlds, which has less immediate real-world utility and narrower breadth of impact.

vs. EXG: Self-Evolving Agents with Experience Graphs
gemini-3.15/19/2026

Paper 1 addresses a critical vulnerability in multimodal LLMs (safety alignment across modalities) using a novel geometric perspective. Its theoretical depth in identifying 'Safety Geometry Collapse' and practical, training-free intervention (ReGap) offer profound implications for AI safety. While Paper 2 presents a useful framework for agent memory, Paper 1's focus on fundamental safety alignment in widely deployed foundational models provides a higher potential for broad, immediate impact in both mechanistic understanding and practical deployment.

vs. MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop
gpt-5.25/19/2026

Paper 2 is likely higher impact due to a clearer conceptual novelty (Safety Geometry Collapse), a representation-geometric framing with causal validation via activation interventions, and a broadly applicable, training-free inference-time mitigation (ReGap). Its relevance is timely given urgent multimodal safety needs, and the approach can generalize across models, modalities, and safety benchmarks—potentially influencing both ML safety research and deployment practices. Paper 1 is strong applied engineering with impressive production metrics and sustainability analysis, but it is more domain-specific and incremental (pipeline + HITL) with narrower cross-field scientific influence.

vs. Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics
gemini-3.15/19/2026

Paper 1 addresses a critical and immediate challenge in AI safety (multimodal jailbreaks) using a highly novel mechanistic interpretability lens. By identifying 'Safety Geometry Collapse' and proposing a training-free, inference-time intervention (ReGap) that does not compromise utility, it offers an easily adoptable solution with broad implications for deploying secure multimodal LLMs. Paper 2 is valuable for AI-driven scientific reasoning, but its approach relies on domain-specific dataset curation for fine-tuning, which has a narrower immediate impact compared to fundamentally understanding and fixing representation-level safety alignment.

vs. Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models
claude-opus-4.65/19/2026

Paper 1 addresses the critical and timely problem of multimodal LLM safety with a novel geometric analysis framework (Safety Geometry Collapse) and a practical training-free solution (ReGap). It offers both theoretical insights into why safety fails across modalities and a deployable fix, with broad implications for AI safety. Paper 2 makes solid contributions to RL for diffusion MLLMs but addresses a narrower technical problem (image generation optimization). Paper 1's impact spans safety, alignment, and deployment of MLLMs—areas of urgent societal importance—giving it broader and more consequential reach.

vs. Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use
claude-opus-4.65/19/2026

Paper 1 addresses the critical and timely problem of multimodal LLM safety with a novel geometric perspective ('Safety Geometry Collapse'), provides both mechanistic understanding and a practical training-free solution (ReGap), and demonstrates broad applicability across multiple benchmarks. The safety implications for deployed MLLMs give it high real-world relevance. Paper 2 offers interesting insights into the 'knowing-doing gap' in tool use, but addresses a narrower problem with more limited practical impact. Paper 1's combination of theoretical depth, causal validation, and practical solution gives it stronger overall impact potential.

vs. Rollout Cards: A Reproducibility Standard for Agent Research
gemini-3.15/19/2026

Paper 1 addresses a systemic methodological crisis in AI agent research by proposing 'rollout cards,' a standardization akin to Model Cards. By exposing how hidden reporting rules can invert model rankings and providing an open-source framework, it has the potential to fundamentally change how agent research is evaluated and published. While Paper 2 offers deep technical insights into multimodal safety, Paper 1's structural impact on scientific rigor and reproducibility gives it broader, longer-lasting significance across the AI community.

vs. Reinforcing VLAs in Task-Agnostic World Models
gemini-3.15/19/2026

Paper 2 addresses a critical, timely issue in AI safety for Multimodal LLMs. By identifying 'Safety Geometry Collapse' and offering a training-free, inference-time correction method (ReGap), it provides a highly scalable and easily adoptable solution. While Paper 1 offers strong contributions to robotics and VLA scalability, Paper 2's focus on foundational model safety has broader, more immediate real-world implications across diverse AI deployment sectors.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction
gpt-5.25/19/2026

Paper 2 likely has higher impact due to a novel, mechanistic framing ("Safety Geometry Collapse") with clear, testable metrics and a demonstrated causal intervention, plus an immediately deployable, training-free inference method (ReGap) validated on multiple benchmarks. Its applications are timely and broad—improving safety of widely used multimodal LLMs across domains—and it offers a general representation-level perspective that can influence both safety research and model design. Paper 1 is a valuable, rigorous synthesis for clinical AI, but as a review/framework it is less directly transformative than a new method with strong empirical validation.

vs. PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play
claude-opus-4.65/19/2026

Paper 1 introduces a novel geometric framework ('Safety Geometry Collapse') for understanding multimodal safety failures, identifies a causal mechanism (modality-induced drift), and proposes a practical training-free inference-time solution (ReGap). It addresses a critical and timely problem—MLLM safety—with broad implications for AI deployment. The combination of theoretical insight, causal validation, and practical mitigation is compelling. Paper 2 contributes a solid population-based self-play framework for LLM reasoning, but builds more incrementally on existing RLVR and population-based training ideas. Safety alignment has broader cross-field impact than reasoning benchmark improvements.

vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
gpt-5.25/19/2026

Paper 2 (TRACE) likely has higher impact: it targets a ubiquitous, broadly relevant failure mode (hallucinations) across essentially all LLM deployments, and demonstrates unusually strong generality—single training-free method, fixed hyperparameter, validated across 15 models/8 families/3 benchmarks with no regressions. Its cross-layer, per-input adaptive intervention is a novel framing beyond fixed “truth direction” approaches and is immediately deployable. Paper 1 is innovative and important for MLLM safety, but its scope is narrower (multimodal safety alignment) and applicability depends on specific MLLM architectures and safety setups.

vs. Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation
gemini-3.15/19/2026

Paper 2 addresses a critical and highly timely issue—safety in multimodal large language models—by providing a novel geometric explanation and an effective training-free mitigation strategy. Its implications for AI safety and alignment have broad, immediate impact across the rapidly growing field of generative AI. In contrast, Paper 1 presents an incremental architectural improvement to PPO for a narrower domain (multi-UAV control), resulting in comparatively lower overall scientific and real-world impact.