StructBreak: Structural Cognitive Overload-Induced Safety Failures in MLLMs
Yang Luo, Xinran Liu, Tiantian Ji, Zhiyi Yin, Lingyun Peng, Shuyu Li
Abstract
Multimodal Large Language Models (MLLMs) excel at structural reasoning yet suffer from a sharp logical brittleness in structural consistency. We term this phenomenon Structural Cognitive Overload (SCO), a byproduct of the contention between deep reasoning and safety alignment. However, prior work has predominantly targeted typographic and pixel-level perturbations, leaving the study of SCO largely unexplored. To this end, we propose StructBreak, an automated end-to-end framework designed to quantify SCO. By leveraging StructBreak, we uncover a novel higher-order cognitive overload attack paradigm; notably, this attack operates under a practical black-box setting, requiring no internal model access. Consequently, we utilize this framework to establish a comprehensive benchmark spanning ten diverse threat scenarios. Empirical evaluations on six leading MLLMs reveal that SCO readily triggers toxic generation, yielding a 92% average ASR (up to 97% on Gemini 2.5). To elucidate the mechanism of SCO, we further conduct model-level interpretations spanning attention dynamics, latent space topology, and geometric analysis. Our findings reveal that StructBreak acts as a novel structural channel to circumvent safety filters. Furthermore, the limited efficacy of inherent safety mechanisms underscores that current alignment paradigms are insufficient for the era of complex multimodal reasoning.
AI Impact Assessments
(1 models)Scientific Impact Assessment: StructBreak
1. Core Contribution
StructBreak identifies and formalizes a novel vulnerability in Multimodal Large Language Models (MLLMs) termed Structural Cognitive Overload (SCO) — the phenomenon where complex visual structures (specifically Visual Knowledge Graphs) exhaust a model's finite attention budget, causing safety alignment mechanisms to be "crowded out" by structural reasoning demands. The paper contributes: (a) an automated end-to-end framework for generating adversarial VKGs, (b) a benchmark spanning ten threat categories, (c) empirical evaluation across six frontier MLLMs showing 92% average attack success rate, and (d) mechanistic interpretability analyses explaining why the attack works at the attention, latent-space, and geometric levels.
The key novelty is the attack surface itself — unlike typographic attacks (FigStep) or pixel-level adversarial perturbations, StructBreak operates at a semantic-structural level. It embeds harmful intent within graph topologies that force models into a "parse-then-execute" mode, consuming attention resources that would otherwise support safety compliance. This is a conceptually distinct and higher-order attack paradigm.
2. Methodological Rigor
Strengths in experimental design:
Methodological concerns:
3. Potential Impact
Immediate impact on AI safety: This work exposes a fundamental tension between reasoning capability and safety alignment that is highly relevant as MLLMs become more capable structural reasoners. The "competency-vulnerability paradox" — that stronger reasoning models are more susceptible — is a particularly concerning finding for the trajectory of model development.
Practical threat assessment: The attack is black-box, low-cost ($0.07 per sample), single-shot effective, and produces near-zero refusal rates. This combination makes it a genuinely practical threat vector, distinct from gradient-based attacks requiring white-box access.
Influence on alignment research: The mechanistic evidence of safety attention dissipation provides concrete targets for defensive research. The finding that StructBreak representations are orthogonal to the model's refusal direction suggests that current linear safety probes are insufficient, potentially motivating non-linear safety monitoring approaches.
Broader implications: The work connects to the emerging theme that scaling reasoning capabilities may inadvertently create new safety vulnerabilities — a message relevant to the broader responsible AI community.
4. Timeliness & Relevance
This paper is exceptionally timely. The release of reasoning-focused models (GPT-5, Gemini 2.5, DeepSeek-R1) has dramatically increased MLLMs' structural reasoning capabilities, making exactly this type of vulnerability increasingly exploitable. The paper addresses a clear gap: prior jailbreak research focused on surface-level perturbations (typography, adversarial noise), while the structural reasoning attack surface was largely unexplored. As MLLMs are increasingly deployed for chart analysis, document understanding, and knowledge graph reasoning, understanding these vulnerabilities is urgent.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Minor concerns:
Overall Assessment
StructBreak makes a significant contribution to MLLM safety research by identifying and systematically exploiting a previously underexplored vulnerability class. The combination of a practical attack framework, strong empirical results, and multi-level mechanistic analysis makes this a comprehensive and impactful paper. The competency-vulnerability paradox is a finding with potentially far-reaching implications for alignment research. While the theoretical grounding could be stronger and the defensive analysis more thorough, the core contribution — demonstrating that structural reasoning complexity systematically undermines safety alignment — is both novel and important for the field.
Generated May 26, 2026
Comparison History (21)
While Paper 1 offers a highly rigorous and valuable open-source hardware implementation for datacenter networking, Paper 2 addresses a critical and rapidly expanding field: AI safety. By exposing a novel vulnerability (StructBreak) with an alarming 92% success rate on state-of-the-art MLLMs like Gemini, Paper 2 promises broader, more immediate cross-disciplinary impact. Its findings directly challenge current AI alignment paradigms and provide deep mechanistic insights, making it exceptionally timely and relevant to a massive global research community.
Paper 2 addresses a critical, unresolved foundation of Retrieval-Augmented Generation (RAG)—distinguishing whether a model uses retrieved facts or parametric memory. By introducing a novel interpretability method (CRM) to solve the 'attribution blind spot,' it provides a crucial step toward verifiable grounding for high-stakes deployments. While Paper 1 presents a significant safety vulnerability in MLLMs, Paper 2's fundamental contribution to model interpretability, attribution, and reliability gives it broader and more enduring scientific and practical impact across the AI field.
Paper 2 likely has higher impact due to strong timeliness and broad relevance: it addresses safety failures in widely deployed multimodal LLMs, proposes an automated black-box attack/measurement framework, and introduces a benchmark across diverse threat scenarios. Its findings (high ASR across major models) are immediately actionable for alignment, evaluation, and policy, with cross-field implications spanning ML security, HCI safety, and AI governance. Paper 1 is novel and valuable for biomedical hypothesis generation, but its impact is more domain-specific and may face higher barriers to real-world adoption and validation.
SIA addresses a fundamental challenge in AI—self-improvement through both harness and weight updates—unifying two previously disjoint research paradigms. It demonstrates strong empirical results across three diverse domains (56.6%-502% improvements), suggesting broad applicability. This work advances the long-horizon goal of autonomous AI improvement, which has transformative potential. While StructBreak is valuable for AI safety (revealing structural vulnerabilities in MLLMs), it is more narrowly focused on attack methodology. SIA's framework for closing the loop on AI self-improvement has broader and more lasting implications for the field.
Paper 1 likely has higher scientific impact due to stronger novelty and broader relevance: it introduces Structural Cognitive Overload as a new failure mode in MLLM safety, provides an automated black-box attack/measurement framework, and reports very high attack success across multiple frontier models with mechanistic analyses. This is timely for AI safety, alignment, and multimodal reasoning, and can influence evaluation practices and mitigation research across many domains. Paper 2 is methodologically solid and practically valuable for financial NLP, but its impact is more domain-specific and incremental (noise-robust learning + dataset expansion) compared to Paper 1’s cross-field safety implications.
StructBreak addresses a critical and timely vulnerability in widely-deployed multimodal LLMs (achieving 92-97% attack success rates on models like Gemini 2.5), which has immediate implications for AI safety. The discovery of Structural Cognitive Overload as a novel attack paradigm, combined with comprehensive benchmarking across six leading MLLMs and mechanistic interpretability analysis, provides broad impact across AI safety, alignment, and deployment. Paper 1, while methodologically sound with its energy shields concept, addresses a more niche area of runtime fairness with incremental contributions over existing fairness shields.
StructBreak identifies a novel vulnerability class (Structural Cognitive Overload) in MLLMs with extremely high attack success rates (92-97%), revealing fundamental limitations in current safety alignment paradigms. This has broader impact across AI safety, affecting all multimodal models. While EvoCode-Bench is a solid benchmarking contribution for coding agents, it is more incremental—extending existing evaluation paradigms to multi-turn settings. StructBreak's findings have more urgent real-world implications for deployed systems and are likely to influence safety research and alignment methods across the field.
Paper 2 addresses a critical and highly timely issue in AI safety—vulnerabilities in Multimodal Large Language Models (MLLMs). By introducing the novel concept of Structural Cognitive Overload (SCO) and demonstrating a 92% attack success rate across leading models, it offers deep theoretical insights and broad implications for AI alignment. In contrast, Paper 1 presents a valuable but narrower applied tool for scaling geospatial workflows. The broad applicability, methodological rigor involving deep model interpretability, and fundamental importance to AI safety give Paper 2 a significantly higher potential scientific impact.
Paper 1 likely has higher scientific impact due to its methodological novelty (a new black-box structural attack paradigm for MLLM safety), broad applicability across many multimodal systems and threat scenarios, and strong empirical/interpretability analysis (benchmarks across six MLLMs, high ASR, mechanistic probes). It targets a timely, general alignment vulnerability relevant to AI safety, robustness, and multimodal reasoning. Paper 2 is valuable and societally important, but its scope is narrower (mental health SI evaluation), relies heavily on LLM-simulated users/judges (potential validity concerns), and is more domain-specific.
Paper 2 addresses foundational vulnerabilities in Multimodal LLMs, proposing a novel attack paradigm with broad implications for AI safety and alignment across all domains. In contrast, Paper 1, while highly valuable for medical AI, is limited to a specific domain (dentistry) and functions primarily as a benchmark rather than uncovering fundamental mechanisms of model failure.
StructBreak addresses a fundamental and timely vulnerability in multimodal LLMs (safety alignment failures), which has broader implications across the AI safety community. Its 92-97% attack success rate on leading models like Gemini 2.5 reveals critical weaknesses in current alignment paradigms, likely prompting urgent responses from major AI labs. The novel SCO concept, black-box attack setting, comprehensive benchmark across 10 threat scenarios, and mechanistic interpretability analysis give it high novelty and broad impact. Paper 1, while solid, addresses a more niche problem (graph fraud detection) with incremental improvements over existing methods.
Paper 2 is likely to have higher impact due to strong novelty (structural cognitive overload as an attack/safety failure mode in MLLMs), high timeliness given rapid MLLM deployment, and broad relevance across AI safety, multimodal reasoning, alignment, and security. It proposes an automated black-box framework plus a benchmark across multiple threat scenarios and evaluates six major models with high attack success rates, supported by interpretability analyses—suggesting methodological breadth and actionable insights for real-world safety mitigation. Paper 1 is useful but more niche (ontology/taxonomy tooling) with narrower cross-field urgency.
Paper 1 addresses a fundamental bottleneck in LLM reasoning (Chain-of-Thought flaws) by introducing a scalable, unsupervised RL method to mitigate premature confidence. By removing the need for expensive process reward models while significantly improving accuracy and faithfulness across diverse domains, it offers a foundational advancement in how reasoning models can be trained. While Paper 2 provides valuable insights into AI safety and MLLM vulnerabilities, Paper 1's methodology has a broader potential to reshape general LLM training paradigms and scale test-time compute.
Paper 1 addresses a critical and highly timely issue in AI safety—vulnerabilities in state-of-the-art Multimodal Large Language Models (MLLMs). By introducing a novel attack paradigm with exceptionally high success rates on leading models like Gemini 2.5, and providing mechanistic interpretations, it offers profound implications for AI alignment and security. While Paper 2 presents a valuable framework for time series tasks, the broader relevance, urgency, and potential real-world consequences of securing foundational MLLMs give Paper 1 a significantly higher potential scientific impact.
Paper 1 likely has higher scientific impact: it introduces a scalable, verifiable RL training pipeline for computer-use agents plus a large dataset (32k tuples) and 110 synthetic yet high-fidelity environments, addressing a key bottleneck for RLVR in CUAs. It demonstrates measurable performance gains and transfer (OSWorld-Verified, WebArena) and plans full open-sourcing, enabling broad adoption across agents, RL, HCI, and automation. Paper 2 is timely and important for safety, but mainly provides an attack/benchmark in a narrower niche with less direct capability-building impact.
Paper 1 addresses a critical and timely AI safety vulnerability in multimodal LLMs, demonstrating a novel attack paradigm (Structural Cognitive Overload) with extremely high success rates (92-97% ASR) across leading models. Its findings have broad implications for AI safety alignment, a field of intense current interest. The mechanistic interpretability analysis adds depth. Paper 2, while technically sound, tackles a narrower problem (distilling game code generation into smaller models) with more incremental contributions (SFT+RLVR on a specific domain). Paper 1's safety implications give it wider cross-field relevance and urgency.
Paper 2 addresses a critical vulnerability in state-of-the-art MLLMs, introducing a novel structural attack that bypasses safety filters with a 92% success rate. Given the urgent global focus on AI safety and alignment, uncovering fundamental flaws in frontier models has broader, more immediate implications across the AI community than the domain-specific simulation platform for mobile GUI agents presented in Paper 1.
Paper 2 demonstrates higher potential impact by uncovering a fundamental vulnerability in Multimodal LLMs (Structural Cognitive Overload) that achieves a 92-97% attack success rate on state-of-the-art models like Gemini. By combining a practical black-box attack framework with deep mechanistic interpretability, it fundamentally challenges current safety alignment paradigms. While Paper 1 offers valuable insights into AI control and sampling strategies, Paper 2 exposes a broader, more critical flaw in multimodal reasoning safety that will likely spur significant, widespread follow-up research in foundational model alignment.
Paper 1 exposes a critical, novel vulnerability in state-of-the-art Multimodal LLMs, demonstrating a 92% attack success rate in bypassing safety filters via structural cognitive overload. The implications for AI safety and alignment are profound and urgently relevant. While Paper 2 offers a solid methodological improvement for multi-agent planning (9.75% gain), Paper 1's fundamental challenge to current safety alignment paradigms and its comprehensive evaluation across leading models promise a much broader and immediate impact across the AI community.
Paper 2 likely has higher impact: it introduces a new, broadly relevant safety failure mode (Structural Cognitive Overload) for MLLMs, an automated black-box attack/evaluation framework, and a large benchmark across threat scenarios with strikingly high attack success rates, plus mechanistic analyses. This is timely and consequential for deployment, policy, and alignment research across multimodal AI. Paper 1 is rigorous and offers a useful benchmark for KG construction from tables, but its scope is narrower (table-to-KG fidelity and retrieval diagnostics) and likely impacts a more specialized community.