Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, Yi Zhang

#1091 of 2292 · Artificial Intelligence
Share
Tournament Score
1418±45
10501800
47%
Win Rate
8
Wins
9
Losses
17
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework"

1. Core Contribution

This paper identifies a fundamental vulnerability in concept erasure methods for diffusion models (DMs) and proposes ConceptAgent, a black-box, training-free, multi-agent framework to exploit it. The key insight is that concept erasure primarily disrupts early-stage text-semantic alignment during denoising, while later stages—where generation becomes increasingly state-driven rather than text-driven—remain largely intact. By injecting surrogate-guided structured noise into intermediate denoising states, the method bypasses erased text-to-concept mappings without requiring model parameters, gradients, or internal representations.

The trajectory-based perspective is the paper's most meaningful conceptual contribution. The decomposition of the denoising process into a "text-conditioned estimate" and a "semantic-noise estimate" (Eq. 2), with the observation that their relative dominance shifts across timesteps, provides a useful analytical framework. Theorems 3.1 and 3.2 formalize the notions of "semantic dominance transition" and "trajectory intersection," respectively, offering theoretical grounding for why surrogate-guided states can lead to awakening of erased concepts.

2. Methodological Rigor

Theoretical analysis: The two theorems are stated but not rigorously proven in the main text. Theorem 3.1 essentially states that when the text-conditioned component dominates, text controls semantics, and vice versa—this is somewhat tautological without quantitative characterization of the transition point t*. Theorem 3.2 claims that non-injective transition operators allow trajectory intersection, which is geometrically intuitive but lacks formal proof regarding when and under what conditions such intersections occur in practice. The theoretical claims would benefit from tighter bounds and explicit assumptions about the model class.

Empirical validation: The controlled experiments in Section 4 (Figure 2) are well-designed and provide convincing evidence for the trajectory-based vulnerability. The three experimental conditions—late-stage text removal, mid-trajectory model switching, and early-stage erasure—effectively decompose the role of text conditioning across denoising stages.

Evaluation methodology: The use of ACC (ResNet-50 classification) and CLIP score as metrics is standard but limited. ACC measures whether the classifier recognizes the target concept, but doesn't capture visual quality, diversity, or subtlety of awakening. The evaluation across three erasure methods (UCE, RECE, SPEED) and three MLLMs (GPT-5, Gemini 3.0, Qwen-7B) provides reasonable breadth. However, only four target concepts are tested (golf ball, tench, garbage truck, pistol), which is a narrow evaluation set that limits conclusions about generalizability.

Reproducibility concerns: Several design choices appear manually tuned without clear justification—tf=70 steps for noise injection, tcan=35 for composition refinement, K=100 surrogates, J=3 physical refinement prompts. The sensitivity to these hyperparameters is not analyzed.

3. Potential Impact

AI Safety: The paper's primary impact lies in exposing vulnerabilities in concept erasure—a critical safety mechanism for deployed DMs. Demonstrating that black-box attacks can awaken erased concepts (including harmful ones like nudity and blood, as shown in Figure 8) is an important finding for the safety community. This could motivate development of more robust erasure techniques that account for trajectory-level dynamics.

Defense design: The trajectory-based perspective suggests that future erasure methods should not only disrupt text-concept mappings but also address semantic propagation through intermediate states. This is a constructive insight that could reshape erasure method design.

Dual-use concerns: The paper demonstrates awakening of safety-critical concepts (pistol, nudity, blood) under black-box constraints, which raises dual-use concerns. While the authors frame this as exposing vulnerabilities, the detailed methodology and promised code release lower the barrier for malicious use.

4. Timeliness & Relevance

The paper addresses a timely concern. Concept erasure has become a standard safety mechanism for deployed diffusion models, and understanding its limitations is urgent. The black-box setting is particularly relevant since most users interact with DMs through APIs without model access. Prior awakening attacks largely assumed white-box access, making this work fill a genuine gap.

The use of modern MLLMs (GPT-5, Gemini 3.0) as agent backbones is timely, leveraging the reasoning capabilities of frontier models for structured adversarial attacks—a paradigm that is likely to become increasingly relevant.

5. Strengths & Limitations

Strengths:

  • The trajectory-based analysis provides genuine insight into how semantic information propagates during denoising, moving beyond the static view of concept erasure
  • The black-box, training-free setting is practically meaningful and underexplored
  • The multi-agent decomposition (Strategist, Guesser, Director, Referee) provides modularity and interpretability
  • Cross-model evaluation (SD v1.4 and v2.1) and cross-MLLM evaluation demonstrate reasonable generalizability
  • The empirical analysis in Section 4 is particularly well-designed and illuminating
  • Limitations:

  • Only 4 target concepts tested—severely limits generalizability claims, especially for abstract or complex concepts
  • Theoretical analysis lacks formal proofs; theorems are stated without derivation
  • Heavy dependence on powerful MLLMs (GPT-5) for agent reasoning raises questions about accessibility and cost
  • No comparison with other black-box attacks beyond concept arithmetic (ARC) and one white-box method (CCE)
  • The pipeline involves many sequential steps and hyperparameters, making it complex and potentially brittle
  • No ablation study examining the contribution of individual agents or the sensitivity to key parameters (tf, tcan, K)
  • The evaluation does not measure preservation of non-erased concepts or overall model utility
  • FID or other distributional quality metrics are absent
  • The surrogate concept selection process relies heavily on MLLM reasoning quality, which may vary unpredictably
  • Additional Observations

    The paper's framing as a "multi-agent" system, while trendy, is somewhat superficial—the agents are essentially pipeline stages orchestrated by MLLM prompting. The multi-agent terminology adds complexity without clear methodological benefit over a standard pipeline design.

    The comparison baseline set is thin. More recent awakening methods should be compared, and the comparison with ARC and CCE lacks controlled experimental conditions (e.g., ARC is tested only under SPEED erasure in Figure 6).

    Rating:5.8/ 10
    Significance 6.5Rigor 4.5Novelty 6Clarity 6.5

    Generated May 19, 2026

    Comparison History (17)

    vs. Probabilistic Tiny Recursive Model
    gemini-3.15/20/2026

    Paper 1 presents a fundamental advancement in efficient AI reasoning by enabling tiny models (7M parameters) to outperform massive frontier LLMs on complex tasks using test-time compute scaling. This challenges the current paradigm of relying solely on massive scale for reasoning, offering immense potential for edge computing, democratizing AI, and inspiring alternative architectures. Paper 2, while offering important insights into AI safety and diffusion model vulnerabilities, has a narrower scope focused on adversarial attacks against concept erasure.

    vs. Beyond Rational Illusion: Behaviorally Realistic Strategic Classification
    gemini-3.15/20/2026

    Paper 2 addresses an urgent and highly topical issue: the safety and security of widely deployed generative AI models. By exposing black-box vulnerabilities in concept erasure mechanisms, it has immediate, real-world implications for AI deployment and red-teaming. While Paper 1 offers a strong interdisciplinary bridge, the explosive growth and focus on foundation model safety gives Paper 2 a higher potential for rapid, widespread scientific impact.

    vs. Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification
    claude-opus-4.65/19/2026

    Paper 1 (CardioThink) addresses a critical gap in clinical AI by introducing structured physician-inspired reasoning for ECG diagnosis, combining interpretability with accuracy. Its novel SSPO optimization method and clinically aligned multi-stage reasoning framework have broad implications for trustworthy medical AI beyond just ECG. Paper 2 (ConceptAgent) makes a solid contribution to AI safety by exposing limitations of concept erasure in diffusion models, but its impact is more niche—primarily relevant to the adversarial robustness of generative models. Paper 1's potential for real-world clinical deployment and cross-domain applicability in medical AI gives it higher overall impact.

    vs. Learning Lifted Action Models from Traces with Minimal Information About Actions and States
    gpt-5.25/19/2026

    Paper 2 is likely to have higher impact due to timeliness and broad relevance: it addresses safety and robustness of diffusion models, a high-velocity area with immediate implications for deployment, governance, and adversarial ML. The proposed training-free, black-box multi-agent attack framework is a novel angle and can influence both defensive concept-erasure research and evaluation standards across generative models. Paper 1 is methodologically rigorous and valuable for planning/representation learning, but its impact is narrower and in a more mature subfield, with fewer cross-domain and near-term real-world stakes.

    vs. AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment
    gemini-3.15/19/2026

    Paper 2 addresses a critical bottleneck in aligning Large Language Models for complex reasoning: token-level credit assignment in reinforcement learning. By introducing a novel self-distillation mechanism with a reflection bottleneck, it tackles the timely and highly impactful problem of improving LLM reasoning capabilities without late-stage collapse. While Paper 1 offers valuable insights into diffusion model vulnerabilities (AI safety), advancing foundational LLM reasoning (Paper 2) currently commands broader applicability and higher transformative potential across diverse domains such as mathematics, science, and general tool-use.

    vs. Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery
    gemini-3.15/19/2026

    Paper 2 demonstrates significant breadth of impact and potential for real-world applications by accelerating scientific discovery across multiple disciplines (physics, chemistry, biology, materials science). Its inclusion of theoretical guarantees alongside a real-world wet-lab experiment (battery electrolytes) showcases strong methodological rigor. While Paper 1 addresses an important AI safety vulnerability, Paper 2's potential to broadly enhance empirical research optimization gives it a higher overall scientific impact.

    vs. Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework
    gemini-3.15/19/2026

    Paper 1 addresses a critical vulnerability in AI safety by revealing fundamental flaws in concept erasure for diffusion models. Its novel, black-box approach provides deep theoretical insights into denoising dynamics and semantic control. In contrast, Paper 2 is an empirical system-integration study of existing LLM agent paradigms. While practically useful for software engineering, Paper 1 offers significantly higher theoretical novelty, methodological innovation, and broader implications for the rapidly growing field of generative AI alignment and security.

    vs. Do LLMs Experience an Internal Polylogue? Investigating Reasoning through the Lens of Personas
    gemini-3.15/19/2026

    Paper 2 addresses the critical field of LLM reasoning and mechanistic interpretability. By introducing the 'polylogue' to dynamically monitor and steer latent persona directions during generation, it provides a novel, interpretable approach to reasoning-time control. This has broader implications for improving LLM accuracy, alignment, and safety across numerous NLP applications, giving it a higher potential impact than Paper 1's narrower focus on bypassing concept erasure in diffusion models.

    vs. EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
    gemini-3.15/19/2026

    Paper 2 tackles the critical challenge of dynamic, execution-time coordination in LLM-based multi-agent systems, moving beyond static workflows. This advancement is highly relevant for solving complex, long-horizon tasks, a major bottleneck in developing autonomous agents. Its broad applicability across reasoning and research benchmarks (e.g., GAIA) suggests a wider transformative impact on general AI capabilities compared to Paper 1's narrower, albeit important, focus on bypassing concept erasure in image diffusion models.

    vs. Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
    gpt-5.25/19/2026

    Paper 2 introduces a broadly applicable reliability criterion (policy invariance) for LLM-based safety evaluation, plus concrete metrics (Policy Invariance Score) and reporting (Judge Card) with a released protocol—likely to reshape how many benchmarks and deployments audit evaluators. Its impact spans AI safety, evaluation science, alignment, and agent benchmarking, and is highly timely given widespread reliance on LLM-as-a-judge. Paper 1 is novel and important for diffusion-model security, but is narrower in scope and mainly exposes limitations/attacks on concept erasure rather than providing a general evaluation standard across LLM systems.

    vs. SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science
    claude-opus-4.65/19/2026

    Paper 2 addresses a fundamental and timely AI safety concern—the robustness of concept erasure in diffusion models—with a novel black-box multi-agent framework. Its finding that concept erasure merely suppresses rather than eliminates concepts has broad implications for AI safety policy and practice. The trajectory-based analysis provides deep mechanistic insight into diffusion model behavior. Paper 1, while valuable, targets a narrower niche (LLM benchmarking for scientific task formulation) with incremental contribution to the benchmark landscape. Paper 2's safety implications give it broader cross-field impact and greater urgency.

    vs. CATO: Charted Attention for Neural PDE Operators
    claude-opus-4.65/19/2026

    CATO addresses fundamental challenges in neural PDE operators—computational cost on complex geometries and physical fidelity—with a principled approach combining learned chart mappings, efficient axial attention, and derivative-aware losses, backed by theoretical guarantees and strong empirical results (26.76% improvement with 82% fewer parameters). This has broad impact across scientific computing, engineering simulation, and computational physics. Paper 2, while relevant to AI safety, is more narrowly focused on adversarial attacks against concept erasure in diffusion models—an important but more niche concern with less cross-disciplinary reach and methodological depth.

    vs. TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?
    gemini-3.15/19/2026

    Paper 1 exposes critical vulnerabilities in generative AI safety by demonstrating how erased concepts in diffusion models can be awakened under black-box constraints. This provides fundamental insights into model dynamics and semantic control, significantly impacting the broader AI alignment and security communities. In contrast, while Paper 2 introduces a valuable and comprehensive benchmark, its impact is primarily confined to the specific application of LLMs within the telecommunications industry, lacking the fundamental methodological novelty of Paper 1.

    vs. Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
    gemini-3.15/19/2026

    Paper 1 addresses a critical and widespread methodological crisis in AI: the unreliability of interactive agent benchmarks. By introducing a framework for evidence-supported bounds, it has the potential to fundamentally shift how the entire field evaluates and reports agent capabilities, preventing misleading progress metrics. Paper 2 is highly relevant for generative AI safety and red-teaming, but its focus on bypassing diffusion model concept erasure is narrower in scope. Establishing rigorous evaluation standards (Paper 1) typically has a broader, more foundational impact across multiple domains of AI research.

    vs. Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
    claude-opus-4.65/19/2026

    Paper 1 addresses a fundamental security vulnerability in diffusion models by demonstrating that concept erasure methods are fundamentally flawed, even under black-box constraints. This reveals deep insights about the denoising process and has significant implications for AI safety, a critically important and timely area. Paper 2, while technically sound, addresses a more incremental optimization problem (routing between reasoning/non-reasoning LLM judges) with narrower impact. Paper 1's findings about the limitations of concept erasure are likely to influence future safety research and policy discussions more broadly.

    vs. Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities
    gpt-5.25/19/2026

    Paper 2 likely has higher impact due to broader relevance and timeliness: it addresses safety and controllability of diffusion models, a central, fast-moving area with implications for AI governance and deployment. Its black-box, training-free multi-agent awakening attack is novel and practically consequential, potentially affecting many existing concept-erasure defenses and prompting new robust methods. The trajectory-based insight into denoising dynamics may generalize across generative models. Paper 1 is solid and application-relevant (clinical segmentation with missing modalities) but is more domain-specific with narrower cross-field reach.

    vs. Context Pruning for Coding Agents via Multi-Rubric Latent Reasoning
    claude-opus-4.65/19/2026

    Paper 2 addresses a fundamental security vulnerability in diffusion models—showing that concept erasure methods are fundamentally flawed even under black-box constraints. This has broader impact across AI safety, trustworthy AI, and generative model governance. The finding that erased concepts can be awakened without any model access is a significant negative result with implications for policy and deployment. Paper 1, while technically solid, addresses a narrower optimization problem (token efficiency for coding agents) with incremental improvements. Paper 2's cross-disciplinary relevance to safety, security, and generative AI gives it higher potential impact.