Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel
Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.
This paper proposes that training image classifiers with reinforcement learning (policy gradient + ε-greedy exploration) instead of supervised learning (cross-entropy) disrupts gradient-based adversarial attacks. The key insight is that RL's stochastic optimization signal acts as an implicit regularizer, producing models with smaller gradient magnitudes and highly unstable gradient directions. The paper further proposes RL-adv, combining RL training with adversarial training (TRADES), claiming a "dual-layer defense" operating at both gradient-level and decision-boundary-level.
1. The AutoAttack results undermine the central claim. Under AutoAttack, plain RL shows essentially no advantage over plain SL (16.71% vs 15.42% on CIFAR-10). The authors acknowledge this but frame it as expected — yet this is precisely the scenario that matters most in adversarial robustness research. Since Athalye et al. (2018) and Croce & Hein (2020), the community has established that robustness claims must survive adaptive attacks. The large PGD gap (56% vs 5%) appears to be largely an artifact of gradient masking/obfuscation rather than genuine robustness.
2. Gradient masking concerns. The pattern observed — high robustness against standard PGD, vulnerability to transfer attacks, and no advantage under gradient-free attacks — is the classic signature of gradient masking/obfuscated gradients, as characterized by Athalye et al. (2018). The paper does not adequately address this connection. While the authors provide mechanism analysis explaining *why* gradients are disrupted, the adversarial robustness community has explicitly warned that gradient disruption without genuine boundary hardening is a known failure mode, not a defense.
3. RL-adv comparison fairness. The RL-adv vs SL-adv comparison uses identical TRADES configurations, but RL requires 20× more training epochs. The computational budget difference means RL models see far more gradient updates. It's unclear whether SL-adv trained for equivalent compute (or with other regularization techniques that also flatten gradients, such as input gradient regularization or spectral normalization) would close the gap.
4. Perturbation budget choice. The primary evaluation uses ε=7.0 under ℓ₂, which is extremely large for CIFAR-10 (32×32 images normalized to [0,1]). At this budget, most models are trivially broken, making the comparison less meaningful for practical deployment. Standard ℓ₂ budgets in the literature are typically 0.5-1.0 for CIFAR-10.
5. Theoretical contribution. The theoretical section (Section 8) formalizes intuitive observations with standard smoothness/Lipschitz assumptions but provides no novel bounds specific to RL-trained models. Propositions 1 and 2 are straightforward applications of smoothness and telescoping — they explain *how* gradient properties affect attack success but don't prove *that* RL produces these properties.
The paper identifies an interesting phenomenon — that RL training changes gradient structure — but the practical implications are limited:
The most valuable contribution may be the detailed mechanism analysis framework, which could inform future work on understanding how different training paradigms shape loss landscapes.
Adversarial robustness remains an active research area, but the community has moved toward certified defenses, foundation model robustness, and understanding robustness-accuracy tradeoffs at scale. The paper addresses a somewhat exhausted research direction (empirical defenses against ℓp attacks on CIFAR), and the gradient masking phenomenon it rediscovers has been well-characterized since 2018.
The paper presents a systematic investigation of an interesting observation but ultimately rediscovers gradient masking through a different mechanism. The honest reporting of AutoAttack and transfer attack results is commendable, but these results also demonstrate that the core contribution (plain RL robustness) is largely illusory. The RL-adv combination shows genuine promise on CIFAR-10 but the advantages diminish on harder tasks, and the computational cost is prohibitive. The paper would be strengthened by comparison with cheaper gradient regularization alternatives and evaluation at standard perturbation budgets.
Generated Jun 11, 2026
Paper 1 addresses a fundamental theoretical gap in understanding truncated positional encodings for GNNs, which are widely used in practice but poorly understood theoretically. It provides rigorous expressivity results with practical implications for PE design. Paper 2's RL-based adversarial defense, while interesting, faces significant concerns: adaptive attacks specifically designed to handle gradient disruption may overcome the defense, and the adversarial robustness community has repeatedly shown that gradient-masking defenses (which this resembles) tend to provide false security. Paper 1's theoretical contributions are more lasting and foundational.
Paper 1 addresses a critical, timely issue in frontier AI: the alignment tension between helpfulness and harmlessness in RLHF. By providing a mechanistic understanding of this conflict at the neuron level, it offers fundamental insights that directly impact LLM safety and capability. While Paper 2 presents an innovative adversarial defense for image classifiers, Paper 1's focus on LLM alignment tackles a more pressing bottleneck in current AI research, promising broader and more immediate real-world impact in the rapidly growing field of AI safety.
Paper 2 is likely higher impact due to strong timeliness and broad relevance to adversarial robustness, a central ML security problem with cross-domain implications. It proposes a distinctive training paradigm (policy-gradient RL for classifiers) and provides extensive empirical evaluation across datasets, architectures, and attack classes plus mechanism analyses, suggesting higher methodological rigor and generality. The potential real-world application (robust deployment against practical attacks) is immediate. Paper 1 is novel and useful for deployment/quantization, but its scope is narrower and impact may be more specialized to efficient inference for time-series/rollout models.
Paper 1 presents a more rigorous and novel contribution by discovering that RL training implicitly disrupts gradient structures exploited by adversarial attacks, offering a complementary defense mechanism. It provides comprehensive mechanistic analysis across multiple datasets, architectures, and attack types, with clear practical implications for adversarial robustness—a critical problem in AI safety. Paper 2's emotion-augmented learning framework, while creative, is more incremental, relies on a less well-motivated analogy between biological emotion and neural network training, and demonstrates narrower improvements on standard benchmarks without strong mechanistic justification.
Paper 2 addresses latent reasoning in large language models, a highly timely and critical area of current AI research. By making hidden-state recurrence compatible with standard on-policy RL (like GRPO) and enabling mechanistic interpretability, it solves major bottlenecks in developing efficient, reasoning-capable models. Its potential impact on the rapidly growing field of LLM reasoning outweighs Paper 1, which, while novel in using RL for adversarial robustness, applies to a more traditional and specific subfield of model security.
Paper 2 likely has higher scientific impact: it introduces a large-scale, carefully controlled benchmark (balanced 2x2 factorial design) addressing a timely, high-stakes issue in LLM deployment—epistemic susceptibility to citation/authority signals. The artifact (220k prompts + code) enables broad, reproducible follow-on research across domains (science, law, medicine, general knowledge) and intersects ML, HCI, information science, and AI governance. Paper 1 is novel but narrower (adversarial robustness via RL training) and may face adoption barriers due to RL training cost and potential reliance on gradient obfuscation-like effects.
Paper 2 has higher likely impact due to its methodological rigor and broad, durable relevance: it provides computable two-sided a posteriori error certificates for PINNs under verifiable local conditions, advancing reliability/certification—an enabling requirement for scientific and engineering deployment. The results generalize prior work (weaker assumptions, sharper bounds), include explicit linear-system formulas, and propose certificate-informed training. Paper 1 is novel and timely for adversarial robustness, but RL-induced gradient disruption may be less universally applicable, potentially vulnerable to adaptive/non-gradient attacks, and has narrower cross-field reach than rigorous certification theory for differential-equation solvers.
Paper 1 likely has higher impact due to its timely, broadly relevant contribution to adversarial robustness in deep learning, a core and widely applicable ML security problem. The idea that RL-style training can systematically disrupt gradient structure (plus mechanistic analysis and a hybrid RL+adversarial-training defense) is novel and potentially transferable across vision models and other domains where gradient-based attacks dominate. It also evaluates across multiple datasets/architectures and attack types, suggesting stronger methodological rigor and real-world applicability than Paper 2’s more domain-specific chess representation-learning demonstration.
Paper 2 has higher estimated impact: it tackles adversarial robustness—a broad, timely, high-stakes problem in ML security—with systematic multi-dataset, multi-architecture evaluation and detailed mechanism analysis, plus a practical hybrid (RL-adv) that improves robustness against diverse attack classes. Its implications span security, optimization, and training methodology. Paper 1 is novel and elegant for operator learning and interpretability, but its demonstrated scope (limited PDE families) suggests narrower near-term real-world uptake and field breadth compared to adversarial defense advances.
Paper 1 addresses a critical challenge in AI security by introducing a novel conceptual defense: using RL as an implicit regularizer to disrupt adversarial gradient optimization. Its deep mechanistic analysis and the proposed dual-layer defense open new avenues for hybrid training methodologies. This fundamental contribution to neural network security offers broader real-world implications and higher theoretical impact across deep learning domains compared to the specialized time-series clustering framework in Paper 2.