Seirênes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning
Chi Zhang, Haibo Qiu, Qiming Zhang, Yufei Xu, Xinbo Gao, Jing Zhang
Abstract
We present Seirênes, a self-play RL framework that transforms contextual interference from a failure mode of LLM reasoning into an internal training signal for co-evolving more resilient reasoners. While RL with verifiable rewards has significantly advanced reasoning capabilities, models can still exhibit fragility when encountering non-idealized contexts: scenarios characterized by superfluous information, tangential instructions, or incidental correlations that differ from the clean distributions typical of standard benchmarks. Seirênes harnesses this vulnerability through a parameter-shared and adversarial self-play loop. Within this framework, a single model is trained to both construct plausible yet distracting contexts that expose its own reasoning blind spots, and solve problems by discerning the essential task from these perturbations to recover the core underlying logic. By pitting these competing objectives against each other, Seirênes compels the model to move beyond superficial pattern matching and anchors its capabilities in robust underlying reasoning. This continuous interaction sustains an informative co-evolutionary curriculum as the model improves. Across seven mathematical reasoning benchmarks and model scales from 4B to 30B, Seirênes achieves average gains of +10.2, +9.1, and +7.2 points. Besides, distracting contexts produced by the 4B Seirênes model reduce the accuracy of top-tier closed-source models (GPT and Gemini) by roughly 4--5 points, revealing Seirênes' general ability to uncover reasoning models' blind spots.
AI Impact Assessments
(1 models)Scientific Impact Assessment: Seirênes
1. Core Contribution
Seirênes introduces a novel self-play reinforcement learning framework where a single shared-parameter LLM simultaneously learns two roles: an Adversary that generates plausible but misleading contextual hints, and a Reasoner that must solve the original problem despite these perturbations. The key insight is that contextual brittleness—a known failure mode where LLMs are derailed by irrelevant or misleading information—can be repurposed as an internal training signal through adversarial co-evolution. Unlike prior self-play approaches (R-Zero, Absolute Zero, SPICE) that focus on task generation or verification, Seirênes keeps the task fixed and evolves only the surrounding context. This is a genuinely novel axis of self-play that complements existing paradigms.
2. Methodological Rigor
The paper demonstrates strong methodological discipline across several dimensions:
Experimental design: The evaluation spans seven mathematical reasoning benchmarks, three model backbones (4B, 7B, 30B), and includes both clean-accuracy and robustness evaluations. The inclusion of post-training-cutoff benchmarks (AIME 2026, HMMT 2026) provides meaningful contamination controls.
Budget-controlled comparisons: The wall-clock matched experiments (Fig. 2a) are particularly well-designed, comparing vanilla DAPO (G1=8), rollout-matched DAPO (G1=24), and Seirênes under identical 40-hour budgets on the same hardware. This rules out the trivial explanation that gains come from additional compute.
Ablation quality: The static vs. evolving hints ablation (Fig. 3) is crucial—it demonstrates that offline hints from a stronger model (Gemini 3 Flash) plateau below the evolving adversary, isolating the value of dynamic co-evolution. The per-step attack-strength statistics (Table 4, Appendix E) provide compelling data-level evidence: Seirênes maintains increasing attack pressure over training while static hints decay.
Reward formulation analysis: The natural bounding properties of the adversarial reward (Section 4.2) are elegantly motivated—the "too hard" and "too easy" regimes are automatically handled without auxiliary stabilization, which is a clean theoretical contribution.
Limitations: The three-round rollout structure (R1, R2, R3) inherently increases per-step cost. While the authors introduce mitigation strategies (latency-aware scheduling, mastery-aware sampling, bounded FIFO buffers), Seirênes remains more expensive than vanilla RL. The paper acknowledges this honestly. Additionally, the REINFORCE-style update for the Adversary (Eq. 12) is a simpler choice than GRPO, though the authors justify this with the built-in variance reduction from averaged success rates.
3. Potential Impact
Immediate applications: The framework directly improves mathematical reasoning—gains of +10.2 (4B), +9.1 (7B), and +7.2 (30B) points average are substantial, especially on competition-level benchmarks. These are clean test-time gains without any hint conditioning, meaning the resulting model is simply a better reasoner.
Cross-model attack transfer: The finding that a 4B model's adversarial hints degrade GPT-5.1 by ~4 points and Gemini 3 Flash by ~5 points (Table 5) is remarkable and suggests the adversary discovers generalizable reasoning blind spots rather than model-specific artifacts. This has implications for both red-teaming and robustness evaluation.
Robustness generalization: Table 2 shows that adversarial-context training transfers to structural perturbations and out-of-domain distractors (MMLU, OpenBookQA), suggesting the model learns genuinely more robust reasoning rather than memorizing the specific hint distribution.
Broader paradigm: The insight that contextual interference can serve as an internal training signal has potential applications beyond math—RAG robustness, tool-use reliability, and multi-step agent reasoning all face contextual perturbation challenges.
4. Timeliness & Relevance
This paper is extremely timely. The RLVR paradigm is the dominant post-training approach for reasoning LLMs, yet robustness to contextual perturbations remains a significant gap. Prior work (Math-Perturb, GSM-IC) has documented 27-31% accuracy drops from structural reformulations—a critical weakness for deployment. The self-play reasoning community (R-Zero, Absolute Zero, SPICE) has focused on task generation; Seirênes opens a complementary and arguably more fundamental axis by targeting *how* models reason rather than *which* problems they solve.
The cooperative-hint literature (SAGE, LUFFY, Scaf-GRPO, InT) provides important context—Table 1 shows these methods help, but Seirênes achieves competitive or superior results by using context adversarially rather than cooperatively. The intriguing observation that cooperative-hint methods may increase susceptibility to distractors (Table 2) raises important questions about the nature of context-dependent reasoning.
5. Strengths & Limitations
Key strengths:
Notable limitations:
Overall Assessment
Seirênes presents a well-executed and timely contribution that introduces a genuinely novel form of self-play for LLM reasoning. The co-evolutionary adversarial context framework is theoretically motivated, practically effective, and opens a new research direction. The experimental methodology is thorough with appropriate controls, and the results are convincing across scales and benchmarks. The cross-model attack transfer finding adds unexpected depth to the contribution.
Generated May 13, 2026
Comparison History (20)
Paper 1 (Seirênes) has higher estimated impact due to its broader applicability and timeliness. It addresses a critical and widely recognized problem—LLM reasoning fragility—with a novel self-play framework that scales across model sizes and demonstrates practical impact even against top-tier commercial models. The approach is intuitive, immediately actionable, and relevant to the massive community working on LLM reasoning. Paper 2 is theoretically rigorous with novel convergence guarantees for decentralized multi-agent settings, but its impact is narrower, targeting a more specialized audience. Seirênes' combination of strong empirical gains and broad relevance gives it higher potential impact.
Paper 1 targets a high-stakes, high-impact domain (personalized medicine) with a concrete methodological contribution (sMMD subset-level alignment) addressing a recognized causal inference tradeoff, and demonstrates benefits on large real-world ICU cohorts plus human-AI evaluation, supporting real-world deployment and clinical relevance. Its impact could span causal representation learning, domain shift robustness, and interpretable clinical decision support. Paper 2 is novel and timely for LLM robustness, but evidence is mainly benchmark-based and may face faster obsolescence as base models and training paradigms change, with less direct societal application than clinical decision support.
Paper 2 likely has higher impact: it introduces a broadly applicable training framework for improving LLM reasoning robustness via adversarial self-play, addressing a timely, high-visibility weakness (contextual fragility) with demonstrated gains across multiple benchmarks and scales, plus transferable adversarial examples that expose blind spots in other frontier models. This suggests wide relevance to alignment, evaluation, and deployment robustness. Paper 1 is methodologically elegant and practical for constrained sampling without retraining, but its impact is narrower (discrete diffusion inference-time constraint handling) and more domain-specific.
Formal Conjectures provides a foundational, evolving benchmark of 2615 formalized research-level math problems in Lean 4, including open conjectures that have already led to new mathematical discoveries. Its impact spans both AI and mathematics communities, establishing lasting infrastructure for evaluating and advancing automated reasoning. While Seirênes presents a clever adversarial self-play method with solid empirical gains, it represents an incremental improvement in LLM robustness training. Formal Conjectures' broader utility as a community-driven, zero-contamination benchmark with real mathematical impact gives it higher long-term scientific significance.
Paper 2 appears to offer a broader, more cross-cutting contribution: a model-agnostic, mathematically derived forecasting condition for undesirable behavior shifts with real-time warning potential across many chatbot architectures and high-stakes domains (safety, healthcare, finance, defense). Its claimed validations include diverse models, production-scale chatbots, and prospective prediction aligned with a large external corpus, suggesting strong methodological ambition and real-world relevance. Paper 1 is novel and useful for robustness in LLM reasoning, but its impact is more scoped to training methodology and reasoning benchmarks, with narrower immediate societal application breadth.
Paper 2 is likely to have higher scientific impact due to a more novel, general training framework (adversarial self-play with evolving distractions) that directly improves LLM robustness in realistic, noisy contexts. It demonstrates methodological rigor via scalable experiments across multiple benchmarks and model sizes, reports substantial gains, and provides an additional diagnostic contribution by generating transferable adversarial distractions that degrade strong closed models. Its applications span reliable reasoning, safety, and evaluation, making it broadly impactful and timely. Paper 1 offers valuable conceptual/statistical framing for LLM-as-annotator, but is narrower in application and less likely to drive widespread methodological adoption.
Paper 1 introduces a novel, domain-agnostic self-play RL framework that fundamentally improves LLM reasoning robustness, addressing a critical flaw in current AI models. Its methodological innovation and broad applicability across the rapidly expanding field of LLM reasoning give it higher potential scientific impact than Paper 2, which, while highly valuable for industrial applications, is a domain-specific benchmark confined to programmatic CAD.
Paper 2 presents a highly novel adversarial self-play framework that addresses a fundamental flaw in LLM reasoning—fragility to contextual noise. Its empirical validation is broader and more rigorous, demonstrating significant performance gains (+7-10 points) across multiple model scales (4B to 30B) and showing transferability by successfully distracting top-tier proprietary models. While Paper 1 offers a practical efficiency improvement for smaller models, Paper 2's approach to evolving resilient reasoners represents a more profound contribution to foundational reinforcement learning paradigms for LLMs.
Paper 1 addresses a fundamental limitation in LLMs—reasoning fragility to contextual distractions—using a novel adversarial self-play RL framework. This provides a methodologically rigorous approach to improving core reasoning capabilities, offering broad implications for foundation model training. Paper 2, while highly relevant to multi-agent systems, focuses more on software engineering abstractions and orchestration frameworks. Therefore, Paper 1 demonstrates higher methodological innovation and greater potential to influence the foundational science of AI reasoning.
Paper 2 likely has higher scientific impact: it introduces a novel training framework (adversarial self-play with evolving distractions) that directly improves LLM robustness and reasoning, with sizable multi-benchmark gains across model scales and demonstrated transfer by exposing blind spots in strong closed models. This is timely for reliable deployment and can influence RLHF/RLVR, robustness, and evaluation. Paper 1 is valuable meta-science with an open dataset/tool and relevance to AI governance, but its impact is more interpretive and may be narrower/less immediately actionable than a broadly applicable training method with quantitative improvements.
Seirênes introduces a novel adversarial self-play RL framework that addresses a fundamental vulnerability in LLM reasoning—fragility to contextual distractions. It demonstrates strong empirical results across multiple benchmarks and model scales (4B-30B), with substantial accuracy gains (+7-10 points). The finding that a small 4B model can generate distractions that degrade top-tier models (GPT, Gemini) is particularly impactful. The methodology is more technically innovative, combining adversarial training with co-evolutionary curricula. Paper 2 contributes a benchmark, which, while useful, has narrower methodological novelty and impact scope.
Seirênes presents a novel self-play RL framework that addresses a fundamental vulnerability in LLM reasoning—fragility to contextual distractions—with strong empirical results across multiple benchmarks and scales (+7-10 points). The adversarial self-play approach for co-evolving robust reasoners is innovative and broadly applicable. Its ability to expose blind spots even in top-tier closed-source models demonstrates generalizability. Paper 2 provides useful mechanistic insights into AVLLMs via cross-modal sink tokens, but its scope is narrower (interpretability of a specific model class) and the practical contribution (training-free hallucination mitigation) is more incremental.
Seirênes introduces a more novel and broadly impactful paradigm—adversarial self-play for reasoning robustness—that addresses a fundamental vulnerability in LLM reasoning (fragility to contextual distractions). Its contributions are more multifaceted: it serves as both a training framework and a diagnostic tool for uncovering reasoning blind spots in even top-tier models. The larger empirical gains (+7-10 points across scales), scalability from 4B to 30B, and demonstrated transferability of generated distractions to closed-source models suggest broader real-world impact. StaRPO's stability metrics are useful but more incremental, refining existing RL frameworks rather than introducing a new training paradigm.
Paper 1 presents a highly novel adversarial self-play reinforcement learning framework that fundamentally improves LLM reasoning robustness. Advancing RL for LLM reasoning is currently a critical frontier in AI research. Its results demonstrate significant quantitative gains across scales and reveal transferability against frontier models like GPT and Gemini. In contrast, while Paper 2 offers a practical and cost-effective approach for agent memory, gradient-free retrieval-augmented methods are less likely to drive fundamental algorithmic shifts compared to new RL paradigms for reasoning.
Paper 2 (Seirênes) likely has higher scientific impact due to a more broadly applicable and timely contribution: adversarial self-play to improve robustness to contextual distractions, a key real-world failure mode for deployed LLMs. The co-evolving distractor/solver setup is novel and can transfer across tasks/domains beyond math, with clear implications for safety, reliability, and red-teaming. It reports gains across seven benchmarks and multiple model scales, plus produces transferable adversarial contexts that degrade strong closed models, suggesting broad methodological and practical relevance. Paper 1 is impactful for SWE agents, but narrower in scope.
Paper 1 offers a highly novel adversarial self-play approach that directly addresses LLM reasoning fragility. It demonstrates substantial, quantified empirical gains across multiple model scales and shows generalization by uncovering blind spots in state-of-the-art closed-source models. Paper 2 presents a solid agentic framework but lacks the specific quantitative evidence and broad applicability demonstrated in Paper 1's abstract, making Paper 1 likely to have a larger and more immediate scientific impact.
Paper 1 has higher likely impact due to broader, timelier relevance and clearer methodological novelty: an adversarial self-play RL scheme that directly targets a known, general weakness of LLMs (contextual interference) and demonstrates consistent gains across seven established reasoning benchmarks and multiple model scales, plus transfers to exposing blind spots in closed models. Its contributions generalize across many tasks where robustness to distractors matters. Paper 2 is application-rich and introduces new negotiation benchmarks, but its impact is narrower (negotiation/emotion modeling), relies heavily on simulated evaluations, and the mixture-of-agents/Bayesian orchestration is a less clearly distinct methodological advance.
Seirênes introduces a novel and broadly applicable self-play RL framework that addresses a fundamental weakness in LLM reasoning—fragility to contextual distractions. It demonstrates strong empirical gains (+7-10 points) across multiple scales and benchmarks, and even exposes vulnerabilities in top closed-source models. Its methodological innovation (parameter-shared adversarial self-play for robust reasoning) has broad applicability beyond math to any reasoning domain. IndustryBench, while valuable for industrial LLM evaluation, is more niche—focused on Chinese industrial procurement QA—and serves primarily as a diagnostic benchmark rather than a methodological advance with wide-reaching impact.
Paper 2 tackles a fundamental and widespread issue in LLMs—fragility in reasoning when facing contextual distractions. Its innovative adversarial self-play RL framework not only improves reasoning across multiple scales but also generates transferable adversarial examples that affect top-tier models. While Paper 1 offers a valuable benchmark for customer service agents, Paper 2's methodological advancements in core LLM reasoning capabilities provide a broader, more profound impact on the development of robust artificial intelligence.
Paper 2 (Seirênes) addresses a critical and timely problem in LLM reasoning robustness with a novel self-play adversarial framework that shows strong empirical results across multiple scales and benchmarks. Its broad applicability to improving LLM reasoning—a central concern in AI research—and its demonstration that even top-tier models have exploitable blind spots give it wide impact potential. Paper 1 (miss-MDPs) makes a solid theoretical contribution bridging missing data theory and POMDPs, but addresses a more niche intersection with narrower immediate applications. The timeliness and breadth of Paper 2's contributions to the rapidly evolving LLM field give it higher estimated impact.