The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan

May 26, 2026

arXiv:2605.26494v1 PDF

cs.AI(primary)cs.CLcs.LG

#289of 2682·Artificial Intelligence

#289 of 2682 · Artificial Intelligence

Tournament Score

1509±44

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor6.5

Novelty6

Clarity7

Tournament Score

1509±44

10501800

76%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MiniMax-M2 Series

1. Core Contribution

The MiniMax-M2 series presents a Mixture-of-Experts (MoE) language model family with 229.9B total parameters but only 9.8B activated per token, targeting agentic deployment scenarios. The paper's three main contributions are: (i) agent-driven data pipelines producing verifiable trajectories for coding and cowork tasks; (ii) Forge, a scalable RL system designed for long-horizon agent training supporting both white-box and black-box agents; and (iii) an early self-evolution capability where M2.7 can debug training runs and modify its own scaffold.

The central thesis—that a small activated parameter footprint can achieve frontier-tier performance on agentic tasks—is substantiated by competitive results across numerous benchmarks against Claude Opus 4.6, GPT 5.4, and Gemini 3.1 Pro. The efficiency argument is compelling: ~10B activated parameters achieving parity with systems presumably activating 10-100× more parameters per token represents meaningful practical value for deployment cost and latency.

2. Methodological Rigor

Architecture. The model design choices are well-motivated through ablation studies. The fine-grained expert design (256 experts, 8 active) with sigmoid gating is empirically validated, and the decision to use full attention over hybrid SWA is backed by thorough experimentation across pre-training and post-SFT evaluation (Tables 2-3). The transparency about hybrid attention's failures—despite theoretical appeal—is unusually candid for an industry report.

Data Pipelines. The SWE-scaling pipeline is impressively comprehensive, covering PR collection, Docker environment synthesis, task diversification, and multi-language support. The AppDev pipeline's three-layer Agent-as-a-Verifier (execution, interaction, visual aesthetics) is a meaningful methodological contribution for evaluating generated applications beyond static analysis. The cowork data pipelines demonstrate thoughtful domain-specific reward design.

RL System. Forge's architecture is well-formulated. The MDP formulation clearly delineates the boundary between policy (LLM generation) and environment (everything else). The windowed-FIFO scheduling addresses a real tension between distributional consistency and throughput. Prefix tree merging claiming up to 40× training speedup is significant if validated. However, the paper provides limited quantitative ablation of Forge's individual components' contribution to final performance.

Weaknesses in rigor. Several benchmark results use internal, non-public benchmarks (NL2Repo, VIBE-Pro, HyperTask, MM Claw, MEWC v2, Finance Modeling Pro, RISE), making independent verification impossible. The self-evolution claims, while intriguing, are demonstrated primarily through MLE Bench Lite results and qualitative descriptions rather than rigorous controlled experiments isolating self-evolution's contribution. The "30% to 50% of daily iteration workload" claim for autonomous debugging lacks formal measurement methodology.

3. Potential Impact

Deployment efficiency. The 9.8B activated parameter footprint has direct economic implications for production deployment, potentially democratizing access to frontier-tier agentic capabilities. This matters enormously for real-world adoption where inference cost scales with activated parameters.

Agentic RL infrastructure. Forge's design—particularly the training-inference-agent decoupling and black-box agent support—could influence how the field approaches RL training for agentic systems. The ability to train arbitrary agent architectures without framework modifications addresses a genuine scalability bottleneck.

Data pipeline methodology. The SWE-scaling pipeline's approach to synthesizing verifiable coding tasks from GitHub PRs, including multi-language Docker environment synthesis and task diversification, provides a replicable template for the community (though the data itself isn't released).

Self-evolution. If the self-evolution capabilities mature, they represent a paradigm shift in model development. The current demonstration is embryonic but directionally significant—autonomous scaffold modification and training debugging could dramatically accelerate model iteration cycles.

4. Timeliness & Relevance

The paper directly addresses the industry's shift toward agentic AI deployment, which is arguably the most active frontier in applied AI. The efficiency thesis is particularly timely as inference costs become a primary bottleneck for agentic workflows requiring hundreds of tool calls per task. The work on verifiable reward construction for diverse agentic domains addresses a critical bottleneck in scaling RL for agents beyond simple QA tasks.

5. Strengths & Limitations

Key Strengths:

Exceptional breadth of engineering and system design, covering architecture, data, RL infrastructure, and deployment

Honest negative results on hybrid attention that will save others significant compute

Comprehensive within-series progression tracking (M2→M2.5→M2.7) demonstrating consistent improvement

The windowed-FIFO scheduling and prefix tree merging are concrete, generalizable techniques

Multi-domain mixed RL training with curriculum design across context length, difficulty, and domain ratios

Notable Limitations:

Many benchmarks are internal and non-reproducible; strong reliance on proprietary evaluation

Self-evolution claims are more aspirational than rigorously demonstrated; the MLE Bench Lite case study, while interesting, doesn't isolate self-evolution from other improvements

No model weights or data releases mentioned, limiting reproducibility

The paper reads as a technical report rather than a focused research contribution—the breadth comes at the cost of depth on any single innovation

Missing detailed compute costs, training duration, and infrastructure specifications

Performance gaps remain visible on several benchmarks (HLE: 28.0 vs 44.7 for Gemini; Terminal-Bench: 57.0 vs 75.1 for GPT 5.4), and the paper's framing sometimes overstates competitiveness

The CISPO algorithm is cited from prior work without sufficient novelty analysis here

Comparison to Prior Art: M2 builds on DeepSeek-V3's MTP design, MiniMax-M1's CISPO, and standard MoE techniques. The primary novelty lies in the system-level integration and the agentic data/RL pipeline rather than in individual algorithmic contributions.

Summary

MiniMax-M2 is an impressive engineering achievement demonstrating that small activated-parameter MoE models can approach frontier performance on agentic tasks. Its strongest contributions are the comprehensive agentic data pipelines with verifiable rewards, the Forge RL system architecture, and the transparent architectural ablations. The self-evolution narrative, while compelling, requires more rigorous validation. The work's impact is somewhat limited by its industrial-report nature, reliance on internal benchmarks, and lack of open artifacts.

Rating:7/ 10

Significance 7.5Rigor 6.5Novelty 6Clarity 7

Generated May 27, 2026

Comparison History (21)

vs. Plant, Persist, Trigger: Sleeper Attack on Large Language Model Agents

gemini-3.15/28/2026

Paper 1 introduces a frontier-tier foundation model featuring novel agent-native RL and self-evolution capabilities. Foundation model papers that demonstrate architectural efficiency (MoE) and new training paradigms typically achieve massive adoption, set new industry baselines, and drive broader scientific impact across the AI community compared to domain-specific security benchmarks, despite the strong novelty of Paper 2.

vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation

gemini-3.15/28/2026

Paper 2 introduces a frontier-level foundational MoE model with significant scale and highly efficient activation, tailored for real-world agentic tasks and self-evolution. Foundational models of this caliber typically generate widespread impact across multiple domains in AI research and industry applications. While Paper 1 offers a highly novel methodological advance in knowledge editing, Paper 2's comprehensive system design, RL framework, and broad utility give it higher potential for widespread scientific and practical impact.

vs. RULER: Representation-Level Verification of Machine Unlearning

gpt-5.25/28/2026

Paper 1 is more likely to have higher scientific impact due to a clearer, more novel scientific contribution: representation-level verification metrics for machine unlearning (including an oracle-free diagnostic) that expose failures missed by standard output-level tests, demonstrated across multiple modalities and methods with statistical rigor. This targets an urgent, broadly relevant problem (privacy/compliance and trustworthy ML) and provides reusable evaluation tools likely to influence future unlearning research and standards. Paper 2 sounds impactful but reads more like a systems/model release with less verifiable methodological detail in the abstract.

vs. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

claude-opus-4.65/27/2026

Paper 1 introduces a fundamentally new research direction—agent lifespan engineering—addressing the overlooked problem of how AI agents degrade over time post-deployment. AgingBench provides a novel benchmark framework with clear taxonomy (compression, interference, revision, maintenance aging), diagnostic methodology, and extensive empirical validation across 14 models and ~400 runs. This opens a new subfield with broad implications for reliable agent deployment. Paper 2, while technically impressive as a large MoE model release, is primarily an engineering contribution in the competitive LLM scaling space with incremental novelty. Paper 1's conceptual framework is more likely to spawn follow-up research and shift evaluation paradigms.

vs. The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

gpt-5.25/27/2026

Paper 2 identifies a fundamental, under-addressed failure mode in retrieval-augmented generation (distinguishing parametric memory from retrieved evidence) and proposes a principled, testable detection method (CRM) with cross-model validation and mechanistic probes. Its implications span safety, evaluation, interpretability, and any high-stakes RAG deployment, making it broadly impactful and timely. Paper 1 is ambitious and application-relevant, but resembles an incremental systems/model-scaling advance in an already crowded MoE/agent-RL landscape, with less clear generalizable scientific insight beyond engineering integration.

vs. LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

gpt-5.25/27/2026

Paper 1 has higher potential scientific impact: it proposes a large-scale MoE LLM optimized for agentic deployment, combining novel agent-generated verifiable trajectory data, an agent-native RL system (Forge) for long-horizon training, and early self-evolution capabilities. These contributions are methodologically ambitious, timely, and broadly relevant across LLM training, RL, systems, and autonomous agents, with clear real-world application potential. Paper 2 is useful and practical but is a more incremental engineering extension of an entity linking approach/library with narrower scope and likely smaller cross-field impact.

vs. Automatic Layer Selection for Hallucination Detection

gemini-3.15/27/2026

Paper 2 introduces a large-scale, frontier-tier Mixture-of-Experts foundation model series (MiniMax-M2) designed specifically for agentic deployment. Its contributions span scalable agent-native reinforcement learning, autonomous self-evolution, and large-scale verifiable data pipelines. While Paper 1 provides a useful technique for hallucination detection, Paper 2 offers massive breadth of impact across numerous domains (coding, deep search, reasoning) and demonstrates state-of-the-art capabilities in the rapidly growing field of autonomous AI agents.

vs. UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

claude-opus-4.65/27/2026

Paper 2 (MiniMax-M2) presents a complete, frontier-tier MoE language model series with novel contributions spanning architecture (229.9B params, 9.8B activated), agent-native RL training (Forge), agent-driven data pipelines, and early self-evolution capabilities. Its breadth of impact is larger—touching model architecture, training infrastructure, agentic AI, and practical deployment at scale. Paper 1 (UnityMAS-O) contributes a useful RL optimization framework for multi-agent LLM systems but is more incremental, extending existing infrastructure (verl) with multi-agent abstractions. M2's combination of efficiency, scale, and self-improvement represents a more transformative contribution.

vs. Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

claude-opus-4.65/27/2026

MiniMax-M2 represents a more impactful contribution: it introduces a full frontier-tier MoE language model system with novel agent-native RL training (Forge), self-evolution capabilities, and demonstrates state-of-the-art performance across multiple benchmarks with only 9.8B activated parameters. Its architectural innovations (windowed-FIFO scheduling, prefix-tree merging, agent-driven data pipelines) and the self-evolution paradigm have broader implications for the field. Paper 2, while valuable, is more narrowly focused on mobile GUI navigation with dataset and toolkit contributions that serve a specific subdomain.

vs. Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

claude-opus-4.65/27/2026

Paper 1 introduces a frontier-tier MoE language model series with significant architectural and training innovations (agent-driven data pipelines, scalable RL system, self-evolution capabilities) backed by extensive empirical results across multiple benchmarks. It addresses core challenges in efficient large-scale AI deployment and agentic systems. Paper 2 presents a conceptual framework for measuring agentic technical debt—a useful management tool but relatively narrow in scope, limited to a simulation/spreadsheet illustration, and lacking the empirical depth and broad scientific contribution of Paper 1.

vs. Learning to Reason Efficiently with A* Post-Training

claude-opus-4.65/27/2026

MiniMax-M2 presents a complete frontier-scale MoE language model system with novel agent-driven training pipelines, a scalable RL system (Forge), and early self-evolution capabilities. Its breadth of impact spans agentic coding, reasoning, and real-world deployment at scale, representing a significant engineering and scientific contribution. Paper 2, while intellectually interesting in applying A* search principles to LLM reasoning training, is narrower in scope (1B-3B models, formal proof generation) and represents an incremental advance in the reasoning-training space. The scale, novelty of self-evolution, and practical deployment implications give Paper 1 higher impact potential.

vs. MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

claude-opus-4.65/27/2026

MiniMax-M2 introduces a frontier-scale MoE language model series with novel contributions across multiple dimensions: agent-driven data pipelines, a scalable agent-native RL system (Forge), and early steps toward self-evolution. Its 229.9B parameter model with only 9.8B activated parameters represents significant architectural innovation, and it achieves frontier-tier performance across multiple benchmarks. The breadth of impact—spanning RL training systems, MoE architectures, and agentic AI—is substantially larger than Paper 2, which presents an incremental optimization framework for on-device mobile GUI agents with modest (23%) latency improvements on a narrower problem scope.

vs. TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

claude-opus-4.65/27/2026

MiniMax-M2 represents a major advance in efficient large-scale language models with a novel Mixture-of-Experts architecture (229.9B params, 9.8B activated), agent-native RL training infrastructure, and steps toward self-evolution. Its breadth of impact spans model architecture, training methodology, and agentic AI deployment, with frontier-tier benchmark results. While TADDLE addresses the timely and important problem of LLM-generated review detection with a solid benchmark contribution, its scope is narrower (peer review quality), and its methodological contributions, though rigorous, are more incremental compared to M2's systems-level innovations.

vs. LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

claude-opus-4.65/27/2026

MiniMax-M2 presents a comprehensive large-scale MoE language model system with broader scientific contributions spanning architecture design, agent-native RL training (Forge), self-evolution capabilities, and frontier-tier performance across multiple benchmark categories. Its impact spans model architecture, training infrastructure, and agentic AI deployment. While LongSeeker introduces a valuable context management paradigm (Context-ReAct) with strong results on search benchmarks, its scope is narrower—focused specifically on context orchestration for search agents. M2's breadth of contributions, scale, and potential to influence multiple research directions gives it higher estimated impact.

vs. Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

gemini-3.15/27/2026

Paper 2 addresses a critical bottleneck in test-time scaling, a highly timely and rapidly growing area of LLM research. By introducing a novel stochastic backtracking algorithm, it offers a foundational methodological improvement applicable across various models. In contrast, while Paper 1 presents an impressive large-scale MoE and agentic system, its contributions are largely engineering-heavy and system-specific, giving Paper 2 broader potential for widespread algorithmic adoption and scientific impact.

vs. SAM: State-Adaptive Memory for Long-Horizon Reasoning Agent

gpt-5.25/27/2026

Paper 2 has higher estimated impact due to its broader, end-to-end contribution: a large-scale MoE model family, novel agent-driven data generation with executable/verifiable trajectories, and a scalable agent-native RL/training system (Forge) that likely generalizes across many agentic tasks. Its real-world applicability (coding, search, office workflows), timeliness (agentic RL, efficient inference via sparse activation), and potential to influence both systems and training practice are high. Paper 1 is a strong, focused memory framework but narrower in scope and likely incremental relative to existing retrieval/compression work.

vs. From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

claude-opus-4.65/27/2026

MiniMax-M2 represents a major engineering and scientific contribution: a frontier-scale MoE model with only 9.8B activated parameters achieving top-tier performance, novel agent-native RL training infrastructure (Forge), and early self-evolution capabilities. Its breadth of impact spans agentic coding, reasoning, and real-world deployment at scale. Paper 2 makes a solid theoretical and empirical contribution on distribution shift in multi-turn dialogue RL, but addresses a narrower problem with more incremental advances. The scale, novelty of the agentic training paradigm, and practical deployment potential of Paper 1 give it substantially higher impact.

vs. When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

gpt-5.25/27/2026

Paper 2 likely has higher scientific impact due to its larger-scale, end-to-end system contribution (MoE model + agentic data pipeline + scalable RL infrastructure) and broad real-world applicability (agentic coding, search, office tasks). Its claims target frontier-level performance and deployment-relevant training/inference/agent decoupling, potentially influencing both research and industry practice across model architecture, RL systems, and agent evaluation. Paper 1 is more focused and rigorous in diagnosing multi-agent RL stability tradeoffs, but its scope and immediate cross-field impact are narrower.

vs. Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

claude-opus-4.65/27/2026

Paper 2 introduces a full frontier-tier MoE language model series (MiniMax-M2) with novel contributions in agent-driven data pipelines, scalable agent-native RL training (Forge), and early self-evolution capabilities. Its breadth of impact spans agentic AI, efficient inference (9.8B active parameters from 229.9B total), and multiple application domains. While Paper 1 addresses an important but narrower problem (recovering general capabilities during domain specialization via improved distillation), Paper 2's system-level innovations, frontier performance, and introduction of self-evolving training paradigms represent broader and more transformative contributions to the field.

vs. Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

gpt-5.25/27/2026

Paper 1 likely has higher scientific impact due to broader, more foundational contributions: a new large MoE model series optimized for low activation, an agent-driven data pipeline with executable/verifiable trajectories, and a scalable RL infrastructure for long-horizon agent training. These advances are general-purpose and can transfer across many domains and downstream systems, aligning with timely interest in agentic LLMs and efficiency. Paper 2 is novel and valuable but more domain-specific (supply chains) with smaller benchmark scope, so its cross-field impact is likely narrower.