MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni

#1181 of 2821 · Artificial Intelligence
Share
Tournament Score
1428±47
10501800
61%
Win Rate
11
Wins
7
Losses
18
Matches
Rating
7.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: MINDGAMES

1. Core Contribution

MINDGAMES introduces a multi-game evaluation platform for assessing LLM agents' social and strategic reasoning capabilities across four complementary game environments: Colonel Blotto (opponent modeling), Iterated Prisoner's Dilemma (trust/betrayal with communication), Codenames (cooperative inference under signaling constraints), and Secret Mafia (sustained deception and deduction). The platform is built on TextArena with TrueSkill-based rating, unified agent interfaces, and full trajectory logging. The paper's most distinctive contribution is not just the benchmark itself but the measurement audit that accompanies it—a systematic analysis of when leaderboard rankings reflect genuine strategic ability versus artifacts of error handling and opponent failures.

The paper delivers four concrete artifacts: (1) the benchmark platform, (2) a dataset of 29,571 games (243M tokens) with turn-level trajectories, (3) MG-Ref, an offline tournament protocol against frozen reference agents, and (4) a detailed synthesis of design patterns from 944 submissions across 76 teams.

2. Methodological Rigor

The benchmark design is principled along two axes—information structure and incentive alignment—yielding games that probe meaningfully different reasoning facets. The formal game model (Section 3.4) is carefully specified, and the instantiation for each game environment is precise.

The confound analysis (Section 5.1) is the paper's strongest analytical contribution. The authors identify that Secret Mafia's 50.3% game-level error rate, combined with early terminations (average <3 turns vs. expected 8-12), creates an error-survival confound where top-ranked agents primarily benefit from opponents failing rather than from strategic excellence. The proposed diagnostic—game-level error rate combined with median termination depth as a fraction of expected length—is simple, transferable, and immediately useful for other live-arena evaluations. The transparency about what their own benchmark *cannot* measure is commendable.

However, there are methodological limitations. The TrueSkill rating system, while appropriate for variable-size matches, is sensitive to the composition and size of the agent pool. Stage II had relatively few qualified agents per track, leading to sparse matchmaking that the authors acknowledge. The behavioral diversity analysis (Section 5.2) relies on embedding final responses rather than internal reasoning traces, limiting interpretive depth. The MG-Ref protocol for Secret Mafia uses only 5 active identities—well below the 15-agent threshold the authors themselves identify as needed for stable ratings.

3. Potential Impact

Immediate impact on LLM agent evaluation: The dataset of ~30K games with full trajectories fills a genuine gap. Prior benchmarks either don't release trajectory data or cover single game types. The error-attribution framework (Clean/Caused/Witnessed/Self-Forfeit/Opp-Forfeit) provides a vocabulary for discussing evaluation validity that the field currently lacks.

Training data for multi-agent reasoning: The 243M-token trajectory corpus, with structured metadata linking observations to actions to outcomes, is directly usable for training future agents through SFT or RL, potentially accelerating progress on social reasoning.

Design pattern synthesis: The finding that cognitive scaffolding without paired training *hurts* performance (Section 4.1c) is practically important—multiple independent teams converged on this result. Similarly, the observation that data curation dominates raw volume in multi-agent settings, and that modular perception-reasoning-action pipelines independently emerged across teams, provides actionable guidance.

Broader evaluation methodology: The "validity gradient" concept—where different environments within the same benchmark produce rankings of fundamentally different interpretability—could reshape how the community designs and interprets multi-agent benchmarks. The simple diagnostic (error rate + termination depth) is transferable beyond this specific setting.

4. Timeliness & Relevance

The paper addresses a pressing need. LLMs are being deployed as interactive agents in multi-party settings (customer service, negotiation, collaborative planning), yet evaluation infrastructure has lagged. Static ToM benchmarks have been criticized for measuring pattern matching rather than genuine social reasoning. The interactive, multi-game approach directly responds to these criticisms.

The competition setting (NeurIPS 2025) ensures ecological validity—the submissions represent genuine attempts at building effective multi-agent systems under time and resource constraints, rather than idealized laboratory conditions. The finding that an RL-trained 8B model can outperform prompted GPT-5 in the Generalization track is timely given debates about scaling versus algorithmic innovation.

5. Strengths & Limitations

Key Strengths:

  • Exceptional transparency about evaluation limitations; the measurement audit is as valuable as the benchmark itself
  • Large-scale trajectory dataset with rich metadata enables diverse downstream research
  • The MG-Ref offline protocol enables reproducible evaluation without live infrastructure
  • Cross-cutting analysis of 944 submissions yields generalizable insights about agent architecture (e.g., the scaffolding-without-training failure mode, the importance of data curation)
  • The game selection covers genuinely complementary reasoning demands rather than variations on a theme
  • Notable Limitations:

  • Secret Mafia, arguably the most interesting environment for ToM evaluation, produced the least interpretable results due to the error-survival confound. The benchmark's most ambitious evaluation target is also its weakest measurement point.
  • The paper cannot make strong claims about whether observed strategic behavior reflects genuine theory of mind versus sophisticated pattern matching—the fundamental question motivating the work remains open.
  • Population effects are significant: rankings depend heavily on which other agents are in the pool. MG-Ref partially addresses this but with acknowledged limitations.
  • The two-track, two-division structure with different matchmaking dynamics makes cross-division comparisons unreliable, limiting some analyses.
  • Role advantage analysis (Section 5.3) reveals Mafia consistently outperforms Villager roles, but whether this reflects inherent game asymmetry versus LLM-specific biases (e.g., RLHF truthfulness bias) is not disentangled.
  • Overall Assessment: This is a well-executed benchmark paper that makes its strongest contribution not through the benchmark itself but through the honest, detailed analysis of what live multi-agent evaluation can and cannot measure. The combination of a usable platform, a large trajectory dataset, an offline evaluation protocol, and a rigorous measurement audit makes it a significant resource for the field. The insights about agent design patterns from nearly 1,000 submissions add substantial practical value.

    Rating:7.5/ 10
    Significance 7.5Rigor 7.5Novelty 6.5Clarity 8

    Generated May 29, 2026

    Comparison History (18)

    vs. Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk
    gemini-3.15/29/2026

    Paper 1 offers profound foundational contributions by formalizing Nested Contextual Causal Bandits and providing causal PAC-Bayesian excess-risk bounds. Its methodological rigor in addressing safe, certified deployment in multi-timescale sequential decision-making solves critical bottlenecks for real-world AI applications (e.g., healthcare, autonomous systems). While Paper 2 provides a timely and valuable empirical benchmark for LLM agents, Paper 1 introduces fundamentally novel theoretical frameworks and algorithmic guarantees that will likely yield deeper, long-lasting scientific impact across reinforcement learning, causality, and AI safety.

    vs. NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs
    gpt-5.25/29/2026

    Paper 2 likely has higher impact: it introduces a scalable live multi-agent evaluation arena with sustained interaction, ratings, and large released trajectories from a major competition (944 agents, 29,571 games). This enables broad downstream research on strategic/social reasoning, agent robustness, evaluation validity, and benchmarking protocols (MG-Ref), impacting ML, multi-agent systems, game theory, and safety. Paper 1 is valuable and theory-grounded, but smaller in scale (137 items, Chinese contexts) and narrower in applicability. Paper 2 is more timely for interactive agent deployment and offers stronger methodological infrastructure.

    vs. CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval
    gemini-3.15/29/2026

    Paper 2 introduces a comprehensive evaluation arena and a large-scale dataset for multi-agent LLM reasoning. In the rapidly evolving LLM field, robust benchmarks and evaluation frameworks (like arenas) tend to have exceptional scientific impact as they establish standard metrics and guide future research directions. While Paper 1 presents a strong, novel methodology for tool retrieval, Paper 2's broad applicability to the fundamental challenge of evaluating strategic and social reasoning gives it a wider potential impact.

    vs. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet
    claude-opus-4.65/29/2026

    Paper 2 (Scaling Monosemanticity) has significantly higher scientific impact. It addresses a fundamental question in AI safety and mechanistic interpretability—whether dictionary learning scales to production models—and demonstrates that sparse autoencoders can extract meaningful, causally relevant features from frontier LLMs. The discovery of multimodal, multilingual features and safety-relevant features (deception, power-seeking) has broad implications for AI alignment, model understanding, and governance. Paper 1 contributes a useful benchmark and dataset for multi-agent LLM evaluation, but is more incremental in scope. Paper 2 has already catalyzed substantial follow-up research across the interpretability field.

    vs. Quantifying and Optimizing Simplicity via Polynomial Representations
    claude-opus-4.65/29/2026

    Paper 2 introduces a novel, theoretically grounded simplicity metric based on polynomial representations that addresses a fundamental open question in deep learning—quantifying and leveraging simplicity bias for generalization. It demonstrates broad applicability across diverse tasks (image/text classification, vision-language models, RL) and provides both a diagnostic tool and a practical regularizer. Paper 1, while valuable as a benchmark and dataset release for multi-agent LLM evaluation, is more narrowly scoped to a specific evaluation paradigm and competition cycle. Paper 2's foundational contribution to understanding generalization has broader and deeper potential impact across machine learning.

    vs. NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs
    claude-opus-4.65/29/2026

    NaRA addresses a fundamental architectural limitation in adapting PEFT methods to diffusion LLMs, proposing a principled noise-aware adaptation mechanism with theoretical justification and empirical validation across multiple benchmarks. This has broader impact because it introduces a reusable technique applicable to the growing field of diffusion-based language models, with clear methodological novelty (noise-conditioned hypernetwork for LoRA). Paper 1, while valuable as a benchmark/dataset contribution, is more niche—focused on evaluating multi-agent LLM social reasoning through game competitions—and its findings are more observational than methodologically transformative.

    vs. Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale
    claude-opus-4.65/29/2026

    Paper 2 demonstrates higher potential scientific impact due to its large-scale real-world deployment (57,954 essays, 10,195 students, 120 schools, 2 years), addressing a pressing practical need in K-12 education. It provides actionable insights about human-AI collaboration dynamics (ceiling effects, adaptive collaboration), with broad implications across education, HCI, and AI policy. Paper 1, while valuable as a benchmark contribution, is more narrowly focused on LLM agent evaluation in game settings, with findings (brittle rule adherence, scaffolding dependence) that are somewhat expected. Paper 2's empirical scale and cross-disciplinary relevance give it broader impact potential.

    vs. HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering
    claude-opus-4.65/29/2026

    MINDGAMES introduces a novel evaluation paradigm for multi-agent LLM reasoning with broader scientific impact. It addresses a fundamental gap in understanding LLM social/strategic reasoning, releases a large dataset (29,571 games), and provides reusable infrastructure. Its findings about brittleness, error-survival confounds, and scaffolding dependence have implications across AI safety, alignment, and agent design. Paper 1, while technically solid, represents an incremental improvement in RAG pipelines for document QA—a more narrowly scoped engineering contribution with improvements limited to specific benchmarks.

    vs. ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression
    gpt-5.25/29/2026

    Paper 2 likely has higher impact due to broader cross-field relevance (multi-agent evaluation, social reasoning, game theory, benchmarking), timeliness for agentic LLM deployment, and immediate real-world utility via a live competition platform plus a large released dataset and standardized offline protocol. Its contribution can become shared infrastructure for many labs and drive comparable progress. Paper 1 is technically solid and useful for MoE deployment efficiency, but it is narrower (post-training MoE compression) and more incremental relative to existing pruning/merging lines, with impact concentrated in systems/serving contexts.

    vs. Multi-Adapter Representation Interventions via Energy Calibration
    gpt-5.25/29/2026

    Paper 2 likely has higher impact: it introduces a broadly useful, timely evaluation arena/dataset for multi-agent social/strategic reasoning, an increasingly central deployment setting for LLMs. The platform (live competition, standardized interface, ratings, offline protocol, large released dataset) can become shared infrastructure across labs, enabling reproducible benchmarking and follow-on research across RL, agent architectures, evaluation science, and alignment. Paper 1 is a solid, innovative alignment method, but its impact is narrower (representation interventions) and more likely to be superseded by subsequent techniques, whereas evaluation infrastructure tends to have longer-lasting, field-wide influence.

    vs. Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models
    claude-opus-4.65/29/2026

    Paper 2 introduces a novel, comprehensive evaluation platform (MINDGAMES) for multi-agent LLM reasoning with a large-scale dataset (29,571 games), addressing a timely gap in understanding LLM social/strategic reasoning. It has broader impact across AI safety, multi-agent systems, and cognitive science. Paper 1 is a solid benchmarking study but addresses a narrower question (positional encoding for EEG transformers) with incremental findings (no universal best strategy). Paper 2's released dataset, competition infrastructure, and insights into LLM limitations have greater potential to influence multiple research communities.

    vs. The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models
    gpt-5.25/29/2026

    Paper 2 likely has higher impact: it introduces a scalable, reusable evaluation platform and large public dataset for multi-agent social/strategic reasoning, addressing a timely gap as agentic LLM deployments grow. The live competition, standardized interface, TrueSkill ratings, trajectory logging, and offline tournament protocol (MG-Ref) enable broad community adoption and follow-on research across ML, NLP, multi-agent systems, and AI evaluation. Paper 1 offers a valuable, more specialized analysis of masked diffusion decoding/training misalignment for reasoning, but its applicability is narrower and less infrastructure-building.

    vs. MolLingo: Molecule-Native Representations for LLM-Powered Scientific Agents
    claude-opus-4.65/29/2026

    MolLingo demonstrates higher scientific impact potential due to its novel approach combining multi-agent LLM coordination with chemically meaningful molecular representations (BFE), addressing a critical real-world application in drug discovery. It shows strong quantitative improvements across four benchmarks, including a fourfold docking score improvement over GPT-5.4. The work bridges AI and chemistry/biology with immediate practical implications for therapeutic design. While MINDGAMES contributes a valuable evaluation platform for multi-agent social reasoning, it is primarily a benchmarking contribution with findings that highlight limitations rather than solutions, and its impact is more narrowly focused on LLM evaluation methodology.

    vs. Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching
    gemini-3.15/29/2026

    Paper 1 introduces a large-scale, novel evaluation framework and dataset for multi-agent LLM reasoning, addressing a critical gap in assessing 'theory of mind' and strategic capabilities. Its extensive empirical foundation (944 agents, ~30k games) and open-sourced offline protocol provide a foundational benchmark likely to drive significant future research, whereas Paper 2 focuses on an applied combination of existing techniques for production efficiency.

    vs. Towards Faithful Agentic XAI: A Verification Method and an Open-World Benchmark for Better Model Faithfulness
    claude-opus-4.65/29/2026

    Paper 2 introduces a large-scale, multi-game evaluation platform for multi-agent LLM reasoning with a substantial dataset (29,571 games) and community engagement (76 teams, 944 agents). It addresses the timely and broad challenge of evaluating social/strategic reasoning in LLMs, with broader applicability across AI safety, multi-agent systems, and cognitive science. Paper 1 addresses the important but narrower problem of faithfulness in agentic XAI. While methodologically sound, its impact is more specialized. Paper 2's benchmark, dataset release, and competition framework have greater potential to catalyze follow-up research across multiple fields.

    vs. SKILLC: Learning Autonomous Skill Internalization in LLM Agents via Contrastive Credit Assignment
    gpt-5.25/29/2026

    Paper 2 likely has higher impact due to its broad, timely evaluation infrastructure for multi-agent social/strategic reasoning, a critical gap for real-world LLM deployment. It delivers a large public dataset, a standardized arena with ratings, and an offline reproducible protocol (MG-Ref), enabling widespread benchmarking across research groups and subfields (LLM agents, game theory, safety/alignment, evaluation). Paper 1 is methodologically innovative and useful for skill internalization in agentic RL, but its applicability is narrower and impact more specialized.

    vs. ReasonOps: Operator Segmentation for LLM Reasoning Traces
    gemini-3.15/29/2026

    ReasonOps tackles a critical and highly timely problem: understanding the opaque, long reasoning traces of modern 'thinking' LLMs. Its unsupervised discovery of universal reasoning operators provides a novel interpretability framework with broad applicability. The demonstrated downstream tasks, such as early correctness prediction and model fingerprinting, offer significant advancements for LLM evaluation, safety, and efficiency, giving it deeper methodological impact than the multi-agent benchmarking approach of Paper 2.

    vs. Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison
    gemini-3.15/29/2026

    Paper 2 presents a large-scale, live arena for evaluating multi-agent LLMs with a massive dataset of almost 30,000 games and participation from nearly a thousand agents. Its focus on social and strategic reasoning (Theory of Mind) addresses a critical and highly active area of AI research. While Paper 1 offers a valuable diagnostic tool for personal AI memory, Paper 2's broader scope, competitive framework, and comprehensive analysis of multi-agent dynamics provide broader applications and higher potential impact across the AI community.