MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni
Abstract
Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.
AI Impact Assessments
(1 models)Scientific Impact Assessment: MINDGAMES
1. Core Contribution
MINDGAMES introduces a multi-game evaluation platform for assessing LLM agents' social and strategic reasoning capabilities across four complementary game environments: Colonel Blotto (opponent modeling), Iterated Prisoner's Dilemma (trust/betrayal with communication), Codenames (cooperative inference under signaling constraints), and Secret Mafia (sustained deception and deduction). The platform is built on TextArena with TrueSkill-based rating, unified agent interfaces, and full trajectory logging. The paper's most distinctive contribution is not just the benchmark itself but the measurement audit that accompanies it—a systematic analysis of when leaderboard rankings reflect genuine strategic ability versus artifacts of error handling and opponent failures.
The paper delivers four concrete artifacts: (1) the benchmark platform, (2) a dataset of 29,571 games (243M tokens) with turn-level trajectories, (3) MG-Ref, an offline tournament protocol against frozen reference agents, and (4) a detailed synthesis of design patterns from 944 submissions across 76 teams.
2. Methodological Rigor
The benchmark design is principled along two axes—information structure and incentive alignment—yielding games that probe meaningfully different reasoning facets. The formal game model (Section 3.4) is carefully specified, and the instantiation for each game environment is precise.
The confound analysis (Section 5.1) is the paper's strongest analytical contribution. The authors identify that Secret Mafia's 50.3% game-level error rate, combined with early terminations (average <3 turns vs. expected 8-12), creates an error-survival confound where top-ranked agents primarily benefit from opponents failing rather than from strategic excellence. The proposed diagnostic—game-level error rate combined with median termination depth as a fraction of expected length—is simple, transferable, and immediately useful for other live-arena evaluations. The transparency about what their own benchmark *cannot* measure is commendable.
However, there are methodological limitations. The TrueSkill rating system, while appropriate for variable-size matches, is sensitive to the composition and size of the agent pool. Stage II had relatively few qualified agents per track, leading to sparse matchmaking that the authors acknowledge. The behavioral diversity analysis (Section 5.2) relies on embedding final responses rather than internal reasoning traces, limiting interpretive depth. The MG-Ref protocol for Secret Mafia uses only 5 active identities—well below the 15-agent threshold the authors themselves identify as needed for stable ratings.
3. Potential Impact
Immediate impact on LLM agent evaluation: The dataset of ~30K games with full trajectories fills a genuine gap. Prior benchmarks either don't release trajectory data or cover single game types. The error-attribution framework (Clean/Caused/Witnessed/Self-Forfeit/Opp-Forfeit) provides a vocabulary for discussing evaluation validity that the field currently lacks.
Training data for multi-agent reasoning: The 243M-token trajectory corpus, with structured metadata linking observations to actions to outcomes, is directly usable for training future agents through SFT or RL, potentially accelerating progress on social reasoning.
Design pattern synthesis: The finding that cognitive scaffolding without paired training *hurts* performance (Section 4.1c) is practically important—multiple independent teams converged on this result. Similarly, the observation that data curation dominates raw volume in multi-agent settings, and that modular perception-reasoning-action pipelines independently emerged across teams, provides actionable guidance.
Broader evaluation methodology: The "validity gradient" concept—where different environments within the same benchmark produce rankings of fundamentally different interpretability—could reshape how the community designs and interprets multi-agent benchmarks. The simple diagnostic (error rate + termination depth) is transferable beyond this specific setting.
4. Timeliness & Relevance
The paper addresses a pressing need. LLMs are being deployed as interactive agents in multi-party settings (customer service, negotiation, collaborative planning), yet evaluation infrastructure has lagged. Static ToM benchmarks have been criticized for measuring pattern matching rather than genuine social reasoning. The interactive, multi-game approach directly responds to these criticisms.
The competition setting (NeurIPS 2025) ensures ecological validity—the submissions represent genuine attempts at building effective multi-agent systems under time and resource constraints, rather than idealized laboratory conditions. The finding that an RL-trained 8B model can outperform prompted GPT-5 in the Generalization track is timely given debates about scaling versus algorithmic innovation.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment: This is a well-executed benchmark paper that makes its strongest contribution not through the benchmark itself but through the honest, detailed analysis of what live multi-agent evaluation can and cannot measure. The combination of a usable platform, a large trajectory dataset, an offline evaluation protocol, and a rigorous measurement audit makes it a significant resource for the field. The insights about agent design patterns from nearly 1,000 submissions add substantial practical value.
Generated May 29, 2026
Comparison History (18)
Paper 1 offers profound foundational contributions by formalizing Nested Contextual Causal Bandits and providing causal PAC-Bayesian excess-risk bounds. Its methodological rigor in addressing safe, certified deployment in multi-timescale sequential decision-making solves critical bottlenecks for real-world AI applications (e.g., healthcare, autonomous systems). While Paper 2 provides a timely and valuable empirical benchmark for LLM agents, Paper 1 introduces fundamentally novel theoretical frameworks and algorithmic guarantees that will likely yield deeper, long-lasting scientific impact across reinforcement learning, causality, and AI safety.
Paper 2 likely has higher impact: it introduces a scalable live multi-agent evaluation arena with sustained interaction, ratings, and large released trajectories from a major competition (944 agents, 29,571 games). This enables broad downstream research on strategic/social reasoning, agent robustness, evaluation validity, and benchmarking protocols (MG-Ref), impacting ML, multi-agent systems, game theory, and safety. Paper 1 is valuable and theory-grounded, but smaller in scale (137 items, Chinese contexts) and narrower in applicability. Paper 2 is more timely for interactive agent deployment and offers stronger methodological infrastructure.
Paper 2 introduces a comprehensive evaluation arena and a large-scale dataset for multi-agent LLM reasoning. In the rapidly evolving LLM field, robust benchmarks and evaluation frameworks (like arenas) tend to have exceptional scientific impact as they establish standard metrics and guide future research directions. While Paper 1 presents a strong, novel methodology for tool retrieval, Paper 2's broad applicability to the fundamental challenge of evaluating strategic and social reasoning gives it a wider potential impact.
Paper 2 (Scaling Monosemanticity) has significantly higher scientific impact. It addresses a fundamental question in AI safety and mechanistic interpretability—whether dictionary learning scales to production models—and demonstrates that sparse autoencoders can extract meaningful, causally relevant features from frontier LLMs. The discovery of multimodal, multilingual features and safety-relevant features (deception, power-seeking) has broad implications for AI alignment, model understanding, and governance. Paper 1 contributes a useful benchmark and dataset for multi-agent LLM evaluation, but is more incremental in scope. Paper 2 has already catalyzed substantial follow-up research across the interpretability field.
Paper 2 introduces a novel, theoretically grounded simplicity metric based on polynomial representations that addresses a fundamental open question in deep learning—quantifying and leveraging simplicity bias for generalization. It demonstrates broad applicability across diverse tasks (image/text classification, vision-language models, RL) and provides both a diagnostic tool and a practical regularizer. Paper 1, while valuable as a benchmark and dataset release for multi-agent LLM evaluation, is more narrowly scoped to a specific evaluation paradigm and competition cycle. Paper 2's foundational contribution to understanding generalization has broader and deeper potential impact across machine learning.
NaRA addresses a fundamental architectural limitation in adapting PEFT methods to diffusion LLMs, proposing a principled noise-aware adaptation mechanism with theoretical justification and empirical validation across multiple benchmarks. This has broader impact because it introduces a reusable technique applicable to the growing field of diffusion-based language models, with clear methodological novelty (noise-conditioned hypernetwork for LoRA). Paper 1, while valuable as a benchmark/dataset contribution, is more niche—focused on evaluating multi-agent LLM social reasoning through game competitions—and its findings are more observational than methodologically transformative.
Paper 2 demonstrates higher potential scientific impact due to its large-scale real-world deployment (57,954 essays, 10,195 students, 120 schools, 2 years), addressing a pressing practical need in K-12 education. It provides actionable insights about human-AI collaboration dynamics (ceiling effects, adaptive collaboration), with broad implications across education, HCI, and AI policy. Paper 1, while valuable as a benchmark contribution, is more narrowly focused on LLM agent evaluation in game settings, with findings (brittle rule adherence, scaffolding dependence) that are somewhat expected. Paper 2's empirical scale and cross-disciplinary relevance give it broader impact potential.
MINDGAMES introduces a novel evaluation paradigm for multi-agent LLM reasoning with broader scientific impact. It addresses a fundamental gap in understanding LLM social/strategic reasoning, releases a large dataset (29,571 games), and provides reusable infrastructure. Its findings about brittleness, error-survival confounds, and scaffolding dependence have implications across AI safety, alignment, and agent design. Paper 1, while technically solid, represents an incremental improvement in RAG pipelines for document QA—a more narrowly scoped engineering contribution with improvements limited to specific benchmarks.
Paper 2 likely has higher impact due to broader cross-field relevance (multi-agent evaluation, social reasoning, game theory, benchmarking), timeliness for agentic LLM deployment, and immediate real-world utility via a live competition platform plus a large released dataset and standardized offline protocol. Its contribution can become shared infrastructure for many labs and drive comparable progress. Paper 1 is technically solid and useful for MoE deployment efficiency, but it is narrower (post-training MoE compression) and more incremental relative to existing pruning/merging lines, with impact concentrated in systems/serving contexts.
Paper 2 likely has higher impact: it introduces a broadly useful, timely evaluation arena/dataset for multi-agent social/strategic reasoning, an increasingly central deployment setting for LLMs. The platform (live competition, standardized interface, ratings, offline protocol, large released dataset) can become shared infrastructure across labs, enabling reproducible benchmarking and follow-on research across RL, agent architectures, evaluation science, and alignment. Paper 1 is a solid, innovative alignment method, but its impact is narrower (representation interventions) and more likely to be superseded by subsequent techniques, whereas evaluation infrastructure tends to have longer-lasting, field-wide influence.
Paper 2 introduces a novel, comprehensive evaluation platform (MINDGAMES) for multi-agent LLM reasoning with a large-scale dataset (29,571 games), addressing a timely gap in understanding LLM social/strategic reasoning. It has broader impact across AI safety, multi-agent systems, and cognitive science. Paper 1 is a solid benchmarking study but addresses a narrower question (positional encoding for EEG transformers) with incremental findings (no universal best strategy). Paper 2's released dataset, competition infrastructure, and insights into LLM limitations have greater potential to influence multiple research communities.
Paper 2 likely has higher impact: it introduces a scalable, reusable evaluation platform and large public dataset for multi-agent social/strategic reasoning, addressing a timely gap as agentic LLM deployments grow. The live competition, standardized interface, TrueSkill ratings, trajectory logging, and offline tournament protocol (MG-Ref) enable broad community adoption and follow-on research across ML, NLP, multi-agent systems, and AI evaluation. Paper 1 offers a valuable, more specialized analysis of masked diffusion decoding/training misalignment for reasoning, but its applicability is narrower and less infrastructure-building.
MolLingo demonstrates higher scientific impact potential due to its novel approach combining multi-agent LLM coordination with chemically meaningful molecular representations (BFE), addressing a critical real-world application in drug discovery. It shows strong quantitative improvements across four benchmarks, including a fourfold docking score improvement over GPT-5.4. The work bridges AI and chemistry/biology with immediate practical implications for therapeutic design. While MINDGAMES contributes a valuable evaluation platform for multi-agent social reasoning, it is primarily a benchmarking contribution with findings that highlight limitations rather than solutions, and its impact is more narrowly focused on LLM evaluation methodology.
Paper 1 introduces a large-scale, novel evaluation framework and dataset for multi-agent LLM reasoning, addressing a critical gap in assessing 'theory of mind' and strategic capabilities. Its extensive empirical foundation (944 agents, ~30k games) and open-sourced offline protocol provide a foundational benchmark likely to drive significant future research, whereas Paper 2 focuses on an applied combination of existing techniques for production efficiency.
Paper 2 introduces a large-scale, multi-game evaluation platform for multi-agent LLM reasoning with a substantial dataset (29,571 games) and community engagement (76 teams, 944 agents). It addresses the timely and broad challenge of evaluating social/strategic reasoning in LLMs, with broader applicability across AI safety, multi-agent systems, and cognitive science. Paper 1 addresses the important but narrower problem of faithfulness in agentic XAI. While methodologically sound, its impact is more specialized. Paper 2's benchmark, dataset release, and competition framework have greater potential to catalyze follow-up research across multiple fields.
Paper 2 likely has higher impact due to its broad, timely evaluation infrastructure for multi-agent social/strategic reasoning, a critical gap for real-world LLM deployment. It delivers a large public dataset, a standardized arena with ratings, and an offline reproducible protocol (MG-Ref), enabling widespread benchmarking across research groups and subfields (LLM agents, game theory, safety/alignment, evaluation). Paper 1 is methodologically innovative and useful for skill internalization in agentic RL, but its applicability is narrower and impact more specialized.
ReasonOps tackles a critical and highly timely problem: understanding the opaque, long reasoning traces of modern 'thinking' LLMs. Its unsupervised discovery of universal reasoning operators provides a novel interpretability framework with broad applicability. The demonstrated downstream tasks, such as early correctness prediction and model fingerprinting, offer significant advancements for LLM evaluation, safety, and efficiency, giving it deeper methodological impact than the multi-agent benchmarking approach of Paper 2.
Paper 2 presents a large-scale, live arena for evaluating multi-agent LLMs with a massive dataset of almost 30,000 games and participation from nearly a thousand agents. Its focus on social and strategic reasoning (Theory of Mind) addresses a critical and highly active area of AI research. While Paper 1 offers a valuable diagnostic tool for personal AI memory, Paper 2's broader scope, competitive framework, and comprehensive analysis of multi-agent dynamics provide broader applications and higher potential impact across the AI community.