GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

Vartan Shadarevian, Kia Ghods, Alex Kenich, Anany Kotawala

May 22, 2026

arXiv:2605.23238v1 PDF

cs.AI(primary)cs.GT cs.LGcs.MA

#684of 2682·Artificial Intelligence

#684 of 2682 · Artificial Intelligence

Tournament Score

1461±43

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7

Rigor7.5

Novelty7.5

Clarity7.8

Tournament Score

1461±43

10501800

58%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GENSTRAT

Core Contribution

GENSTRAT introduces a procedurally generated benchmark for evaluating LLM strategic reasoning in two-player zero-sum imperfect-information card games ("generalized betting games" or GBGs). The key innovations are threefold: (1) a parameterized game generator that can produce fresh games on demand, addressing contamination and saturation concerns; (2) a six-axis capability-profile decomposition (state space, temporal depth, information sensitivity, opponent modeling, risk, brittleness) that characterizes *how* models are strong rather than just *how much*; and (3) a "jaggedness" measure quantifying local volatility of performance across strategically similar games. The paper evaluates nine frontier LLMs across 50 benchmark games in a 36,000+ match tournament.

Methodological Rigor

The statistical methodology is notably thorough. The additive paired-comparison estimator with sum-to-zero constraints is well-motivated for zero-sum games. The paper employs paired-cluster bootstrapping throughout, with appropriate cluster definitions. Multiple robustness checks — leave-one-game-out, llama-excluded refits, axis-space partition refits, and a CFR+ solver baseline (Spearman ρ=0.95) — strengthen confidence in the leaderboard ordering.

The six complexity axes are grounded in Monte Carlo simulation with explicit formulas (Appendix C), and variance inflation factors confirm acceptable collinearity (all VIF < 5). The capability-profile regression applies Benjamini-Hochberg correction across 54 hypothesis tests, showing appropriate multiple-testing awareness.

However, several methodological concerns warrant attention. The game acceptance rate (~1/6 of candidates) and the restriction to games with ≤10 average moves per player constrains the space substantially. The jaggedness measure Jm does not subtract the fitted capability-profile surface, meaning smooth gradients contribute alongside genuine local volatility — acknowledged but not resolved. The six axes, while conceptually distinct, carry mechanical correlations (state-space × information-sensitivity r=0.65), and the paper acknowledges these are "complementary rather than fully orthogonal." The per-game strength estimates ˆα_{m,g} are used as primitives for downstream analyses but inherit estimation noise that propagates through the capability profiles and jaggedness measures in ways that aren't fully deconvolved.

Potential Impact

The paper addresses a genuine gap in LLM evaluation. As LLMs are deployed in economic agent roles (pricing, auctions, negotiations), understanding their strategic reasoning beyond canonical games is practically important. The procedural generation approach offers several concrete benefits:

Contamination resistance: Training on the 50-game benchmark doesn't exhaust the distribution — fresh games can always be drawn. This is a meaningful advance over fixed benchmarks.

Deployment-relevant diagnostics: The capability profile + jaggedness pairing provides actionable information. The finding that gpt-5 and claude are strong but jagged while gemini-3.1-pro is strong and smooth has direct deployment implications — a deployer wanting predictable performance across novel games would favor gemini-3.1-pro despite comparable overall scores.

Scalability: The complexity dial and the amplification effect in multi-agent settings mean the benchmark can track frontier improvements without redesign.

The framework could influence adjacent fields: mechanism design (how do LLM agents behave in novel auction formats?), multi-agent systems, and AI safety (understanding where strategic reasoning breaks down unpredictably).

Timeliness & Relevance

This work is well-timed. LLM-as-economic-agent deployment is accelerating (Anthropic's Project Vend/Deal, algorithmic collusion studies), and existing benchmarks (GTBench, GameBench, AvalonBench) use fixed canonical games that are increasingly susceptible to contamination and saturation. The procedural generation approach directly addresses this bottleneck. The inclusion of the latest frontier models (gpt-5, claude-sonnet-4-6, gemini-3.1-pro) makes the results immediately relevant.

Strengths

1. Well-designed evaluation framework: The combination of procedural generation, multi-axis decomposition, and jaggedness creates a richer evaluation than any single metric. The finding that the variance decomposition ratio σ²_MG/σ²_M rises to 1.29 when excluding the outlier llama model demonstrates that model×game interaction is substantial — justifying the entire capability-profile enterprise.

2. Substantive empirical findings: The paper reveals that models with near-identical aggregate strength have qualitatively different profiles. Claude concentrates its advantage on brittleness alone; gemini-3.1-pro gains broadly across axes. This is a non-trivial finding that single-number leaderboards miss entirely.

3. Rank-stability analysis: The per-game rank-stability test (Appendix P) showing that reversal significance concentrates on easy games (12/16 easiest games at q<0.05, 0/17 hardest) is an insightful finding about when aggregate rankings can be trusted.

4. Thinking-budget ablation: The paired design (identical deals, same opponent) isolates the reasoning-budget effect cleanly.

Limitations

1. Narrow game class: Two-player zero-sum imperfect-information card games are a restricted slice of strategic environments. Cooperative games, n-player settings, non-betting domains, and repeated interactions are absent. The paper's external validity claims should be read cautiously.

2. No non-LLM baselines at scale: The CFR+ solver baseline covers only 5 seeds. Without systematic comparison to game-theoretic solvers across the full 50-game benchmark, it's hard to calibrate absolute competence levels.

3. Axis validation: The six axes are measured via Monte Carlo simulation of random agents, but whether they capture the dimensions that matter most for *LLM* strategic reasoning (as opposed to game-theoretic difficulty in general) isn't validated independently.

4. Sample size concerns: 50 games with axis-conditional analyses acknowledged as directionally robust but not definitive. The thinking ablation has some families with only 50-80 matched pairs.

5. Reproducibility: Provider snapshot dependence and non-reproducibility of proprietary model outputs limit exact replication, though the game generator is deterministic.

Overall Assessment

GENSTRAT represents a well-executed methodological contribution to LLM evaluation. Its procedural generation approach, multi-axis decomposition, and jaggedness measure together offer a substantially richer evaluation framework than existing fixed-game benchmarks. The empirical findings — particularly the divergent capability profiles among top-tier models and the concentration of rank instability on easy games — are genuinely informative. The main limitations are the restricted game class and the gap between the demonstrated evaluation framework and the broader strategic deployment scenarios motivating the work.

Rating:7.2/ 10

Significance 7Rigor 7.5Novelty 7.5Clarity 7.8

Generated May 25, 2026

Comparison History (26)

vs. Learning When to Optimize: Verified Optimization Skills from Expert GPU-Kernel Lineages

gemini-3.15/28/2026

Paper 1 addresses a critical and broad challenge in AI: evaluating the strategic reasoning of LLM agents dynamically to prevent benchmark contamination and saturation. Its intersection of AI, game theory, and economic agent modeling offers a wider breadth of impact and tackles a more foundational problem in AI safety and evaluation than Paper 2, which focuses on the narrower, albeit valuable, domain of GPU kernel optimization.

vs. Energy Shields for Fairness

claude-opus-4.65/26/2026

Paper 1 introduces a fundamentally novel concept—energy shields for runtime fairness—that bridges physics-inspired control theory with algorithmic fairness, providing formal safety and liveness guarantees. This is a stronger methodological contribution with broad applicability across any sequential decision-making system requiring fairness. Paper 2, while timely and well-executed, is primarily a benchmarking framework for LLM strategic reasoning—useful but more incremental. Paper 1's formal guarantees, synthesis procedure, and novel conceptual framework have greater potential to influence multiple research communities (fairness, formal methods, control theory).

vs. PALoRA: Projection-Adaptive LoRA for Preserving Reasoning in Large Language Models

gemini-3.15/26/2026

Paper 1 introduces a fundamental methodological improvement to LLM fine-tuning by addressing the critical plasticity-stability dilemma. By enabling factual knowledge injection without catastrophic forgetting of reasoning skills, PALoRA offers immediate, broad utility across all domains utilizing fine-tuned LLMs. While Paper 2 presents an innovative and necessary evaluation framework for strategic reasoning, its impact is more narrowly focused on benchmarking LLMs as economic agents. Paper 1's algorithmic advancement provides a core capability enhancement with wider real-world applications and broader impact across the AI community.

vs. AI Cartography: Mapping the Latent Landscape of AI Benchmark Ecosystems

gpt-5.25/26/2026

Paper 2 has higher potential scientific impact because it introduces a broadly applicable measurement framework (CFA + Generalizability Theory) for quantifying noise, dependence, and reliability across benchmark ecosystems, using a very large dataset (4,000+ models). Its results directly affect how the community interprets leaderboards, builds benchmarks, and estimates scaling—issues spanning ML evaluation, psychometrics, and policy. Paper 1 is novel and useful for strategic-reasoning evaluation, but its scope is narrower (procedurally generated zero-sum card games) and less immediately general across the broader benchmarking landscape.

vs. Uncertainty Reasoning with Large Language Models for Explainable Disease Diagnosis

claude-opus-4.65/26/2026

Paper 2 addresses a critical real-world problem—explainable and verifiable medical diagnosis—combining LLMs with formal logic in a neuro-symbolic framework. Its impact spans AI, healthcare, and clinical decision support, with direct patient safety implications. The interpretability and auditability features address key barriers to deploying AI in medicine. Paper 1, while methodologically interesting for benchmarking LLM strategic reasoning, serves a narrower community (AI evaluation/game theory) and its procedurally generated card games, though clever, have less immediate real-world applicability compared to clinical diagnosis systems.

vs. DART: Semantic Recoverability for Structured Tool Agents

gemini-3.15/25/2026

Paper 2 addresses a fundamental and widespread bottleneck in LLM research: benchmark saturation and contamination. By introducing procedurally generated environments for strategic reasoning and novel evaluation metrics like 'jaggedness', it offers broad, long-lasting implications for AI evaluation, economics, and game theory. While Paper 1 provides a valuable systems-level solution for agent reliability, Paper 2's methodological innovations in evaluating complex, strategic model behavior give it a higher potential for broad scientific impact.

vs. Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

gemini-3.15/25/2026

Paper 2 (GENSTRAT) addresses a critical crisis in AI: benchmark saturation and data contamination. By introducing procedurally generated strategic environments, it provides an evergreen evaluation method for LLMs. Its focus on economic/strategic reasoning, paired with novel metrics like jaggedness and capability profiling, has broad implications for AI safety, game theory, and real-world deployment in high-stakes marketplaces. While Paper 1 offers a valuable algorithmic improvement for agent skill management, Paper 2 provides a foundational methodological shift for evaluating and understanding the strategic behavior of frontier AI systems across multiple disciplines.

vs. Mediative Fuzzy Logic: From Type-1 Foundations to Type-2, Type-3 and Quantum Extensions

gpt-5.25/25/2026

Paper 2 has higher likely impact due to its timeliness (LLM deployment safety/economics), clear and reusable methodological contribution (procedural game generator, capability profiles, jaggedness metric), and immediate applicability for evaluation and governance. It offers a scalable, contamination-resistant benchmark paradigm with broad relevance across ML evaluation, AI safety, computational economics, and game theory. Paper 1 is conceptually novel and mathematically rich, but its extensions (type-3/quantum) may be harder to validate empirically and likely have narrower near-term adoption outside fuzzy-logic communities.

vs. Causal Probing for Internal Visual Representations in Multimodal Large Language Models

gemini-3.15/25/2026

Paper 2 provides foundational insights into the internal mechanisms and scaling laws of Multimodal LLMs through causal probing. Its findings on concept encoding and the perception-reasoning disconnect offer deep contributions to AI interpretability, potentially guiding future architecture design. While Paper 1 introduces a highly innovative evaluation benchmark for strategic reasoning, Paper 2's mechanistic approach uncovers fundamental structural properties of models, likely driving broader and more profound impact across the fields of AI alignment and development.

vs. When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

gpt-5.25/25/2026

Paper 1 is more likely to have higher impact: it introduces an evergreen, procedurally generated benchmark distribution plus new diagnostics (capability profiles and jaggedness) that address benchmark saturation/contamination and provide deployment-relevant evaluation of strategic reasoning. The methodology is broadly applicable across AI safety, econ/market design, and evaluation science, and includes substantial empirical rigor (large game pool, many head-to-head matches, multiple frontier/open models). Paper 2 targets an important failure mode in multi-agent planning, but appears narrower in scope and offers a more incremental workflow with moderate gains and less clearly generalizable evaluation infrastructure.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gpt-5.25/25/2026

Paper 2 likely has higher impact: it introduces a general, procedurally generated, contamination-resistant benchmark for strategic reasoning with a principled capability-profile decomposition and a new stability/jaggedness metric. This framework is broadly applicable to evaluating LLMs as agents across economics, game theory, security, and multi-agent AI, and is timely given deployment in markets and negotiations. Paper 1 identifies an important inverse-scaling failure mode in forecasting and improves evaluation practice, but its domain scope is narrower (time-series/tail risk) and the main contribution is diagnostic rather than a widely extensible evaluation paradigm.

vs. Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

gpt-5.25/25/2026

Paper 1 has higher likely impact: it introduces an evergreen, contamination-resistant evaluation paradigm for strategic reasoning via procedurally generated imperfect-information games plus interpretable diagnostics (capability profiles and jaggedness) that can generalize across messy real deployments of LLM agents. This is timely as LLMs enter economic/strategic settings and offers a reusable benchmark infrastructure with broad relevance to AI evaluation, agentic safety, and economics. Paper 2 advances multimodal knowledge editing robustness, but is more specialized, method-dependent, and its real-world adoption hinges on editing-stack compatibility and semantic-validity assumptions for latent adversaries.

vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability

claude-opus-4.65/25/2026

Paper 1 addresses a fundamental methodological gap in VLM explainability by identifying evaluation collapse in cross-modal settings and proposing a principled metric (Synergistic Faithfulness) grounded in game-theoretic concepts. It offers broad impact across XAI, multimodal AI, and AI safety, with rigorous evaluation across multiple architectures and methods. Paper 2 contributes a useful benchmark for LLM strategic reasoning but is more narrowly scoped to game-theoretic evaluation. Paper 1's novel metric framework, identification of a systematic flaw in current evaluation paradigms, and applicability to high-stakes AI auditing give it greater potential for widespread methodological adoption.

vs. Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

gpt-5.25/25/2026

Paper 1 likely has higher scientific impact: it introduces an evergreen, procedurally generated benchmark distribution plus new diagnostic constructs (capability profiles across multiple strategic axes and a “jaggedness” stability measure) for evaluating LLM strategic behavior in deployment-relevant, variable environments. This is a broadly useful methodological contribution for AI evaluation, safety, and agentic economics, with potential to become a standard stress-test framework resistant to saturation/contamination. Paper 2 is a solid, practical inference-time control improvement for ReAct agents, but is more incremental and narrower in scope.

vs. Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

gpt-5.25/25/2026

Paper 2 likely has higher scientific impact: it introduces a novel agentic loop (IDS) that jointly synthesizes code and mechanized proofs, demonstrably closing a major capability gap (2/7 to 7/7) on formally verified distributed systems with strong efficiency gains. The real-world applications (verified infrastructure, safety-critical software) are immediate and broad across PL, formal methods, distributed systems, and AI. If the evaluation is rigorous and reproducible, the results suggest a step-change. Paper 1 is timely and useful for LLM evaluation, but its impact is more incremental and narrower to benchmarking.

vs. EVE-Agent: Evidence-Verifiable Self-Evolving Agents

claude-opus-4.65/25/2026

EVE-Agent addresses a fundamental and broadly applicable challenge in self-evolving AI agents—ensuring evidence verifiability in self-generated training data. This has wide implications for trustworthy AI, retrieval-augmented generation, and scalable agent training without human supervision. Paper 2 (GENSTRAT) contributes a valuable benchmark for strategic reasoning in LLMs, but benchmarks tend to have narrower and more transient impact. EVE-Agent's principle that self-evolving agents must justify their training data introduces a methodological contribution with lasting influence across multiple agent paradigms, making it more impactful overall.

vs. ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

gemini-3.15/25/2026

Paper 1 addresses a highly critical and broad challenge in AI: evaluating the strategic reasoning of LLMs as autonomous economic agents. Its introduction of procedurally generated, evergreen benchmarks and novel metrics like 'jaggedness' offers a highly innovative, contamination-resistant evaluation paradigm. While Paper 2 presents valuable advancements in automated theorem proving, its scope is more specialized. Paper 1's methodology has wider implications across AI safety, game theory, and economics, giving it a broader potential scientific impact.

vs. CLORE: Content-Level Optimization for Reasoning Efficiency

claude-opus-4.65/25/2026

GENSTRAT introduces a novel evaluation paradigm for LLM strategic reasoning with procedurally generated games, capability profiling, and a jaggedness metric—contributions that are methodologically innovative and broadly applicable across AI safety, economics, and multi-agent systems. Its evergreen, contamination-resistant benchmarking addresses a fundamental limitation of static benchmarks. CLORE offers a solid but more incremental contribution—optimizing reasoning trace efficiency via content-level editing within an established RL post-training pipeline. While useful, it targets a narrower problem (math reasoning efficiency) with less cross-disciplinary impact potential.

vs. SMDD-Bench: Can LLMs Solve Real-World Small Molecule Drug Design Tasks?

claude-opus-4.65/25/2026

SMDD-Bench addresses a critical real-world application (drug design) with direct translational potential, combining LLM evaluation with computational chemistry across 502 task instances and 102 protein targets. Its impact spans AI, chemistry, and pharmaceutical sciences. While GENSTRAT offers a novel procedural approach to evaluating strategic reasoning with strong methodological contributions, its scope is narrower (game theory/strategic reasoning). SMDD-Bench's practical relevance to drug discovery, broader interdisciplinary appeal, and potential to accelerate pharmaceutical research give it higher estimated scientific impact.

vs. SkillOpt: Executive Strategy for Self-Evolving Agent Skills

claude-opus-4.65/25/2026

SkillOpt introduces a novel and practically impactful framework—treating agent skills as trainable text-space parameters with optimizer discipline (learning rate, validation, edit buffers). It demonstrates broad empirical impact across 52 evaluation cells, multiple models, and execution harnesses, with large accuracy gains (+19-25 points). Its transferability across models and environments enhances real-world applicability. While GENSTRAT contributes a useful benchmarking methodology for strategic reasoning, SkillOpt addresses a more fundamental problem in AI agent development with a generalizable optimization paradigm that could influence how agent skills are developed across the field.