Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners

Botos Csaba, Sreejan Kumar, Austin Tudor David Andrews, Laurence Hunt, Chris Summerfield, Joshua B. Tenenbaum, Rui Ponte Costa, Marcelo G. Mattar

May 8, 2026

arXiv:2605.08019v1 PDF

cs.AI(primary)q-bio.NC

#113of 2292·Artificial Intelligence

#113 of 2292 · Artificial Intelligence

Tournament Score

1539±46

10501800

94%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance8

Rigor7.5

Novelty8

Clarity8.5

Tournament Score

1539±46

10501800

94%

Win Rate

Wins

Losses

Matches

Rating

7.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Humans rapidly learn abstract knowledge when encountering novel environments and flexibly deploy this knowledge to guide efficient and intelligent action. Can modern AI systems learn and plan in a similar way? We study this question using a dataset of complex human gameplay with concurrent fMRI recordings, in which participants learn novel video games that require rule discovery, hypothesis revision, and multi-step planning. We jointly evaluate models by their ability to play the games, match human learning behavior, and predict brain activity during the same task, comparing a suite of frontier Large Reasoning Models (LRMs) against model-free and model-based deep reinforcement learning agents and a Bayesian theory-based agent. We find that frontier LRMs most closely match human behavioral patterns during game discovery and predict brain activity an order of magnitude better than both reinforcement learning alternatives across cortical and subcortical regions, with effects robust to permutation controls. Through targeted manipulations, we further show that brain alignment reflects the model's in-context representation of the game state rather than its downstream planning or reasoning. Our results establish LRMs as compelling computational accounts of human learning and decision making in complex, naturalistic environments. Project page with interactive replays: https://botcs.github.io/reason-to-play/

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Reason to Play: Behavioral and Brain Alignment Between Frontier LRMs and Human Game Learners"

1. Core Contribution

This paper provides the first joint behavioral and neural evaluation of frontier Large Reasoning Models (LRMs) as computational accounts of human learning during interactive gameplay. Using the VGDL-fMRI dataset—where humans learn novel video games requiring rule discovery, hypothesis revision, and multi-step planning while undergoing fMRI—the authors compare eight LRMs (Qwen 3.5 family, DeepSeek family) against model-free RL (DDQN), model-based RL (EfficientZero), and a Bayesian theory-based agent (EMPA). The key findings are: (i) LRMs match human discovery efficiency (5–11× closer than deep RL baselines in Earth Mover's Distance), (ii) LRM representations predict BOLD activity an order of magnitude better than all RL alternatives across cortical and subcortical regions, and (iii) targeted ablations show the brain alignment reflects in-context game state representations rather than planning or reasoning processes.

2. Methodological Rigor

The methodology is thorough and well-controlled. The multi-turn dialogue paradigm for LRM evaluation represents a meaningful advance over prior single-step prompting approaches. The behavioral evaluation uses appropriate metrics (discovery EMD, capability progression, Kaplan-Meier survival), and the brain encoding pipeline employs banded ridge regression with nuisance regressors (motor, game identity, temporal variables), cross-validated across level partitions.

The control analyses are comprehensive: random-initialization controls (6.4–6.9× gap, confirming learned representations matter), temporal shuffle controls (within-episode, level, and game), prompt ablations (minimal vs. elaborate vs. oracle), context-length truncation, and replication with plain ridge regression. The sign-flip permutation tests for whole-brain maps add statistical rigor. The authors also transparently document seven bugs in the prior DDQN codebase and methodological improvements, which strengthens confidence in the baseline comparisons.

However, several limitations temper the strength of claims. The noise ceiling problem is acknowledged but unresolved—every human trajectory is unique (diverging within 4–5 steps), making it impossible to estimate per-voxel reliability ceilings. This means the absolute magnitude of encoding improvements cannot be contextualized against an upper bound. The r-values are low (0.01–0.10), and while the authors argue these are comparable to prior language-model encoding work, the comparison to Kumar et al. (2024) on story listening is imperfect given the vastly different paradigms. The EMPA comparison uses only the HRR representation of its theory-building component, not its full planning/exploration architecture, creating an asymmetry.

3. Potential Impact

Cognitive science and neuroscience: This paper opens a new front in using LRMs as process-level models of human cognition during interactive learning—extending the "DNN as brain model" paradigm beyond sensory perception into interactive cognition. The finding that text-trained models predict visual cortex activity better than RL agents trained on game grids is provocative and will stimulate debate about the nature of representational alignment.

AI evaluation: The behavioral analysis provides a template for human-referenced evaluation of LRM agents in interactive settings, complementing existing benchmarks like ARC-AGI. The discovery–execution gap finding (LRMs perseverate on winning trajectories) reveals a specific failure mode that could guide future post-training.

Mechanistic interpretability: The paper argues that brain activity provides an external reference signal for model representations—complementary to internal interpretability methods. This framing could influence how the interpretability community thinks about validation.

Dataset contribution: Over 100,000 reasoning traces from eight models across twelve games constitute a rich resource for studying in-context problem-solving dynamics.

4. Timeliness & Relevance

The paper is exceptionally timely. LRMs (o1, DeepSeek-R1, Qwen3.5) are the current frontier of AI development, and there is intense interest in whether chain-of-thought reasoning constitutes genuine planning or sophisticated pattern matching. The neuroscience community has been seeking models that capture interactive cognition beyond passive perception—this paper directly addresses that gap. The recent work by Paugam et al. (2025) showing that from-scratch trained DNNs produce brittle brain encoding on this exact dataset creates a natural "negative result" that this paper supersedes.

5. Strengths & Limitations

Key Strengths:

Joint behavioral + neural evaluation on matched tasks is rare and scientifically powerful—it constrains model selection beyond what either alone could achieve.

Breadth of comparison: Eight LRMs spanning multiple families, scales, and architectures, plus three baseline paradigms, evaluated on twelve games and 32 participants.

Ablation discipline: The targeted manipulations (reasoning removal, prompt variation, context truncation, temporal shuffling) systematically isolate what drives the encoding signal.

Transparency: Bug documentation, open release of traces and code, interactive supplementary materials.

The "representation without reasoning" finding is conceptually important—it suggests LRM-brain alignment captures shared representational structure rather than shared cognitive algorithms.

Key Limitations:

Uncontrollable LRM priors: LRMs may have seen game-related content during pre-training. The authors acknowledge this but cannot control for it.

No noise ceiling: Cross-regional differences in absolute encoding accuracy are largely uninterpretable without reliability estimates.

Passive encoding only: The brain encoding extracts features from passive observation, not active play. The paper explicitly flags this as capturing "representational alignment" not "process alignment," but this significantly limits the scope of the claim that LRMs are "compelling computational accounts of human learning."

Scale-performance dissociation unexplained: Brain encoding doesn't scale with model size (Qwen3.5-35B-A3B > much larger models), and behavioral performance doesn't predict encoding accuracy well. This undermines a clean narrative.

Sample size: n=21 for brain encoding (one cohort) is modest, though not unusual for fMRI.

No fine-grained cognitive attribution: The games confound multiple cognitive demands, preventing component-level claims about which aspects of cognition drive alignment.

Summary

This is a high-quality, well-executed study that establishes LRMs as the current best computational proxies for human behavior and brain representations during interactive learning. The joint behavioral-neural framework is novel and the controls are thorough. The main limitation is that the brain encoding captures representational similarity rather than shared cognitive processes, which the authors are transparent about. The work will likely catalyze a new line of research using LRMs as cognitive models in interactive paradigms.

Rating:7.8/ 10

Significance 8Rigor 7.5Novelty 8Clarity 8.5

Generated May 11, 2026

Comparison History (24)

vs. Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

gpt-5.25/19/2026

Paper 2 has higher likely impact due to its cross-disciplinary contribution: it introduces a rare joint benchmark linking model performance, human learning behavior, and concurrent fMRI, and reports large, controlled gains in brain-activity predictability over RL baselines. This can influence cognitive neuroscience, computational psychiatry, AI evaluation, and model design. Its applications span mechanistic understanding of learning and principled human-alignment metrics. Paper 1 is timely and practically valuable for multimodal safety, but its impact is more confined to safety engineering for MLLMs and relies on representation interventions whose generality may vary across architectures and modalities.

vs. LongSeeker: Elastic Context Orchestration for Long-Horizon Search Agents

gemini-3.15/16/2026

Paper 2 bridges AI, cognitive science, and neuroscience by aligning Large Reasoning Models with human fMRI data and learning behaviors. This interdisciplinary approach addresses fundamental questions about artificial and human intelligence, offering broader scientific implications than Paper 1, which primarily focuses on an algorithmic optimization for AI search agents. Paper 2's novel methodology and implications for understanding cognitive representations give it a higher potential for widespread scientific impact across multiple fields.

vs. Process Matters more than Output for Distinguishing Humans from Machines

gpt-5.25/16/2026

Paper 1 has higher likely scientific impact due to a more novel cross-disciplinary contribution: jointly benchmarking frontier LRMs on real human learning behavior, game performance, and fMRI brain-prediction in complex, naturalistic tasks, with strong quantitative gains and mechanistic ablations (state representation vs planning). This tightly connects AI evaluation with cognitive neuroscience and may influence both model development and theories of human learning. Paper 2 is timely and useful for human–machine discrimination and evaluation, but its impact is more application/policy-facing and may be narrower scientifically, with process features potentially task- and dataset-specific.

vs. KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation

gemini-3.15/16/2026

Paper 1 bridges artificial intelligence, cognitive science, and neuroscience by demonstrating that Large Reasoning Models align with human brain activity and behavioral patterns during complex learning tasks. This interdisciplinary approach offers profound foundational insights into both human cognition and AI architectures, providing broader and deeper scientific impact than Paper 2, which primarily contributes a practical benchmark for mobile agent evaluation within the narrower intersection of AI and HCI.

vs. Case-Specific Rubrics for Clinical AI Evaluation: Methodology, Validation, and LLM-Clinician Agreement Across 823 Encounters

gpt-5.25/16/2026

Paper 2 has higher potential scientific impact due to its novelty and breadth: it introduces a joint benchmark spanning task performance, human behavioral learning dynamics, and fMRI brain-prediction in complex, naturalistic games, and finds strong advantages for frontier LRMs over RL and Bayesian baselines with robustness controls. This connects AI evaluation to cognitive neuroscience, offering a new computational account of human learning and decision-making and a widely reusable paradigm. Paper 1 is methodologically valuable and highly applicable to clinical AI evaluation, but its impact is more domain-specific and incremental.

vs. How Far Are Large Multimodal Models from Human-Level Spatial Action? A Benchmark for Goal-Oriented Embodied Navigation in Urban Airspace

claude-opus-4.65/16/2026

Paper 2 bridges AI and cognitive neuroscience by demonstrating that Large Reasoning Models align with human brain activity (fMRI) and behavior during complex game learning. This cross-disciplinary contribution—linking frontier AI models to neural data—offers novel insights into both AI capabilities and human cognition, with broad implications for computational neuroscience, cognitive science, and AI alignment. Paper 1, while rigorous in benchmarking LMMs for embodied navigation, is more incremental as a benchmark/evaluation paper. Paper 2's finding that LRM representations predict brain activity an order of magnitude better than RL alternatives is a striking result likely to generate significant interdisciplinary interest.

vs. Data-driven Circuit Discovery for Interpretability of Language Models

claude-opus-4.65/16/2026

Paper 1 presents a novel cross-disciplinary study bridging AI and cognitive neuroscience, demonstrating that frontier LRMs align with human brain activity and behavior during complex game learning. This establishes LRMs as computational models of human cognition—a high-impact finding with broad implications for both AI and neuroscience. The methodology is rigorous (fMRI data, multiple model comparisons, permutation controls, targeted manipulations). Paper 2 makes a solid contribution to mechanistic interpretability by identifying limitations of existing circuit discovery and proposing DCD, but its scope is narrower, focused on improving an existing methodology within ML interpretability rather than opening fundamentally new cross-disciplinary connections.

vs. SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning Context

gemini-3.15/16/2026

Paper 2 bridges AI, cognitive science, and neuroscience by demonstrating that Large Reasoning Models align with human brain activity and behavior during complex tasks. This interdisciplinary contribution offers profound insights into both artificial and biological intelligence, promising broader scientific impact than Paper 1, which, while valuable, proposes a domain-specific engineering framework for software engineering agents.

vs. OpenMobile: Building Open Mobile Agents with Task and Trajectory Synthesis

gpt-5.25/16/2026

Paper 2 likely has higher scientific impact due to its open-source release of data/code and a scalable framework for task+trajectory synthesis that can be directly adopted and extended by many groups, accelerating progress in mobile agents. Its strong benchmark results on widely used, timely evaluations (AndroidWorld and others), plus analyses addressing data overlap/overfitting, increase credibility and practical relevance. Paper 1 is novel and cross-disciplinary (LLMs–cognition–neuroscience) but depends on specialized fMRI datasets and its impact may be narrower and harder to translate into broad, reusable infrastructure.

vs. LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

gemini-3.15/11/2026

Paper 1 bridges AI and cognitive neuroscience by demonstrating that frontier Large Reasoning Models align with human brain activity and behavior during complex learning. This interdisciplinary approach offers fundamental scientific insights into both human cognition and AI representations, promising a broader and more profound scientific impact across multiple fields compared to Paper 2's application-specific methodological improvements for lightweight GUI agents.

vs. Can Agents Price a Reaction? Evaluating LLMs on Chemical Cost Reasoning

claude-opus-4.65/11/2026

Paper 2 bridges AI and cognitive neuroscience by demonstrating that Large Reasoning Models align with human brain activity and behavioral patterns during complex learning tasks. This interdisciplinary contribution—linking frontier AI models to fMRI data and human cognition—has broader impact across AI, neuroscience, and cognitive science. The finding that LRMs predict brain activity better than RL alternatives is novel and opens new research directions in understanding both human cognition and AI representations. Paper 1, while rigorous and practical, addresses a narrower chemistry benchmarking problem with more incremental contributions.

vs. Multi-Environment POMDPs with Finite-Horizon Objectives

claude-opus-4.65/11/2026

Paper 1 presents a highly novel interdisciplinary study bridging AI and cognitive neuroscience, demonstrating that Large Reasoning Models align with human brain activity and behavior during complex game learning. This finding has broad implications for understanding human cognition, AI alignment, and computational neuroscience. The combination of fMRI data, behavioral analysis, and frontier AI models is timely and innovative. Paper 2, while technically solid in establishing complexity results and a practical algorithm for MEPOMDPs, addresses a narrower theoretical problem with more limited cross-disciplinary impact and audience.

vs. AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance

gemini-3.15/11/2026

Paper 2 bridges AI, cognitive science, and neuroscience by linking frontier Large Reasoning Models to human brain activity and learning behaviors. Its highly interdisciplinary approach and novel insights into AI-human cognitive alignment offer broader potential impact than Paper 1, which, while highly valuable for AI safety standardization, functions primarily as a meta-analysis and catalogue of existing benchmarks.

vs. Efficient Data Selection for Multimodal Models via Incremental Optimization Utility

gemini-3.15/11/2026

Paper 2 bridges AI, neuroscience, and cognitive science by mapping LRM representations to human fMRI data. This interdisciplinary approach offers profound theoretical insights into both artificial and biological intelligence, having a broader scientific impact than Paper 1, which provides a highly practical but more narrowly focused methodological improvement for model training efficiency.

vs. Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

claude-opus-4.65/11/2026

Paper 1 presents a novel interdisciplinary contribution connecting frontier Large Reasoning Models to human cognition using fMRI data, establishing LRMs as computational accounts of human learning and decision-making. This bridges AI and cognitive neuroscience in a timely and impactful way, with broad implications for understanding both human intelligence and AI alignment with human cognition. Paper 2, while solid, addresses a more narrow robotics/navigation problem with incremental methodological improvements. Paper 1's novelty, breadth of impact across AI, neuroscience, and cognitive science, and timeliness given the rise of LRMs give it substantially higher potential impact.

vs. The Limits of AI-Driven Allocation: Optimal Screening under Aleatoric Uncertainty

gemini-3.15/11/2026

Paper 2 bridges AI, cognitive science, and neuroscience by mapping frontier Large Reasoning Models (LRMs) directly to human fMRI data and behavioral learning. Establishing LRMs as plausible computational accounts of human decision-making in complex environments is highly novel and broadly relevant to multiple rapidly growing fields. While Paper 1 offers a valuable operational framework for public policy and resource allocation, Paper 2 addresses fundamental scientific questions regarding the nature of artificial versus human intelligence, giving it higher potential for cross-disciplinary scientific impact and citations.

vs. From Storage to Experience: A Survey on the Evolution of LLM Agent Memory Mechanisms

gpt-5.25/11/2026

Paper 1 has higher potential impact due to its novel empirical alignment of frontier LRMs with both human behavior and fMRI signals in complex, naturalistic game learning, alongside strong comparative baselines and robustness controls. It contributes actionable insights for AI and computational neuroscience (e.g., in-context state representation driving brain alignment) and provides a dataset/task framework likely to be reused broadly. Paper 2 is a conceptual survey with an organizing framework; useful and timely, but typically lower impact than a rigorous, data-driven result that can change modeling practice and enable follow-on experiments.

vs. MPD$^2$-Router: Mask-aware Multi-expert Prior-regularized Dual-head Deferral Router in Glaucoma Screening and Diagnosis

gpt-5.25/11/2026

Paper 1 likely has higher scientific impact due to its novelty and breadth: it links frontier Large Reasoning Models to human learning using a rare combination of complex gameplay, behavior, and concurrent fMRI, and reports large gains in brain-activity prediction with mechanistic manipulations. This positions LRMs as computational models of cognition, with cross-field relevance (AI, neuroscience, psychology, RL) and strong timeliness. Paper 2 is methodologically solid and clinically relevant, but it is more incremental (advanced routing/deferral in a specific medical domain) with narrower cross-disciplinary reach.

vs. Tacit Knowledge Extraction via Logic Augmented Generation and Active Inference

claude-opus-4.65/11/2026

Paper 2 demonstrates a striking convergence between frontier LRMs and human cognition using rigorous neuroscience methodology (fMRI), behavioral alignment, and brain predictivity metrics. It bridges AI and cognitive neuroscience in a novel way, showing LRMs as computational models of human learning—a finding with broad implications for both AI development and understanding human cognition. Paper 1 addresses an important but narrower knowledge engineering problem. Paper 2's interdisciplinary scope, methodological rigor (permutation controls, targeted manipulations), and timeliness regarding LRM capabilities give it substantially broader impact potential.

vs. Inference Time Causal Probing in LLMs

gpt-5.25/11/2026

Paper 1 likely has higher scientific impact due to its cross-disciplinary novelty (linking frontier LRMs to human learning behavior and fMRI brain activity in complex, naturalistic tasks), strong real-world relevance to cognitive neuroscience and AI evaluation, and broad applicability as a benchmark for “human-aligned” learning/planning. Its methodological contribution—joint behavioral, task-performance, and neural predictivity with robustness controls and mechanistic manipulations—supports rigor and could influence both neuroscience modeling and AI model assessment. Paper 2 is useful and timely for interpretability/control, but its impact is narrower and more incremental within existing causal editing/probing lines.