Monte Carlo Pass Search: Using Trajectory Generation for 3D Counterfactual Pass Evaluation in Football

Andrew Kang, Priya Narasimhan

Jun 9, 2026arXiv:2606.11120v1

cs.AIcs.CV

#2968of 3489·Artificial Intelligence

#2968 of 3489 · Artificial Intelligence

Tournament Score

1290±44

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty5.5

Clarity7

Abstract

We recast pass evaluation in football (soccer) as a Monte Carlo Tree Search (MCTS)-like evaluation problem whose components mostly exist in the literature under different names: a value model (possession value), a world model (multi-agent trajectories with ball interactions), and a policy over counterfactual actions (sampling pass variants with noise). Building on the first public high-fidelity tracking dataset with 3D ball trajectories from the Bundesliga, we introduce Monte Carlo Pass Search (MCPS), which infers kick parameters for each observed pass, samples execution variants and option variants, rolls each candidate forward with a ball-conditioned world model until the next ball interaction, and scores outcomes with a learned value model to obtain a distribution over gained value. This distribution enables distribution-aware attribution with two complementary execution-surplus scores used for analysis and ranking: mean-based and percentile-based scores. To make the world model sample-efficient under limited public data, we adapt a discrete-token, autoregressive trajectory generator from autonomous driving (SMART) and show it yields strong best-of-20 forecasting accuracy compared to baselines, while supporting fully hypothetical rollouts for downstream evaluation. We have released model checkpoints and code.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Monte Carlo Pass Search (MCPS)

1. Core Contribution

The paper introduces Monte Carlo Pass Search (MCPS), a framework that recasts pass evaluation in football as a Monte Carlo Tree Search-like problem. The key idea is to evaluate a pass not as a point estimate of value, but as a *distribution* over counterfactual outcomes. The framework has four main components: (1) inference of physical kick parameters from observed passes, (2) sampling of local (execution noise) and global (alternative option) counterfactual pass variants, (3) a learned world model (adapted from autonomous driving) that rolls out multi-agent trajectories conditioned on sampled ball flights, and (4) a possession value model that scores terminal states. The dual local/global search produces distribution-aware metrics that disentangle execution quality from decision quality.

The conceptual contribution is well-articulated: the authors explicitly frame pass evaluation as a planning problem with policy, world model, and value model components, connecting football analytics to the reinforcement learning and world-model literature. This framing, while not technically novel in isolation (the authors acknowledge each component exists separately), is synthesized in a coherent and operationally useful way.

2. Methodological Rigor

The methodology is reasonable but has significant limitations that the authors partially acknowledge:

Strengths in methodology:

The adaptation of SMART (a discrete-token autoregressive model from autonomous driving) to football is creative and shows good best-of-20 trajectory forecasting results (minADE 2.4, minFDE 4.7), outperforming baselines including Sports-Traj.

The CEM-style kick parameter inference with a physics-based ball-flight simulator is a principled approach.

The dual local/global search design provides a clean conceptual separation between execution and decision evaluation.

Weaknesses in methodology:

The dataset is extremely small: only 7 matches (5/1/1 train/val/test split). This severely limits the reliability of all learned components. The PV model (Table 4) actually *underperforms* a ball-only baseline on shot AUROC (0.73 vs. 0.78), which the authors justify by assuming it "has learned some additional context" — this is speculative and undermines confidence in the scoring function that is central to the entire framework.

The Player-to-Touch module shows substantially lower accuracy than existing baselines (Top-1 Acc: 0.605 vs. 0.679/0.899), which the authors attribute to a deliberately harder setting. While this argument has some merit, the cascading errors through the pipeline (trajectory → touch prediction → ball-at-touch → value) are not rigorously quantified.

Only 512 of the total passes survived the kick-parameter fitting quality filter, further limiting the evaluation scope.

There is no end-to-end validation of the MCPS framework itself — no ground-truth "pass quality" labels, no comparison to expert assessments, and no calibration analysis of the output distributions.

3. Potential Impact

Practical applications: The framework could be valuable for coaching, scouting, and recruitment in professional football by offering richer pass evaluation than existing point-estimate metrics. The visualization tools (opportunity/sensitivity views) are directly applicable to coaching workflows.

Broader methodological impact: The connection between MCTS-style evaluation and sports analytics is potentially influential — it provides a template for applying world-model-based reasoning to other sports decisions (shots, dribbles, set pieces). The adaptation of autonomous driving trajectory models to sports is a useful cross-domain transfer that others may follow.

Reproducibility: The release of code and model checkpoints is commendable and addresses a major pain point in football analytics, where proprietary data and models dominate. This alone could catalyze follow-up work.

However, the impact is constrained by the small data regime and the lack of convincing downstream validation. Without evidence that MCPS rankings correlate with expert judgment or predictive outcomes, the framework remains a conceptual demonstration rather than a validated tool.

4. Timeliness & Relevance

The paper is well-timed, leveraging the first public high-fidelity tracking dataset with 3D ball trajectories (Bassek et al., 2025). It addresses a genuine bottleneck: the proprietary nature of football analytics has limited reproducible research. The connection to generative AI and world models is topical. The cross-pollination from autonomous driving (SMART) to sports is timely given the maturity of AV trajectory prediction methods.

5. Strengths & Limitations

Key Strengths:

Clean conceptual framing that unifies existing football analytics components under the MCTS/world-model paradigm

Open-source release of code and checkpoints, rare in football analytics

Strong trajectory forecasting results relative to available baselines

Principled dual local/global search design for separating execution from decision quality

Distribution-aware metrics (mean-difference and percentile surplus) are more informative than point estimates

Notable Weaknesses:

Extremely limited training data (7 matches) undermines all learned components

The PV model, which is the final scoring function for the entire pipeline, does not outperform a ball-only baseline

No end-to-end validation or calibration of MCPS outputs against ground truth or expert judgment

Evaluation is limited to a single test match case study, making it impossible to assess generalization

The Player-to-Touch module significantly underperforms existing methods, and errors compound through the pipeline

The local/global search distributions are hand-designed with manually chosen perturbation magnitudes — sensitivity to these hyperparameters is not analyzed

Overall Assessment

MCPS presents an intellectually appealing framework that connects football analytics to world-model-based planning in a principled way. The conceptual contribution is solid, and the open-source release addresses a real community need. However, the paper is fundamentally limited by data scarcity, resulting in under-performing sub-components (especially the value model) and a lack of convincing end-to-end validation. The work reads more as a proof-of-concept or framework proposal than a validated methodology. Its impact will depend heavily on whether the community adopts and scales the approach with larger datasets.

Rating:4.5/ 10

Significance 5.5Rigor 3.5Novelty 5.5Clarity 7

Generated Jun 10, 2026

Comparison History (19)

Wonvs. A Lightweight Multi-Agent Framework for Automated Concrete Barrier Design

Paper 2 demonstrates higher scientific impact potential due to its cross-disciplinary innovation (combining MCTS with sports analytics and autonomous driving models), broader methodological contributions (adapting SMART from autonomous driving to football trajectory prediction, novel distribution-aware attribution), and wider applicability beyond its specific domain. It leverages a unique 3D tracking dataset, introduces a reusable framework for counterfactual evaluation, and bridges multiple active research communities. Paper 1, while practically useful, addresses a narrow engineering application with a relatively straightforward multi-agent LLM orchestration approach, and its main finding about model scale is already well-documented in the LLM literature.

claude-opus-4-6·Jun 11, 2026

Wonvs. AutoMine Solution for AV2 2026 Scenario Mining Challenge

Paper 2 introduces a novel framework (MCPS) that creatively adapts techniques across domains (MCTS, autonomous driving trajectory models) for sports analytics, with broader methodological contributions including distribution-aware attribution and cross-domain transfer of trajectory generation models. It releases code and checkpoints, enabling reproducibility. Paper 1 is a competition solution report with incremental engineering contributions combining existing LLM/VLM techniques. Paper 2 has greater novelty, broader cross-field impact (sports analytics, multi-agent modeling, counterfactual reasoning), and stronger methodological rigor.

claude-opus-4-6·Jun 11, 2026

Wonvs. What Spatial Memory Must Store: Occlusion as the Test for Language-Agent Memory

Paper 2 introduces a novel framework (MCPS) combining MCTS-style evaluation with 3D ball tracking data for football pass evaluation, bridging sports analytics, multi-agent trajectory prediction, and counterfactual reasoning. It adapts methods from autonomous driving (SMART) to a new domain, releases code/checkpoints, and uses a novel public dataset. Paper 1 addresses a narrower problem (occlusion handling in language-agent memory palaces) with results that the authors themselves acknowledge as 'near-tautological,' and the confirmatory studies remain future work. Paper 2 has broader cross-domain impact, stronger methodological novelty, and more immediate practical applications.

claude-opus-4-6·Jun 10, 2026

Lostvs. Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

Paper 1 addresses a fundamental and highly relevant challenge in modern AI (memory limits in long-horizon language agents), offering broad applicability across numerous domains relying on LLMs. Paper 2, while methodologically innovative in adapting autonomous driving techniques to sports analytics, has a much narrower scope of impact primarily restricted to football data science.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Architect-Ant: Editable Automatic Furnishing of Architectural Floor Plans

Paper 1 offers a highly practical and scalable solution to a major bottleneck in architecture and real estate design. By combining a novel dataset, a domain-specific language, and vision-language models for procedural reasoning, it presents a comprehensive neuro-symbolic framework. While Paper 2 introduces an innovative multi-agent world model approach for sports analytics, Paper 1 has broader immediate real-world applications and commercial potential across multiple large-scale industries.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Soul Computing: A Theoretical Framework and Technical Architecture for Intelligent Agents with Independent Consciousness

Paper 1 presents a concrete, novel methodological integration (MCTS-style counterfactual search + learned world/value models) enabled by rare 3D tracking data, with measurable evaluation, model adaptations, and released code/checkpoints—supporting reproducibility and near-term uptake in sports analytics and trajectory-modeling research. Its approach is timely (world models, counterfactual evaluation), has clear real-world applications (player/team decision analysis), and can generalize to other multi-agent domains. Paper 2 is largely conceptual/theoretical with unclear formalism, validation, or implementable methodology, making near-term scientific impact less likely.

gpt-5.2·Jun 10, 2026

Lostvs. HIPIF: Hierarchical Planning and Information Folding for Long-Horizon LLM Agent Learning

Paper 2 addresses a fundamental and highly timely challenge in artificial intelligence: improving LLM agents' performance in long-horizon tasks by mitigating long-context interference. Its proposed methodology has broad applicability across numerous domains where autonomous agents are deployed. In contrast, Paper 1 focuses on a niche application (sports analytics for football). While methodologically rigorous and innovative in its specific domain, Paper 1 lacks the cross-disciplinary breadth and widespread technological relevance of Paper 2.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. A Negative Result on Cross-Model Activation Transfer in a Pythia Multi-Hop Setting

Paper 1 introduces a novel framework (Monte Carlo Pass Search) combining multiple ML techniques for counterfactual pass evaluation in football, with released code/checkpoints and a public 3D tracking dataset. It has clear real-world applications in sports analytics, methodological novelty in adapting autonomous driving trajectory models to sports, and broad appeal across ML and sports science communities. Paper 2, while methodologically sound, reports a scoped negative result on cross-model activation transfer in a narrow setting with small models, limiting its broader impact and applicability despite contributing useful knowledge about mechanistic interpretability limitations.

claude-opus-4-6·Jun 10, 2026

Lostvs. Infini Memory: Maintainable Topic Documents for Long-Term LLM Agent Memory

Paper 2 addresses a fundamental challenge in LLM agents—long-term persistent memory—which is highly relevant given the explosive growth of LLM agent research. The topic-structured document approach with iterative retrieval is broadly applicable across many agent applications. Paper 1, while technically sophisticated and well-executed, targets a narrow domain (football/soccer pass evaluation) with limited cross-disciplinary impact. The LLM memory problem affects a much larger research community and has wider real-world applications, giving Paper 2 greater potential scientific impact despite Paper 1's strong methodological contribution within sports analytics.

claude-opus-4-6·Jun 10, 2026

Wonvs. A complementary study on PlanGPT: Evaluation with defined Performance Metrics and comparison with a planner

Paper 1 is more novel and methodologically substantial: it integrates an MCTS-like counterfactual evaluation framework with a learned multi-agent world model using rare public 3D ball-tracking data, enabling distributional pass value attribution and releasing code/checkpoints. It has clear real-world application in sports analytics and broader relevance to counterfactual reasoning, model-based RL, and generative trajectory modeling. Paper 2 is mainly a replication/benchmarking study of an existing LLM planner, with limited innovation and narrower impact, despite being timely and useful for validation.

gpt-5.2·Jun 10, 2026

#2968of 3489·Artificial Intelligence

#2968 of 3489 · Artificial Intelligence

Tournament Score

1290±44

10501800

42%

Win Rate

Wins

Losses

Matches

Rating

4.5/ 10

Significance5.5

Rigor3.5

Novelty5.5

Clarity7