GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

Kyeongjin Ahn, Seungeon Lee, Krishna P. Gummadi, Meeyoung Cha

May 19, 2026

arXiv:2605.20006v1 PDF

cs.AI(primary)

#461of 2292·Artificial Intelligence

#461 of 2292 · Artificial Intelligence

Tournament Score

1476±42

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance7.5

Rigor7

Novelty7.5

Clarity7.5

Tournament Score

1476±42

10501800

70%

Win Rate

Wins

Losses

Matches

Rating

7/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Geospatial reasoning requires solving image-grounded problems over the complex spatial structure of a scene. However, developing this capability is hindered by the cost of annotating a vast and combinatorial question space. We propose GeoX, a self-play framework that acquires spatial logic through executable programs that yield verifiable rewards, without relying on large-scale human-curated data Given a satellite or aerial image, our framework employs a single multimodal policy that proposes spatial problems as executable programs and solves them under three reasoning modes-abduction, deduction, and induction-over spatial primitives and an image understanding tool. A verifier executes each program to covert a reward signal that jointly optimizes the two roles via reinforcement learning. GeoX consistently improves its base VLMs by up to 5.5 points on average, matching or exceeding conventional baselines trained on millions of curated data. Along-side the proposed method, we release a benchmark for geospatial understanding accumulated through self-play.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: GeoX

1. Core Contribution

GeoX introduces a self-play framework for geospatial reasoning over satellite and aerial imagery that eliminates the need for human-curated question-answer training data. The key insight is that spatial structure inherent in remote sensing imagery can be extracted via executable programs composed of geometric, topological, and aggregation primitives paired with an open-vocabulary segmenter. A single multimodal policy alternates between a proposer (generating executable spatial problems) and a solver (answering them under three reasoning modes: abduction, deduction, and induction), with program execution providing verifiable reward signals for reinforcement learning.

The contribution is genuinely novel in combining three ideas: (1) self-play for visual reasoning in a geospatial domain, (2) executable program-based verification replacing human annotation, and (3) three complementary reasoning modes from a single problem proposal. The paper directly addresses the combinatorial explosion of spatial questions over complex overhead imagery—a real bottleneck that has limited prior work to narrow template-based questions.

2. Methodological Rigor

The methodology is well-structured. The framework builds on established RL foundations (RLVR, self-play from Absolute Zero) and adapts them thoughtfully to the visual-spatial domain. Several design choices demonstrate care:

Intermediate values hidden from policy: Preventing the model from exploiting execution traces as shortcuts is a sound design decision that forces genuine visual reasoning.

Learnability reward: The proposer reward (Eq. 10) that peaks when the solver succeeds ~50% of the time creates an adaptive curriculum—problems at the frontier of solver ability.

Induction verification via held-out splits: The train/test partition of input-output pairs for induction prevents memorization.

Seed initialization: Starting from a minimal template (presence detection) and growing to compositional problems demonstrates the framework's bootstrapping ability.

However, there are methodological concerns:

The SFT warm-up stage, though described as "parser warm-up," introduces a confound. The claim of "zero curated data" is somewhat undermined by this step, even if the warm-up data is self-generated.

The reliance on a single open-vocabulary segmenter (SegEarth-OV3) means that segmentation errors propagate through the entire reward pipeline. The authors acknowledge this but don't quantify its impact.

Evaluation uses N_eval=32 responses with majority voting, which is computationally expensive and may inflate apparent performance relative to single-pass baselines.

3. Potential Impact

Direct impact on remote sensing AI: The framework addresses a genuine scalability bottleneck. Remote sensing imagery is proliferating far faster than annotation capacity, making self-supervised approaches essential. GeoX's improvements are particularly strong on counting and spatial relation tasks—precisely the tasks where human curation is most expensive.

Broader methodological impact: The paper demonstrates that self-play with verifiable rewards can extend beyond mathematics and code (where Absolute Zero operates) to perception-grounded domains. This could inspire similar approaches in medical imaging, autonomous driving scene understanding, or architectural analysis. The key transferable insight is that physical structure in images can serve as its own supervision source when combined with executable verification.

Benchmark contribution: The self-grown benchmark covering nine compositional dimensions fills a gap. The compositional analysis (Figure 4) convincingly shows that existing VQA benchmarks are sparse in their dimensional coverage, while GeoX produces denser, more compositional problems.

4. Timeliness & Relevance

The paper is well-timed along multiple axes:

The RL-from-verifiable-rewards paradigm (DeepSeek-R1, Absolute Zero) is rapidly gaining traction; extending it to visual-spatial reasoning is a natural and timely step.

The explosion of satellite imagery (from Sentinel, commercial providers) creates urgent demand for scalable reasoning methods.

The limitations of curated data for VLMs in specialized domains is widely recognized.

5. Strengths & Limitations

Key Strengths:

Zero curated data with competitive performance: Matching or exceeding models trained on millions of curated pairs (e.g., EarthDial with 11.1M pairs) is remarkable and the paper's strongest selling point.

Principled ablation: The ablation study (Table 2) is well-designed, isolating contributions of each reasoning mode, adaptive curriculum (BaseGen vs. Full), and proposer reward (SolvOnly vs. Full). The finding that removing abduction hurts most provides genuine insight.

Compositional analysis: The nine-dimension framework and graph visualization provide a principled way to characterize problem diversity beyond simple counts.

Qualitative evidence: The Chain-of-Thought examples (Figures 10-12) show the model genuinely following program logic grounded in image evidence.

Notable Limitations:

Single-tool dependency: Restricting T to one segmenter limits the framework to mask-derivable spatial reasoning. Depth, road networks, and metadata reasoning are excluded. The authors flag this honestly, but it significantly constrains current applicability.

Modest absolute improvements on some tasks: While +5.5 average improvement for LLaVA is notable, Qwen gains are smaller (+1.9), and on several individual tasks (e.g., Event Detection for LLaVA, Area for Qwen), performance doesn't improve or even degrades.

Computational cost: 60 hours on 4×H200 GPUs is substantial, and the paper doesn't discuss efficiency compared to conventional fine-tuning.

Limited base model diversity: Only two base VLMs are tested (both 7B). Scaling behavior to larger or smaller models is unknown.

Segmenter quality ceiling: The framework's performance is ultimately bounded by the segmenter's accuracy, creating a hard ceiling that curated-data approaches don't face.

6. Additional Observations

The primitive usage analysis (Figure 6, Table 7) revealing a long-tail distribution is informative—it shows the proposer discovers diverse problem types but heavily favors certain operations, suggesting room for curriculum design improvements. The paper's framing around "spatial physics" and "geospatial world models" is aspirational but not yet fully realized; the current system handles 2D spatial relations rather than physical dynamics.

Rating:7/ 10

Significance 7.5Rigor 7Novelty 7.5Clarity 7.5

Generated May 20, 2026

Comparison History (23)

vs. Implicit Safety Alignment from Crowd Preferences

gemini-3.15/22/2026

Paper 1 addresses AI safety and alignment, a critical challenge in modern AI and LLMs. By extracting implicit safety criteria from crowd preferences without explicit safety rewards, it offers a scalable solution to a major bottleneck in RLHF. Paper 2's self-play framework is highly innovative, but its focus on geospatial reasoning makes its immediate impact more domain-specific compared to the broader applicability of Paper 1.

vs. Implicit Safety Alignment from Crowd Preferences

claude-opus-4.65/22/2026

GeoX introduces a novel self-play framework for geospatial reasoning that eliminates the need for large-scale human annotations, combining executable program generation with verifiable rewards across multiple reasoning modes. This addresses a significant bottleneck in geospatial AI and demonstrates strong empirical results matching models trained on millions of curated examples. While Paper 2 addresses important safety alignment questions, its contribution is more incremental within the well-explored RLHF/safe RL space. GeoX's broader applicability to remote sensing, urban planning, and environmental monitoring, plus its benchmark release, gives it higher potential impact.

vs. What Counts as AI Sycophancy? A Taxonomy and Expert Survey of a Fragmented Construct

gemini-3.15/22/2026

Paper 1 introduces a novel, scalable self-play framework for geospatial reasoning, eliminating the need for massive human-annotated datasets. Its use of verifiable rewards and executable programs advances fundamental VLM reasoning capabilities. While Paper 2 provides a valuable taxonomy for AI alignment, Paper 1 offers a technical breakthrough with broad real-world applications in remote sensing and autonomous systems, along with a new benchmark, likely leading to higher broader scientific impact.

vs. Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents

gemini-3.15/22/2026

Paper 1 addresses a highly critical and universal bottleneck in the rapidly growing field of LLM agents: systematic debugging and diagnostics at scale. Its framework for corpus-level trace diagnostics offers broad, cross-domain utility for developers and researchers, leading to substantial performance improvements. While Paper 2 presents an innovative self-play approach for geospatial reasoning, its direct impact is largely confined to a specific subfield, whereas Paper 1's methodology will impact the foundational development and deployment of LLM agents across all domains.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

claude-opus-4.65/21/2026

Paper 1 addresses a fundamental theoretical gap in a widely-used alignment technique (DPO vs RLHF), which impacts the entire LLM alignment community. It provides rigorous theoretical analysis identifying failure modes of DPO, proves when equivalence breaks down, and proposes a principled fix (CPO) with provable guarantees. Given the massive adoption of DPO across the field, this work has broad implications for all practitioners doing preference optimization. Paper 2, while innovative in its self-play approach for geospatial reasoning, addresses a more specialized domain with narrower impact. The foundational nature of Paper 1's contribution to LLM alignment gives it higher potential impact.

vs. Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

claude-opus-4.65/21/2026

Paper 2 addresses a fundamental theoretical gap in DPO vs RLHF equivalence, which is central to modern LLM alignment—a topic with enormous breadth of impact across all AI applications. It provides rigorous mathematical characterization of failure modes, proposes a practical fix (CPO), and has immediate relevance given DPO's widespread adoption. Paper 1 is innovative for geospatial reasoning with self-play but targets a narrower domain. Paper 2's theoretical insights and practical implications for the entire LLM training ecosystem give it substantially broader and more timely impact.

vs. AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

claude-opus-4.65/20/2026

GeoX introduces a novel self-play framework with verifiable rewards for geospatial reasoning that addresses a fundamental data scarcity problem, combining executable program synthesis with three reasoning modes. Its methodological innovation (self-play + verifiable rewards without human-curated data) is more technically novel and transferable. Paper 2, while addressing an important problem in AI-assisted research, is more of an engineering integration of existing ideas (multi-agent debate, self-healing execution) into a pipeline, with evaluation on a self-created benchmark. GeoX's approach has broader methodological impact across spatial AI and reinforcement learning communities.

vs. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

gpt-5.25/20/2026

Paper 1 (GeoX) has higher estimated impact due to stronger novelty (self-play with executable, verifiable rewards for image-grounded geospatial reasoning) and clearer real-world applicability (remote sensing, mapping, disaster response, defense, urban planning). It also contributes a benchmark, aiding field-wide progress. The approach is timely and could generalize to other grounded reasoning domains. Paper 2 (SIGMA) is a solid, relevant improvement to multi-agent LLM aggregation via signed graphs, but is more incremental and likely narrower in downstream impact compared to a new data/learning paradigm plus benchmark in an important application area.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

claude-opus-4.65/20/2026

GeoX presents a concrete, novel methodology (self-play with verifiable rewards for geospatial reasoning) with demonstrated empirical results, a released benchmark, and practical applications in remote sensing. It introduces an innovative training paradigm that reduces reliance on expensive human annotations while achieving competitive performance. Paper 1, while addressing an important topic (trust in agent networks), is a vision/conceptual paper without empirical validation. GeoX's combination of methodological novelty, quantitative results, and cross-domain applicability (RL, VLMs, geospatial AI) gives it higher near-term scientific impact.

vs. Memory-Augmented Reinforcement Learning Agent for CAD Generation

claude-opus-4.65/20/2026

GeoX demonstrates higher potential scientific impact due to several factors: (1) its self-play framework with verifiable rewards is a novel paradigm that eliminates dependence on expensive human annotations, applicable beyond geospatial reasoning; (2) it addresses three reasoning modes (abduction, deduction, induction) providing broader methodological contribution; (3) the release of a benchmark enables community-wide progress; (4) geospatial AI has vast real-world applications (urban planning, disaster response, environmental monitoring); (5) the approach of using executable programs as verifiable rewards is timely and generalizable to other domains requiring spatial reasoning.

vs. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

gpt-5.25/20/2026

Paper 1 is likely higher impact: it introduces a novel self-play + verifiable-reward framework for geospatial reasoning in VLMs, reducing reliance on expensive human annotation and releasing a benchmark, which can catalyze broader follow-on work. Its applications span remote sensing, mapping, disaster response, and spatial planning, and the core idea (programmatic self-play with execution-based rewards across abduction/deduction/induction) may generalize to other grounded reasoning domains. Paper 2 is valuable for training robustness, but resembles systems/controls tuning around existing optimizers with narrower methodological novelty and external applicability.

vs. From Prompts to Pavement Through Time: Temporal Grounding in Agentic Scene-to-Plan Reasoning

gemini-3.15/20/2026

Paper 1 introduces a highly scalable self-play framework with verifiable rewards that eliminates the need for massive human-annotated datasets in geospatial reasoning, demonstrating strong quantitative improvements over baselines. In contrast, while Paper 2 addresses a critical topic in autonomous vehicles (temporal grounding), its results show no statistically significant quantitative improvements in standard metrics. Paper 1's methodological innovation in multimodal reinforcement learning and its definitive empirical success suggest a broader and more immediate scientific impact across both fundamental AI research and applied geospatial fields.

vs. AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

claude-opus-4.65/20/2026

GeoX introduces a novel self-play framework for geospatial reasoning that combines executable programs, verifiable rewards, and reinforcement learning without requiring large-scale human annotations. This represents a more fundamental methodological innovation with broader impact: it addresses the core challenge of data scarcity in geospatial AI, introduces a new reasoning paradigm (abduction/deduction/induction over spatial primitives), and releases a benchmark. Paper 1, while practical, offers an incremental engineering contribution (training-free token reduction for GUI agents) with narrower scope. GeoX's approach could generalize to other spatial reasoning domains beyond geospatial applications.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gemini-3.15/20/2026

Paper 1 offers higher scientific impact due to its methodological innovation in foundational AI. By introducing self-play and verifiable rewards to multimodal geospatial reasoning, it addresses a major bottleneck (costly data curation) in vision-language models. This self-improving framework can be generalized across numerous domains like remote sensing, urban planning, and disaster response. While Paper 2 is highly rigorous and valuable for industrial control, Paper 1's advancement of generalizable AI reasoning capabilities and the release of a new benchmark will likely drive broader cross-disciplinary adoption and citations.

vs. Generative Recursive Reasoning

gemini-3.15/20/2026

Paper 1 introduces a fundamental methodological advancement in neural reasoning by enabling probabilistic, multi-trajectory latent search (GRAM), addressing a critical bottleneck in AI inference scaling. Its broad applicability to general reasoning tasks gives it widespread relevance across the entire machine learning community. Paper 2 presents a strong application of self-play and verifiable rewards, but its focus is restricted to the specific domain of geospatial vision. Consequently, Paper 1 has a significantly higher potential for broad scientific impact and foundational innovation.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

gpt-5.25/20/2026

GeoX is more novel and broadly impactful: it introduces a self-play RL framework with executable, verifiable rewards to learn geospatial reasoning from imagery, and releases a benchmark—advancing both method and data for an important, under-annotated domain. Its real-world applications (remote sensing, mapping, disaster response, urban planning) are immediate and high-value, and the verifiable-program setup suggests methodological rigor and transferability to other grounded reasoning tasks. MOCHA is timely and useful for LLM-agent prompt/skill optimization, but is narrower in scope and impact, and improvements are incremental within an established multi-objective optimization paradigm.

vs. Responsible Agentic AI Requires Explicit Provenance

claude-opus-4.65/20/2026

GeoX presents a novel, concrete technical framework combining self-play reinforcement learning with executable program verification for geospatial reasoning—a growing field with significant real-world applications (remote sensing, urban planning, disaster response). It demonstrates measurable improvements (5.5 points) over strong baselines without requiring large-scale human annotation, and releases a benchmark. Paper 2, while addressing an important topic (responsible AI provenance), is primarily a position/framework paper with only preliminary experiments. Its impact depends on adoption of proposed frameworks, which historically faces challenges. GeoX's methodological contribution and empirical results offer more immediate and citable scientific impact.

vs. Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

gpt-5.25/20/2026

Paper 2 (MEMOIR) has higher estimated impact due to broader applicability and clearer real-world relevance: solver synthesis for combinatorial optimization spans many high-value domains (logistics, scheduling, routing, chip design). The proposed cross-branch knowledge transfer via a two-level memory hierarchy is a generally reusable agentic search innovation, and results emphasize rigor-relevant metrics (validity, quality at matched budget, and reduced variance across runs). Paper 1 is novel and timely for geospatial VLMs, but its domain is narrower and gains are more incremental.

vs. EXG: Self-Evolving Agents with Experience Graphs

gpt-5.25/20/2026

Paper 2 (GeoX) is likely higher impact due to stronger novelty and timeliness: self-play with executable programs and verifiable rewards for multimodal geospatial reasoning reduces dependence on costly annotations and targets a high-value, under-served domain. It offers clear real-world applications (remote sensing, mapping, disaster response) and broader cross-field relevance (VLMs, RL, program synthesis, geospatial AI). Releasing a benchmark further amplifies adoption and reproducibility. Paper 1 is valuable but more incremental within crowded agent-memory/experience-structuring work and may have less immediate domain-specific payoff.

vs. Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

gpt-5.25/20/2026

Paper 1 introduces a novel self-play + verifiable-reward framework for geospatial reasoning, addressing a major bottleneck (costly combinatorial annotations) and demonstrating measurable gains over large curated-data baselines while also releasing a new benchmark. This combination of methodological innovation, domain significance (remote sensing, GIS, robotics, disaster response), and potential cross-field influence (program-based RL for multimodal reasoning) suggests broader and longer-lasting scientific impact. Paper 2 is timely and practically important for LLM systems, but is primarily an empirical characterization that may have narrower novelty and longer-term citation reach.