Theo Uscidda, Marta Tintore Gazulla, Maks Ovsjanikov, Federico Tombari, Leonidas Guibas
Current Large Reasoning Models (LRMs) exhibit remarkable general capabilities but significantly underperform in spatial reasoning tasks. Existing approaches treat this gap as a knowledge deficit, relying on supervised fine-tuning (SFT) to ingest labeled spatial data from external vision sources or synthetic engines. In contrast, we argue that for many tasks, spatial reasoning capabilities are already present in pre-trained LRMs but require alignment through logical coherence under geometric 2D and 3D constraints. In this work, we propose a self-supervised reinforcement learning (RL) framework that targets the internal reasoning process without requiring ground-truth annotations. By formalizing the notion of consistency verifiers -- reward functions that check for geometric and semantic consistency under transformations -- we demonstrate that models can improve their spatial reasoning abilities. We use both image transformations, like flipping, and textual transformations, like swapping the order of objects in the question, and propose a new optimal transport-based RL strategy, OT-GRPO, which is a minimal-matching variant of group relative policy optimization tailored to pairwise verifiers. We show that this label-free consistency training approaches the accuracy of models trained with ground-truth supervision and achieves similar generalization across diverse tasks and data domains.
This paper introduces a self-supervised RL framework for improving spatial reasoning in vision-language models (VLMs) by exploiting consistency under geometric and semantic transformations as a reward signal, eliminating the need for ground-truth labels. The key insight is that spatial reasoning capabilities likely already exist in pre-trained models but need alignment through logical coherence. The authors formalize "consistency verifiers" — reward functions that check whether model answers satisfy expected relationships (invariance or equivariance) under known transformations like image flipping, cropping, or textual object/relation swapping. They also introduce OT-GRPO, a minimal-matching variant of Group Relative Policy Optimization that uses optimal transport to pair completions adversarially, ensuring high rewards require genuine consistency rather than lucky alignment.
The methodology is well-grounded. The formalization of consistency verifiers is clean: transformations decompose into image and text components, each with known effects (invariant or equivariant), and composition follows a simple parity rule. The OT-GRPO contribution is theoretically motivated through a Wasserstein distance interpretation (Proposition A.3), and the random baseline analysis (Proposition A.1) provides a clear justification for why minimal matching is preferable — expected reward under random guessing decays as O(1/√K) versus a constant 1/2 for alternatives.
The experimental design is thorough. Four complementary tasks (orientation, depth, size, relative distance) on two domains (KITTI outdoor, SUN RGB-D indoor) with two model sizes (3B, 7B) provide comprehensive coverage. The comparison between consistency and accuracy training is fair — same data, same augmentations, same hyperparameters, differing only in the reward signal. Cross-task and cross-domain transfer matrices (Figures 3-4) are particularly informative.
However, there are methodological concerns. The tasks are all relatively simple binary VQA problems with well-defined geometric structure. The transformations used are hand-designed and task-specific — it's unclear how this framework would extend to tasks without such clean equivariance properties. The paper acknowledges this limitation but doesn't explore it deeply.
Practical impact: The ability to improve spatial reasoning without ground-truth labels is valuable for robotics, navigation, and embodied AI, where spatial annotation pipelines are expensive and error-prone. The label corruption experiment (Figure 6) is practically compelling — consistency training overtakes accuracy training at just 20% noise, and real annotation pipelines chaining depth estimators, calibrators, and detectors likely exceed this threshold.
Methodological impact: The consistency verifier framework could generalize beyond spatial reasoning to any domain where known transformations induce predictable answer changes. The OT-GRPO algorithm for handling pairwise rewards in GRPO is a clean technical contribution applicable whenever verifiers naturally score pairs rather than individuals.
Broader influence: The paper contributes to the growing understanding that RL post-training acts more as a selector/aligner of existing capabilities rather than creating new ones (aligning with Yue et al., 2025 and Chen et al., 2025b). This philosophical insight about the nature of model improvement is significant for the field.
The paper is highly timely. Spatial reasoning is a recognized bottleneck for VLMs (30-40% gap vs. humans on recent benchmarks). The RL post-training paradigm following DeepSeek-R1 is generating intense interest, and self-supervised reward signals are an active frontier. The work sits at the intersection of these two trends, offering an alternative to the dominant paradigm of scaling labeled data.
1. Elegant framework: The consistency verifier formalization is simple yet powerful. The invariance/equivariance dichotomy under transformations is natural for spatial tasks and the composition rule is practical.
2. Strong empirical results: Consistency training achieves within 2-3pp of ground-truth accuracy training across four tasks, two model sizes, and two domains. The cross-task and cross-domain transfer results are particularly compelling.
3. Principled OT matching: The adversarial pairing via optimal transport is well-motivated theoretically and delivers consistent empirical improvements over random and one-to-all alternatives with negligible computational overhead.
4. Comprehensive ablations: Label corruption robustness, pairing strategy comparison, extension to numeric tasks, and comparison against seven self-supervised baselines provide thorough validation.
5. Practical relevance: The framework addresses a real problem — annotation quality in spatial reasoning pipelines — with a principled solution.
1. Task simplicity: All four core tasks are binary True/False questions with clean geometric structure. Extension to compositional spatial reasoning, multi-step reasoning, or open-ended spatial questions remains unexplored.
2. Transformation design: The transformations are hand-crafted and domain-specific. Scaling to new spatial tasks or non-spatial domains requires manual identification of appropriate equivariances.
3. Narrow evaluation scope: Only two model sizes (3B, 7B) of a single model family (Qwen2.5-VL) are tested. Generalization to other architectures is unverified.
4. Self-supervised baseline comparison: Visual Jigsaw and SSL4RL checkpoints are used without fine-tuning on the same data, making the comparison somewhat asymmetric despite being pragmatic.
5. Numeric task exploration is preliminary: Only counting and absolute distance are tested, with limited analysis of failure modes or scaling behavior.
6. The 2-3pp gap: While small, this gap is consistent and may compound in downstream applications requiring high reliability.
This is a well-executed paper with a clean conceptual contribution. The idea that consistency under known transformations can substitute for ground-truth supervision is compelling and well-demonstrated for spatial reasoning. The OT-GRPO algorithm is a useful technical contribution. The main limitation is the restriction to relatively simple tasks with clean geometric structure, leaving open whether the approach scales to harder compositional reasoning. Nevertheless, the practical implications for reducing annotation dependence and the methodological clarity make this a solid contribution.
Generated Jun 11, 2026
Paper 2 addresses a fundamental and widespread bottleneck in LLM agents—handling long-horizon tasks and managing context length. Its hierarchical memory approach significantly reduces token usage (up to 78%) while maintaining or improving reasoning quality across general tasks. While Paper 1 presents an innovative self-supervised RL approach for spatial reasoning, Paper 2's focus on general agentic memory and efficiency offers broader applicability and higher potential impact across various real-world AI agent deployments.
Paper 2 addresses the pervasive issue of knowledge conflicts in Retrieval-Augmented Generation (RAG) systems. By shifting from a context-aware to a conflict-aware paradigm, it tackles a critical reliability bottleneck applicable to nearly all LLM deployments. While Paper 1 introduces a novel self-supervised RL approach for spatial reasoning, Paper 2's potential to improve factual accuracy and robustness against erroneous contexts in general LLM applications gives it a broader and more immediate scientific and practical impact.
Paper 1 introduces a fundamentally novel paradigm shift: improving spatial reasoning in LRMs through self-supervised consistency verification rather than labeled data, with a new RL strategy (OT-GRPO). This challenges the dominant assumption that spatial reasoning requires external supervision, has broad implications for reasoning alignment beyond spatial tasks, and the theoretical contribution (consistency verifiers, optimal transport-based RL) is more foundational. Paper 2, while practically useful for resource-constrained QA with its latent memory compression, is more incremental—optimizing token efficiency in RAG systems—with narrower conceptual novelty.
Paper 1 addresses a critical and highly timely issue in LLM deployment: AI safety and sycophancy in memory-augmented models. As persistent memory becomes standard in consumer LLMs, identifying and mitigating memory-induced errors has immediate, broad real-world applicability. While Paper 2 presents an innovative self-supervised RL method for spatial reasoning, Paper 1's findings have broader implications for general LLM alignment, safety, and architecture, granting it wider interdisciplinary relevance.
Paper 1 is likely to have higher scientific impact due to greater novelty (label-free, self-supervised RL via consistency verifiers and OT-GRPO), broader applicability across LLM/LRM reasoning tasks, and strong timeliness in foundational AI alignment and reasoning research. Its approach could generalize to multiple domains (vision-language, planning, verification) and influence model training paradigms. Paper 2 is valuable and rigorous with clear real-world relevance to infrastructure FE modeling, but its impact is narrower (engineering workflow automation) and more application-specific, with less methodological innovation at the core scientific level.
Paper 1 offers higher fundamental scientific impact by addressing a core cognitive gap in Large Reasoning Models (spatial reasoning) without relying on ground-truth annotations. By formalizing consistency verifiers and introducing a novel RL strategy (OT-GRPO), it advances the critical frontier of self-improving models and unsupervised alignment. While Paper 2 provides exceptional systems-level and practical deployment contributions for multi-agent orchestration, Paper 1's methodological innovations in algorithmic self-improvement have broader implications for foundational model training and reasoning.
Paper 1 introduces a novel self-supervised RL framework (OT-GRPO) for improving spatial reasoning in LRMs without ground-truth labels, leveraging consistency verifiers and optimal transport-based policy optimization. This has broad impact across AI/ML, addressing a fundamental limitation of LRMs with a generalizable methodology. Paper 2 presents a domain-specific framework for BIM compliance checking with narrower applicability to AEC industry. Paper 1's methodological innovations (consistency-based self-supervision, OT-GRPO) are more transferable across fields and address a timely problem in foundation model research.
Paper 1 introduces a novel self-supervised RL framework (OT-GRPO) for improving spatial reasoning in LRMs without ground-truth labels, addressing a fundamental limitation with broad applicability across vision-language tasks. The consistency verifier concept and optimal transport-based RL strategy represent significant methodological innovations. Paper 2, while practically useful, addresses a narrow domain (concrete barrier design) with an application-focused framework combining existing tools (AutoGen, LLMs). Paper 1's contributions to reasoning alignment, label-free training, and generalizable methodology give it substantially broader scientific impact potential.
Paper 1 has higher estimated impact due to stronger novelty and broader applicability: it introduces label-free, self-supervised RL via consistency verifiers leveraging geometric/semantic invariances, potentially generalizable beyond spatial reasoning to other reasoning domains (e.g., logical, causal) using transformation-based constraints. The OT-GRPO optimization tailored to pairwise verifiers is a methodological contribution. It addresses a timely, widely observed weakness in LRMs without reliance on external supervision, increasing real-world feasibility. Paper 2 is valuable for agent training, but is more domain-specific and closer to incremental refinement of self-distillation.
Paper 1 introduces a novel self-supervised RL framework (OT-GRPO) for improving spatial reasoning in LRMs without ground-truth labels, addressing a fundamental capability gap with a principled methodological contribution (consistency verifiers, optimal transport-based policy optimization). It demonstrates that label-free training can match supervised approaches, which has broad implications for AI alignment and reasoning. Paper 2 proposes a reference architecture for AI agent governance—valuable for enterprise security but is more of a systems/engineering contribution with narrower scope, lacks empirical evaluation against live agents, and addresses an emerging but more application-specific problem.