Generative Recursive Reasoning

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

May 19, 2026

arXiv:2605.19376v2 PDF

v1v2

cs.AI(primary)

#179of 2292·Artificial Intelligence

#179 of 2292 · Artificial Intelligence

Tournament Score

1523±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6.5

Novelty7

Clarity7.5

Tournament Score

1523±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce Generative Recursive reAsoning Models (GRAM), a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_{θ} (x)$ . Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. https://ahn-ml.github.io/gram-website

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Generative Recursive Reasoning (GRAM)

1. Core Contribution

GRAM introduces stochastic latent transitions into Recursive Reasoning Models (RRMs), transforming deterministic single-trajectory recursive refinement into a probabilistic multi-trajectory framework. The key insight is that reasoning systems should be both "deep" (many recursive refinement steps) and "wide" (exploring multiple latent trajectories in parallel). The model adds a learned Gaussian perturbation (stochastic guidance) to deterministic state updates at each recursion step, with both a mean (directional steering) and variance (exploration) component. This is trained via amortized variational inference with an ELBO objective, yielding a latent-variable generative model that supports both conditional reasoning p(y|x) and unconditional generation p(x).

The contribution is conceptually clean: it bridges probabilistic latent state-space models (VRNN, Dreamer) with the emerging recursive reasoning paradigm, reinterpreting stochastic dynamics as computation rather than temporal modeling. The "width" axis of inference-time scaling through parallel trajectory sampling is a genuinely useful addition to the recursive reasoning toolkit.

2. Methodological Rigor

Strengths: The variational formulation is well-grounded. The hierarchical instantiation with high-level stochastic and low-level deterministic components is a sensible design choice, as is the decision to inject noise only at the abstract reasoning level. The ablation study (Table 3) is thorough and informative—it isolates the contribution of stochastic guidance across different architectural configurations, demonstrating that naive stochasticity (random initialization, stochastic decoding) does not help, validating the variational framework.

Concerns: The training objective (Equation 14) is a truncated surrogate rather than the true ELBO, with gradients propagated only through the final transition of each supervision step. While the authors provide empirical validation (Figure 8) showing both the surrogate and full ELBO improve monotonically, this is a biased approximation whose theoretical properties are not analyzed. The gap between surrogate and full ELBO is acknowledged but hand-waved. Additionally, the KL coefficient β varies substantially across tasks (0.04 to 0.5), suggesting sensitivity that could limit generalization.

The experimental comparisons are appropriately scoped to recursive reasoning baselines (Looped TF, HRM, TRM) rather than frontier LLMs, which is honest and methodologically sound. However, the benchmark suite is relatively narrow—Sudoku, ARC-AGI, N-Queens, Graph Coloring, and binarized MNIST are all small-scale, discrete, highly structured problems. The claim of "generative recursive reasoning" as a general framework is stronger than what the experiments demonstrate.

3. Potential Impact

Direct impact on recursive reasoning: GRAM establishes that stochastic guidance is a consistently beneficial module that can be added to any recursive architecture. This is immediately actionable for the growing community working on weight-sharing Transformers and recursive reasoning models.

Width-based inference scaling: The demonstration that parallel trajectory sampling can substitute for deeper recursion (GRAM with N=20 at 16 iterations outperforming TRM at 320 iterations) is practically significant. It converts a serial bottleneck into a parallelizable operation, which is better suited to modern hardware.

Multi-solution tasks: The mode collapse analysis for deterministic RRMs (Figure 4 right, Table 1) is a clear and important finding. GRAM's ability to maintain coverage across multi-solution landscapes addresses a real limitation of existing approaches.

Unconditional generation: The Sudoku generation results (99.05% validity from empty boards) are impressive and demonstrate an interesting capability, though the binarized MNIST results are modest compared to existing generative models.

Limitations on broader impact: The sequential nature of deep supervision and the small scale of experiments (10M parameters, simple benchmarks) pose significant barriers to scaling. The authors acknowledge this. Real-world impact depends on whether this approach can scale to problems beyond constraint satisfaction puzzles.

4. Timeliness & Relevance

The paper is well-timed. The recursive reasoning paradigm (HRM, TRM, Looped Transformers) is gaining traction as an alternative to scaling model size, and inference-time compute scaling is a hot topic. The idea that reasoning models should maintain uncertainty and explore multiple hypotheses resonates with current discussions about the limitations of deterministic chain-of-thought reasoning. The connection to the "consciousness prior" (Bengio, 2017) and System 2 reasoning adds conceptual depth.

5. Strengths & Limitations

Key Strengths:

Clean conceptual contribution with a well-motivated formulation

Comprehensive ablations that isolate the source of gains (variational framework, not mere randomness)

Width-based scaling is a practical and novel inference-time strategy for RRMs

Strong performance on multi-solution tasks where the advantage is clearly demonstrated

The latent trajectory visualization (Figures 18-19) effectively illustrates the exploration mechanism

Notable Limitations:

Scale is extremely limited (10M parameters, small benchmarks)—unclear if gains persist at larger scales

The truncated ELBO is a significant approximation with no theoretical guarantees

Task diversity is narrow: all benchmarks involve discrete constraint satisfaction or simple image generation

No comparison with other probabilistic reasoning approaches (e.g., particle-based methods, ensemble approaches)

The unconditional generation capability, while interesting, is not competitive with modern generative models on standard benchmarks

The interaction between augmentation and sampling (Figure 14) suggests diminishing returns when training data is sufficient, potentially limiting the approach's value in data-rich regimes

Training efficiency is acknowledged as a limitation but not quantified relative to baselines

Overall Assessment

GRAM makes a solid architectural contribution to the recursive reasoning paradigm by introducing principled stochasticity through variational inference. The multi-trajectory exploration capability and width-based scaling are genuinely useful ideas. However, the experimental scope is narrow, the scale is small, and the practical significance beyond structured puzzle-solving remains unclear. The paper is well-written and well-ablated, but the gap between the ambitious framing ("future neural reasoning systems") and the actual experimental evidence (small constraint satisfaction puzzles) tempers enthusiasm about broader impact.

Rating:6.5/ 10

Significance 6.5Rigor 6.5Novelty 7Clarity 7.5

Generated May 21, 2026

Comparison History (19)

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gpt-5.25/22/2026

Paper 2 likely has higher impact: it demonstrates a large-scale, first-of-its-kind evaluation on genuinely open math problems with concrete successes and active deployment in multiple research domains, indicating immediate real-world applicability and timeliness. Its methodology is rigorous due to formal verification in Lean and clear quantitative benchmarks (Erdős, OEIS). The breadth spans mathematics, computer science, and AI tooling for scientific discovery. Paper 1 is novel in probabilistic recursive reasoning and may influence ML architectures, but its current evidence is mainly on benchmark reasoning/constraint tasks, with less direct near-term transformative application than automated formal proof search solving open problems.

vs. Towards a General Intelligence and Interface for Wearable Health Data

gemini-3.15/22/2026

Paper 1 leverages an unprecedented dataset of 5 million participants to build a foundation model for wearable health. Its rigorous evaluation across 35 tasks, integration with LLM agents, and direct clinical validation demonstrate immense potential for real-world healthcare applications. While Paper 2 offers a valuable theoretical advancement in neural reasoning, Paper 1's scale, practical utility, and broad interdisciplinary impact give it a significantly higher potential for immediate and transformative scientific impact.

vs. Advancing Mathematics Research with AI-Driven Formal Proof Search

gemini-3.15/22/2026

Paper 2 demonstrates significant real-world impact by autonomously solving previously open mathematical problems (Erdős problems and OEIS conjectures) and being actively deployed across multiple research fields. While Paper 1 introduces a novel and rigorous theoretical framework for latent reasoning, Paper 2's concrete achievements in advancing actual mathematics research showcase a much broader and more immediate scientific breakthrough.

vs. Towards a General Intelligence and Interface for Wearable Health Data

claude-opus-4.65/22/2026

Paper 1 presents a foundation model for wearable health pretrained on unprecedented scale (1 trillion minutes, 5 million participants), demonstrating broad applicability across 35 health tasks with clinical validation. Its real-world impact potential in personalized healthcare, combined with novel integration of LLM agents and clinical evaluation, gives it exceptional breadth and immediate applicability. Paper 2 introduces an interesting theoretical framework for probabilistic recursive reasoning, but its impact is more narrowly scoped to the neural reasoning community and lacks the same scale of empirical validation and real-world deployment potential.

vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

gemini-3.15/21/2026

While Paper 1 offers a highly valuable and rigorously designed application for clinical cardiology, Paper 2 addresses a fundamental challenge in artificial intelligence: moving beyond autoregressive sequence generation to stochastic, multi-trajectory latent reasoning. This foundational methodological advancement in extended computation and inference-time scaling has the potential for broader impact across numerous domains and applications within AI.

vs. AgentCo-op: Retrieval-Based Synthesis of Interoperable Multi-Agent Workflows

gpt-5.25/21/2026

Paper 2 (GRAM) is more conceptually novel and broadly impactful: it generalizes recursive reasoning into a probabilistic latent-trajectory framework, enabling multi-hypothesis computation and inference-time scaling—ideas applicable across reasoning, planning, generation, and uncertainty modeling. Its methodological framing (latent-variable generative model + variational inference) is principled and extensible. Paper 1 is strong and timely for tool/agent orchestration in scientific workflows, but its impact may be narrower and more engineering/system-integration focused, with novelty largely in composition and repair mechanisms rather than a new foundational modeling paradigm.

vs. AI Co-Mathematician: Accelerating Mathematicians with Agentic AI

claude-opus-4.65/21/2026

The AI Co-Mathematician demonstrates immediate, tangible scientific impact by helping mathematicians solve open problems and achieving state-of-the-art results on FrontierMath (48% on Tier 4). It addresses a broad, practical need across all of mathematics with a deployed interactive system. While GRAM introduces a theoretically interesting framework for probabilistic recursive reasoning, it remains more preliminary, tested on structured reasoning benchmarks rather than demonstrating real-world breakthroughs. The Co-Mathematician's combination of practical utility, benchmark performance, and direct impact on mathematical discovery gives it higher near-term and broad scientific impact.

vs. Mahjax: A GPU-Accelerated Mahjong Simulator for Reinforcement Learning in JAX

gemini-3.15/21/2026

Paper 1 proposes a novel architectural framework for probabilistic recursive reasoning, addressing fundamental challenges in neural computation and inference-time scaling. Its implications span broad areas of AI, including LLMs and complex problem-solving. Paper 2 presents a valuable but niche GPU-accelerated Mahjong simulator for RL. The theoretical and methodological advancements in Paper 1 offer significantly wider applicability and higher potential impact across the machine learning community.

vs. \ECUAS{n}: A family of metrics for principled evaluation of uncertainty-augmented systems

gpt-5.25/21/2026

Paper 2 (GRAM) is more novel and broadly impactful: it reframes recursive reasoning as probabilistic multi-trajectory latent computation, enabling inference-time scaling, multi-hypothesis reasoning, and both conditional and unconditional generation. This direction is timely given interest in test-time compute and robust reasoning, and it can influence multiple areas (reasoning architectures, generative modeling, inference methods). Paper 1 offers a principled metric family for evaluating uncertainty-augmented systems, valuable for deployment and benchmarking, but it is narrower in scope and more incremental relative to existing proper scoring rule and selective prediction evaluation work.

vs. AutoRPA: Efficient GUI Automation through LLM-Driven Code Synthesis from Interactions

gemini-3.15/21/2026

Paper 2 proposes a foundational advancement in neural reasoning by introducing probabilistic multi-trajectory latent computation as an alternative to standard autoregressive models. This theoretical innovation has broad implications across all of machine learning and AI reasoning. In contrast, Paper 1 presents a highly practical but domain-specific engineering solution for GUI automation. Paper 2's core architectural contributions offer greater potential for widespread methodological impact.

vs. Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

claude-opus-4.65/21/2026

GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of deterministic recursive reasoning models. This has broad implications for reasoning systems, generative modeling, and inference-time scaling. While Paper 2 (DBE) offers a useful methodological contribution to LLM evaluation by addressing benchmark saturation, it is more incremental and narrowly scoped to evaluation methodology. GRAM's novelty in unifying conditional reasoning and unconditional generation through variational inference has greater potential to influence future research directions across multiple subfields.

vs. From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

claude-opus-4.65/21/2026

Paper 1 introduces GRAM, a novel technical framework that combines recursive reasoning with generative probabilistic modeling, addressing fundamental limitations of deterministic recursive reasoning models. It offers concrete methodological contributions (stochastic latent trajectories, amortized variational inference) with demonstrated improvements on reasoning tasks and inference-time scaling. Paper 2 is a review/synthesis paper that organizes existing work into a unified framework for clinical trajectory modeling. While comprehensive and valuable, review papers typically have less direct scientific impact than papers introducing novel methods. GRAM's contributions to neural reasoning architectures have broad applicability across AI research.

vs. Latent Action Reparameterization for Efficient Agent Inference

claude-opus-4.65/21/2026

GRAM introduces a fundamentally novel framework that combines recursive reasoning with probabilistic generative modeling, addressing core limitations of deterministic recursive reasoning models. It contributes new theoretical foundations (latent-variable generative model for reasoning, variational inference training) and demonstrates capabilities across multiple paradigms (conditional reasoning, unconditional generation, inference-time scaling). Paper 2's LAR, while practically useful for LLM agent efficiency, is more incremental—applying learned action abstractions (a well-studied concept in RL/planning) to LLM agents. GRAM's broader theoretical contribution and potential to influence reasoning architectures gives it higher long-term impact.

vs. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

gpt-5.25/21/2026

Paper 2 has higher estimated scientific impact due to its strong real-world applicability and timeliness: it targets runtime safety/behavioral control for foundation models in high-stakes social domains, a pressing cross-disciplinary need. It introduces a systems/robotics-inspired framing (closed-loop trajectory constraints) and validates it in three concrete deployments, suggesting broader adoption potential across HCI, robotics, ML safety, and applied AI. Paper 1 is novel methodologically for probabilistic multi-trajectory recursive reasoning, but its immediate impact is more contained within ML reasoning/modeling benchmarks and may face adoption friction without clear downstream killer apps.

vs. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

gpt-5.25/21/2026

Paper 2 (GRAM) introduces a broadly applicable modeling paradigm: probabilistic multi-trajectory recursive latent reasoning with variational training and inference-time scaling (depth and sampling). This is a clear algorithmic innovation with potential to influence reasoning architectures across NLP, vision, planning, and constraint satisfaction, and aligns with current interest in test-time compute and robust multi-hypothesis reasoning. Paper 1 is timely and valuable for safety in social deployments, but its impact may be more domain- and system-integration-specific, with weaker generality and fewer formal guarantees than the framing suggests.

vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

claude-opus-4.65/21/2026

GRAM introduces a fundamental new framework for neural reasoning by making recursive reasoning models probabilistic, enabling multi-trajectory computation with theoretical grounding in variational inference. This addresses core questions about how neural systems should implement extended computation, with broad applicability across reasoning, generation, and constraint satisfaction. While SceneCode is a strong engineering contribution for embodied AI scene synthesis, GRAM's conceptual innovation in combining recursive latent reasoning with generative modeling has broader potential impact across multiple fields of AI research, offering a new paradigm for inference-time scaling and probabilistic reasoning.

vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

claude-opus-4.65/21/2026

GRAM introduces a fundamental architectural innovation for neural reasoning—turning deterministic recursive reasoning into probabilistic multi-trajectory computation. This addresses a core challenge in AI (how to implement extended computation in neural systems) with broad theoretical and practical implications across reasoning, generation, and inference-time scaling. SceneCode is a strong engineering contribution for indoor scene synthesis but is more application-specific. GRAM's framework-level contribution to reasoning architectures has wider potential impact across multiple fields and research directions.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

claude-opus-4.65/21/2026

GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of both autoregressive and deterministic recursive models. Its contributions span generative modeling, variational inference, and reasoning—broad foundational areas with wide applicability. While ScenePilot makes a solid contribution to autonomous driving safety testing with practical value, its impact is more domain-specific. GRAM's theoretical novelty, breadth of impact across reasoning and generation tasks, and its potential to influence future neural architecture design give it higher estimated scientific impact.

vs. ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

claude-opus-4.65/21/2026

GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of deterministic recursive reasoning models. Its breadth of impact spans reasoning, generation, and inference-time scaling—topics central to modern AI research. While ScenePilot makes a solid contribution to autonomous driving safety testing with its boundary-band scenario generation approach, it addresses a more domain-specific problem. GRAM's novelty in unifying generative modeling with recursive reasoning and its broader applicability to diverse reasoning tasks give it higher potential scientific impact.