Generative Recursive Reasoning

Junyeob Baek, Mingyu Jo, Minsu Kim, Mengye Ren, Yoshua Bengio, Sungjin Ahn

May 19, 2026

arXiv:2605.19376v1 PDF

Frozen v1 — this version was superseded on arXiv. Stats below reflect the state at freeze time and will not change.View latest (v2) →

#83of 2292·Artificial Intelligence

#83 of 2292 · Artificial Intelligence

Tournament Score

1548±43

10501800

90%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor7

Novelty7

Clarity8

Tournament Score

1548±43

10501800

90%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance

Rigor

Novelty

Clarity

Abstract

How should future neural reasoning systems implement extended computation? Recursive Reasoning Models (RRMs) offer a promising alternative to autoregressive sequence extension by performing iterative latent-state refinement with shared transition functions. Yet existing RRMs are largely deterministic, following a single latent trajectory and converging to a single prediction. We introduce \emph{Generative Recursive reAsoning Models (GRAM)}, a framework that turns recursive latent reasoning into probabilistic multi-trajectory computation. GRAM models reasoning as a stochastic latent trajectory, enabling multiple hypotheses, alternative solution strategies, and inference-time scaling through both recursive depth and parallel trajectory sampling. This yields a latent-variable generative model supporting conditional reasoning via $p_θ(y \mid x)$ and, with fixed or absent inputs, unconditional generation via $p_{θ} (x)$ . Trained with amortized variational inference, GRAM improves over deterministic recurrent and recursive baselines on structured reasoning and multi-solution constraint satisfaction tasks, while demonstrating an unconditional generation capability. \href{https://ahn-ml.github.io/gram-website/}{https://ahn-ml.github.io/gram-website}

AI Impact Assessments

(1 models)

Scientific Impact Assessment: Generative Recursive Reasoning (GRAM)

1. Core Contribution

GRAM introduces probabilistic multi-trajectory computation into the Recursive Reasoning Model (RRM) paradigm. While existing RRMs (HRM, TRM, Looped Transformers) perform deterministic iterative latent-state refinement—following a single trajectory per input—GRAM injects learned stochastic perturbations at each recursion step, modeling reasoning as a distribution over latent trajectories. This is formalized as a latent-variable generative model trained via amortized variational inference (ELBO optimization), where a prior generates trajectories at inference and a posterior guides training. The framework supports both conditional reasoning p(y|x) and unconditional generation p(x), and introduces "width-based" inference-time scaling through parallel trajectory sampling.

The key technical mechanism is a learnable stochastic guidance signal: after a deterministic update u_t, a state-dependent Gaussian perturbation ε_t is sampled and added. Both mean and variance are learned, enabling the model to adaptively steer exploration. A hierarchical instantiation separates high-level (slow, stochastic) and low-level (fast, deterministic) latent dynamics.

2. Methodological Rigor

The formulation is mathematically clean. The latent-variable model with Markov transitions, the ELBO derivation, and the variational posterior structure are well-specified. However, several methodological concerns warrant attention:

Truncated gradient approximation: The practical training objective (Eq. 14) only propagates gradients through the final transition of each supervision step, making it a biased surrogate for the full ELBO. While the authors provide empirical validation showing both the surrogate and full ELBO decrease during training (Appendix A.3), the theoretical implications of this bias remain unanalyzed. The precedent from Dreamer-family models is cited but the approximation quality may differ substantially in the reasoning setting.

Experimental scope: Evaluations are conducted on controlled benchmarks (Sudoku-Extreme, ARC-AGI, N-Queens, Graph Coloring, binarized MNIST) rather than large-scale or naturalistic reasoning tasks. While appropriate for probing architectural properties, this limits conclusions about generalizability. The comparison against LLMs is rightly positioned as "reference points" rather than controlled baselines, but this also means we don't know how GRAM performs on reasoning tasks where LLMs excel.

Ablation quality: The ablations are thorough and informative. Table 3 demonstrates that stochastic guidance provides consistent gains across architectures, and that naive stochasticity (random initialization, stochastic decoding) applied to TRM does not help—strengthening the claim that the variational framework itself is essential. The decomposition of stochasticity vs. guidance components reveals task-dependent behavior (stochasticity alone works for Sudoku but fails on N-Queens), adding nuance.

Statistical reporting: Results include means and standard deviations across runs, which is good practice. However, the number of runs is not always clearly stated.

3. Potential Impact

Multi-solution reasoning: The most compelling contribution is demonstrating that deterministic RRMs structurally cannot handle multi-solution problems (Table 1, Figure 4 right). GRAM's ability to maintain solution diversity while achieving high constraint satisfaction (99.7% accuracy on 8×8 N-Queens vs. 96.3% for AR) is a genuine capability gap being addressed.

Inference-time scaling: The width-based scaling axis is practically significant. GRAM with N=20 samples at 16 iterations outperforms TRM at 320 iterations (97.0% vs. 90.5% on Sudoku), bypassing sequential latency bottlenecks. The LPRM mechanism for trajectory selection is a pragmatic addition.

Unconditional generation: Demonstrating that the same recursive framework can generate valid Sudoku boards (99.05% validity with 10.9M parameters vs. 91.33% for D3PM with 55.1M) is a novel finding. The monotonic improvement with additional inference steps beyond training-time depth is particularly interesting for the generative modeling community.

Broader influence: The paper could influence the design of future reasoning architectures by establishing that stochasticity in latent computation is not just noise but a structured exploration mechanism. The connection between probabilistic state-space models (from RL/video prediction) and reasoning architectures is intellectually productive.

4. Timeliness & Relevance

The paper addresses a timely question: how to scale reasoning beyond autoregressive token generation. With the rapid development of inference-time compute strategies (test-time scaling, best-of-N, process reward models), GRAM offers a principled probabilistic foundation for these ideas within compact recurrent architectures. The demonstrated complementarity between depth and width scaling aligns with current industry interest in efficient inference.

However, the acknowledged limitation—sequential deep supervision limiting training efficiency compared to Transformers—is a significant barrier to practical adoption at scale. This constrains near-term impact on foundation model development.

5. Strengths & Limitations

Strengths:

Clean conceptual framework connecting recursive reasoning, probabilistic inference, and generative modeling

Stochastic guidance as a general-purpose architectural extension (consistent gains across all backbone variants)

Compelling demonstration that deterministic recursion fundamentally fails on multi-solution problems

Width-based scaling provides practical efficiency gains over depth-only approaches

Strong ablations distinguishing the variational framework from naive randomness

Limitations:

Evaluation limited to controlled/synthetic benchmarks; unclear how benefits transfer to natural language reasoning or larger-scale problems

Training efficiency concerns (deep supervision, sequential computation) limit scalability

The truncated ELBO approximation introduces bias whose impact is characterized only empirically

The LPRM for trajectory selection adds complexity and assumes access to accuracy signals during training

Comparison scope is narrow—missing comparisons with other probabilistic/ensemble reasoning approaches (e.g., mixture-of-experts, dropout-based uncertainty)

Binarized MNIST is an extremely limited test of generative capability; richer image or text generation experiments would strengthen claims

6. Overall Assessment

GRAM makes a well-motivated and technically sound contribution to the recursive reasoning literature. The core insight—that stochastic latent transitions enable multi-hypothesis reasoning and width-based scaling—is validated through careful experiments and ablations. The paper is clearly written with good reproducibility details. However, the impact is bounded by the narrow experimental scope and scalability limitations. The contribution is best understood as establishing a design principle for future architectures rather than providing an immediately deployable system.

Rating:6.8/ 10

Significance 7Rigor 7Novelty 7Clarity 8

Generated May 20, 2026

Comparison History (31)

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gpt-5.25/22/2026

Paper 2 likely has higher impact: it introduces a broadly applicable modeling framework (GRAM) that generalizes recursive reasoning to probabilistic multi-trajectory computation, enabling inference-time scaling, multimodal hypothesis exploration, and both conditional and unconditional generation. This is a novel architectural/training contribution with potential to influence reasoning, planning, generative modeling, and efficient test-time compute across many domains. Paper 1 is timely and valuable for evaluation and safety in forecasting, but it is more domain-specific (forecasting/tail risk) and primarily diagnostic/recommendational rather than proposing a new general modeling paradigm.

vs. SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

gpt-5.25/22/2026

Paper 2 (GRAM) is more novel and broadly impactful: it generalizes recursive reasoning by making latent trajectories stochastic, enabling multi-hypothesis computation, inference-time scaling via depth and parallel sampling, and both conditional reasoning and unconditional generation. This targets a central, timely problem in ML (extended computation and reasoning) with potential applicability across domains (planning, constraint solving, language, vision) rather than a single scientific vertical. Paper 1 is strong and application-relevant for chemistry, but its impact is likely narrower and more engineering/modular-integration focused.

vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

gemini-3.15/22/2026

Paper 2 introduces a fundamental architectural innovation in neural reasoning by enabling probabilistic multi-trajectory computation in latent space. This shift from deterministic, autoregressive models to stochastic recursive models has profound implications for the future design of general AI reasoning systems. While Paper 1 provides valuable empirical insights into current LLM limitations in specific forecasting scenarios, Paper 2 offers a broader, foundational methodology that could widely influence how extended computation and constraint satisfaction are approached across the field of machine learning.

vs. SciCore-Mol: Augmenting Large Language Models with Pluggable Molecular Cognition Modules

claude-opus-4.65/22/2026

Paper 1 introduces a fundamentally new computational framework (GRAM) that generalizes recursive reasoning to probabilistic multi-trajectory computation, with broad implications across AI reasoning, generation, and inference-time scaling. Its theoretical contribution—unifying conditional reasoning and unconditional generation in a latent-variable framework with variational training—addresses a core question in neural computation. Paper 2, while practically valuable for molecular AI, is more application-specific and follows an established pattern of augmenting LLMs with domain-specific modules. Paper 1's foundational nature gives it broader potential impact across multiple fields of AI research.

vs. Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

gemini-3.15/20/2026

Paper 1 offers a profound theoretical breakthrough by providing the first finite-sample guarantee for neural Q-learning under decentralized partial observability. By formalizing multi-agent LLM pipelines across trust/vendor boundaries as IC-SMDPs, it addresses a highly relevant, real-world bottleneck in compound AI systems. While Paper 2 tackles the important trend of inference-time scaling, Paper 1's combination of rigorous mathematical bounds, novel extension of the approximate information state framework, and diverse empirical validation (from synthetic to multi-LLM reasoning) suggests a deeper and more foundational scientific impact on multi-agent reinforcement learning.

vs. Learn-by-Wire Training Control Governance: Bounded Autonomous Training Under Stress for Stability and Efficiency

gemini-3.15/20/2026

Paper 2 proposes a fundamental architectural shift in neural reasoning, moving beyond autoregressive generation to probabilistic, multi-trajectory latent recursive computation. This addresses a critical frontier in AI: scaling inference-time compute and System 2 reasoning. While Paper 1 offers highly practical engineering improvements for LLM training stability, Paper 2's theoretical novelty and potential to influence next-generation reasoning architectures give it broader, paradigm-shifting scientific impact across the field of machine learning.

vs. Attributing Emergence in Million-Agent Systems

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental methodological gap in multi-agent systems research by providing scalable attribution methods for million-agent LLM simulations, backed by a formal impossibility theorem (Attribution Scaling Bias) and validated on real-world Bluesky data. It bridges computational social science, game theory, and LLM-based simulation with broad interdisciplinary impact. Paper 1 contributes a solid incremental advance in latent reasoning models, but Paper 2's combination of theoretical contribution, practical scalability (4-5 orders of magnitude speedup), and the demonstration that small-scale studies are fundamentally inadequate for nonlinear indicators has broader and more transformative implications.

vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

gpt-5.25/20/2026

Paper 2 (GRAM) is likely higher impact due to greater conceptual novelty and breadth: it reframes recursive/iterative reasoning as a probabilistic latent-trajectory generative model, enabling multi-hypothesis computation, inference-time scaling via depth and sampling, and both conditional and unconditional generation. This is a broadly applicable modeling paradigm relevant to reasoning, generative modeling, and scalable inference across domains. Paper 1 is rigorous and practically useful for mitigating a specific MLLM hallucination mode, but its contribution is more targeted (mechanistic diagnosis + intervention on attention heads) and may generalize less widely.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

claude-opus-4.65/20/2026

Paper 2 (GRAM) introduces a novel framework for probabilistic recursive reasoning that addresses fundamental limitations of deterministic recursive reasoning models. It offers broad applicability across reasoning and generation tasks, with a principled probabilistic formulation enabling multi-trajectory computation and inference-time scaling. Paper 1, while providing useful empirical insights about when agent skills help in cybersecurity, is a negative result with limited statistical significance (p=0.71) in a narrow domain. GRAM's methodological contribution to neural reasoning architectures has substantially broader potential impact across AI research.

vs. Probabilistic Tiny Recursive Model

gemini-3.15/20/2026

Paper 1 introduces a fundamental methodological innovation by formalizing recursive reasoning as a probabilistic generative model trained via amortized variational inference. This theoretical framework provides a principled foundation for latent trajectory generation. In contrast, Paper 2 proposes a highly effective but narrower test-time heuristic (noise injection) for existing models. Thus, Paper 1 offers broader theoretical applicability and deeper structural innovation.

vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

gemini-3.15/20/2026

Paper 2 introduces a novel, foundational framework for neural reasoning (GRAM) that addresses fundamental limitations in current models by enabling probabilistic multi-trajectory computation. This approach has broad applicability across AI, generative modeling, and reasoning tasks. In contrast, Paper 1 is a highly specific case study on a single math problem using a specific API, which limits its impact to the specialized niche of AI-assisted formal theorem proving.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

claude-opus-4.65/20/2026

Paper 2 introduces a fundamentally new framework (GRAM) for neural reasoning that combines recursive latent reasoning with probabilistic multi-trajectory computation. This addresses a core challenge in AI—how to implement extended computation in neural systems—with broad implications across reasoning, generation, and inference-time scaling. Its novelty (probabilistic recursive reasoning), methodological depth (variational inference framework), and breadth of applicability across multiple AI domains give it significantly higher potential impact than Paper 1, which offers an incremental improvement (2.1% AUC) on a specific deepfake detection benchmark using emotion cues.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

claude-opus-4.65/20/2026

GRAM introduces a fundamentally new framework for neural reasoning by combining recursive latent-state refinement with probabilistic multi-trajectory computation, addressing core limitations of deterministic recursive reasoning models. This has broad implications for reasoning architectures, inference-time scaling, and generative modeling. While POLAR-Bench addresses the important and timely topic of privacy-utility trade-offs in LLM agents, it is primarily a benchmark contribution with diagnostic findings rather than a new methodology. GRAM's theoretical novelty, methodological depth, and potential to influence future reasoning system design give it higher long-term scientific impact.

vs. SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

gpt-5.25/20/2026

Paper 2 (GRAM) introduces a broadly applicable probabilistic extension to recursive reasoning models, enabling multi-trajectory latent computation, hypothesis diversity, and inference-time scaling—ideas likely to transfer across many ML domains (reasoning, planning, constraint solving, and generative modeling). Its methodological framing (latent-variable model with variational inference) is general and aligns with current interest in scalable test-time compute and structured reasoning. Paper 1 is innovative and impactful for embodied AI/robotics simulation, but its impact is more domain-specific (indoor scene/asset generation toolchain) and depends on ecosystem adoption.

vs. Latent Action Reparameterization for Efficient Agent Inference

gemini-3.15/20/2026

Paper 2 introduces a fundamental architectural shift by enabling probabilistic, multi-trajectory recursive latent reasoning, offering a significant departure from standard autoregressive models. This has broad implications for the future of neural reasoning systems and generative models. While Paper 1 offers valuable efficiency improvements for LLM agents, Paper 2's theoretical depth and potential to redefine extended computation in deep learning give it a higher potential for foundational scientific impact across AI.

vs. MOCHA: Multi-Objective Chebyshev Annealing for Agent Skill Optimization

gpt-5.25/20/2026

Paper 2 (GRAM) introduces a broadly applicable probabilistic framework for recursive latent-state reasoning with multi-trajectory computation, enabling hypothesis diversity, inference-time scaling, and both conditional reasoning and unconditional generation. This is novel at the modeling level, potentially impacting core sequence modeling, reasoning, and generative modeling across domains, and aligns with timely interest in test-time compute and structured reasoning. Paper 1 is a strong, practical contribution for LLM-agent prompt/skill optimization, but is more domain-specific and likely narrower in cross-field impact despite solid empirical gains.

vs. Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On

gpt-5.25/20/2026

Paper 2 proposes a concrete, novel modeling framework (stochastic multi-trajectory recursive latent reasoning) with a clear training method (amortized variational inference) and demonstrated empirical gains plus new capabilities (inference-time scaling, multi-solution reasoning, unconditional generation). This combination of methodological innovation, rigor, and broad applicability across reasoning, generative modeling, and scaling makes it more likely to influence subsequent research. Paper 1 is a vision/conceptual framework for trustworthy agent networks—timely and important, but likely less impactful scientifically without technical mechanisms, formalization, or empirical validation.

vs. Neurosymbolic Learning for Inference-Time Argumentation

gemini-3.15/20/2026

Paper 2 introduces a fundamental architectural advancement in neural reasoning through probabilistic multi-trajectory latent computation. This addresses a critical, highly active area in AI (inference-time scaling and extended computation), offering broad applicability across various domains. While Paper 1 provides a valuable neurosymbolic approach for claim verification, its scope is more specialized compared to the foundational innovations proposed in Paper 2.

vs. GeoX: Mastering Geospatial Reasoning Through Self-Play and Verifiable Rewards

gemini-3.15/20/2026

Paper 1 introduces a fundamental methodological advancement in neural reasoning by enabling probabilistic, multi-trajectory latent search (GRAM), addressing a critical bottleneck in AI inference scaling. Its broad applicability to general reasoning tasks gives it widespread relevance across the entire machine learning community. Paper 2 presents a strong application of self-play and verifiable rewards, but its focus is restricted to the specific domain of geospatial vision. Consequently, Paper 1 has a significantly higher potential for broad scientific impact and foundational innovation.

vs. Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

gpt-5.25/20/2026

Paper 1 introduces a novel probabilistic extension of recursive reasoning models (multi-trajectory stochastic latent refinement with variational training) that can scale inference via depth and sampling, spanning conditional reasoning and unconditional generation. This is a broadly applicable modeling contribution with potential impact across ML reasoning, planning, and generative modeling, and is timely given interest in test-time compute and alternatives to autoregression. Paper 2 offers an important evaluation/validity criterion for LLM-based assessment, but its scope is more domain-specific and primarily diagnostic rather than a general-purpose methodological advance.