Learning to Reason Efficiently with A* Post-Training

Andreas Opedal, Francesco Ignazio Re, Abulhair Saparov, Mrinmaya Sachan, Bernhard Schölkopf, Ryan Cotterell

May 23, 2026

arXiv:2605.24597v1 PDF

cs.AI(primary)cs.CLcs.LG

#400of 2682·Artificial Intelligence

#400 of 2682 · Artificial Intelligence

Tournament Score

1493±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor7.5

Novelty6

Clarity8

Tournament Score

1493±43

10501800

68%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Many applications of large language models (LLMs) require deductive reasoning, yet models frequently produce incorrect or redundant inference steps. We frame natural language inference as a search problem where the final answer is the valid proof itself, requiring a reasoning procedure in which intermediate inferences are correct. Specifically, we investigate whether LLMs can learn to generate correct and efficient proofs with guidance from A* search -- an algorithm that guarantees an optimally efficient path to a goal. We explore two training techniques: supervised fine-tuning on execution traces from A* and reinforcement learning with A*-informed process reward models. Empirically, we find that Llama-3.2 models in the 1B--3B range benefit substantially from A* post training, going from near-zero accuracy to outperforming DeepSeek-V3.2 -- a much larger model. Our analysis uncovers a trade-off: while simple correctness rewards maximize accuracy, A*-informed signals strike a balance between accuracy and efficiency. Furthermore, we find that on larger search spaces, models trained with imperfect heuristics exhibit superior accuracy. Our results demonstrate a promising direction towards reasoning guided by principles derived from classical search algorithms.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Learning to Reason Efficiently with A* Post-Training"

1. Core Contribution

This paper bridges classical AI search algorithms (specifically A* search) with modern LLM post-training for deductive reasoning in natural language. The key novelty is twofold: (1) extending A* search to hypergraphs for natural language deductive reasoning, where inference rules may have multiple premises, and (2) deriving training signals from A* execution traces for both supervised fine-tuning (SFT) and reinforcement learning via A*-informed process reward models (PRMs). The paper frames natural language inference as requiring the *entire proof* to be correct—not just the final answer—which is a meaningful departure from the dominant paradigm of outcome-based evaluation.

The A*-informed PRM (Eq. 7) is an elegant design choice: it normalizes raw priority scores via exponential decay, sets the location parameter adaptively to the A* trace's own score (ensuring a reward of 1 for perfect alignment), and gracefully handles the cost-minimization-to-reward-maximization inversion. The dependency heuristic—a relaxation based on unary projections of rule dependencies—provides a practical, consistent heuristic that doesn't require solving the full problem.

2. Methodological Rigor

The formal framework is carefully constructed. The connection between logic programs (Datalog), hypergraph search, and A* is well-established through proper definitions, and the extension to hypergraphs follows Felzenszwalb & McAllester (2007) rigorously. The formalization of search traces as quotient sets under string homomorphisms (§A.1) is technically precise and clearly distinguishes this work from Lehnert et al. (2024).

The experimental design is generally sound: two datasets (ProofWriter and DeepRD) with different search space sizes, three search heuristics, four reward models, and two model sizes. Evaluation uses both accuracy (full proof correctness) and efficiency (pushes and pops relative to shortest proof). The inclusion of confidence intervals (Wilson score CIs for accuracy, bootstrap CIs for efficiency) is appropriate.

However, several methodological limitations deserve note. First, the evaluation is limited to relatively small models (1B-3B parameters), and the comparison to DeepSeek-V3.2 uses in-context learning rather than fine-tuning, making it somewhat asymmetric. Second, the datasets are synthetic/semi-synthetic, and generalization to more naturalistic reasoning tasks remains untested. Third, the RL experiments train for only one epoch, and the sensitivity to hyperparameters (KL temperature, group size) is not explored. Fourth, the paper acknowledges but does not deeply investigate why RL fails when starting from a weak SFT checkpoint (Fig. 9), which limits the practical recipe's generality.

3. Potential Impact

Direct impact on LLM reasoning training: The paper provides a concrete framework for incorporating search-algorithmic principles into LLM training. The accuracy-efficiency Pareto front analysis (Figs. 3-4) is practically useful for practitioners choosing reward models based on their priorities.

Broader methodological impact: The "learning to search" paradigm applied to natural language reasoning could extend beyond deductive logic to other structured reasoning tasks (e.g., planning, constraint satisfaction). The A*-informed PRM design could inspire similar reward models based on other classical algorithms.

Limitations on impact: The restriction to Datalog-style reasoning (polynomial-time decidable) limits immediate applicability to harder reasoning domains. The requirement for ground-truth access to the search algorithm at training time (for computing priority scores) means the approach doesn't easily transfer to domains where no symbolic solver exists. The improvement from near-zero to ~93% accuracy, while dramatic, largely reflects that small models cannot perform the task at all without task-specific training, rather than demonstrating that A* signals are uniquely powerful.

4. Timeliness & Relevance

This paper is well-timed given the current intense interest in reasoning models (DeepSeek-R1, OpenAI o-series) and the growing recognition that outcome-based RLVR is insufficient for faithful reasoning. The "overthinking" problem is receiving significant attention, and this paper offers a principled approach grounded in decades of AI search theory. The connection between process rewards and search heuristics is a natural and underexplored direction.

5. Strengths & Limitations

Key Strengths:

*Principled framework*: The formalization connecting logic programs, hypergraph search, A*, and natural language verbalization is comprehensive and could serve as a foundation for future work.

*Nuanced empirical findings*: The discovery that imperfect heuristics outperform perfect ones on larger search spaces (DeepRD) is non-obvious and practically important—it echoes the classical insight that learning from suboptimal demonstrations can improve robustness.

*Pareto front analysis*: Showing that no single reward model dominates across accuracy and efficiency provides genuine insight into reward design trade-offs.

*Rigorous evaluation*: Evaluating full proof correctness rather than final-answer accuracy sets a higher and more meaningful bar.

Key Limitations:

*Scale and generalization*: Only 1B-3B models tested; synthetic datasets only; unclear how findings transfer to more complex or naturalistic reasoning.

*Asymmetric baseline comparison*: Comparing fine-tuned 3B models to ICL with DeepSeek-V3.2 conflates model capacity with training paradigm.

*Limited RL exploration*: Single-epoch training, one RL algorithm (GRPO), and the failure mode when starting from weak checkpoints suggest the approach may be brittle.

*Practical applicability*: Requires a symbolic implementation of the reasoning task to generate A* traces and compute rewards, limiting deployment to domains with formal specifications.

*Incremental over Lehnert et al. (2024)*: While the extension to hypergraphs and natural language is non-trivial, the core idea of training on A* traces is not new.

The paper makes a solid contribution to understanding how classical search principles can improve LLM reasoning, with careful formalism and informative experiments. Its primary limitation is the gap between the controlled synthetic setting and the messy reality of real-world reasoning tasks.

Rating:6.5/ 10

Significance 6.5Rigor 7.5Novelty 6Clarity 8

Generated May 26, 2026

Comparison History (25)

vs. Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

gpt-5.25/27/2026

Paper 2 has higher likely impact because it defines a timely, under-addressed deployment problem (longitudinal agent reliability), proposes a concrete benchmark (AgingBench) with mechanistic taxonomy and diagnostics, and evaluates broadly across models, scenarios, memory policies, and agent regimes—supporting methodological rigor and generality. Its contributions apply across many deployed agent systems (memory, retrieval, maintenance), influencing evaluation standards and engineering practices beyond any single reasoning task. Paper 1 is novel in combining A* with post-training for proofs, but its scope is narrower (NLI/proof-style reasoning) and may be more sensitive to task/setup specifics.

vs. The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

claude-opus-4.65/27/2026

MiniMax-M2 presents a complete frontier-scale MoE language model system with novel agent-driven training pipelines, a scalable RL system (Forge), and early self-evolution capabilities. Its breadth of impact spans agentic coding, reasoning, and real-world deployment at scale, representing a significant engineering and scientific contribution. Paper 2, while intellectually interesting in applying A* search principles to LLM reasoning training, is narrower in scope (1B-3B models, formal proof generation) and represents an incremental advance in the reasoning-training space. The scale, novelty of self-evolution, and practical deployment implications give Paper 1 higher impact potential.

vs. Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

claude-opus-4.65/27/2026

Paper 2 identifies a fundamental evaluation flaw ('composition collapse') affecting how the entire field assesses multi-hop reasoning in LLMs, introducing a principled diagnostic protocol applicable across models and benchmarks. Its insight that aggregate metrics mask critical composition failures challenges widespread evaluation practices, with broad methodological implications. Paper 1, while solid, addresses a more specific application (proof generation via A* search guidance) with narrower scope. Paper 2's contribution to evaluation methodology and its revelation of hidden failure modes will likely influence more research directions and prompt reevaluation of existing claims.

vs. SIA: Self Improving AI with Harness & Weight Updates

gpt-5.25/27/2026

Paper 2 is likely higher impact: it introduces a clear, principled bridge between classical optimal search (A*) and LLM reasoning, with broadly applicable training signals (trace SFT + A*-informed RL) that target both correctness and efficiency. The result—small models rivaling much larger ones—has strong practical implications for cost-effective reasoning systems and is highly timely. Methodologically, the framing is crisp and analyzable, with interpretable trade-offs and insights about imperfect heuristics. Paper 1 is ambitious and cross-domain, but “self-improving” harness+weights loops raise reproducibility/safety concerns and may be harder to generalize rigorously.

vs. CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

gemini-3.15/26/2026

Paper 2 proposes a fundamental advancement in LLM reasoning by integrating A* search for post-training, an extremely timely and critical area of AI research. Demonstrating that 1B-3B models can outperform much larger models like DeepSeek-V3.2 via A*-informed process rewards offers massive implications for efficient, deductive AI reasoning. While Paper 1 introduces a valuable benchmark for causal discovery, Paper 2 provides a broadly applicable methodological breakthrough with wider potential impact across all LLM applications requiring logical deduction.

vs. Uncertainty Decomposition via Cyclical SG-MCMC and Soft-label Learning for Subjective NLP

claude-opus-4.65/26/2026

Paper 1 introduces a novel framework connecting classical A* search with LLM reasoning via post-training, demonstrating that small models (1-3B) can outperform much larger models like DeepSeek-V3.2. This addresses a critical challenge in LLM reasoning with broad applicability across many tasks. The approach is highly timely given the focus on reasoning efficiency, and the integration of search algorithms with RL-based training opens a promising research direction. Paper 2 makes a solid but more incremental contribution in a narrower subfield (subjective NLP/uncertainty quantification), combining existing techniques (cSG-MCMC, soft labels) rather than proposing a fundamentally new paradigm.

vs. CODESKILL: Learning Self-Evolving Skills for Coding Agents

gpt-5.25/26/2026

Paper 2 is likely higher impact due to broader, more general significance: it introduces a principled bridge between classical optimal search (A*) and LLM post-training for reasoning, applicable beyond any single domain (e.g., theorem proving, planning, verification, NLI). The framing of proofs as search with A*-guided supervision/RL is novel and timely for improving reasoning efficiency and correctness, with strong evidence (small models surpassing much larger ones) suggesting practical value and scalability. Paper 1 is impactful for coding agents, but is more domain-specific.

vs. Fairness of Explanations in Artificial Intelligence (AI): A Unifying Framework, Axioms, and Future Direction toward Responsible AI

claude-opus-4.65/26/2026

Paper 1 identifies a novel and important blind spot at the intersection of algorithmic fairness and explainable AI, providing the first unified theoretical framework for explanation fairness. Its conditional invariance framework, seven-dimensional taxonomy, and practical audit workflow address a critical gap in responsible AI with broad implications across high-stakes domains (healthcare, criminal justice, credit). While Paper 2 makes a solid contribution applying A* search to LLM reasoning with promising empirical results, Paper 1's conceptual contribution is more foundational, has broader cross-disciplinary impact, and addresses an increasingly urgent societal need for fair and accountable AI systems.

vs. ATWL: A Formal Language for Representing, Comparing, and Reusing Visual Analytics Workflows

gpt-5.25/26/2026

Paper 2 likely has higher impact due to a broadly applicable, timely advance in LLM reasoning: integrating classical A* search signals into post-training to improve proof correctness and efficiency, with strong empirical gains on small models outperforming much larger baselines. This is methodologically concrete (SFT on search traces + RL with process rewards), generalizable across reasoning tasks, and relevant to current industry and academic efforts to make LLM reasoning reliable and cost-efficient. Paper 1 is novel for visual analytics workflow formalization but is narrower in domain and adoption-dependent.

vs. GRAIL: AI translation for scientists application workflow on satellite data

claude-opus-4.65/26/2026

Paper 2 addresses a fundamental challenge in LLM reasoning by connecting classical A* search algorithms with modern post-training techniques (SFT and RL with process reward models). It demonstrates that small models (1B-3B) can outperform much larger models like DeepSeek-V3.2, which has broad implications across AI. The novelty of framing reasoning as search with A*-informed training, the rigorous methodology, and the generalizable insights about efficiency-accuracy tradeoffs give it significantly broader impact potential. Paper 1, while useful, is a narrower domain-specific engineering contribution for geospatial workflow translation.

vs. Large Vision-Language Models Get Lost in Attention

gpt-5.25/26/2026

Paper 2 has higher potential impact: it offers a broadly applicable, theoretically grounded framework (information theory + geometry) for interpreting residual updates and functionally separating attention vs FFN roles, with striking empirical evidence that attention weights can be randomized with little or no performance loss. This challenges a core design assumption across LVLMs/Transformers and could drive architectural simplification and efficiency gains across many multimodal systems. Paper 1 is novel and useful for reasoning, but is more niche (NLI/proof search) and tied to A*-guided training rather than a general rethink of model components.

vs. RECTOR: Priority-Aware Rule-Based Reranking for Compliance-Aware Autonomous Driving Trajectory Selection

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental challenge in LLM reasoning by bridging classical AI search (A*) with modern language model training. It demonstrates that small models (1-3B) can outperform much larger ones (DeepSeek-V3.2) through A*-guided post-training, offering broad impact across all LLM reasoning applications. The novelty of combining process reward models with A* heuristics and the insights about imperfect heuristics improving accuracy in large search spaces are significant contributions. Paper 2, while practical, is a narrower engineering contribution focused on trajectory reranking in autonomous driving with incremental improvements and acknowledged limitations (proxy-only, open-loop, single dataset).

vs. EPPC-OASIS: Ontology-Aware Adaptation and Structured Inference Refinement for Electronic Patient-Provider Communication Mining in Secure Messages

gemini-3.15/26/2026

Paper 1 addresses a fundamental limitation in general LLM capabilities—deductive reasoning—by innovatively integrating classical A* search principles. Its results are highly impactful, enabling 1B-3B parameter models to outperform significantly larger state-of-the-art models. This broad applicability and dramatic performance leap offer widespread implications across the AI field. In contrast, Paper 2 focuses on a niche application (clinical NLP message mining) and yields only modest performance gains (~2% F1), resulting in a narrower and less transformative scientific impact.

vs. SkillFlow: Flow-Driven Recursive Skill Evolution for Agentic Orchestration

claude-opus-4.65/26/2026

Paper 1 presents a more focused and novel contribution by bridging classical A* search with LLM reasoning training, demonstrating that small models (1B-3B) can outperform much larger models like DeepSeek-V3.2. The clear theoretical grounding in optimal search algorithms, the discovery of meaningful trade-offs between accuracy and efficiency, and the surprising finding about imperfect heuristics provide deep insights. Paper 2, while comprehensive, combines multiple existing ideas (flow matching, skill libraries, credit assignment) into a complex framework that may be harder to reproduce and build upon. Paper 1's simplicity and fundamental insight give it broader potential impact.

vs. MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

claude-opus-4.65/26/2026

Paper 1 presents a novel framework connecting classical A* search with LLM reasoning through post-training, demonstrating that small models (1-3B) can outperform much larger ones. This addresses a fundamental challenge in LLM reasoning with broad implications across many applications. The insight about A*-informed process reward models balancing accuracy and efficiency, and the finding about imperfect heuristics improving accuracy in larger search spaces, offer deep theoretical contributions. Paper 2, while practical and well-engineered, is more of an infrastructure/benchmark contribution for a specific domain (mobile GUI agents) with narrower impact scope.

vs. Credit Assignment with Resets in Language Model Reasoning

gemini-3.15/26/2026

Paper 1 addresses a fundamental bottleneck in RL post-training for LLMs—credit assignment over long reasoning trajectories. By proposing a fully self-supervised reset mechanism (SRPO) grounded in the Conservative Policy Iteration framework, it offers a highly scalable and domain-agnostic approach. While Paper 2 presents impressive empirical results using A* search, its reliance on formal search spaces and heuristics may limit its application to specific deductive reasoning tasks, whereas Paper 1's methodology is more broadly applicable to general open-ended reasoning tasks.

vs. Market Regime Council for Dynamic Credit Assignment in Multi-Agent LLM Decision Systems

claude-opus-4.65/26/2026

Paper 1 addresses a fundamental challenge in LLM reasoning by bridging classical AI search (A*) with modern LLM training, offering broad methodological contributions applicable across many domains. It demonstrates that small models can outperform much larger ones through principled training, with insights about heuristic-guided learning that generalize beyond the specific task. Paper 2, while technically sound, addresses a narrower application (crypto portfolio management) with limited generalizability. Paper 1's contributions to reasoning efficiency and process reward modeling have broader impact potential across the AI research community.

vs. Privacy-Preserving Local Language Models for Longitudinal Data Retrieval in Chronic Dermatologic Disease: Implementation in Pemphigus Patients

claude-opus-4.65/26/2026

Paper 1 presents a novel methodological contribution by bridging classical A* search with LLM reasoning through post-training, demonstrating that small models can outperform much larger ones. It addresses a fundamental challenge in AI reasoning with broad applicability across many domains. Paper 2 is a well-executed but incremental clinical application study evaluating an existing small language model on a narrow use case (pemphigus records), with limited novelty in methodology and restricted generalizability beyond its specific clinical domain.

vs. Entropy-Gradient Inversion: Moving Toward Internal Mechanism of Large Reasoning Models

claude-opus-4.65/26/2026

Paper 1 introduces a novel theoretical concept (Entropy-Gradient Inversion) that provides mechanistic insight into how large reasoning models work internally, addressing a fundamental gap in understanding. It also proposes a practical method (CorR-PO) that leverages this insight for RL-based training without costly external verifiers. The combination of interpretability discovery and practical training improvement across multiple model scales gives it broader impact potential. Paper 2, while valuable in applying A* search principles to LLM reasoning, is more incremental—applying classical search to guide training in a specific domain (proof generation) with smaller models.

vs. CoRe-Code: Collaborative Reinforcement Learning for Code Generation

claude-opus-4.65/26/2026

Paper 1 introduces a novel conceptual bridge between classical A* search and LLM reasoning via post-training, offering both theoretical grounding and surprising empirical results (1-3B models outperforming DeepSeek-V3.2). The framing of reasoning as search with optimality guarantees is more foundational and broadly applicable. Paper 2, while solid, applies a relatively incremental Planner-Coder multi-agent paradigm to code generation using existing RL techniques (GRPO). Paper 1's insights on heuristic-guided training and the accuracy-efficiency tradeoff open richer research directions across reasoning tasks.