Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

May 7, 2026

arXiv:2605.06638v1 PDF

cs.AI(primary)cs.CL

#124of 2292·Artificial Intelligence

#124 of 2292 · Artificial Intelligence

Tournament Score

1535±47

10501800

88%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance7.5

Rigor8.5

Novelty7

Clarity8.5

Tournament Score

1535±47

10501800

88%

Win Rate

Wins

Losses

Matches

Rating

7.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ( $T \propto D^γ$ , $R^{2} > 0.99$ ), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$ . On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+ 10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key"

1. Core Contribution

This paper introduces ScaleLogic, a synthetic logical reasoning framework that provides independent, fine-grained control over two axes of task difficulty: proof depth (horizon) and logical expressiveness (from simple implication-only logic up to first-order reasoning with conjunction, disjunction, negation, and universal quantification). The key empirical finding is a power-law relationship between RL training compute $T$ and reasoning depth $D$ ( $T \propto D^\gamma$ , $R^{2} > 0.99$ ), where the scaling exponent $\gamma$ increases monotonically with logical expressiveness (from 1.04 to 2.60). The paper further demonstrates that training on more expressive logic yields stronger downstream transfer to real-world reasoning benchmarks (+10.66 points over base), establishing that *what* a model trains on matters more than *how much* it trains.

2. Methodological Rigor

The experimental methodology is notably thorough:

Controlled environment design: The backward proof construction guarantees unique derivability, and Z3 SMT solver verification on 1,000 samples per configuration provides strong soundness guarantees. Shortcut controls (random predicates, shuffled ordering, uniform corruption sampling) are carefully implemented.

Robust scaling analysis: The power-law fits achieve

R^{2} > 0.99

across all settings. The authors provide extensive robustness checks: power-law vs. exponential comparison via AIC (

\Delta

AIC ≥ +7.1), sensitivity to accuracy threshold (85% vs. 90%), multiple compute metrics (steps, tokens, FLOPs, GPU-hours), and multi-seed validation showing exponent stability within 0.02.

Confound isolation: The multi-entity ablation (Table 7) cleanly shows that the elevated

\gamma

at +Quantification is driven by quantification itself, not the multi-entity dimension.

Cross-algorithm validation: Testing DAPO, GRPO, and GSPO confirms that the power-law relationship is algorithm-agnostic.

However, there are methodological limitations. All main experiments use Qwen3-4B with only partial replication on 8B. The depth ranges used for fitting are relatively narrow (e.g., 4-14 for +Quantification), and the authors appropriately caveat that the power-law characterization describes the observed regime rather than an asymptotic law. Single-seed runs for the main scaling curves, while validated via multi-seed checks at +Conjunction, leave some uncertainty.

3. Potential Impact

Immediate applications: ScaleLogic provides a ready-to-use benchmark and training framework for the RLVR community. Its controlled axes enable principled curriculum design and diagnostic evaluation of reasoning capabilities.

Broader implications: The central finding—that expressiveness governs both training efficiency and downstream transfer—has significant implications for RL post-training data curation. Rather than simply scaling data volume, practitioners should focus on the structural richness of training problems. The power-law characterization provides a practical tool for predicting training budgets.

Transfer results: The +10.66 point improvement on downstream math/reasoning benchmarks from purely synthetic logical training is practically significant and demonstrates that abstract logical reasoning skills transfer to real-world tasks. The monotonic relationship between expressiveness and transfer gain is a particularly actionable finding.

4. Timeliness & Relevance

This work addresses a critical bottleneck in the current RL-for-reasoning wave. Following DeepSeek-R1 and similar models, the community recognizes that RLVR can improve reasoning, but lacks controlled environments to understand *why* and *how*. The paper fills a gap clearly identified in Table 1: existing data sources (math/code, SAT, Knights and Knaves, game-based) fail to simultaneously provide verifiability, scalability, and controllable horizon/expressiveness. The work is highly timely given the rapid adoption of RL post-training and the emerging interest in scaling laws for this paradigm.

5. Strengths & Limitations

Key Strengths:

The two-axis decomposition (depth × expressiveness) is elegant and enables clean causal attribution of difficulty sources.

The monotonic expressiveness-exponent relationship is a clean, memorable finding with clear predictive utility.

Comprehensive ablations (curriculum vs. uniform vs. difficult-only; candidate count scaling; OOD generalization) provide a rich picture of training dynamics.

The observation that curriculum training reduces $\gamma$ (from 1.70 to 1.33 at +Conjunction) has practical value for training pipeline design.

The OOD generalization analysis (Figure 5b), showing performance collapses at ~3×

D_{\text{train}}

regardless of training depth, reveals a fundamental horizon limitation.

Notable Limitations:

The logic hierarchy, while carefully nested, remains within classical deductive reasoning. Richer fragments (equality, higher-order logic, abductive reasoning, probabilistic reasoning) are unaddressed.

No theoretical explanation is offered for why the power law holds or why different operators change

\gamma

. The gap between

\gamma

for +Conjunction (1.72) and +Negation (1.81) with overlapping standard errors suggests the hierarchy may not be fully monotonic at finer granularity.

The downstream gains, while significant, are evaluated on a 4B model. Whether expressiveness-driven transfer persists at larger scales where base capabilities are stronger remains unclear.

The synthetic problems use random 5-letter predicates, eliminating semantic grounding. It's unclear whether adding semantic content would change the scaling dynamics.

The fixed candidate count

B = 4

throughout most experiments is reasonable but limits the generality of conclusions about the interaction between depth and branching factor.

Additional Observations

The qualitative example (Appendix L) effectively illustrates how synthetic logic training encourages systematic case-splitting behavior on math problems—a compelling mechanistic hint. The framework's open-ended extensibility (adding new logical operators) makes it a potential community resource. The paper is well-written with clear figures and a logical structure.

Rating:7.5/ 10

Significance 7.5Rigor 8.5Novelty 7Clarity 8.5

Generated May 8, 2026

Comparison History (25)

vs. How Far Are We From True Auto-Research?

gpt-5.25/20/2026

Paper 2 has higher impact potential: it introduces a controllable benchmark (ScaleLogic) enabling systematic, reproducible scaling studies of RL for long-horizon reasoning, and reports clear quantitative laws (power-law scaling with depth; expressiveness-dependent exponents) with strong fits and validation across RL methods plus curriculum effects. Its findings are timely for LLM post-training, with direct implications for compute planning and task design, and broad relevance across ML, reasoning, and alignment. Paper 1 is valuable as an audit/diagnostic of agentic auto-research, but is more observational and narrower in immediate methodological generalization.

vs. Ads in AI Chatbots? An Analysis of How Large Language Models Navigate Conflicts of Interest

gemini-3.15/16/2026

Paper 1 establishes fundamental scaling laws for reinforcement learning in LLMs, quantifying the relationship between compute, reasoning depth, and expressiveness. This foundational contribution to core AI development and theoretical understanding promises broader, longer-lasting scientific impact than Paper 2's empirical analysis of chatbot advertising behaviors, which, while ethically important, addresses transient deployment phenomena.

vs. Thinking Diffusion: Penalize and Guide Visual-Grounded Reasoning in Diffusion Multimodal Language Models

gpt-5.25/16/2026

Paper 1 likely has higher scientific impact due to its more general, theory-like contribution: a controlled framework (ScaleLogic) enabling systematic study of RL scaling with reasoning horizon and logical expressiveness, with strong empirical evidence (power-law scaling, high R^2) and transfer results across downstream math/reasoning plus robustness across RL methods/curricula. This can influence multiple subfields (scaling laws, RL for LLMs, reasoning benchmarks, curriculum learning). Paper 2 is timely and useful for diffusion multimodal models, but its techniques (penalties/guidance) appear more model/paradigm-specific and likely narrower in cross-field impact.

vs. Unsteady Metrics and Benchmarking Cultures of AI Model Builders

claude-opus-4.65/16/2026

Paper 2 introduces a novel, controlled framework (ScaleLogic) for studying RL-based reasoning in LLMs, discovering precise power-law scaling relationships between compute, reasoning depth, and logical expressiveness. This provides actionable, quantitative insights for training methodology (e.g., expressiveness matters more than volume, curriculum learning improves efficiency) with strong methodological rigor (R²>0.99). Paper 1 offers valuable meta-analysis of benchmarking practices but is more observational/descriptive. Paper 2's findings on scaling laws and transfer learning have broader, more immediate impact on how the field trains and evaluates reasoning models.

vs. MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

gpt-5.25/16/2026

Paper 2 likely has higher scientific impact: it introduces a solver-grounded multimodal benchmark with verified instances, clear evaluation protocol, and broad real-world relevance to operations research, supply chains, scheduling, and decision support. Its methodological rigor (instance generation + exact-solver verification) and timeliness (multimodal LLMs, code generation) make it a durable community resource that can standardize progress and attract cross-field adoption. Paper 1 offers strong insights into RL scaling laws for reasoning, but it is more synthetic and may influence a narrower slice of RL-for-reasoning research compared to a widely usable benchmark.

vs. LLM Safety From Within: Detecting Harmful Content with Internal Representations

gemini-3.15/16/2026

Paper 2 offers fundamental scientific insights into a critical, timely problem: scaling RL for LLM reasoning. By establishing clear power laws between training compute, reasoning depth, and logical expressiveness, it provides foundational knowledge that will guide the development of next-generation reasoning models. While Paper 1 presents a highly efficient and practical engineering solution for LLM safety, Paper 2's methodological rigor in defining synthetic scaling laws has broader implications for the theoretical understanding and future training paradigms of AI, giving it a higher potential scientific impact.

vs. Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

gemini-3.15/8/2026

Paper 1 establishes fundamental power-law scaling relationships between RL compute, reasoning depth, and logical expressiveness. By addressing the critical bottleneck of long-horizon reasoning in LLMs and demonstrating actionable transfer to general benchmarks, it offers highly rigorous, foundational insights that will directly guide the next generation of LLM training methodologies. While Paper 2 provides a valuable safety taxonomy, Paper 1's rigorous empirical laws and direct impact on model capability give it broader, transformative potential across AI research.

vs. Emergent Strategic Reasoning Risks in AI: A Taxonomy-Driven Evaluation Framework

gemini-3.15/8/2026

Paper 2 establishes fundamental scaling laws for reinforcement learning in LLM reasoning, a critical and rapidly growing area of AI research. By demonstrating a mathematically rigorous power-law relationship between training compute and reasoning depth, and linking logical expressiveness to downstream transfer efficiency, it provides highly actionable insights for developing next-generation reasoning models. While Paper 1 offers a valuable safety benchmark, Paper 2's foundational methodological contributions to understanding capability scaling give it a broader and potentially more transformative scientific impact across the AI community.

vs. Hidden in Plain Sight: Visual-to-Symbolic Analytical Solution Inference from Field Visualizations

gemini-3.15/8/2026

Paper 2 addresses a fundamental challenge in AI—improving LLM reasoning via RL—and establishes quantitative scaling laws linking compute, reasoning depth, and expressiveness. Its findings demonstrate broad downstream transfer to mathematics and general reasoning, promising widespread impact across the AI community. While Paper 1 introduces a highly novel visual-to-symbolic pipeline, its immediate impact is narrower, primarily benefiting specific subfields in physics and AI-assisted scientific discovery.

vs. MolRecBench-Wild: A Real-World Benchmark for Optical Chemical Structure Recognition

claude-opus-4.65/8/2026

Paper 1 makes fundamental contributions to understanding how RL training scales with reasoning complexity in LLMs, discovering precise power-law relationships between compute and reasoning depth modulated by logical expressiveness. This has broad implications for LLM training methodology, curriculum design, and scaling laws—topics central to AI research. The finding that training content (expressiveness) matters more than volume for downstream transfer is highly actionable. Paper 2, while valuable for the OCSR community, addresses a narrower domain-specific benchmarking problem with more limited cross-field impact.

vs. Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning

gpt-5.25/8/2026

Paper 1 offers a more novel and broadly informative contribution: a controlled, scalable reasoning environment (ScaleLogic) that disentangles horizon and logical expressiveness, plus empirical scaling laws with strong fits and systematic analysis across RL methods and curricula. These results can influence understanding of RL-for-reasoning training dynamics and guide dataset/task design, with potential impact across LLM training, scaling theory, and reasoning benchmarks. Paper 2 is practically valuable for cost-effective inference, but is narrower (routing policy engineering) and closer to existing adaptive compute/routing lines, with more limited cross-field impact.

vs. Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs

claude-opus-4.65/8/2026

Paper 1 makes fundamental contributions to understanding RL-based reasoning in LLMs, discovering precise power-law scaling relationships between compute, reasoning depth, and logical expressiveness. This has broad implications for training methodology across the LLM field. The controlled ScaleLogic framework enables systematic study of a critical open question, and the finding that training expressiveness (not just volume) drives downstream transfer is highly impactful. Paper 2 addresses a narrower applied problem (PII detection in HTTP traffic) with an LLM pipeline that, while useful, offers more incremental and domain-specific contributions.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gemini-3.15/8/2026

While Paper 1 provides valuable insights into LLM scaling laws and reasoning, Paper 2 presents a groundbreaking leap in medical AI. By creating a multimodal generative model of human physiology capable of in silico clinical trial simulation and accurate disease prediction, HealthFormer introduces the foundation for clinical digital twins. Its ability to accurately predict individualized responses to interventions across multiple independent cohorts demonstrates immense potential for revolutionizing personalized medicine, clinical trial design, and longitudinal healthcare, giving it a much broader and deeper real-world scientific impact.

vs. Simulating clinical interventions with a generative multimodal model of human physiology

gemini-3.15/8/2026

Paper 1 presents a paradigm-shifting approach to personalized medicine by developing a generative 'world model' of human physiology. Its ability to simulate clinical interventions in silico and accurately forecast disease endpoints without task-specific training offers massive real-world applications for clinical trials and digital twins. While Paper 2 provides valuable foundational insights into LLM scaling laws for reasoning, Paper 1 demonstrates unprecedented scale, multi-modal integration, and rigorous real-world validation across diverse medical cohorts, giving it a profoundly wider and more transformative potential impact on human health and applied AI.

vs. Who Prices Cognitive Labor in the Age of Agents? A Position on Compute-Anchored Wages

gemini-3.15/8/2026

Paper 2 addresses a critical and highly active area in AI—teaching LLMs reasoning via RL. By establishing empirical power-law scaling and demonstrating the importance of logical expressiveness for downstream transfer, it provides actionable insights for AI development. Paper 1 offers an interesting theoretical economic framework for AI labor, but Paper 2's empirical methodology and direct applicability to current LLM training bottlenecks give it a higher potential for immediate, widespread scientific impact and citations.

vs. ReasonSTL: Bridging Natural Language and Signal Temporal Logic via Tool-Augmented Process-Rewarded Learning

claude-opus-4.65/8/2026

Paper 2 addresses a fundamental question about RL-based reasoning in LLMs with broad implications. It introduces a controlled framework (ScaleLogic) revealing power-law scaling relationships between compute, reasoning depth, and logical expressiveness—offering principled insights applicable across many domains. The finding that training expressiveness matters more than volume for transfer is highly impactful for the LLM training community. Paper 1, while solid engineering work on NL-to-STL translation, addresses a narrower domain (formal specification for cyber-physical systems) with more limited cross-field impact.

vs. Mind the Gap? A Distributional Comparison of Real and Synthetic Priors for Tabular Foundation Models

gemini-3.15/8/2026

Paper 1 investigates scaling laws for RL in LLM reasoning, a highly prominent and rapidly evolving area of AI. By establishing power laws for training compute versus reasoning depth and demonstrating transfer to downstream tasks, it provides foundational insights likely to influence the development of next-generation reasoning models. Paper 2, while rigorous, focuses on tabular foundation models, which has a significantly narrower scope and lesser broader impact across the field compared to general LLM reasoning capabilities.

vs. Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

gpt-5.25/8/2026

Paper 1 has higher potential impact due to a broadly relevant, timely question (how RL scales for long-horizon LLM reasoning) and a controlled benchmark (ScaleLogic) enabling systematic, generalizable scaling-law findings across RL methods, with strong quantitative fits and demonstrated transfer to standard reasoning/math benchmarks. Its insights on expressiveness vs. compute scaling could influence RLHF/RLAIF training design across many domains. Paper 2 is innovative and rigorous (provable budget conservation) with clear engineering value for code agents, but its impact is narrower (multi-agent codegen orchestration) and relies on proxy misrouting metrics rather than widely adopted benchmark outcomes.

vs. When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

claude-opus-4.65/8/2026

Paper 2 presents novel empirical findings with a controlled synthetic framework (ScaleLogic) that reveals precise power-law scaling relationships between RL training compute and reasoning depth/expressiveness. These quantitative insights into how logical expressiveness governs training difficulty and downstream transfer are immediately actionable for the LLM training community. Paper 1, while offering a useful conceptual framework for sycophancy, is a position paper that primarily reorganizes existing observations into a taxonomy without empirical validation. Paper 2's methodological rigor, scalable framework, and broadly applicable findings give it significantly higher potential impact.

vs. When Helpfulness Becomes Sycophancy: Sycophancy is a Boundary Failure Between Social Alignment and Epistemic Integrity in Large Language Models

claude-opus-4.65/8/2026

Paper 1 presents novel empirical findings with rigorous methodology: a controlled framework (ScaleLogic) revealing power-law scaling relationships between RL training compute and reasoning depth/expressiveness, with strong quantitative results (R²>0.99). It provides actionable insights for training LLMs and demonstrates downstream transfer benefits. Paper 2, while addressing an important problem (sycophancy), is a position paper proposing a conceptual framework and taxonomy without empirical validation. Paper 1's quantitative contributions, practical implications for RL-based LLM training, and methodological rigor give it substantially higher potential for scientific impact.