Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov
Abstract
Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. We introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute follows a power law with respect to reasoning depth (, ), and that the scaling exponent increases monotonically with logical expressiveness, from to . On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key"
1. Core Contribution
This paper introduces ScaleLogic, a synthetic logical reasoning framework that provides independent, fine-grained control over two axes of task difficulty: proof depth (horizon) and logical expressiveness (from simple implication-only logic up to first-order reasoning with conjunction, disjunction, negation, and universal quantification). The key empirical finding is a power-law relationship between RL training compute and reasoning depth (, ), where the scaling exponent increases monotonically with logical expressiveness (from 1.04 to 2.60). The paper further demonstrates that training on more expressive logic yields stronger downstream transfer to real-world reasoning benchmarks (+10.66 points over base), establishing that *what* a model trains on matters more than *how much* it trains.
2. Methodological Rigor
The experimental methodology is notably thorough:
However, there are methodological limitations. All main experiments use Qwen3-4B with only partial replication on 8B. The depth ranges used for fitting are relatively narrow (e.g., 4-14 for +Quantification), and the authors appropriately caveat that the power-law characterization describes the observed regime rather than an asymptotic law. Single-seed runs for the main scaling curves, while validated via multi-seed checks at +Conjunction, leave some uncertainty.
3. Potential Impact
Immediate applications: ScaleLogic provides a ready-to-use benchmark and training framework for the RLVR community. Its controlled axes enable principled curriculum design and diagnostic evaluation of reasoning capabilities.
Broader implications: The central finding—that expressiveness governs both training efficiency and downstream transfer—has significant implications for RL post-training data curation. Rather than simply scaling data volume, practitioners should focus on the structural richness of training problems. The power-law characterization provides a practical tool for predicting training budgets.
Transfer results: The +10.66 point improvement on downstream math/reasoning benchmarks from purely synthetic logical training is practically significant and demonstrates that abstract logical reasoning skills transfer to real-world tasks. The monotonic relationship between expressiveness and transfer gain is a particularly actionable finding.
4. Timeliness & Relevance
This work addresses a critical bottleneck in the current RL-for-reasoning wave. Following DeepSeek-R1 and similar models, the community recognizes that RLVR can improve reasoning, but lacks controlled environments to understand *why* and *how*. The paper fills a gap clearly identified in Table 1: existing data sources (math/code, SAT, Knights and Knaves, game-based) fail to simultaneously provide verifiability, scalability, and controllable horizon/expressiveness. The work is highly timely given the rapid adoption of RL post-training and the emerging interest in scaling laws for this paradigm.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Additional Observations
The qualitative example (Appendix L) effectively illustrates how synthetic logic training encourages systematic case-splitting behavior on math problems—a compelling mechanistic hint. The framework's open-ended extensibility (adding new logical operators) makes it a potential community resource. The paper is well-written with clear figures and a logical structure.
Generated May 8, 2026
Comparison History (25)
Paper 2 has higher impact potential: it introduces a controllable benchmark (ScaleLogic) enabling systematic, reproducible scaling studies of RL for long-horizon reasoning, and reports clear quantitative laws (power-law scaling with depth; expressiveness-dependent exponents) with strong fits and validation across RL methods plus curriculum effects. Its findings are timely for LLM post-training, with direct implications for compute planning and task design, and broad relevance across ML, reasoning, and alignment. Paper 1 is valuable as an audit/diagnostic of agentic auto-research, but is more observational and narrower in immediate methodological generalization.
Paper 1 establishes fundamental scaling laws for reinforcement learning in LLMs, quantifying the relationship between compute, reasoning depth, and expressiveness. This foundational contribution to core AI development and theoretical understanding promises broader, longer-lasting scientific impact than Paper 2's empirical analysis of chatbot advertising behaviors, which, while ethically important, addresses transient deployment phenomena.
Paper 1 likely has higher scientific impact due to its more general, theory-like contribution: a controlled framework (ScaleLogic) enabling systematic study of RL scaling with reasoning horizon and logical expressiveness, with strong empirical evidence (power-law scaling, high R^2) and transfer results across downstream math/reasoning plus robustness across RL methods/curricula. This can influence multiple subfields (scaling laws, RL for LLMs, reasoning benchmarks, curriculum learning). Paper 2 is timely and useful for diffusion multimodal models, but its techniques (penalties/guidance) appear more model/paradigm-specific and likely narrower in cross-field impact.
Paper 2 introduces a novel, controlled framework (ScaleLogic) for studying RL-based reasoning in LLMs, discovering precise power-law scaling relationships between compute, reasoning depth, and logical expressiveness. This provides actionable, quantitative insights for training methodology (e.g., expressiveness matters more than volume, curriculum learning improves efficiency) with strong methodological rigor (R²>0.99). Paper 1 offers valuable meta-analysis of benchmarking practices but is more observational/descriptive. Paper 2's findings on scaling laws and transfer learning have broader, more immediate impact on how the field trains and evaluates reasoning models.
Paper 2 likely has higher scientific impact: it introduces a solver-grounded multimodal benchmark with verified instances, clear evaluation protocol, and broad real-world relevance to operations research, supply chains, scheduling, and decision support. Its methodological rigor (instance generation + exact-solver verification) and timeliness (multimodal LLMs, code generation) make it a durable community resource that can standardize progress and attract cross-field adoption. Paper 1 offers strong insights into RL scaling laws for reasoning, but it is more synthetic and may influence a narrower slice of RL-for-reasoning research compared to a widely usable benchmark.
Paper 2 offers fundamental scientific insights into a critical, timely problem: scaling RL for LLM reasoning. By establishing clear power laws between training compute, reasoning depth, and logical expressiveness, it provides foundational knowledge that will guide the development of next-generation reasoning models. While Paper 1 presents a highly efficient and practical engineering solution for LLM safety, Paper 2's methodological rigor in defining synthetic scaling laws has broader implications for the theoretical understanding and future training paradigms of AI, giving it a higher potential scientific impact.
Paper 1 establishes fundamental power-law scaling relationships between RL compute, reasoning depth, and logical expressiveness. By addressing the critical bottleneck of long-horizon reasoning in LLMs and demonstrating actionable transfer to general benchmarks, it offers highly rigorous, foundational insights that will directly guide the next generation of LLM training methodologies. While Paper 2 provides a valuable safety taxonomy, Paper 1's rigorous empirical laws and direct impact on model capability give it broader, transformative potential across AI research.
Paper 2 establishes fundamental scaling laws for reinforcement learning in LLM reasoning, a critical and rapidly growing area of AI research. By demonstrating a mathematically rigorous power-law relationship between training compute and reasoning depth, and linking logical expressiveness to downstream transfer efficiency, it provides highly actionable insights for developing next-generation reasoning models. While Paper 1 offers a valuable safety benchmark, Paper 2's foundational methodological contributions to understanding capability scaling give it a broader and potentially more transformative scientific impact across the AI community.
Paper 2 addresses a fundamental challenge in AI—improving LLM reasoning via RL—and establishes quantitative scaling laws linking compute, reasoning depth, and expressiveness. Its findings demonstrate broad downstream transfer to mathematics and general reasoning, promising widespread impact across the AI community. While Paper 1 introduces a highly novel visual-to-symbolic pipeline, its immediate impact is narrower, primarily benefiting specific subfields in physics and AI-assisted scientific discovery.
Paper 1 makes fundamental contributions to understanding how RL training scales with reasoning complexity in LLMs, discovering precise power-law relationships between compute and reasoning depth modulated by logical expressiveness. This has broad implications for LLM training methodology, curriculum design, and scaling laws—topics central to AI research. The finding that training content (expressiveness) matters more than volume for downstream transfer is highly actionable. Paper 2, while valuable for the OCSR community, addresses a narrower domain-specific benchmarking problem with more limited cross-field impact.
Paper 1 offers a more novel and broadly informative contribution: a controlled, scalable reasoning environment (ScaleLogic) that disentangles horizon and logical expressiveness, plus empirical scaling laws with strong fits and systematic analysis across RL methods and curricula. These results can influence understanding of RL-for-reasoning training dynamics and guide dataset/task design, with potential impact across LLM training, scaling theory, and reasoning benchmarks. Paper 2 is practically valuable for cost-effective inference, but is narrower (routing policy engineering) and closer to existing adaptive compute/routing lines, with more limited cross-field impact.
Paper 1 makes fundamental contributions to understanding RL-based reasoning in LLMs, discovering precise power-law scaling relationships between compute, reasoning depth, and logical expressiveness. This has broad implications for training methodology across the LLM field. The controlled ScaleLogic framework enables systematic study of a critical open question, and the finding that training expressiveness (not just volume) drives downstream transfer is highly impactful. Paper 2 addresses a narrower applied problem (PII detection in HTTP traffic) with an LLM pipeline that, while useful, offers more incremental and domain-specific contributions.
While Paper 1 provides valuable insights into LLM scaling laws and reasoning, Paper 2 presents a groundbreaking leap in medical AI. By creating a multimodal generative model of human physiology capable of in silico clinical trial simulation and accurate disease prediction, HealthFormer introduces the foundation for clinical digital twins. Its ability to accurately predict individualized responses to interventions across multiple independent cohorts demonstrates immense potential for revolutionizing personalized medicine, clinical trial design, and longitudinal healthcare, giving it a much broader and deeper real-world scientific impact.
Paper 1 presents a paradigm-shifting approach to personalized medicine by developing a generative 'world model' of human physiology. Its ability to simulate clinical interventions in silico and accurately forecast disease endpoints without task-specific training offers massive real-world applications for clinical trials and digital twins. While Paper 2 provides valuable foundational insights into LLM scaling laws for reasoning, Paper 1 demonstrates unprecedented scale, multi-modal integration, and rigorous real-world validation across diverse medical cohorts, giving it a profoundly wider and more transformative potential impact on human health and applied AI.
Paper 2 addresses a critical and highly active area in AI—teaching LLMs reasoning via RL. By establishing empirical power-law scaling and demonstrating the importance of logical expressiveness for downstream transfer, it provides actionable insights for AI development. Paper 1 offers an interesting theoretical economic framework for AI labor, but Paper 2's empirical methodology and direct applicability to current LLM training bottlenecks give it a higher potential for immediate, widespread scientific impact and citations.
Paper 2 addresses a fundamental question about RL-based reasoning in LLMs with broad implications. It introduces a controlled framework (ScaleLogic) revealing power-law scaling relationships between compute, reasoning depth, and logical expressiveness—offering principled insights applicable across many domains. The finding that training expressiveness matters more than volume for transfer is highly impactful for the LLM training community. Paper 1, while solid engineering work on NL-to-STL translation, addresses a narrower domain (formal specification for cyber-physical systems) with more limited cross-field impact.
Paper 1 investigates scaling laws for RL in LLM reasoning, a highly prominent and rapidly evolving area of AI. By establishing power laws for training compute versus reasoning depth and demonstrating transfer to downstream tasks, it provides foundational insights likely to influence the development of next-generation reasoning models. Paper 2, while rigorous, focuses on tabular foundation models, which has a significantly narrower scope and lesser broader impact across the field compared to general LLM reasoning capabilities.
Paper 1 has higher potential impact due to a broadly relevant, timely question (how RL scales for long-horizon LLM reasoning) and a controlled benchmark (ScaleLogic) enabling systematic, generalizable scaling-law findings across RL methods, with strong quantitative fits and demonstrated transfer to standard reasoning/math benchmarks. Its insights on expressiveness vs. compute scaling could influence RLHF/RLAIF training design across many domains. Paper 2 is innovative and rigorous (provable budget conservation) with clear engineering value for code agents, but its impact is narrower (multi-agent codegen orchestration) and relies on proxy misrouting metrics rather than widely adopted benchmark outcomes.
Paper 2 presents novel empirical findings with a controlled synthetic framework (ScaleLogic) that reveals precise power-law scaling relationships between RL training compute and reasoning depth/expressiveness. These quantitative insights into how logical expressiveness governs training difficulty and downstream transfer are immediately actionable for the LLM training community. Paper 1, while offering a useful conceptual framework for sycophancy, is a position paper that primarily reorganizes existing observations into a taxonomy without empirical validation. Paper 2's methodological rigor, scalable framework, and broadly applicable findings give it significantly higher potential impact.
Paper 1 presents novel empirical findings with rigorous methodology: a controlled framework (ScaleLogic) revealing power-law scaling relationships between RL training compute and reasoning depth/expressiveness, with strong quantitative results (R²>0.99). It provides actionable insights for training LLMs and demonstrates downstream transfer benefits. Paper 2, while addressing an important problem (sycophancy), is a position paper proposing a conceptual framework and taxonomy without empirical validation. Paper 1's quantitative contributions, practical implications for RL-based LLM training, and methodological rigor give it substantially higher potential for scientific impact.