What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou
Abstract
Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper challenges a widely held assumption in the LLM training community: that code data in pretraining corpora broadly enhances reasoning capabilities. Through controlled ablation experiments on a 10T-token corpus with fine-grained domain separation, the authors demonstrate three key findings: (1) pure executable code improves programming but actually *competes* with mathematical reasoning under fixed-budget training; (2) reasoning gains previously attributed to code are better explained by cross-domain structured reasoning traces (Code-NL data like notebooks, Q&A pages, and mixed-format documents); and (3) increasing the density of "cognitive scaffolds"—structurally organized math samples with explicit intermediate reasoning steps—within a fixed math budget yields substantial gains on complex mathematical reasoning while preserving programming performance.
The central insight is a disambiguation: what prior work lumped together as "code" actually contains two distinct signals—executable programming artifacts and structured reasoning traces embedded in natural language. Separating these reveals that the reasoning benefits come from the latter, not the former.
Methodological Rigor
The experimental design is notably strong in several respects. The 10T-token corpus scale is realistic and industrially relevant, lending credibility to the findings. The seven-domain taxonomy with explicit criteria for distinguishing Code from Code-NL is well-motivated and addresses a genuine confound in prior work (e.g., Aryabumi et al. 2025 counted Markdown, CSS, and HTML as "code"). The fixed-budget substitution design—where ablated tokens are replaced by proportional upsampling of remaining domains—isolates marginal contributions rather than conflating removal with reduced training data.
The paper validates findings across multiple architectures and scales: dense models at 1B and 5B, and MoE variants with 16, 32, and 64 experts. This cross-validation meaningfully strengthens the claims. The benchmark suite is comprehensive, spanning 30+ evaluations across five capability dimensions.
However, there are methodological limitations. The cognitive scaffold selection relies on a FastText classifier trained on code vs. non-code samples as a proxy for "structure"—this is acknowledged but somewhat circular, as it uses code-derived features to identify structured math data. The paper does not systematically sweep replacement ratios for scaffolds, making it difficult to assess sensitivity. The 200B training tokens per configuration (not the full 10T) may not fully capture long-horizon interactions that emerge at larger scales. Additionally, the causal language around "competition" and "negative coupling" should be interpreted carefully—these are fixed-budget substitution effects, not pure causal claims, though the authors are mostly careful about this distinction.
Potential Impact
The practical implications are significant for the LLM training community. If the findings generalize, they suggest that:
1. Data taxonomy matters more than volume: Simply adding more code to pretraining is suboptimal; careful separation of executable code from mixed-format reasoning traces is essential.
2. Targeted data composition beats broad domain mixing: The cognitive scaffolding approach—increasing density of structured reasoning samples within a fixed domain budget—offers a practical lever for improving complex reasoning without additional compute.
3. Cross-domain trade-offs are real and measurable: The expert-routing analysis provides mechanism-level evidence that data composition shapes internal model specialization, giving practitioners diagnostic tools for understanding domain competition.
These insights directly inform industrial pretraining pipelines. The finding that code hurts complex math (e.g., -71.53% on Minerva-Math, -47.16% on OlympiadBench) under fixed budgets is particularly actionable for teams building math-focused models.
Timeliness & Relevance
This paper addresses a timely question. As foundation models scale and data curation becomes a primary bottleneck, understanding *what* in training data drives *which* capabilities is critical. The "code improves reasoning" narrative has become conventional wisdom, influencing how major labs allocate their training budgets. Providing a more nuanced picture—that structured reasoning traces, not executable code per se, drive reasoning gains—is a valuable corrective.
The work also connects to the growing interest in data-centric AI, where the focus shifts from architectural innovation to understanding data properties. The cognitive scaffolding concept aligns with recent work on chain-of-thought and process-based reasoning, suggesting that exposing intermediate reasoning structure during pretraining (not just fine-tuning) is beneficial.
Strengths
Limitations
Overall Assessment
This is a well-executed empirical study that provides important nuance to a widely accepted claim about code and reasoning. The fine-grained domain separation is the key contribution, and the cognitive scaffolding concept, while preliminary, points toward more principled data-centric optimization. The main limitation is the lack of theoretical depth and the relatively modest model scales, but the practical relevance and thoroughness of the experiments make this a valuable contribution.
Generated May 20, 2026
Comparison History (22)
Paper 2 investigates a fundamental question about pretraining data composition, challenging the widely held belief that pure code training improves general reasoning. By showing that structured reasoning traces drive these improvements, it provides actionable insights for optimizing foundation model training. This deep methodological understanding will likely have a broader and more lasting impact across the field than Paper 1, which introduces an evaluation benchmark that, while valuable, may become obsolete as models rapidly evolve.
Paper 2 provides fundamental insights into pretraining data composition by refining the prevalent assumption that pure code improves general reasoning. Its findings on structured reasoning signals will directly influence how foundation models are trained globally, offering broader and more profound methodological impact compared to the introduction of a new benchmark in Paper 1.
Paper 1 likely has higher scientific impact: it addresses a central, timely question in foundation-model training (whether code causally improves general reasoning) using large-scale controlled pretraining with domain-separated data, yielding actionable guidance for data-centric optimization and revealing trade-offs across capabilities. Its conclusions affect broad LM development (math, knowledge, programming) and can influence corpus design and training strategies across many labs. Paper 2 is rigorous and useful for MLLM hallucination mitigation, but its scope is narrower (modality-conflict cases, inference-time head interventions) and may generalize less broadly than Paper 1’s data-composition findings.
Paper 2 addresses a fundamental question about how code and structured reasoning affect language model capabilities, with implications across the entire LLM training ecosystem. Its findings—that structured reasoning traces matter more than executable code for reasoning, and its mechanistic analysis of expert activation patterns—provide actionable insights for foundation model development at scale. Paper 1, while practically valuable for a specific region, addresses a narrow geographic application with standard ML methods (RF + XGBoost). Paper 2's breadth of impact across AI research, its methodological rigor with controlled 10T-token experiments, and its timeliness given the centrality of data mixture optimization in LLM development give it substantially higher scientific impact.
Paper 1 likely has higher scientific impact due to broad relevance to foundation model training and data-centric optimization: it challenges a common assumption (code improves general reasoning), provides controlled large-scale pretraining evidence, and offers actionable insights (structured reasoning traces, density trade-offs) with mechanism-level routing analysis. Its findings can influence LM dataset design across many domains. Paper 2 is methodologically strong and practically valuable for wastewater control with safety guarantees, but its impact is more domain-specific and likely narrower across fields.
Paper 1 challenges and clarifies a widely held assumption in foundation model pretraining—that code data universally improves general reasoning. By identifying that structured reasoning traces, rather than pure code, drive these gains, it offers highly impactful, actionable insights for data curation and LLM training. Paper 2 presents a valuable but narrower negative result regarding agent skills in high-feedback environments, making its broader scientific impact more limited compared to the fundamental LLM pretraining findings in Paper 1.
Paper 2 addresses a fundamental question about how code and structured reasoning signals contribute to LLM capabilities, with experiments at massive scale (10T tokens). Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for foundation model training data composition across the entire field. Paper 1 tackles a more niche problem (multi-task unlearning) with solid but incremental contributions limited to computer vision benchmarks. Paper 2's insights are more timely, broadly applicable, and likely to influence how major labs design pretraining data mixtures.
Paper 2 addresses a fundamental question about how code and structured reasoning signals affect LLM capabilities during pretraining, providing mechanistic insights via expert-activation patterns. Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for data-centric optimization of foundation models across the entire AI field. Paper 1, while valuable for health AI applications, is more applied and incremental, evaluating an existing LLM on PHR-augmented queries. Paper 2's insights will likely influence training data strategies for many future models, giving it broader and deeper scientific impact.
Paper 2 addresses a fundamental question about how code and structured reasoning signals affect LLM capabilities, with experiments at massive scale (10T tokens). Its findings about data composition for training foundation models have broad implications for the entire LLM training paradigm and will influence how future models are built. Paper 1 offers an incremental improvement (2.1% AUC) in deepfake detection using emotion cues—a narrower problem with limited generalization gains. Paper 2's mechanistic insights and practical guidance for data-centric optimization have far broader impact across AI research.
Paper 2 has higher potential impact due to broader relevance and stronger real-world applicability: it provides controlled large-scale pretraining evidence about what data actually improves mathematical reasoning, with actionable guidance for dataset design and mechanistic routing analyses. Its conclusions can influence foundation-model training across many domains (math, code, general reasoning) and are timely for current LLM development. Paper 1 is a valuable, reproducible case study for AI-assisted formalization, but it is narrower in scope and reports a partially unresolved proof, limiting immediate methodological and practical impact beyond the theorem-proving community.
Paper 1 addresses a foundational and widely debated topic in LLM pretraining: the role of code in developing reasoning capabilities. By conducting massive-scale (10T tokens) controlled experiments, it challenges prevailing assumptions and provides actionable insights into data-centric optimization (structured reasoning signals vs. pure code). This will likely broadly influence how next-generation foundation models are trained. While Paper 2 introduces a valuable privacy benchmark, Paper 1's fundamental insights into model cognition and pretraining data dynamics offer a wider, more transformative impact on the trajectory of AI capability research.
Paper 2 addresses a fundamental question about LLM training data composition and mathematical reasoning at scale (10T tokens), with findings that challenge a widely-held assumption (code improves general reasoning). Its insights about structured reasoning traces vs. executable code have broad implications for foundation model training across the industry. Paper 1, while methodologically rigorous, addresses a narrower problem (skill library management in self-evolving agents) with a smaller community of practitioners. Paper 2's mechanistic analysis via expert-activation patterns and practical data-centric optimization strategies give it wider applicability and timeliness.
Paper 2 has higher potential impact due to broader relevance and timeliness: it addresses foundation-model pretraining data composition at 10T-token scale, clarifying a widely held assumption (code improves reasoning) and providing actionable, data-centric strategies for improving mathematical reasoning with mechanism-level evidence (routing/expert activations). Its findings can influence LM training practices across many domains and applications. Paper 1 is solid and practically useful for wearable HAR, but its scope is narrower and primarily advances a specific application/architecture combination.
Paper 2 addresses a fundamental question about how code affects LLM reasoning—a topic central to the massive ongoing investment in foundation model training. Its findings that structured reasoning traces (not executable code) drive reasoning gains, combined with practical data-centric optimization strategies validated at 10T-token scale, have immediate broad impact on how the entire field approaches pretraining data composition. Paper 1, while theoretically rigorous in formalizing controllable simulation as causal inference, addresses a narrower problem in conversational agent evaluation with a more specialized audience.
Paper 2 addresses a fundamental question about what drives reasoning improvements in LLMs, providing systematic evidence through large-scale controlled experiments on a 10T-token corpus. Its findings—that structured reasoning traces, not executable code per se, improve mathematical reasoning—have broad implications for LLM pretraining data strategies across the entire field. Paper 1, while methodologically sound, focuses on a narrower adversarial attack technique. Paper 2's insights into data-centric optimization and cross-domain interactions will likely influence foundation model training practices more broadly and durably.
Paper 1 addresses a fundamental question about how LLMs acquire reasoning capabilities, challenging the prevailing assumption that pure code drives these gains. By identifying structured reasoning traces as the true catalyst and providing mechanistic evidence, it offers transformative insights for foundation model pre-training. While Paper 2 provides a valuable inference acceleration technique, Paper 1's discoveries regarding data composition and cognitive scaffolding have a deeper, more foundational scientific impact on AI development.
Paper 1 addresses a fundamental question about what drives reasoning improvements in foundation models—a topic central to the entire LLM training pipeline. Its controlled 10T-token pretraining experiments provide causal evidence that structured reasoning traces, not executable code per se, drive reasoning gains. This insight has broad implications for data curation strategies across the industry. Paper 2 presents a solid engineering contribution (signed graphs for multi-agent reasoning) but is more incremental, building on existing MAS frameworks with a specific architectural modification. Paper 1's findings reshape understanding of pretraining data composition, affecting a wider research community.
Paper 2 has higher likely scientific impact due to its strong real-world applicability (telemedicine decision support), timeliness (multimodal, real-time clinical AI), and broader cross-field influence spanning ML, HCI, and clinical medicine. It proposes a novel low-latency dual-agent multimodal system and introduces a task-relevant evaluation (TelePACES) with a sizable randomized crossover simulation study, lending methodological credibility. Paper 1 is rigorous and valuable for data-centric LLM training insights, but its impact is more specialized to model training/benchmarking and less directly transformative for high-stakes deployment.
Paper 2 likely has higher impact due to timely relevance to deployed memory-augmented LLM agents and safety. It introduces a concrete, broadly applicable failure mode (memory laundering), a measurable metric (SPG), and actionable mitigation guidance tied to intervention placement, making it immediately useful for real-world systems. The phenomenon generalizes across agent architectures and connects to security, alignment, and HCI. Paper 1 is rigorous and valuable for data-centric training insights, but its impact is more specialized to pretraining data composition debates and may translate more slowly into deployment practices.
Paper 1 addresses a fundamental question about why code in pretraining improves reasoning, providing controlled experiments at scale (10T tokens) that challenge a widely-held assumption. Its findings—that structured reasoning traces, not executable code per se, drive reasoning gains—have broad implications for data-centric optimization across the entire foundation model training community. Paper 2 presents a useful but more incremental framework for LLM agents. Paper 1's mechanistic insights and practical guidance for training data composition will likely influence a wider range of future research and industry practices.