What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou

May 19, 2026

arXiv:2605.19762v1 PDF

cs.AI(primary)cs.CL

#110of 2292·Artificial Intelligence

#110 of 2292 · Artificial Intelligence

Tournament Score

1539±45

10501800

91%

Win Rate

Wins

Losses

Matches

Rating

7.3/ 10

Significance7.5

Rigor7

Novelty7

Clarity7.5

Tournament Score

1539±45

10501800

91%

Win Rate

Wins

Losses

Matches

Rating

7.3/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper challenges a widely held assumption in the LLM training community: that code data in pretraining corpora broadly enhances reasoning capabilities. Through controlled ablation experiments on a 10T-token corpus with fine-grained domain separation, the authors demonstrate three key findings: (1) pure executable code improves programming but actually *competes* with mathematical reasoning under fixed-budget training; (2) reasoning gains previously attributed to code are better explained by cross-domain structured reasoning traces (Code-NL data like notebooks, Q&A pages, and mixed-format documents); and (3) increasing the density of "cognitive scaffolds"—structurally organized math samples with explicit intermediate reasoning steps—within a fixed math budget yields substantial gains on complex mathematical reasoning while preserving programming performance.

The central insight is a disambiguation: what prior work lumped together as "code" actually contains two distinct signals—executable programming artifacts and structured reasoning traces embedded in natural language. Separating these reveals that the reasoning benefits come from the latter, not the former.

Methodological Rigor

The experimental design is notably strong in several respects. The 10T-token corpus scale is realistic and industrially relevant, lending credibility to the findings. The seven-domain taxonomy with explicit criteria for distinguishing Code from Code-NL is well-motivated and addresses a genuine confound in prior work (e.g., Aryabumi et al. 2025 counted Markdown, CSS, and HTML as "code"). The fixed-budget substitution design—where ablated tokens are replaced by proportional upsampling of remaining domains—isolates marginal contributions rather than conflating removal with reduced training data.

The paper validates findings across multiple architectures and scales: dense models at 1B and 5B, and MoE variants with 16, 32, and 64 experts. This cross-validation meaningfully strengthens the claims. The benchmark suite is comprehensive, spanning 30+ evaluations across five capability dimensions.

However, there are methodological limitations. The cognitive scaffold selection relies on a FastText classifier trained on code vs. non-code samples as a proxy for "structure"—this is acknowledged but somewhat circular, as it uses code-derived features to identify structured math data. The paper does not systematically sweep replacement ratios for scaffolds, making it difficult to assess sensitivity. The 200B training tokens per configuration (not the full 10T) may not fully capture long-horizon interactions that emerge at larger scales. Additionally, the causal language around "competition" and "negative coupling" should be interpreted carefully—these are fixed-budget substitution effects, not pure causal claims, though the authors are mostly careful about this distinction.

Potential Impact

The practical implications are significant for the LLM training community. If the findings generalize, they suggest that:

1. Data taxonomy matters more than volume: Simply adding more code to pretraining is suboptimal; careful separation of executable code from mixed-format reasoning traces is essential.

2. Targeted data composition beats broad domain mixing: The cognitive scaffolding approach—increasing density of structured reasoning samples within a fixed domain budget—offers a practical lever for improving complex reasoning without additional compute.

3. Cross-domain trade-offs are real and measurable: The expert-routing analysis provides mechanism-level evidence that data composition shapes internal model specialization, giving practitioners diagnostic tools for understanding domain competition.

These insights directly inform industrial pretraining pipelines. The finding that code hurts complex math (e.g., -71.53% on Minerva-Math, -47.16% on OlympiadBench) under fixed budgets is particularly actionable for teams building math-focused models.

Timeliness & Relevance

This paper addresses a timely question. As foundation models scale and data curation becomes a primary bottleneck, understanding *what* in training data drives *which* capabilities is critical. The "code improves reasoning" narrative has become conventional wisdom, influencing how major labs allocate their training budgets. Providing a more nuanced picture—that structured reasoning traces, not executable code per se, drive reasoning gains—is a valuable corrective.

The work also connects to the growing interest in data-centric AI, where the focus shifts from architectural innovation to understanding data properties. The cognitive scaffolding concept aligns with recent work on chain-of-thought and process-based reasoning, suggesting that exposing intermediate reasoning structure during pretraining (not just fine-tuning) is beneficial.

Strengths

Scale and realism: 10T-token corpus with industrial-grade data pipeline; results validated across 4 model configurations

Fine-grained domain separation: The Code vs. Code-NL distinction is the paper's key methodological innovation, revealing a confound in prior work

Comprehensive evaluation: 30+ benchmarks across 5 capability dimensions with training curves over tokens

Mechanistic evidence: MoE routing analysis provides interpretable evidence for domain competition/synergy beyond just downstream metrics

Practical actionability: The cognitive scaffolding approach is simple, scalable (FastText-based), and immediately applicable

Limitations

Limited model scale: While multiple architectures are tested, the largest models are relatively small by current standards; competitive dynamics may differ at 70B+ scale

Scaffold selection circularity: Using code-trained classifiers to find "structured" math data conflates structural formatting with reasoning depth

No systematic ratio sweep: The paper doesn't optimize the scaffold replacement ratio, limiting prescriptive power

Reproducibility concerns: Despite claiming public availability, the 10T corpus and exact data pipeline may be difficult to fully replicate

Limited theoretical grounding: The "negative coupling" framework is descriptive rather than predictive; no formal model explains when domains will compete vs. cooperate

Single training objective: All experiments use standard autoregressive pretraining; interactions may differ under instruction tuning or RLHF

Overall Assessment

This is a well-executed empirical study that provides important nuance to a widely accepted claim about code and reasoning. The fine-grained domain separation is the key contribution, and the cognitive scaffolding concept, while preliminary, points toward more principled data-centric optimization. The main limitation is the lack of theoretical depth and the relatively modest model scales, but the practical relevance and thoroughness of the experiments make this a valuable contribution.

Rating:7.3/ 10

Significance 7.5Rigor 7Novelty 7Clarity 7.5

Generated May 20, 2026

Comparison History (22)

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gemini-3.15/21/2026

Paper 2 investigates a fundamental question about pretraining data composition, challenging the widely held belief that pure code training improves general reasoning. By showing that structured reasoning traces drive these improvements, it provides actionable insights for optimizing foundation model training. This deep methodological understanding will likely have a broader and more lasting impact across the field than Paper 1, which introduces an evaluation benchmark that, while valuable, may become obsolete as models rapidly evolve.

vs. DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation

gemini-3.15/21/2026

Paper 2 provides fundamental insights into pretraining data composition by refining the prevalent assumption that pure code improves general reasoning. Its findings on structured reasoning signals will directly influence how foundation models are trained globally, offering broader and more profound methodological impact compared to the introduction of a new benchmark in Paper 1.

vs. Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact: it addresses a central, timely question in foundation-model training (whether code causally improves general reasoning) using large-scale controlled pretraining with domain-separated data, yielding actionable guidance for data-centric optimization and revealing trade-offs across capabilities. Its conclusions affect broad LM development (math, knowledge, programming) and can influence corpus design and training strategies across many labs. Paper 2 is rigorous and useful for MLLM hallucination mitigation, but its scope is narrower (modality-conflict cases, inference-time head interventions) and may generalize less broadly than Paper 1’s data-composition findings.

vs. HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code and structured reasoning affect language model capabilities, with implications across the entire LLM training ecosystem. Its findings—that structured reasoning traces matter more than executable code for reasoning, and its mechanistic analysis of expert activation patterns—provide actionable insights for foundation model development at scale. Paper 1, while practically valuable for a specific region, addresses a narrow geographic application with standard ML methods (RF + XGBoost). Paper 2's breadth of impact across AI research, its methodological rigor with controlled 10T-token experiments, and its timeliness given the centrality of data mixture optimization in LLM development give it substantially higher scientific impact.

vs. Explainable Wastewater Digital Twins: Adaptive Context-Conditioned Structured Simulators with Self-Falsifying Decision Support

gpt-5.25/20/2026

Paper 1 likely has higher scientific impact due to broad relevance to foundation model training and data-centric optimization: it challenges a common assumption (code improves general reasoning), provides controlled large-scale pretraining evidence, and offers actionable insights (structured reasoning traces, density trade-offs) with mechanism-level routing analysis. Its findings can influence LM dataset design across many domains. Paper 2 is methodologically strong and practically valuable for wastewater control with safety guarantees, but its impact is more domain-specific and likely narrower across fields.

vs. When Skills Don't Help: A Negative Result on Procedural Knowledge for Tool-Grounded Agents in Offensive Cybersecurity

gemini-3.15/20/2026

Paper 1 challenges and clarifies a widely held assumption in foundation model pretraining—that code data universally improves general reasoning. By identifying that structured reasoning traces, rather than pure code, drive these gains, it offers highly impactful, actionable insights for data curation and LLM training. Paper 2 presents a valuable but narrower negative result regarding agent skills in high-feedback environments, making its broader scientific impact more limited compared to the fundamental LLM pretraining findings in Paper 1.

vs. Interference-Aware Multi-Task Unlearning

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code and structured reasoning signals contribute to LLM capabilities, with experiments at massive scale (10T tokens). Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for foundation model training data composition across the entire field. Paper 1 tackles a more niche problem (multi-task unlearning) with solid but incremental contributions limited to computer vision benchmarks. Paper 2's insights are more timely, broadly applicable, and likely to influence how major labs design pretraining data mixtures.

vs. Evaluating the Utility of Personal Health Records in Personalized Health AI

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code and structured reasoning signals affect LLM capabilities during pretraining, providing mechanistic insights via expert-activation patterns. Its findings—that structured reasoning traces rather than executable code drive reasoning improvements—have broad implications for data-centric optimization of foundation models across the entire AI field. Paper 1, while valuable for health AI applications, is more applied and incremental, evaluating an existing LLM on PHR-augmented queries. Paper 2's insights will likely influence training data strategies for many future models, giving it broader and deeper scientific impact.

vs. EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code and structured reasoning signals affect LLM capabilities, with experiments at massive scale (10T tokens). Its findings about data composition for training foundation models have broad implications for the entire LLM training paradigm and will influence how future models are built. Paper 1 offers an incremental improvement (2.1% AUC) in deepfake detection using emotion cues—a narrower problem with limited generalization gains. Paper 2's mechanistic insights and practical guidance for data-centric optimization have far broader impact across AI research.

vs. Using Aristotle API for AI-Assisted Theorem Proving in Lean 4: A Formalisation Case Study of the Grasshopper Problem

gpt-5.25/20/2026

Paper 2 has higher potential impact due to broader relevance and stronger real-world applicability: it provides controlled large-scale pretraining evidence about what data actually improves mathematical reasoning, with actionable guidance for dataset design and mechanistic routing analyses. Its conclusions can influence foundation-model training across many domains (math, code, general reasoning) and are timely for current LLM development. Paper 1 is a valuable, reproducible case study for AI-assisted formalization, but it is narrower in scope and reports a partially unresolved proof, limiting immediate methodological and practical impact beyond the theorem-proving community.

vs. POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

gemini-3.15/20/2026

Paper 1 addresses a foundational and widely debated topic in LLM pretraining: the role of code in developing reasoning capabilities. By conducting massive-scale (10T tokens) controlled experiments, it challenges prevailing assumptions and provides actionable insights into data-centric optimization (structured reasoning signals vs. pure code). This will likely broadly influence how next-generation foundation models are trained. While Paper 2 introduces a valuable privacy benchmark, Paper 1's fundamental insights into model cognition and pretraining data dynamics offer a wider, more transformative impact on the trajectory of AI capability research.

vs. Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about LLM training data composition and mathematical reasoning at scale (10T tokens), with findings that challenge a widely-held assumption (code improves general reasoning). Its insights about structured reasoning traces vs. executable code have broad implications for foundation model training across the industry. Paper 1, while methodologically rigorous, addresses a narrower problem (skill library management in self-evolving agents) with a smaller community of practitioners. Paper 2's mechanistic analysis via expert-activation patterns and practical data-centric optimization strategies give it wider applicability and timeliness.

vs. KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

gpt-5.25/20/2026

Paper 2 has higher potential impact due to broader relevance and timeliness: it addresses foundation-model pretraining data composition at 10T-token scale, clarifying a widely held assumption (code improves reasoning) and providing actionable, data-centric strategies for improving mathematical reasoning with mechanism-level evidence (routing/expert activations). Its findings can influence LM training practices across many domains and applications. Paper 1 is solid and practically useful for wearable HAR, but its scope is narrower and primarily advances a specific application/architecture combination.

vs. Controllable User Simulation

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about how code affects LLM reasoning—a topic central to the massive ongoing investment in foundation model training. Its findings that structured reasoning traces (not executable code) drive reasoning gains, combined with practical data-centric optimization strategies validated at 10T-token scale, have immediate broad impact on how the entire field approaches pretraining data composition. Paper 1, while theoretically rigorous in formalizing controllable simulation as causal inference, addresses a narrower problem in conversational agent evaluation with a more specialized audience.

vs. Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

claude-opus-4.65/20/2026

Paper 2 addresses a fundamental question about what drives reasoning improvements in LLMs, providing systematic evidence through large-scale controlled experiments on a 10T-token corpus. Its findings—that structured reasoning traces, not executable code per se, improve mathematical reasoning—have broad implications for LLM pretraining data strategies across the entire field. Paper 1, while methodologically sound, focuses on a narrower adversarial attack technique. Paper 2's insights into data-centric optimization and cross-domain interactions will likely influence foundation model training practices more broadly and durably.

vs. Parallel Prefix Verification for Speculative Generation

gemini-3.15/20/2026

Paper 1 addresses a fundamental question about how LLMs acquire reasoning capabilities, challenging the prevailing assumption that pure code drives these gains. By identifying structured reasoning traces as the true catalyst and providing mechanistic evidence, it offers transformative insights for foundation model pre-training. While Paper 2 provides a valuable inference acceleration technique, Paper 1's discoveries regarding data composition and cognitive scaffolding have a deeper, more foundational scientific impact on AI development.

vs. Conflict-Resilient Multi-Agent Reasoning via Signed Graph Modeling

claude-opus-4.65/20/2026

Paper 1 addresses a fundamental question about what drives reasoning improvements in foundation models—a topic central to the entire LLM training pipeline. Its controlled 10T-token pretraining experiments provide causal evidence that structured reasoning traces, not executable code per se, drive reasoning gains. This insight has broad implications for data curation strategies across the industry. Paper 2 presents a solid engineering contribution (signed graphs for multi-agent reasoning) but is more incremental, building on existing MAS frameworks with a specific architectural modification. Paper 1's findings reshape understanding of pretraining data composition, affecting a wider research community.

vs. Towards Conversational Medical AI with Eyes, Ears and a Voice

gpt-5.25/20/2026

Paper 2 has higher likely scientific impact due to its strong real-world applicability (telemedicine decision support), timeliness (multimodal, real-time clinical AI), and broader cross-field influence spanning ML, HCI, and clinical medicine. It proposes a novel low-latency dual-agent multimodal system and introduces a task-relevant evaluation (TelePACES) with a sizable randomized crossover simulation study, lending methodological credibility. Paper 1 is rigorous and valuable for data-centric LLM training insights, but its impact is more specialized to model training/benchmarking and less directly transformative for high-stakes deployment.

vs. State Contamination in Memory-Augmented LLM Agents

gpt-5.25/20/2026

Paper 2 likely has higher impact due to timely relevance to deployed memory-augmented LLM agents and safety. It introduces a concrete, broadly applicable failure mode (memory laundering), a measurable metric (SPG), and actionable mitigation guidance tied to intervention placement, making it immediately useful for real-world systems. The phenomenon generalizes across agent architectures and connects to security, alignment, and HCI. Paper 1 is rigorous and valuable for data-centric training insights, but its impact is more specialized to pretraining data composition debates and may translate more slowly into deployment practices.

vs. Harnessing LLM Agents with Skill Programs

claude-opus-4.65/20/2026

Paper 1 addresses a fundamental question about why code in pretraining improves reasoning, providing controlled experiments at scale (10T tokens) that challenge a widely-held assumption. Its findings—that structured reasoning traces, not executable code per se, drive reasoning gains—have broad implications for data-centric optimization across the entire foundation model training community. Paper 2 presents a useful but more incremental framework for LLM agents. Paper 1's mechanistic insights and practical guidance for training data composition will likely influence a wider range of future research and industry practices.