Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training
Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Abstract
Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training"
1. Core Contribution
This paper introduces a taxonomy of chain-of-thought (CoT) reasoning formats—Explicit CoT (fully decomposed), Composed CoT (multiple operations bundled but explicitly listed), and Implicit CoT (intermediate operations omitted)—and systematically studies how these compression formats affect supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). The core novelty lies in constructing a controlled synthetic testbed that independently varies difficulty (number of operations), compression granularity, data size, and CoT type, enabling precise isolation of each factor's effect.
The paper produces four principal findings: (i) coarser CoT granularity demands more SFT data, (ii) compressed CoT benefits disproportionately from data scaling over repetition, with Composed CoT benefiting from repetition but Implicit CoT suffering from it, (iii) RLVR can decompose compressed reasoning steps that SFT cannot, and (iv) unidirectional CoT orderings generalize better than hierarchical ones on sequential tasks.
2. Methodological Rigor
The experimental design is commendable in its controlled setup. The synthetic arithmetic task (sequential operations modulo 23) provides clean control over confounds that would arise in natural data. The authors test across multiple model families (Qwen2.5 and Llama-3) and sizes (0.5B to 14B), lending robustness to their claims.
However, several methodological concerns arise:
The decomposition experiment (Section 3.3) is cleverly designed—training on even-op tasks with g=2 and testing on odd-op tasks that require g=1 fractional steps—providing a clean test of whether models can break apart learned chunks.
3. Potential Impact
Practical implications: The findings offer actionable guidelines for practitioners designing CoT training data under resource constraints: use Composed CoT with repetition when data is limited; scale data diversity for compressed formats; prefer unidirectional CoT orderings for sequential tasks. These are directly applicable to the growing industry of reasoning model training.
Theoretical implications: The RLVR decomposition finding contributes to the ongoing debate about whether RL discovers genuinely new solutions or merely sharpens existing distributions. The evidence that RLVR can decompose compressed reasoning chunks into atomic operations, enabling unseen compositions, supports an optimistic view of RL's generalization capacity. This connects meaningfully to the compositional generalization literature (Lake & Baroni, 2018; Yuan et al., 2026).
Broader influence: The taxonomy itself (Explicit/Composed/Implicit) provides a useful conceptual framework that could be adopted by the community for characterizing reasoning compression methods. However, the real-world applicability remains unvalidated.
4. Timeliness & Relevance
The paper is highly timely. Token cost optimization for reasoning LLMs is an active bottleneck—especially as agentic LLM deployments scale. The paper directly addresses the practical question of how to train models on shorter reasoning traces without sacrificing generalization. The SFT-then-RL pipeline analysis is particularly relevant given current industry practices (e.g., DeepSeek-R1, OpenAI o1).
The investigation of when data repetition helps versus hurts is also timely, connecting to recent findings (LIMO, s1) showing that small high-quality datasets with repetition can sometimes outperform larger datasets—but this paper shows this doesn't hold universally across CoT types.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
This is a solid empirical study that provides useful insights into an increasingly important practical problem. The experimental design is clean and the findings are clearly communicated. The RLVR decomposition result is the most novel and impactful finding, offering evidence for RL's capacity to discover compositional solutions beyond what SFT provides. However, the reliance on a single synthetic task type and the absence of real-world validation significantly constrain the paper's impact. The work is best viewed as a controlled preliminary investigation that establishes hypotheses requiring validation in more realistic settings.
Generated May 28, 2026
Comparison History (20)
Paper 1 addresses the fundamental and timely challenge of reasoning efficiency in LLMs through a rigorous, controlled experimental framework. Its systematic taxonomy of CoT compression types, combined with insights about SFT vs. RL dynamics, has broader applicability across the field. The findings about data scaling, memorization risks, and how RL decomposes compressed reasoning provide actionable guidance for practitioners. Paper 2 tackles an important but narrower problem in knowledge editing with a clever solution, but Paper 1's breadth of impact—spanning reasoning, post-training methodology, and data efficiency—gives it higher potential influence.
Paper 1 offers higher scientific impact due to its timeliness and broad applicability to the highly active field of LLM reasoning. By systematically analyzing how compressed Chain-of-Thought data affects supervised fine-tuning and reinforcement learning, it addresses a critical bottleneck: balancing reasoning capability with token efficiency. Its insights into data scaling and how RL decomposes compressed steps provide foundational guidelines for future LLM post-training. While Paper 2 presents a novel optimization trick inspired by jailbreaking, Paper 1's comprehensive taxonomy and rigorous evaluation of reasoning mechanisms will likely influence a wider range of core LLM development.
Paper 1 presents a highly impactful real-world application with broad cross-disciplinary reach, enabling non-experts in fields like biology and physics to automatically build state-of-the-art AI models. While Paper 2 offers valuable foundational insights into LLM reasoning compression, Paper 1's AIBuildAI-2 agent addresses a critical bottleneck in applied scientific discovery. Its novel evolving knowledge system and verifiable top-tier performance on MLE-Bench demonstrate exceptional potential to accelerate research across diverse domains, giving it a broader overall scientific impact.
Paper 1 likely has higher scientific impact due to stronger novelty and broader real-world applicability: it bridges formal methods (e.g., LTL semantics) with LLM auditing/monitoring/intervention, addressing urgent governance and safety needs across many deployed AI systems. The contribution spans offline auditing, runtime predictive monitoring, and intervention—actionable tools for compliance with regulations and norms—making it timely and cross-disciplinary (AI safety, software verification, policy). Paper 2 is rigorous and useful for post-training efficiency, but its impact is more confined to LLM training practice and synthetic-task findings.
Paper 2 likely has higher scientific impact because it identifies a counterintuitive, safety- and decision-critical inverse-scaling phenomenon (more capable LLMs making worse distributional forecasts) in domains with tail risk, validated across simulated benchmarks and multiple real-world datasets. It also surfaces an evaluation failure mode (threshold metrics masking upper-tail errors) with immediate implications for forecasting benchmarks and deployment in finance/epidemiology. While Paper 1 is methodologically careful and useful for LLM training practice, its impact is more specialized to post-training CoT compression choices, whereas Paper 2 affects evaluation, reliability, and high-stakes applications broadly.
Paper 2 addresses a critical and systemic vulnerability in AI research: the validity of agent benchmarks. By demonstrating that popular benchmarks can be easily reward-hacked without solving tasks and providing an automated framework to patch these flaws, it directly impacts how the entire field evaluates and directs AI progress. This broad applicability, combined with its high relevance to AI safety and evaluation rigor, gives it a higher potential scientific impact than Paper 1's narrower focus on CoT data compression mechanics.
Paper 1 addresses a highly timely and practically important problem—understanding how compressed reasoning data affects LLM post-training—which is central to the rapidly growing field of LLM reasoning optimization. Its systematic taxonomy (Explicit, Composed, Implicit CoT), comprehensive experimental design, and novel insights about RL decomposing compressed steps have broad implications for the large community working on LLM training efficiency. Paper 2 makes solid theoretical contributions to algorithmic recourse but targets a narrower community. The explosive growth of LLM research and the practical cost implications of Paper 1's findings give it higher potential impact.
Paper 2 likely has higher scientific impact due to broader applicability and timeliness: it studies compressed chain-of-thought—a central, widely used post-training technique affecting cost/performance across many LLM deployments. It proposes a clear taxonomy, uses controlled synthetic tasks to vary key factors, and runs extensive experiments across model families/sizes, yielding mechanistic insights (e.g., SFT vs RLVR behavior, memorization risks) with actionable guidance. Paper 1 is novel and valuable for medical safety evaluation, but its scope is narrower (health QA, 60 questions) and impact is more domain-specific.
Paper 1 addresses one of the most pressing challenges in modern AI: optimizing Chain-of-Thought (CoT) reasoning and post-training (SFT/RL) for LLMs. By providing actionable insights into how reasoning data compression affects capabilities and RLVR, it directly informs the development of next-generation reasoning models. While Paper 2 offers a rigorous and novel evaluation metric for VLM explainability, Paper 1 has broader, more immediate implications for scaling model capabilities, reducing inference costs, and advancing the foundational training paradigms of frontier LLMs, granting it higher overall scientific and practical impact.
Paper 1 addresses a fundamental and broadly applicable challenge in LLM post-training: understanding how compressed reasoning data affects model performance. Its systematic taxonomy of CoT types, controlled experiments across model families, and novel findings about RL decomposing compressed steps provide foundational insights relevant to the entire LLM training community. Paper 2, while solid, addresses a narrower domain (chemical reaction diagram parsing) with incremental improvements on a specific benchmark. Paper 1's findings have broader implications for efficient training data design across all reasoning tasks.
Paper 1 addresses a highly fundamental and timely challenge in LLMs: the trade-off between reasoning performance and token cost. By establishing a taxonomy for CoT compression and providing empirical insights into how SFT and RLVR interact with compressed reasoning, it offers broad theoretical and practical implications for training next-generation reasoning models. Paper 2, while methodologically rigorous, focuses on a more specific algorithmic improvement (adaptive weighting in RL for open-ended QA), which has a narrower scope and potentially less foundational impact across the broader AI landscape.
Paper 2 (PEAM) is more novel and broadly impactful: it proposes a concrete framework for continual, parametric skill memory in embodied agents, combining modular MoE-LoRA adapters, failure–correction contrastive internalization, and self-triggered consolidation—mechanisms with clear real-world relevance to robotics and interactive agents. It targets long-horizon autonomy and catastrophic forgetting, central open problems, and is timely given rapid growth in agentic LLMs. Paper 1 is rigorous and valuable for understanding CoT compression in post-training, but its impact is likely more incremental and narrower to LLM training practice.
Paper 1 addresses a critical bottleneck in current AI development: the trade-off between LLM reasoning performance and token efficiency. By providing empirical insights into how compressed reasoning data affects supervised fine-tuning and reinforcement learning, it offers immediate, highly practical applications for optimizing LLMs. While Paper 2 presents a valuable ethical framework, Paper 1 demonstrates greater methodological rigor through controlled experiments and will likely have a much broader and more immediate technical impact on the rapidly moving field of AI engineering.
Paper 1 provides novel empirical insights into how compressed reasoning data affects LLM post-training, with actionable findings about data scaling, memorization, and the interplay between SFT and RL. These results directly inform practical training strategies. Paper 2, while useful as a survey/taxonomy mapping ToT to classical search, is primarily a synthesis of existing work rather than presenting new methods or empirical discoveries. Paper 1's controlled experiments revealing mechanisms (e.g., RL decomposing compressed steps, data repetition effects) offer more original scientific contributions with broader implications for the rapidly growing field of reasoning LLMs.
Paper 1 provides a fundamental, mechanistic understanding of how compressed reasoning data influences LLM post-training. By systematically analyzing supervised fine-tuning (SFT) and reinforcement learning (RL) dynamics through a novel Chain-of-Thought taxonomy, it addresses core theoretical gaps in model training. While Paper 2 offers a strong practical application for inference acceleration, Paper 1's rigorous methodological design and foundational insights into training dynamics will likely have a broader, longer-lasting scientific impact on how future reasoning models are systematically developed and optimized.
Paper 2 addresses a fundamental and broadly applicable challenge in LLM training—understanding how compressed chain-of-thought reasoning data affects post-training. Its systematic taxonomy (Explicit, Composed, Implicit CoT), controlled experiments, and actionable findings about SFT vs. RL dynamics have wide implications for the entire LLM community. Paper 1, while addressing an interesting niche in cinematic multi-talker video generation benchmarking, targets a narrower domain with fewer downstream applications. Paper 2's insights into data efficiency, reasoning compression, and training mechanisms are more broadly impactful and timely.
Paper 2 addresses a more broadly impactful and timely problem—understanding how compressed chain-of-thought reasoning affects LLM post-training. It provides a systematic taxonomy, controlled experiments across model families/sizes, and novel insights about SFT vs. RL dynamics that are relevant to the entire LLM research community. Paper 1, while technically sound, addresses a narrower problem (multimodal sentiment analysis) with experiments on only one dataset (CMU-MOSI), limiting its generalizability and breadth of impact. Paper 2's findings have wider applicability to LLM training efficiency and reasoning optimization.
Paper 2 addresses a critical bottleneck in foundational AI research: optimizing chain-of-thought reasoning to balance LLM performance and token cost. Its insights into how compressed reasoning data affects supervised fine-tuning and reinforcement learning have broad, immediate implications across the rapidly moving field of LLM post-training. While Paper 1 offers a valuable clinical application (IBD detection), Paper 2 provides fundamental methodological insights that will influence a significantly wider range of researchers and applications in AI.
Paper 1 addresses a fundamental question about how compressed reasoning data affects LLM post-training, providing a systematic taxonomy and controlled experiments that yield broadly applicable insights (e.g., the finding that RL decomposes compressed SFT steps). These mechanistic insights are relevant to the entire LLM training community. Paper 2, while technically sound, addresses a narrower problem (co-evolving prompts and topologies for multi-agent systems) with incremental improvements on existing benchmarks. Paper 1's findings have broader implications for CoT design, data efficiency, and understanding SFT vs. RL dynamics.
Paper 2 addresses a more fundamental and timely question about LLM reasoning efficiency—understanding how compressed chain-of-thought data affects post-training. Its systematic taxonomy (Explicit, Composed, Implicit CoT) and findings about SFT vs. RL dynamics have broad implications for the entire LLM training community. The insights about data scaling, memorization risks, and how RL decompresses compressed reasoning steps are novel and actionable. Paper 1, while addressing a practical problem in model routing, targets a narrower audience and a more incremental infrastructure challenge with less fundamental scientific contribution.