Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

Kohsei Matsutani, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

#431 of 2682 · Artificial Intelligence
Share
Tournament Score
1488±48
10501800
75%
Win Rate
15
Wins
5
Losses
20
Matches
Rating
5.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models (LLMs) can now solve complex problems through long chain-of-thought (CoT) reasoning, but the trade-off between performance and token cost remains a central challenge. To address this issue, supervised fine-tuning (SFT) often uses compressed reasoning data, where CoT traces are shortened into compact forms. However, the effect of such compressed reasoning data on post-training remains poorly understood. In this paper, we propose a taxonomy of CoT consisting of Explicit CoT, which outputs all operations without aggregation, Composed CoT, which combines multiple operations into a single step, and Implicit CoT, which omits intermediate operations. We construct a synthetic compositional reasoning task that allows controlled variation of difficulty, compression granularity, and data size, and conducted a comprehensive set of experiments across different model families and sizes. Notably, we find that (i) coarser CoT requires more SFT data, (ii) compared with Explicit CoT, Composed CoT and Implicit CoT benefit more from data scaling, while Composed CoT benefits from data repetition and Implicit CoT tends to lead to memorization, (iii) unlike SFT, subsequent reinforcement learning (RL) with verifiable rewards (RLVR) decomposes compressed steps learned during SFT, and (iv) unidirectional CoT ordering shows stronger generalization on longer sequential tasks. Our findings provide implications for CoT design under data resource constraints and offer important insights into the mechanisms of SFT and RL in LLM post-training.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training"

1. Core Contribution

This paper introduces a taxonomy of chain-of-thought (CoT) reasoning formats—Explicit CoT (fully decomposed), Composed CoT (multiple operations bundled but explicitly listed), and Implicit CoT (intermediate operations omitted)—and systematically studies how these compression formats affect supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). The core novelty lies in constructing a controlled synthetic testbed that independently varies difficulty (number of operations), compression granularity, data size, and CoT type, enabling precise isolation of each factor's effect.

The paper produces four principal findings: (i) coarser CoT granularity demands more SFT data, (ii) compressed CoT benefits disproportionately from data scaling over repetition, with Composed CoT benefiting from repetition but Implicit CoT suffering from it, (iii) RLVR can decompose compressed reasoning steps that SFT cannot, and (iv) unidirectional CoT orderings generalize better than hierarchical ones on sequential tasks.

2. Methodological Rigor

The experimental design is commendable in its controlled setup. The synthetic arithmetic task (sequential operations modulo 23) provides clean control over confounds that would arise in natural data. The authors test across multiple model families (Qwen2.5 and Llama-3) and sizes (0.5B to 14B), lending robustness to their claims.

However, several methodological concerns arise:

  • Synthetic-only evaluation: The entire study is conducted on a single synthetic task type—sequential modular arithmetic with linear dependency chains (in-degree and out-degree of 1). This severely limits generalizability claims. Real reasoning tasks involve branching, search, backtracking, and heterogeneous operation types.
  • OOD definition: Out-of-distribution is defined solely as longer compositional chains, not domain shift or structural variation. This is a narrow notion of generalization.
  • Limited statistical reporting: The paper does not report confidence intervals or significance tests. Given the stochastic nature of training, this is a notable gap.
  • RLVR decomposition analysis: The claim that RLVR "decomposes" compressed steps is supported primarily by response length dynamics and a few qualitative examples. A more rigorous analysis of the model's internal representations or systematic categorization of generated reasoning traces would strengthen this finding.
  • The decomposition experiment (Section 3.3) is cleverly designed—training on even-op tasks with g=2 and testing on odd-op tasks that require g=1 fractional steps—providing a clean test of whether models can break apart learned chunks.

    3. Potential Impact

    Practical implications: The findings offer actionable guidelines for practitioners designing CoT training data under resource constraints: use Composed CoT with repetition when data is limited; scale data diversity for compressed formats; prefer unidirectional CoT orderings for sequential tasks. These are directly applicable to the growing industry of reasoning model training.

    Theoretical implications: The RLVR decomposition finding contributes to the ongoing debate about whether RL discovers genuinely new solutions or merely sharpens existing distributions. The evidence that RLVR can decompose compressed reasoning chunks into atomic operations, enabling unseen compositions, supports an optimistic view of RL's generalization capacity. This connects meaningfully to the compositional generalization literature (Lake & Baroni, 2018; Yuan et al., 2026).

    Broader influence: The taxonomy itself (Explicit/Composed/Implicit) provides a useful conceptual framework that could be adopted by the community for characterizing reasoning compression methods. However, the real-world applicability remains unvalidated.

    4. Timeliness & Relevance

    The paper is highly timely. Token cost optimization for reasoning LLMs is an active bottleneck—especially as agentic LLM deployments scale. The paper directly addresses the practical question of how to train models on shorter reasoning traces without sacrificing generalization. The SFT-then-RL pipeline analysis is particularly relevant given current industry practices (e.g., DeepSeek-R1, OpenAI o1).

    The investigation of when data repetition helps versus hurts is also timely, connecting to recent findings (LIMO, s1) showing that small high-quality datasets with repetition can sometimes outperform larger datasets—but this paper shows this doesn't hold universally across CoT types.

    5. Strengths & Limitations

    Key Strengths:

  • Clean, well-controlled experimental design enabling causal-style reasoning about each variable
  • Comprehensive model coverage across families and scales
  • The decomposition experiment (even/odd op) is a particularly elegant design
  • Clearly stated takeaways that translate into practical guidelines
  • The RLVR finding—that on-policy exploration enables decomposition unavailable to SFT—is a genuinely interesting mechanistic insight
  • Code availability enhances reproducibility
  • Notable Limitations:

  • Single task type: Sequential arithmetic with modular operations is a narrow proxy for real reasoning. Tasks involving search, planning, or multi-branch computation are not addressed.
  • No real-world validation: Claims about practical CoT design remain conjectural without evaluation on benchmarks like GSM8K, MATH, or code generation.
  • Linear chain structure only: The computational graph has in-degree and out-degree 1—a significant restriction acknowledged but not addressed.
  • Scalability questions: Whether findings hold for frontier-scale models (70B+) or more complex reasoning domains is unknown.
  • Hierarchical CoT comparison: The Hierarchical CoT comparison (Section 3.4) is somewhat unfair since the hierarchical format requires fundamentally different cognitive demands (variable management, deferred evaluation) that may require different training approaches rather than simply more data.
  • Missing analysis of error types: Understanding *how* models fail (e.g., arithmetic errors vs. structural errors vs. format compliance) would provide deeper mechanistic insight.
  • Overall Assessment

    This is a solid empirical study that provides useful insights into an increasingly important practical problem. The experimental design is clean and the findings are clearly communicated. The RLVR decomposition result is the most novel and impactful finding, offering evidence for RL's capacity to discover compositional solutions beyond what SFT provides. However, the reliance on a single synthetic task type and the absence of real-world validation significantly constrain the paper's impact. The work is best viewed as a controlled preliminary investigation that establishes hypotheses requiring validation in more realistic settings.

    Rating:5.8/ 10
    Significance 6Rigor 6.5Novelty 5.5Clarity 7.5

    Generated May 28, 2026

    Comparison History (20)

    vs. From Fact Overwriting to Knowledge Evolution: Causal Editing via On-Policy Self-Distillation
    claude-opus-4.65/28/2026

    Paper 1 addresses the fundamental and timely challenge of reasoning efficiency in LLMs through a rigorous, controlled experimental framework. Its systematic taxonomy of CoT compression types, combined with insights about SFT vs. RL dynamics, has broader applicability across the field. The findings about data scaling, memorization risks, and how RL decomposes compressed reasoning provide actionable guidance for practitioners. Paper 2 tackles an important but narrower problem in knowledge editing with a clever solution, but Paper 1's breadth of impact—spanning reasoning, post-training methodology, and data efficiency—gives it higher potential influence.

    vs. Cross-Entropy Games and Frost Training
    gemini-3.15/28/2026

    Paper 1 offers higher scientific impact due to its timeliness and broad applicability to the highly active field of LLM reasoning. By systematically analyzing how compressed Chain-of-Thought data affects supervised fine-tuning and reinforcement learning, it addresses a critical bottleneck: balancing reasoning capability with token efficiency. Its insights into data scaling and how RL decomposes compressed steps provide foundational guidelines for future LLM post-training. While Paper 2 presents a novel optimization trick inspired by jailbreaking, Paper 1's comprehensive taxonomy and rigorous evaluation of reasoning mechanisms will likely influence a wider range of core LLM development.

    vs. AIBuildAI-2: A Knowledge-Enhanced Agent for Automatically Building AI Models
    gemini-3.15/28/2026

    Paper 1 presents a highly impactful real-world application with broad cross-disciplinary reach, enabling non-experts in fields like biology and physics to automatically build state-of-the-art AI models. While Paper 2 offers valuable foundational insights into LLM reasoning compression, Paper 1's AIBuildAI-2 agent addresses a critical bottleneck in applied scientific discovery. Its novel evolving knowledge system and verifiable top-tier performance on MLE-Bench demonstrate exceptional potential to accelerate research across diverse domains, giving it a broader overall scientific impact.

    vs. Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
    gpt-5.25/28/2026

    Paper 1 likely has higher scientific impact due to stronger novelty and broader real-world applicability: it bridges formal methods (e.g., LTL semantics) with LLM auditing/monitoring/intervention, addressing urgent governance and safety needs across many deployed AI systems. The contribution spans offline auditing, runtime predictive monitoring, and intervention—actionable tools for compliance with regulations and norms—making it timely and cross-disciplinary (AI safety, software verification, policy). Paper 2 is rigorous and useful for post-training efficiency, but its impact is more confined to LLM training practice and synthetic-task findings.

    vs. Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact because it identifies a counterintuitive, safety- and decision-critical inverse-scaling phenomenon (more capable LLMs making worse distributional forecasts) in domains with tail risk, validated across simulated benchmarks and multiple real-world datasets. It also surfaces an evaluation failure mode (threshold metrics masking upper-tail errors) with immediate implications for forecasting benchmarks and deployment in finance/epidemiology. While Paper 1 is methodologically careful and useful for LLM training practice, its impact is more specialized to post-training CoT compression choices, whereas Paper 2 affects evaluation, reliability, and high-stakes applications broadly.

    vs. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
    gemini-3.15/28/2026

    Paper 2 addresses a critical and systemic vulnerability in AI research: the validity of agent benchmarks. By demonstrating that popular benchmarks can be easily reward-hacked without solving tasks and providing an automated framework to patch these flaws, it directly impacts how the entire field evaluates and directs AI progress. This broad applicability, combined with its high relevance to AI safety and evaluation rigor, gives it a higher potential scientific impact than Paper 1's narrower focus on CoT data compression mechanics.

    vs. Causal Algorithmic Recourse: Foundations and Methods
    claude-opus-4.65/28/2026

    Paper 1 addresses a highly timely and practically important problem—understanding how compressed reasoning data affects LLM post-training—which is central to the rapidly growing field of LLM reasoning optimization. Its systematic taxonomy (Explicit, Composed, Implicit CoT), comprehensive experimental design, and novel insights about RL decomposing compressed steps have broad implications for the large community working on LLM training efficiency. Paper 2 makes solid theoretical contributions to algorithmic recourse but targets a narrower community. The explosive growth of LLM research and the practical cost implications of Paper 1's findings give it higher potential impact.

    vs. MIRA: A Bilingual Benchmark for Medical Information Response Audit
    gpt-5.25/28/2026

    Paper 2 likely has higher scientific impact due to broader applicability and timeliness: it studies compressed chain-of-thought—a central, widely used post-training technique affecting cost/performance across many LLM deployments. It proposes a clear taxonomy, uses controlled synthetic tasks to vary key factors, and runs extensive experiments across model families/sizes, yielding mechanistic insights (e.g., SFT vs RLVR behavior, memorization risks) with actionable guidance. Paper 1 is novel and valuable for medical safety evaluation, but its scope is narrower (health QA, 60 questions) and impact is more domain-specific.

    vs. Measuring Cross-Modal Synergy: A Benchmark for VLM Explainability
    gemini-3.15/28/2026

    Paper 1 addresses one of the most pressing challenges in modern AI: optimizing Chain-of-Thought (CoT) reasoning and post-training (SFT/RL) for LLMs. By providing actionable insights into how reasoning data compression affects capabilities and RLVR, it directly informs the development of next-generation reasoning models. While Paper 2 offers a rigorous and novel evaluation metric for VLM explainability, Paper 1 has broader, more immediate implications for scaling model capabilities, reducing inference costs, and advancing the foundational training paradigms of frontier LLMs, granting it higher overall scientific and practical impact.

    vs. MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental and broadly applicable challenge in LLM post-training: understanding how compressed reasoning data affects model performance. Its systematic taxonomy of CoT types, controlled experiments across model families, and novel findings about RL decomposing compressed steps provide foundational insights relevant to the entire LLM training community. Paper 2, while solid, addresses a narrower domain (chemical reaction diagram parsing) with incremental improvements on a specific benchmark. Paper 1's findings have broader implications for efficient training data design across all reasoning tasks.

    vs. EAPO: Entropy-Driven Adaptive Positive-Negative Sample Weighting for Policy Optimization in Open-Ended QA
    gemini-3.15/28/2026

    Paper 1 addresses a highly fundamental and timely challenge in LLMs: the trade-off between reasoning performance and token cost. By establishing a taxonomy for CoT compression and providing empirical insights into how SFT and RLVR interact with compressed reasoning, it offers broad theoretical and practical implications for training next-generation reasoning models. Paper 2, while methodologically rigorous, focuses on a more specific algorithmic improvement (adaptive weighting in RL for open-ended QA), which has a narrower scope and potentially less foundational impact across the broader AI landscape.

    vs. PEAM: Parametric Embodied Agent Memory through Contrastive Internalization of Experience in Minecraft
    gpt-5.25/28/2026

    Paper 2 (PEAM) is more novel and broadly impactful: it proposes a concrete framework for continual, parametric skill memory in embodied agents, combining modular MoE-LoRA adapters, failure–correction contrastive internalization, and self-triggered consolidation—mechanisms with clear real-world relevance to robotics and interactive agents. It targets long-horizon autonomy and catastrophic forgetting, central open problems, and is timely given rapid growth in agentic LLMs. Paper 1 is rigorous and valuable for understanding CoT compression in post-training, but its impact is likely more incremental and narrower to LLM training practice.

    vs. The Illusion of Opting in AI-Mediated Consequential Decisions
    gemini-3.15/28/2026

    Paper 1 addresses a critical bottleneck in current AI development: the trade-off between LLM reasoning performance and token efficiency. By providing empirical insights into how compressed reasoning data affects supervised fine-tuning and reinforcement learning, it offers immediate, highly practical applications for optimizing LLMs. While Paper 2 presents a valuable ethical framework, Paper 1 demonstrates greater methodological rigor through controlled experiments and will likely have a much broader and more immediate technical impact on the rapidly moving field of AI engineering.

    vs. Tree of Thoughts as a Classical Heuristic Search Problem: Formal Foundations and Design Patterns
    claude-opus-4.65/28/2026

    Paper 1 provides novel empirical insights into how compressed reasoning data affects LLM post-training, with actionable findings about data scaling, memorization, and the interplay between SFT and RL. These results directly inform practical training strategies. Paper 2, while useful as a survey/taxonomy mapping ToT to classical search, is primarily a synthesis of existing work rather than presenting new methods or empirical discoveries. Paper 1's controlled experiments revealing mechanisms (e.g., RL decomposing compressed steps, data repetition effects) offer more original scientific contributions with broader implications for the rapidly growing field of reasoning LLMs.

    vs. Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
    gemini-3.15/28/2026

    Paper 1 provides a fundamental, mechanistic understanding of how compressed reasoning data influences LLM post-training. By systematically analyzing supervised fine-tuning (SFT) and reinforcement learning (RL) dynamics through a novel Chain-of-Thought taxonomy, it addresses core theoretical gaps in model training. While Paper 2 offers a strong practical application for inference acceleration, Paper 1's rigorous methodological design and foundational insights into training dynamics will likely have a broader, longer-lasting scientific impact on how future reasoning models are systematically developed and optimized.

    vs. MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation
    claude-opus-4.65/28/2026

    Paper 2 addresses a fundamental and broadly applicable challenge in LLM training—understanding how compressed chain-of-thought reasoning data affects post-training. Its systematic taxonomy (Explicit, Composed, Implicit CoT), controlled experiments, and actionable findings about SFT vs. RL dynamics have wide implications for the entire LLM community. Paper 1, while addressing an interesting niche in cinematic multi-talker video generation benchmarking, targets a narrower domain with fewer downstream applications. Paper 2's insights into data efficiency, reasoning compression, and training mechanisms are more broadly impactful and timely.

    vs. A Conflict-Aware Penalty and Statistical Loss Framework for Balancing Modalities and Enhancing Stability in Multimodal Sentiment Analysis
    claude-opus-4.65/28/2026

    Paper 2 addresses a more broadly impactful and timely problem—understanding how compressed chain-of-thought reasoning affects LLM post-training. It provides a systematic taxonomy, controlled experiments across model families/sizes, and novel insights about SFT vs. RL dynamics that are relevant to the entire LLM research community. Paper 1, while technically sound, addresses a narrower problem (multimodal sentiment analysis) with experiments on only one dataset (CMU-MOSI), limiting its generalizability and breadth of impact. Paper 2's findings have wider applicability to LLM training efficiency and reasoning optimization.

    vs. GraD-IBD: Graph Representation Learning from Diagnosis Trajectories for Early Detection of Inflammatory Bowel Disease
    gemini-3.15/28/2026

    Paper 2 addresses a critical bottleneck in foundational AI research: optimizing chain-of-thought reasoning to balance LLM performance and token cost. Its insights into how compressed reasoning data affects supervised fine-tuning and reinforcement learning have broad, immediate implications across the rapidly moving field of LLM post-training. While Paper 1 offers a valuable clinical application (IBD detection), Paper 2 provides fundamental methodological insights that will influence a significantly wider range of researchers and applications in AI.

    vs. TCP-MCP: Landscape-Guided Co-Evolution of Prompts and Communication Topologies for Multi-Agent Systems
    claude-opus-4.65/28/2026

    Paper 1 addresses a fundamental question about how compressed reasoning data affects LLM post-training, providing a systematic taxonomy and controlled experiments that yield broadly applicable insights (e.g., the finding that RL decomposes compressed SFT steps). These mechanistic insights are relevant to the entire LLM training community. Paper 2, while technically sound, addresses a narrower problem (co-evolving prompts and topologies for multi-agent systems) with incremental improvements on existing benchmarks. Paper 1's findings have broader implications for CoT design, data efficiency, and understanding SFT vs. RL dynamics.

    vs. Continual Model Routing in Evolving Model Hubs
    claude-opus-4.65/28/2026

    Paper 2 addresses a more fundamental and timely question about LLM reasoning efficiency—understanding how compressed chain-of-thought data affects post-training. Its systematic taxonomy (Explicit, Composed, Implicit CoT) and findings about SFT vs. RL dynamics have broad implications for the entire LLM training community. The insights about data scaling, memorization risks, and how RL decompresses compressed reasoning steps are novel and actionable. Paper 1, while addressing a practical problem in model routing, targets a narrower audience and a more incremental infrastructure challenge with less fundamental scientific contribution.