Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor
Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang
Abstract
Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.
AI Impact Assessments
(1 models)Scientific Impact Assessment: "Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor"
1. Core Contribution
The paper introduces Thinking as Compression (TaC), a paradigm that reframes the thinking traces of reasoning LLMs as compressed context for downstream inference. Rather than designing specialized compression modules (token pruners, soft-token encoders, relevance scorers), TaC leverages the model's intrinsic reasoning capability to distill long contexts into compact, query-conditioned traces. The key insight—that reasoning is inherently a compression process—is intuitive but underexplored in the compression literature.
The extended variant, TaC-C, adds a reward-driven optimization framework (via GRPO) with three components: (1) a utility reward based on downstream answerer performance, (2) a budget control reward with a soft tolerance window, and (3) an anti-hacking constraint to prevent the thinker from embedding answers directly in traces. The decoupled Thinker–Answerer architecture ensures the compressed trace must be self-sufficient.
2. Methodological Rigor
Strengths:
Concerns:
3. Potential Impact
The conceptual framing—reasoning as compression—is compelling and could influence how the community thinks about intermediate representations in LLMs. Several potential impact vectors:
However, the practical impact is tempered by the observation that TaC-C requires running a reasoning model on the full context first—the very operation that context compression aims to avoid. The value proposition is strongest when: (a) the trace is reused across multiple queries or models, or (b) the thinker is significantly smaller/cheaper than the answerer.
4. Timeliness & Relevance
The paper is timely given the explosion of reasoning models (DeepSeek-R1, OpenAI o-series, Qwen3) and the growing need for efficient long-context processing. The idea of repurposing reasoning capabilities for compression aligns with the broader trend of discovering emergent abilities in LLMs. The connection between information-theoretic views of CoT and compression is intellectually stimulating and well-cited.
5. Strengths & Limitations
Key Strengths:
Notable Weaknesses:
Summary
This paper presents a creative and well-executed study connecting reasoning traces to context compression. The empirical results are strong within the evaluated setting, and the conceptual contribution is genuinely novel. However, the practical impact is limited by the requirement to process full context during trace generation, the narrow evaluation scope, and questions about whether "compression" is the right framing for what is essentially query-conditioned summarization optimized with RL. The work opens interesting research directions but needs broader validation to substantiate its claims.
Generated May 28, 2026
Comparison History (18)
Paper 1 reveals a fundamental and novel intrinsic capability of LLMs, bridging reasoning and context compression. This conceptual shift ('Thinking as Compression') has broad theoretical implications for how we understand and utilize LLM reasoning processes, likely inspiring cross-disciplinary research. While Paper 2 offers a highly practical systems architecture for serving agents, Paper 1's discovery of emergent compression behaviors offers deeper scientific insights into model mechanics and representation learning.
Paper 2 likely has higher impact: it uncovers a general internal mechanism (spontaneous topology reconstruction) with a theoretically grounded explanation (anisotropy/representation bottleneck) and offers a training-free, plug-and-play intervention (SLASH). This combination of interpretability + actionable method can transfer broadly to graph reasoning, molecular property prediction, and structured data understanding across many LLMs. Paper 1 is timely and useful for long-context efficiency, but it leans on prompting and reward optimization around “thinking traces,” which may be less generalizable and harder to standardize given variability and policy constraints around chain-of-thought.
Paper 1 introduces a highly novel paradigm by linking reasoning traces with context compression, revealing an intrinsic capability of thinking models. This conceptual leap offers broad utility for inference acceleration without specialized training modules. While Paper 2 addresses an important safety gap with a solid RL methodology, Paper 1's fundamental reinterpretation of reasoning as compression is more likely to inspire diverse downstream applications and broader follow-up research across the LLM efficiency, reasoning, and architecture communities.
Paper 2 introduces a fundamentally novel insight—that reasoning models inherently function as context compressors—which reframes two active research areas (chain-of-thought reasoning and context compression) under a unified lens. This conceptual bridge has broader impact across NLP, enabling practical inference acceleration without specialized modules. The strong empirical gains (17-23% improvements) and the paradigm's simplicity increase adoption potential. Paper 1 addresses an important but narrower problem (abstention under insufficient information) with a well-engineered but more incremental solution, limiting its cross-field impact compared to Paper 2's broader theoretical contribution.
Paper 1 presents a novel and practical paradigm (TaC) that repurposes reasoning models as context compressors, achieving substantial quantitative improvements (17-23% gains) over strong baselines. It addresses the highly relevant problem of long-context efficiency with a surprisingly simple yet effective approach, has immediate practical applications for LLM deployment, and reveals an unexpected connection between reasoning and compression. Paper 2 provides valuable mechanistic interpretability insights into reasoning circuits but has narrower applicability and less immediate practical impact, primarily confirming and extending existing understanding of attention head specialization.
Paper 2 presents a more novel and broadly applicable insight—that reasoning models inherently perform context compression—which reframes understanding of LLM thinking traces and has wide implications for efficiency across many LLM applications. The substantial performance improvements (17-23% gains) and the elegance of requiring no dedicated compression module make it highly impactful. Paper 1, while useful, addresses a narrower problem (claim-citation verification) with more incremental gains (+4.5 F1) using a relatively straightforward pipeline approach. Paper 2's conceptual contribution and breadth of applicability give it greater potential impact.
Paper 2 addresses a highly timely and widely relevant problem in AI (LLM context compression) by introducing a novel paradigm that leverages reasoning models. Its impact extends across the rapidly growing field of natural language processing and AI inference optimization. In contrast, Paper 1, while rigorous and offering new best solutions, focuses on a much more niche operations research problem (a specific variant of the facility location problem), limiting its breadth of impact compared to Paper 2.
Paper 1 addresses a highly timely and critical bottleneck in LLMs (long context compression) by leveraging the emerging paradigm of 'thinking' models. Its approach offers immediate, practical real-world applications for inference acceleration without requiring specialized compression modules. The substantial empirical improvements on standard benchmarks suggest broad and immediate utility. While Paper 2 tackles an important aspect of AI safety, its reliance on formal calculus for dynamic norms is more theoretical and may face broader adoption challenges in current deep learning-dominated pipelines.
While Paper 1 offers a practical engineering solution for LLM efficiency, Paper 2 tackles a fundamental theoretical debate in AI: whether LLMs build internal world models. By introducing a comprehensive benchmark and uncovering a universal 'reasoning cliff' akin to human working memory limits, Paper 2 provides profound cognitive and architectural insights that will likely drive foundational research in multimodal and augmented reasoning across the broader AI community.
Paper 2 is likely higher impact due to greater novelty and broader relevance: it reframes reasoning traces as an intrinsic context-compression mechanism, a timely problem for long-context LLM efficiency. It shows strong benchmark gains at high compression ratios and proposes a simple, general optimization (TaC-C) without dedicated compressors, making adoption easier across models and tasks. Paper 1 addresses important operational governance (learned abstention, auditability), but the approach is more domain/interface-oriented with narrower cross-field impact and less clear algorithmic generality beyond calibrated triage classification.
Paper 1 has higher estimated impact due to a more novel, structured approach (typed compositional code DAG with co-evolution of planner and tool library), strong methodological breadth (retrieval theory, reward shaping analysis, well-formedness), and broader applicability to tool-augmented agents, program synthesis, and scalable skill libraries under context limits. Its demonstrated scaling benefit (8B matching/exceeding 32B on GSM8K/MATH) suggests substantial real-world utility. Paper 2 is timely and practical for long-context acceleration, but relies more on prompting/rewarding existing “thinking traces,” with narrower scope and potentially higher reproducibility/privacy concerns.
Paper 1 has higher estimated scientific impact. It proposes a novel paradigm (Thinking as Compression) that leverages intrinsic LLM reasoning traces for long-context compression, removing the need for dedicated compressors and introducing a constrained optimization variant with strong benchmark gains. The approach is timely for scaling LLM inference, broadly applicable across many long-context tasks and systems, and likely to influence both model prompting/optimization and deployment practices. Paper 2 is rigorous and valuable for interpretable, standards-grounded reward design in building DRL, but its impact is more domain-specific and incremental relative to existing reward-shaping practices.
Paper 2 addresses a fundamental bottleneck in LLMs (long context inference) with a highly novel, broadly applicable paradigm that leverages reasoning models for context compression. Its impact spans across all domains using LLMs, making it highly timely and relevant. In contrast, Paper 1, while methodologically rigorous and valuable, is confined to the specific domain of materials synthesis, limiting its overall breadth of scientific impact.
Paper 1 provides a fundamental, mechanistic understanding of how compressed reasoning data influences LLM post-training. By systematically analyzing supervised fine-tuning (SFT) and reinforcement learning (RL) dynamics through a novel Chain-of-Thought taxonomy, it addresses core theoretical gaps in model training. While Paper 2 offers a strong practical application for inference acceleration, Paper 1's rigorous methodological design and foundational insights into training dynamics will likely have a broader, longer-lasting scientific impact on how future reasoning models are systematically developed and optimized.
Paper 1 offers a fundamental and highly novel insight by repurposing reasoning/thinking models as natural context compressors. This addresses a critical bottleneck in LLM inference (long context management) without requiring specialized modules. Its potential impact spans across almost all LLM applications. While Paper 2 presents a valuable benchmark and framework for audio-visual reasoning, Paper 1's approach has broader applicability, greater methodological simplicity, and tackles a more pervasive challenge in the current AI landscape.
Paper 2 introduces a highly novel conceptual shift by repurposing reasoning/thinking traces as context compressors. This innovative paradigm connects two critical areas of LLM research (reasoning and long-context efficiency) without requiring dedicated compression modules. While Paper 1 provides a valuable and rigorous benchmarking framework, Paper 2 offers a more foundational algorithmic insight that could broadly influence future architectures and optimization strategies for LLM inference.
Paper 1 is more novel in reframing “thinking traces” as an intrinsic, model-driven context compression mechanism, offering a broadly applicable paradigm for long-context efficiency without extra compressor models. It targets a timely bottleneck (long-context inference cost) with clear, sizable benchmark gains and a lightweight reward-based control method, likely generalizable across tasks beyond QA. Paper 2 is practically relevant, but offline RL for improving LLMs (including for code) is a more incremental extension of established RLHF/offline RL ideas, with impact narrower to code post-training and dependent on dataset/verification choices.
Paper 1 addresses the critical bottleneck of long-context LLM inference by introducing a highly novel and intuitive paradigm: repurposing reasoning traces as context compression. Its significant empirical gains and immediate, broad applicability to any long-context task give it higher potential for widespread adoption and real-world impact compared to Paper 2's more specialized focus on multi-stakeholder alignment.