Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor

Guoxin Ma, Yibing Liu, Chengzhengxu Li, Yu Liang, Yan Wang, Yueyang Zhang, Kecheng Chen, Zhaohan Zhang

May 27, 2026

arXiv:2605.28713v1 PDF

cs.AI(primary)

#513of 2682·Artificial Intelligence

#513 of 2682 · Artificial Intelligence

Tournament Score

1477±49

10501800

78%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance6.5

Rigor6

Novelty7.5

Clarity7.5

Tournament Score

1477±49

10501800

78%

Win Rate

Wins

Losses

Matches

Rating

6.5/ 10

Significance

Rigor

Novelty

Clarity

Abstract

Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Thinking as Compression: Your Reasoning Model is Secretly a Context Compressor"

1. Core Contribution

The paper introduces Thinking as Compression (TaC), a paradigm that reframes the thinking traces of reasoning LLMs as compressed context for downstream inference. Rather than designing specialized compression modules (token pruners, soft-token encoders, relevance scorers), TaC leverages the model's intrinsic reasoning capability to distill long contexts into compact, query-conditioned traces. The key insight—that reasoning is inherently a compression process—is intuitive but underexplored in the compression literature.

The extended variant, TaC-C, adds a reward-driven optimization framework (via GRPO) with three components: (1) a utility reward based on downstream answerer performance, (2) a budget control reward with a soft tolerance window, and (3) an anti-hacking constraint to prevent the thinker from embedding answers directly in traces. The decoupled Thinker–Answerer architecture ensures the compressed trace must be self-sufficient.

2. Methodological Rigor

Strengths:

The pilot study (TaC-Vanilla) provides useful empirical grounding for the central claim before introducing TaC-C, demonstrating that even prompt-only thinking traces outperform several baselines.

The ablation study (Table 5) cleanly isolates the contribution of each reward component, revealing that removing the anti-hack constraint leads to 71.67% hack rate while boosting raw scores—a genuinely informative finding.

The scaling study (Figure 4) across 1.7B–14B thinkers shows diminishing returns beyond 4B, which has practical implications.

Training dynamics (Figure 5) transparently show how reward components evolve during optimization.

Concerns:

The evaluation is limited to extractive/multi-hop QA benchmarks (NaturalQuestions, 2WikiMQA, HotpotQA, MuSiQue). These are well-suited but narrow; the paper does not test on summarization, code understanding, or more diverse long-context tasks despite acknowledging this limitation.

The "compression ratio" semantics differ from baselines. TaC-C generates new text rather than selecting/pruning existing tokens. Table 4 shows TaC-C achieves only 1.13% actual compression ratio on LoCoMo (essentially generating very short traces), which raises questions about fair comparison. The compressed output is fundamentally different in nature—it's generated text, not a subset of original tokens.

The end-to-end cost analysis is incomplete. TaC-C requires a forward pass through the Thinker on the full context to *generate* the trace, which itself involves processing the entire long context. The latency comparison in Table 4 appears to measure only the answerer's inference, not the thinker's generation cost. This is a significant omission for a method motivated by "inference acceleration."

The training set is relatively small (3,000 instances from the evaluation benchmarks), raising concerns about data leakage, even if train/test splits are properly maintained.

3. Potential Impact

The conceptual framing—reasoning as compression—is compelling and could influence how the community thinks about intermediate representations in LLMs. Several potential impact vectors:

Practical deployment: If the thinker's generation cost is amortized or the thinker is small (4B suffices per scaling analysis), TaC-C could serve as a preprocessing step for RAG pipelines, generating reusable compressed contexts.

Cross-model transferability (Table 3) is a genuine advantage over soft-token methods, since TaC-C produces natural language traces consumable by any downstream model.

Connection to test-time compute literature: The paper bridges context compression and chain-of-thought reasoning, potentially inspiring work on optimizing reasoning traces for purposes beyond answer quality (e.g., memory, planning, tool use).

However, the practical impact is tempered by the observation that TaC-C requires running a reasoning model on the full context first—the very operation that context compression aims to avoid. The value proposition is strongest when: (a) the trace is reused across multiple queries or models, or (b) the thinker is significantly smaller/cheaper than the answerer.

4. Timeliness & Relevance

The paper is timely given the explosion of reasoning models (DeepSeek-R1, OpenAI o-series, Qwen3) and the growing need for efficient long-context processing. The idea of repurposing reasoning capabilities for compression aligns with the broader trend of discovering emergent abilities in LLMs. The connection between information-theoretic views of CoT and compression is intellectually stimulating and well-cited.

5. Strengths & Limitations

Key Strengths:

Novel and elegant conceptual framing that unifies reasoning and compression

Strong empirical results: 17–23% improvements over best baselines are substantial

Clean experimental design with proper ablations, scaling studies, and transferability analysis

The anti-hacking reward component addresses a real and nuanced problem in RL-trained language models

Practical simplicity: LoRA training with GRPO, no separate critic model

Notable Weaknesses:

Unfair efficiency comparison: The thinker must process the full context, so total compute is not reduced; it may even increase. The paper frames this as "compression" but the full-context processing still occurs.

Limited task diversity: Only QA benchmarks; the generality claim is aspirational rather than demonstrated.

The "compression" framing is somewhat misleading: TaC-C performs query-conditioned summarization/extraction with RL optimization. Calling it "compression" obscures important differences from methods that actually reduce computational cost at inference time.

Budget control semantics: At 8× compression, TaC-C achieves similar performance to 4×, raising questions about whether the budget constraint is truly binding or the traces are simply short by default.

Relatively short contexts: Average context lengths are 1,000–3,000 tokens (Table 6), which are not "long" by modern standards (128K+ context windows). Performance on truly long contexts is unexplored.

Summary

This paper presents a creative and well-executed study connecting reasoning traces to context compression. The empirical results are strong within the evaluated setting, and the conceptual contribution is genuinely novel. However, the practical impact is limited by the requirement to process full context during trace generation, the narrow evaluation scope, and questions about whether "compression" is the right framing for what is essentially query-conditioned summarization optimized with RL. The work opens interesting research directions but needs broader validation to substantiate its claims.

Rating:6.5/ 10

Significance 6.5Rigor 6Novelty 7.5Clarity 7.5

Generated May 28, 2026

Comparison History (18)

vs. A Policy-Driven Runtime Layer for Agentic LLM Serving

gemini-3.15/28/2026

Paper 1 reveals a fundamental and novel intrinsic capability of LLMs, bridging reasoning and context compression. This conceptual shift ('Thinking as Compression') has broad theoretical implications for how we understand and utilize LLM reasoning processes, likely inspiring cross-disciplinary research. While Paper 2 offers a highly practical systems architecture for serving agents, Paper 1's discovery of emergent compression behaviors offers deeper scientific insights into model mechanics and representation learning.

vs. SLASH the Sink: Sharpening Structural Attention Inside LLMs

gpt-5.25/28/2026

Paper 2 likely has higher impact: it uncovers a general internal mechanism (spontaneous topology reconstruction) with a theoretically grounded explanation (anisotropy/representation bottleneck) and offers a training-free, plug-and-play intervention (SLASH). This combination of interpretability + actionable method can transfer broadly to graph reasoning, molecular property prediction, and structured data understanding across many LLMs. Paper 1 is timely and useful for long-context efficiency, but it leans on prompting and reward optimization around “thinking traces,” which may be less generalizable and harder to standardize given variability and policy constraints around chain-of-thought.

vs. Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories

gemini-3.15/28/2026

Paper 1 introduces a highly novel paradigm by linking reasoning traces with context compression, revealing an intrinsic capability of thinking models. This conceptual leap offers broad utility for inference acceleration without specialized training modules. While Paper 2 addresses an important safety gap with a solid RL methodology, Paper 1's fundamental reinterpretation of reasoning as compression is more likely to inspire diverse downstream applications and broader follow-up research across the LLM efficiency, reasoning, and architecture communities.

vs. Bridging the Detection-to-Abstention Gap in Reasoning Models under Insufficient Information

claude-opus-4.65/28/2026

Paper 2 introduces a fundamentally novel insight—that reasoning models inherently function as context compressors—which reframes two active research areas (chain-of-thought reasoning and context compression) under a unified lens. This conceptual bridge has broader impact across NLP, enabling practical inference acceleration without specialized modules. The strong empirical gains (17-23% improvements) and the paradigm's simplicity increase adoption potential. Paper 1 addresses an important but narrower problem (abstention under insufficient information) with a well-engineered but more incremental solution, limiting its cross-field impact compared to Paper 2's broader theoretical contribution.

vs. Revealing Algorithmic Deductive Circuits for Logical Reasoning

claude-opus-4.65/28/2026

Paper 1 presents a novel and practical paradigm (TaC) that repurposes reasoning models as context compressors, achieving substantial quantitative improvements (17-23% gains) over strong baselines. It addresses the highly relevant problem of long-context efficiency with a surprisingly simple yet effective approach, has immediate practical applications for LLM deployment, and reveals an unexpected connection between reasoning and compression. Paper 2 provides valuable mechanistic interpretability insights into reasoning circuits but has narrower applicability and less immediate practical impact, primarily confirming and extending existing understanding of attention head specialization.

vs. DeepSciVerify: Verifying Scientific Claim--Citation Alignment via LLM-Driven Evidence Escalation

claude-opus-4.65/28/2026

Paper 2 presents a more novel and broadly applicable insight—that reasoning models inherently perform context compression—which reframes understanding of LLM thinking traces and has wide implications for efficiency across many LLM applications. The substantial performance improvements (17-23% gains) and the elegance of requiring no dedicated compression module make it highly impactful. Paper 1, while useful, addresses a narrower problem (claim-citation verification) with more incremental gains (+4.5 F1) using a relatively straightforward pipeline approach. Paper 2's conceptual contribution and breadth of applicability give it greater potential impact.

vs. An Enhanced Large Neighborhood Search Approach for the Capacitated Facility Location Problem with Incompatible Customers

gemini-3.15/28/2026

Paper 2 addresses a highly timely and widely relevant problem in AI (LLM context compression) by introducing a novel paradigm that leverages reasoning models. Its impact extends across the rapidly growing field of natural language processing and AI inference optimization. In contrast, Paper 1, while rigorous and offering new best solutions, focuses on a much more niche operations research problem (a specific variant of the facility location problem), limiting its breadth of impact compared to Paper 2.

vs. Reasoning and Planning with Dynamically Changing Norms

gemini-3.15/28/2026

Paper 1 addresses a highly timely and critical bottleneck in LLMs (long context compression) by leveraging the emerging paradigm of 'thinking' models. Its approach offers immediate, practical real-world applications for inference acceleration without requiring specialized compression modules. The substantial empirical improvements on standard benchmarks suggest broad and immediate utility. While Paper 2 tackles an important aspect of AI safety, its reliance on formal calculus for dynamic norms is more theoretical and may face broader adoption challenges in current deep learning-dominated pipelines.

vs. Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

gemini-3.15/28/2026

While Paper 1 offers a practical engineering solution for LLM efficiency, Paper 2 tackles a fundamental theoretical debate in AI: whether LLMs build internal world models. By introducing a comprehensive benchmark and uncovering a universal 'reasoning cliff' akin to human working memory limits, Paper 2 provides profound cognitive and architectural insights that will likely drive foundational research in multimodal and augmented reasoning across the broader AI community.

vs. Auditable Decision Models with Learned Abstention and Real-Time Steering

gpt-5.25/28/2026

Paper 2 is likely higher impact due to greater novelty and broader relevance: it reframes reasoning traces as an intrinsic context-compression mechanism, a timely problem for long-context LLM efficiency. It shows strong benchmark gains at high compression ratios and proposes a simple, general optimization (TaC-C) without dedicated compressors, making adoption easier across models and tasks. Paper 1 addresses important operational governance (learned abstention, auditability), but the approach is more domain/interface-oriented with narrower cross-field impact and less clear algorithmic generality beyond calibrated triage classification.

vs. CoCoDA: Co-evolving Compositional DAG for Tool-Augmented Agents

gpt-5.25/28/2026

Paper 1 has higher estimated impact due to a more novel, structured approach (typed compositional code DAG with co-evolution of planner and tool library), strong methodological breadth (retrieval theory, reward shaping analysis, well-formedness), and broader applicability to tool-augmented agents, program synthesis, and scalable skill libraries under context limits. Its demonstrated scaling benefit (8B matching/exceeding 32B on GSM8K/MATH) suggests substantial real-world utility. Paper 2 is timely and practical for long-context acceleration, but relies more on prompting/rewarding existing “thinking traces,” with narrower scope and potentially higher reproducibility/privacy concerns.

vs. PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

gpt-5.25/28/2026

Paper 1 has higher estimated scientific impact. It proposes a novel paradigm (Thinking as Compression) that leverages intrinsic LLM reasoning traces for long-context compression, removing the need for dedicated compressors and introducing a constrained optimization variant with strong benchmark gains. The approach is timely for scaling LLM inference, broadly applicable across many long-context tasks and systems, and likely to influence both model prompting/optimization and deployment practices. Paper 2 is rigorous and valuable for interpretable, standards-grounded reward design in building DRL, but its impact is more domain-specific and incremental relative to existing reward-shaping practices.

vs. ProvMind: Provenance-grounded reasoning for materials synthesis

gemini-3.15/28/2026

Paper 2 addresses a fundamental bottleneck in LLMs (long context inference) with a highly novel, broadly applicable paradigm that leverages reasoning models for context compression. Its impact spans across all domains using LLMs, making it highly timely and relevant. In contrast, Paper 1, while methodologically rigorous and valuable, is confined to the specific domain of materials synthesis, limiting its overall breadth of scientific impact.

vs. Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training

gemini-3.15/28/2026

Paper 1 provides a fundamental, mechanistic understanding of how compressed reasoning data influences LLM post-training. By systematically analyzing supervised fine-tuning (SFT) and reinforcement learning (RL) dynamics through a novel Chain-of-Thought taxonomy, it addresses core theoretical gaps in model training. While Paper 2 offers a strong practical application for inference acceleration, Paper 1's rigorous methodological design and foundational insights into training dynamics will likely have a broader, longer-lasting scientific impact on how future reasoning models are systematically developed and optimized.

vs. Agentic Active Omni-Modal Perception for Multi-Hop Audio-Visual Reasoning

gemini-3.15/28/2026

Paper 1 offers a fundamental and highly novel insight by repurposing reasoning/thinking models as natural context compressors. This addresses a critical bottleneck in LLM inference (long context management) without requiring specialized modules. Its potential impact spans across almost all LLM applications. While Paper 2 presents a valuable benchmark and framework for audio-visual reasoning, Paper 1's approach has broader applicability, greater methodological simplicity, and tackles a more pervasive challenge in the current AI landscape.

vs. HRBench: Benchmarking and Understanding Thinking-Mode Switch Strategies in Hybrid-Reasoning LLMs

gemini-3.15/28/2026

Paper 2 introduces a highly novel conceptual shift by repurposing reasoning/thinking traces as context compressors. This innovative paradigm connects two critical areas of LLM research (reasoning and long-context efficiency) without requiring dedicated compression modules. While Paper 1 provides a valuable and rigorous benchmarking framework, Paper 2 offers a more foundational algorithmic insight that could broadly influence future architectures and optimization strategies for LLM inference.

vs. Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning

gpt-5.25/28/2026

Paper 1 is more novel in reframing “thinking traces” as an intrinsic, model-driven context compression mechanism, offering a broadly applicable paradigm for long-context efficiency without extra compressor models. It targets a timely bottleneck (long-context inference cost) with clear, sizable benchmark gains and a lightweight reward-based control method, likely generalizable across tasks beyond QA. Paper 2 is practically relevant, but offline RL for improving LLMs (including for code) is a more incremental extension of established RLHF/offline RL ideas, with impact narrower to code post-training and dependent on dataset/verification choices.

vs. Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

gemini-3.15/28/2026

Paper 1 addresses the critical bottleneck of long-context LLM inference by introducing a highly novel and intuitive paradigm: repurposing reasoning traces as context compression. Its significant empirical gains and immediate, broad applicability to any long-context task give it higher potential for widespread adoption and real-world impact compared to Paper 2's more specialized focus on multi-stakeholder alignment.