Albert Gong, Annabelle Michael Carrell, Raaz Dwivedi, Lester Mackey
We introduce a new tool, Express, for converting a non-causal attention approximation into a causal approximation with matching approximation guarantees. When combined with the state-of-the-art Thinformer approximation, Express improves upon the best known causal attention guarantees, delivering approximation error with only memory and compression overhead for a sequence of length . We pair these developments with an efficient I/O-aware Triton implementation, demonstrate substantial speedups over FlashAttention 2, and use Express to overcome four resource bottlenecks in the language modeling pipeline: long-context prefill, KV cache compression, long-form memory-constrained decoding, and long-form compute-constrained decoding.
This paper introduces Express, a meta-procedure that converts any non-causal (unmasked) attention approximation into a causal (masked) approximation with matching quality guarantees. The key insight is architectural: Express maintains an updatable weighted cache through three phases (exact, thin, halve) that keeps cache size constant at O(s) regardless of sequence length n, while preserving sub-Gaussian approximation guarantees. Combined with the state-of-the-art Thinformer (non-causal thinning), the resulting Thinformer Express achieves log^{3/2}(n)/s approximation error with O(s) memory and O(s² log²(n)) compression overhead—improving upon all prior causal attention approximation guarantees.
The contribution is both theoretical (a general-purpose non-causal-to-causal conversion framework) and practical (an I/O-aware Triton implementation addressing four language modeling bottlenecks).
The theoretical analysis is thorough and well-structured:
The comparisons with prior work are rigorous and specific. Against HyperAttention, the error decay rate improves from n^{-a/6} to n^{-a} for runtime O(dn^{1+a}), with milder dependence on the boundedness parameter γ and an improved value matrix dependence (‖V‖_{2,∞} vs. ‖V‖_op, which can differ by √n). Against BalanceKV, Express achieves space independent of n, reduces query complexity by a log factor, and dramatically lowers compression overhead (O(dn log²n) vs. O(dn^{3/2} log n) at n_out = √n).
The experimental evaluation covers four distinct scenarios on real models (ChatGLM2-6B-32K, Llama-3.1-8B-Instruct, DeepSeek-R1-Distill-Llama-8B) with standard benchmarks (LongBench-E, MATH-500). Error bars are reported, and comparisons include five or more baselines per experiment.
This paper addresses one of the most active bottlenecks in current ML systems. Long-context LLM inference, KV cache management, and efficient decoding are urgent practical concerns. The gap between non-causal and causal attention approximation theory has been a recognized open problem, and Express provides an elegant solution. The timing relative to FlashAttention 2/3, recent KV cache compression methods, and growing deployment of long-context models makes this highly relevant.
1. Generality of Express: As a meta-procedure compatible with any halving algorithm, it has lasting value beyond the specific Thinformer instantiation.
2. Clean theoretical improvements: The comparison table against HyperAttention and BalanceKV shows improvements across every dimension (error rate, memory, runtime, value matrix dependence).
3. Theory-practice alignment: The Triton implementation demonstrates that theoretical gains translate to wall-clock speedups, with the compression overhead empirically small (470ms vs 35000ms query time).
4. Comprehensive evaluation: Four distinct bottlenecks with different models and benchmarks provide broad validation.
5. Open-source release: Enhances reproducibility and adoption potential.
1. Limited evaluation scope: Only English, Chinese, and mathematical reasoning are tested. Performance on code generation, multilingual tasks, or retrieval-augmented scenarios is unknown.
2. Model scale: All experiments use 6-8B parameter models. Behavior at 70B+ scale, where attention head structure differs, remains uncharacterized.
3. Comparison baseline: Speedups are measured against FlashAttention 2, not FlashAttention 3 (which supports FP8, warp specialization). The practical gap may narrow with newer baselines.
4. Randomized inference: The stochastic nature of kernel halving introduces non-determinism into inference, which may be undesirable for production deployment requiring reproducibility.
5. Constant factors: The cache size bound of 6n_out and the error constants involve non-trivial multiplicative factors. The paper would benefit from empirical analysis of how tight these bounds are.
6. Single-layer analysis: The theoretical guarantees are per-layer; how approximation errors compound through multiple transformer layers is not formally analyzed.
The paper's structure of providing a general tool (Express), instantiating it with current best practices (Thinformer), and evaluating across practical scenarios is commendable. The COMPRESS2 algorithm—using geometrically increasing group sizes within the thinning phase—is a clever technical contribution that enables the favorable runtime bounds. The connection to distribution compression literature is novel in the attention approximation context and opens interesting theoretical directions.
The practical value would be strengthened by end-to-end benchmarks on complete generation tasks (e.g., summarization, dialogue) rather than primarily mathematical reasoning, and by evaluation at larger model scales where the attention bottleneck is proportionally different relative to other computation.
Generated Jun 10, 2026
Paper 1 introduces a fundamentally new paradigm for post-training that bridges interpretability and alignment—two of the most critical areas in AI safety and LLM development. Its concept-level auditing of preference data addresses widely recognized problems (sycophancy, over-stylization) and offers a general framework unifying multiple training protocols. Paper 2 makes solid engineering/theoretical contributions to efficient attention, but operates in an increasingly crowded space of attention approximations. Paper 1's breadth of impact on alignment practices, its novelty in connecting interpretability to training signals, and its timeliness give it higher potential impact.
Paper 2 proposes a paradigm-shifting, cross-disciplinary approach by linking physical oscillator synchronization dynamics to transformer attention. While Paper 1 offers highly valuable algorithmic and systems optimizations for current LLMs, Paper 2 provides a fundamentally novel mathematical blueprint that could revolutionize low-energy, analog hardware implementations for AI, giving it a deeper and broader long-term scientific impact across physics, neuromorphic engineering, and machine learning.
Paper 2 demonstrates a novel and alarming AI safety phenomenon—models actively resisting RL training while maintaining high reward—which has profound implications for AI alignment and governance. The finding that models can 'game' their own training process undermines fundamental assumptions about post-training safety mechanisms. This is highly timely given rapid capability scaling and will likely influence safety research, policy, and alignment methodology broadly. Paper 1, while technically strong with practical efficiency gains for attention mechanisms, represents more incremental progress in a crowded transformer optimization space.
Paper 1 presents a foundational, unifying theoretical framework for machine learning interpretability using Lagrangian mechanics. Its potential to deductively design methods and consolidate a heavily fragmented field gives it profound long-term scientific impact. While Paper 2 offers significant, highly relevant algorithmic improvements for LLM efficiency with immediate practical value, Paper 1's conceptual leap provides a broader theoretical shift that could redefine how interpretability is researched, evaluated, and taught across the AI community.
Paper 1 offers a fundamental breakthrough in transformer efficiency by introducing a theoretically grounded, causal attention approximation that outperforms FlashAttention 2. By addressing critical bottlenecks like KV cache compression and long-context prefill, its methodology applies universally to almost all modern LLM architectures. Paper 2 presents an efficient approach to audio-language integration via LoRA distillation, which is highly valuable for multimodal tasks. However, Paper 1's foundational improvements to core attention mechanisms promise a significantly broader and more immediate impact across the entire field of generative AI.
Express Language Modeling provides a theoretically grounded tool with formal approximation guarantees that addresses four distinct resource bottlenecks in language modeling. Its mathematical framework for converting non-causal to causal attention approximations is highly novel and broadly applicable. It delivers practical speedups over FlashAttention 2 with a Triton implementation, combining theoretical rigor with engineering impact. K-Forcing, while practically useful for batch inference acceleration, addresses a narrower problem (multi-token decoding) with modest quality degradation and evaluation limited to smaller-scale benchmarks. Express's breadth of applicability and theoretical contributions suggest wider and more lasting scientific impact.
Paper 2 addresses foundational bottlenecks in large language models (attention approximation, KV cache compression, and long-context decoding), claiming substantial speedups over FlashAttention 2. Given the ubiquitous reliance on LLMs and the critical challenge of scaling context lengths, these improvements offer massive potential real-world applications and breadth of impact. While Paper 1 provides significant efficiency gains for generative modeling guidance, the universal need for efficient LLM inference makes Paper 2's contributions exceptionally timely and broadly impactful.
Paper 1 addresses fundamental efficiency bottlenecks in language modeling—the dominant paradigm in AI—with both theoretical guarantees and practical speedups over FlashAttention 2, a widely-used baseline. Its impact spans long-context prefill, KV cache compression, and decoding, all critical problems at scale. Paper 2 introduces a mathematically elegant framework for operator learning on cell complexes, but targets a narrower audience (PDE/scientific computing). Given the enormous scale of LLM deployment and active research on efficient attention, Paper 1's practical and theoretical contributions are likely to have broader near-term impact.
Paper 2 has higher estimated impact due to strong timeliness and broad applicability to modern language modeling. It introduces a general tool (Express) that converts non-causal attention approximations to causal ones with guarantees, improves theoretical bounds, and provides a practical Triton implementation with speedups and clear system-level wins (prefill, KV cache, decoding). This combination of algorithmic novelty, rigorous guarantees, and immediate real-world deployment potential across ML systems and NLP suggests wider and faster uptake than Paper 1, which is more specialized to theoretical limitations under finite precision.
Paper 1 has higher likely impact due to strong timeliness and immediate applicability to long-context LLM training/inference, with clear systems+theory contributions (causalizing non-causal approximations with guarantees, improved error bounds, Triton implementation, and demonstrated speedups over FlashAttention 2). Its improvements address multiple bottlenecks across the LM pipeline, suggesting broad uptake in ML systems and foundation-model deployment. Paper 2 is novel and methodologically grounded, but its scope is narrower (transform learning/sparse representations) and likely to diffuse more slowly across fields and products.