Chenxiao Yang, Nathan Srebro, Zhiyuan Li
We tightly characterize the VC dimension of depth- Transformers with a total of parameters, mapping an input sequence of length to a single output, establishing an upper bound of and a nearly matching lower bound of . We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity and that any learning rule that uses chain-of-thought data requires at least examples, where is the input length and is the number of autoregressive steps.
This paper provides tight (up to lower-order terms) characterizations of the VC dimension of depth- Transformers with parameters operating on sequences of length : an upper bound of and a lower bound of . It further extends this analysis to chain-of-thought (CoT) autoregressive generation, establishing that teacher forcing achieves optimal sample complexity up to the gap in the lower bound, where is the generation length.
The paper solves a fundamental open problem in the statistical learning theory of Transformers: what is the parametric capacity of the Transformer architecture? Prior work either relied on norm-based bounds (which depend on weight magnitudes and are scale-sensitive) or provided loose parametric bounds. This work settles the question for hard-attention Transformers with ReLU feed-forward layers.
The paper demonstrates exceptional technical rigor on multiple fronts:
Upper bounds: The proof extends the piecewise-polynomial framework of Bartlett et al. (1998, 2019) to handle attention's data-dependent routing. The key insight is that hard attention introduces discrete branching decisions (pairwise score comparisons) that can be counted via Warren's theorem. The careful accounting of attention branching polynomials per layer—arising from pairwise key comparisons per head, heads, and inputs—is technically clean and yields the factor naturally.
Lower bounds: The "Recursive Retrieval Machine" construction is an elegant and nontrivial gadget. The idea of encoding label configurations in context positions and using attention as a nearest-neighbor retrieval mechanism (via the key encoding trick) is creative. The base- encoding that chains retrieval rounds through left-shift operations demonstrates deep understanding of what hard-attention Transformers can compute.
CoT analysis: The introduction of two distinct complexity measures—trace shattering dimension (TSdim) and answer shattering dimension (ASdim)—is conceptually clean and captures the train-test asymmetry inherent in CoT learning. The scratchpad construction for the CoT lower bound is particularly ingenious: it shows that intermediate tokens can serve as label-invariant computation (a lookup table) while the final token extracts label-dependent information.
The near-tightness of all bounds (matching up to factors in the lower bound) is a hallmark of careful analysis.
Foundational understanding: This work provides the definitive parametric complexity characterization for (hard-attention) Transformers, analogous to the classical results for feed-forward networks. This serves as a baseline for all subsequent generalization analyses of Transformers.
Translating expressivity to learnability: Many papers establish that Transformers of certain size can *represent* specific functions. This work provides the missing link: with the VC dimension known, representational results directly translate to sample complexity guarantees for learning.
CoT training theory: The proof that teacher forcing is sample-optimal for Transformers is practically significant. It provides theoretical justification for the dominant training paradigm used in practice (e.g., training LLMs on next-token prediction with ground-truth prefixes).
Sequence length dependence: The multiplicative factor in the VC dimension is a refined finding. It shows that while Transformers' capacity grows only logarithmically with sequence length (much better than sequence-length-dependent architectures), this dependence is *multiplicative* with , not additive—improving upon Edelman et al. (2022)'s bound.
This work is highly timely given the dominance of Transformers and the growing interest in chain-of-thought reasoning. The theoretical understanding of why and when CoT helps, and whether teacher forcing is optimal, directly addresses active research questions in the LLM community. The sample complexity characterization also informs scaling laws discussions.
1. Near-tight bounds: The gap between upper and lower bounds is only , which is negligible in practice.
2. Novel constructions: The Recursive Retrieval Machine and scratchpad amplification are technically creative constructions that may find use in other lower bound arguments.
3. Comprehensive treatment: The paper addresses both the standard classification setting and the autoregressive CoT setting, with matching upper and lower bounds for both.
4. Clean conceptual framework: The TSdim/ASdim distinction elegantly captures the CoT learning problem's structure.
5. Practical relevance: The universality of teacher forcing result has direct implications for practice.
1. Hard attention only: The results do not extend to softmax attention, which is used in all practical Transformers. The authors acknowledge this, noting that the exponential function in softmax creates fundamental difficulties. The best known bound for softmax is , leaving a significant gap.
2. No layer normalization: LayerNorm is standard in practice but excluded here. This is understandable technically but limits direct applicability.
3. Parametric (not norm-based) analysis: The VC dimension analysis allows arbitrary real-valued weights. In practice, generalization is often better explained by norm-based or margin-based bounds. However, the authors correctly note this serves as a fundamental baseline.
4. Computational aspects ignored: The sample complexity bounds assume ERM/consistency can be achieved computationally, which is NP-hard in general for neural networks.
5. The gap: While small, closing this gap remains open.
This is a strong theoretical contribution that resolves a natural and important open question about the statistical capacity of Transformers. The techniques are sophisticated, the results are nearly tight, and the CoT analysis provides novel insights about autoregressive learning. The main limitation—restriction to hard attention—is significant but well-understood and clearly stated. The work sets a new benchmark for the learning-theoretic understanding of Transformer architectures.
Generated Jun 9, 2026
Paper 2 provides fundamental, tight theoretical bounds on the sample complexity and VC dimension of Transformers, including chain-of-thought reasoning. Given the dominant role of Transformers in modern AI, these foundational theoretical results are highly timely and likely to have a massive, broad impact on machine learning theory and the understanding of large language models, surpassing the more specialized, though innovative, surrogate modeling approach presented in Paper 1.
Paper 1 provides tight theoretical characterizations (VC dimension, sample complexity) for Transformers and chain-of-thought learning—foundational results with broad implications across machine learning theory. These tight bounds fill a significant gap in understanding Transformer generalization and will likely be widely cited. Paper 2 presents a practical decentralized training algorithm (GASLoC) that improves communication efficiency for LLM pretraining, which is valuable but more incremental and narrower in scope, addressing an engineering bottleneck rather than establishing fundamental theoretical understanding.
Paper 1 provides foundational theoretical results by tightly characterizing the VC dimension and sample complexity of Transformers, including the mechanics of Chain-of-Thought learning. These fundamental mathematical bounds have long-lasting, broad scientific implications for understanding why LLMs work. While Paper 2 presents an innovative and highly practical systems engineering feat for automating kernel synthesis, Paper 1's theoretical contributions are more likely to have a profound and enduring impact on the core science of machine learning.
Paper 1 provides tight, fundamental theoretical bounds on the VC dimension and sample complexity of Transformers and chain-of-thought learning. Establishing definitive mathematical limits for the dominant architecture in AI offers lasting foundational impact, whereas optimizer analyses like Paper 2, while highly practical, are often tied to more transient empirical trends.
Paper 2 has higher likely impact due to strong real-world relevance and timeliness: securing multi-stage LLM post-training is an immediate practical concern, and the sequential poisoning threat model exposes overlooked compound vulnerabilities with actionable implications for audits and defenses. It offers concrete empirical findings (additive vs complementary effects across pipelines) and releases code, aiding adoption and follow-on work. Paper 1 is theoretically novel and rigorous, but its impact may be narrower and more indirect (learning-theory bounds that may not change practice quickly), whereas Paper 2 can influence both ML security research and deployment practices across many systems.
Paper 2 likely has higher impact: it introduces a novel, causally motivated RL framework (policy-conditioned counterfactual credit + validity gating + constrained optimization) and demonstrates measurable gains and reduced reward hacking across multiple realistic long-horizon agent benchmarks with statistical testing and human audit. Its applications (safer, more reliable language agents with tools) are immediate and broadly relevant to RL, agent safety, and LLM deployment. Paper 1 is rigorous and theoretically valuable, but its main contribution (tight VC/sample complexity bounds) is more specialized and may translate more slowly into practical advances.
Paper 2 likely has higher scientific impact due to strong real-world applicability and timeliness: it targets LLM inference cost, a major bottleneck, and offers practical, controllable compute budgets with decode-time adaptive routing—features directly valuable for deployment across many systems. It also appears empirically validated on widely used model families, increasing adoption potential and cross-field impact (systems, ML, NLP). Paper 1 is methodologically rigorous and theoretically novel, but its impact may be narrower and slower to translate into practice than an inference-efficiency framework.
Paper 2 provides tight theoretical characterizations of VC dimension and sample complexity for Transformers, including chain-of-thought learning—topics of immense current interest. These fundamental results have broad implications across machine learning theory, deep learning practice, and understanding of LLMs. The near-matching upper and lower bounds represent a significant theoretical achievement with wide applicability. Paper 1 addresses a more niche problem (HDLSS tabular data generation) with a well-engineered but incremental framework, limiting its breadth of impact compared to foundational Transformer theory.
Paper 2 likely has higher impact due to timely relevance (safe/deployable unlearning under ubiquitous quantization), clear real-world applicability for regulated deployment, and a concrete method (MANSU) plus a new verification metric (CAD). It identifies a systematic failure mode (updates below quantization bin width) with an explanatory tradeoff and proposes mechanistically grounded fixes that can generalize across models/benchmarks. Paper 1 is rigorous and novel theoretically, but its impact is more specialized (learning-theory characterization) and may translate less directly into near-term practice compared to quantization-robust unlearning.
Paper 2 likely has higher near-term scientific impact: it introduces a practical, scalable training framework for a widely used paradigm (RLVR) and directly addresses a common failure mode (zero group-level advantage). The method (trace tournaments + anchor-based comparisons + Bradley–Terry fitting) is broadly applicable to LLM reasoning, improves benchmark performance, and reduces compute—strong real-world relevance and timeliness. Paper 1 is theoretically novel and rigorous with tight bounds, but its impact is more specialized and indirect for practitioners compared to an immediately deployable training improvement.