Philipp Grohs, Matěj Trödler
We investigate limitations of learning neural networks from point evaluations under finite-precision computations and accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on samples can achieve a convergence rate higher than the Monte Carlo rate in the norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the setting.
This paper establishes fundamental lower bounds on the sampling complexity for learning neural networks with tanh activation functions under finite-precision arithmetic. The main result (Theorem 6) shows that no adaptive randomized algorithm using point samples can achieve an approximation error better than — the Monte Carlo rate — unless grows exponentially with network architecture parameters. This extends the landmark results of Berner, Grohs, and Voigtländer (2023) from ReLU networks to the smooth sigmoidal (tanh) setting.
The key technical innovation is a novel construction of sharply localized bump functions using iterated tanh activations. Unlike ReLU, where compactly supported bump functions are straightforward to construct due to the exact zero region, tanh is real-analytic and never exactly vanishes. The authors circumvent this by introducing finite-precision arithmetic as a modeling assumption: tails that decay below machine precision become indistinguishable from zero under any quantizer. This is both a clever technical workaround and a physically meaningful modeling choice, since all practical computations operate under finite precision.
The paper is technically demanding and the proofs are detailed and rigorous. The construction proceeds through several carefully orchestrated stages:
1. Fixed-point analysis of iterated tanh (Section 4.1): The authors characterize the fixed-point structure of for , showing convergence of iterates toward a step-like function . The convergence rates are quantified through exit-time bounds (Lemma 17) and attraction estimates (Lemma 18-19), which are essential for controlling how many layers are needed to sharpen the bump.
2. Coarse bump construction and separation (Section 4.2): Lemma 20 establishes that a carefully designed average of shifted tanh pairs produces a function that is positive on a small cube and sufficiently negative outside a slightly larger cube. The analysis requires several non-trivial auxiliary lemmas (Lemmas 21-24).
3. Bump sharpening and grid packing (Section 4.3): Theorem 25 combines the coarse bump with iterative sharpening, and Theorem 27 provides the full constructive result. The final proof uses a pigeonhole/volume-packing argument with Markov's inequality to extend from deterministic to randomized algorithms.
The proofs are complete and self-contained, with all constants tracked explicitly. The authors even provide an interactive web application for numerical verification of the theorem's conditions, which is commendable for reproducibility.
Theoretical significance: This paper addresses a genuine gap in the theory. The prior ReLU results relied heavily on homogeneity () and exact compact support — properties absent for smooth activations. Demonstrating that hardness results persist for tanh networks, albeit through a fundamentally different mechanism (finite-precision truncation of analytic tails), broadens the scope of impossibility results in learning theory.
Practical relevance: The tanh activation is widely used in RNNs, LSTMs, PINNs, and is closely related to GELU used in transformers. The result implies that for networks of even moderate size (the authors give a concrete example with , , ), uniform reconstruction from samples is infeasible. The stability implication — that networks differing by in sampled values can differ by in — highlights an intrinsic ill-posedness that no algorithmic improvement can overcome.
Broader impact on approximation theory: The connection to the impossibility results of Platte, Trefethen, and Kuijlaars (2011) for analytic interpolation is insightful, and the authors correctly note that their result is stronger: the instability holds regardless of sampling point placement, not just for equispaced points.
The paper addresses a timely question given the growing interest in understanding fundamental limits of neural network learning. As the community pushes for theoretical guarantees in scientific computing (PINNs) and AI safety, understanding when learning is fundamentally impossible — independent of computational budget — is crucial. The finite-precision perspective is particularly relevant as quantized inference and low-precision training become standard practice.
This is a strong theoretical contribution that meaningfully extends the understanding of fundamental learning limitations from piecewise-linear to smooth neural network classes. The technical depth is substantial, the modeling choices are well-justified, and the results are both qualitatively interesting and quantitatively precise. The main caveat is the essential role of finite precision, which weakens the impossibility compared to the ReLU analog but remains practically relevant.
Generated Jun 10, 2026
Paper 2 establishes fundamental theoretical limitations on learning tanh neural networks under finite precision, extending important prior results from ReLU to tanh settings. This addresses a core question in approximation theory and learning theory with broad implications for understanding neural network learnability. Paper 1 proposes a practical but incremental forgetting mechanism for non-stationary RL that, while useful, addresses a narrower problem with acknowledged limitations. Paper 2's rigorous mathematical contributions (novel bump function constructions, impossibility results) are more likely to have lasting theoretical impact across multiple subfields of machine learning and computational mathematics.
Paper 2 likely has higher scientific impact: it establishes fundamental, architecture-relevant lower bounds for learning tanh networks under realistic finite-precision constraints, extending prior ReLU impossibility results and clarifying intrinsic limits beyond algorithmic choices. Such theory can influence broad areas (learning theory, numerical analysis, approximation theory, and practical ML expectations) and is timely given growing attention to compute/precision constraints. Paper 1 is a solid algorithmic/efficiency contribution for a specific bandit model, but its impact is narrower and more incremental (sketching applied to reduce OFUL-MLogB costs) compared to a general limitation result.
Paper 2 likely has higher scientific impact: it provides rigorous, general lower bounds on learnability under finite precision, extending foundational ReLU results to tanh via a novel bump-function construction. Such negative results can influence theory, algorithm design, and expectations across approximation theory, learning theory, and numerical analysis. Paper 1 is an interesting architectural synthesis demonstrated in chess, but appears more domain-specific and incremental relative to existing masked/predictive SSL paradigms, with less guaranteed breadth and rigor than the theoretical limitations established in Paper 2.
Paper 2 has higher potential impact because it delivers a fundamental, broadly applicable theoretical limitation: finite-precision constraints yield Monte Carlo–rate lower bounds for learning tanh networks unless sampling grows exponentially. Such results can influence learning theory, numerical analysis, and practical ML assumptions across many domains and architectures, and are timely given concerns about precision, quantization, and reliability. Paper 1 proposes an efficient clustering method with promising applications, but its impact is likely narrower (time-series clustering) and more contingent on empirical adoption than a general impossibility/lower-bound result.
Paper 1 establishes fundamental theoretical limitations for learning neural networks under finite precision, offering rigorous mathematical guarantees that apply broadly across deep learning. In contrast, Paper 2 presents an applied, domain-specific architectural combination for EEG data. While Paper 2 has direct clinical applications, Paper 1's insights into the learnability and computational bounds of networks hold broader, more foundational scientific significance for artificial intelligence.
Paper 2 establishes fundamental theoretical limitations on learning tanh neural networks under finite precision, extending known impossibility results from ReLU to smooth activations. This addresses a core question in approximation theory and learning theory with broad implications across machine learning. Paper 1, while methodologically sound, is an incremental empirical study on a specific HAR dataset with a relatively narrow scope (zero-shot IMU-based activity recognition). Paper 2's theoretical contribution—proving exponential lower bounds on sample complexity—is more likely to influence multiple research communities and have lasting impact.
Paper 2 has higher estimated impact due to strong timeliness and broad applicability to modern language modeling. It introduces a general tool (Express) that converts non-causal attention approximations to causal ones with guarantees, improves theoretical bounds, and provides a practical Triton implementation with speedups and clear system-level wins (prefill, KV cache, decoding). This combination of algorithmic novelty, rigorous guarantees, and immediate real-world deployment potential across ML systems and NLP suggests wider and faster uptake than Paper 1, which is more specialized to theoretical limitations under finite precision.
Paper 1 addresses fundamental theoretical limitations of learning neural networks under finite precision, extending important lower bounds from ReLU to tanh networks. This contributes to computational learning theory with broad implications for understanding learnability. The novel construction of localized bump functions via iterated tanh activations is a meaningful technical contribution. Paper 2, while practical, is a thesis presenting an incremental empirical study on trajectory data augmentation selection strategies with domain-specific findings and limited generalizability. Paper 1's theoretical results have broader and more lasting impact across machine learning theory.
Paper 1 addresses a timely and broadly impactful issue in AI interpretability and MoE model pruning—a rapidly growing area given the deployment of large MoE models. It introduces a rigorous causal framework (connecting Pearl's causal hierarchy to interpretability practices) and provides concrete empirical evidence challenging widely-used assumptions, which could reshape pruning methodologies and interpretability standards across the field. Paper 2, while technically strong, extends known impossibility results from ReLU to tanh networks in a narrower theoretical niche, with less immediate practical impact and a smaller affected community.
Paper 1 addresses the highly practical and timely problem of uncertainty quantification in machine learning. By directly integrating conformal prediction into training, it offers broad, real-world utility across safety-critical domains like healthcare and finance. Paper 2, while methodologically rigorous, focuses on theoretical limitations of finite precision for tanh networks, which has a narrower scope and lower potential for immediate, widespread application.