Limitations of Learning Tanh Neural Networks with Finite Precision

Philipp Grohs, Matěj Trödler

Jun 9, 2026arXiv:2606.11104v1

cs.LGstat.ML

#4729of 5669·cs.LG

#4729 of 5669 · cs.LG

Tournament Score

1305±43

10501750

39%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor8.5

Novelty7.5

Clarity7

Abstract

We investigate limitations of learning $\tanh$ neural networks from point evaluations under finite-precision computations and $L^{p}$ accuracy guarantees, building on Berner, Grohs, and Voigtländer (2023). Our approach is based on a novel construction of sharply localized bump functions via iterated $\tanh$ activations. Using this mechanism, we show that, in a finite-precision setting, no adaptive randomized algorithm based on $m$ samples can achieve a convergence rate higher than the Monte Carlo rate $O(m^{-1/p})$ in the $L^{p}$ norm, unless the sampling budget grows exponentially with the size of the network parameters and architecture. The results reveal fundamental limitations imposed by finite precision on the learnability of classes containing localized bump functions, extending previous results for ReLU networks to the $\tanh$ setting.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper establishes fundamental lower bounds on the sampling complexity for learning neural networks with tanh activation functions under finite-precision arithmetic. The main result (Theorem 6) shows that no adaptive randomized algorithm using $m$ point samples can achieve an $L^{p}$ approximation error better than $O(m^{-1/p})$ — the Monte Carlo rate — unless $m$ grows exponentially with network architecture parameters. This extends the landmark results of Berner, Grohs, and Voigtländer (2023) from ReLU networks to the smooth sigmoidal (tanh) setting.

The key technical innovation is a novel construction of sharply localized bump functions using iterated tanh activations. Unlike ReLU, where compactly supported bump functions are straightforward to construct due to the exact zero region, tanh is real-analytic and never exactly vanishes. The authors circumvent this by introducing finite-precision arithmetic as a modeling assumption: tails that decay below machine precision $\varepsilon_p$ become indistinguishable from zero under any quantizer. This is both a clever technical workaround and a physically meaningful modeling choice, since all practical computations operate under finite precision.

Methodological Rigor

The paper is technically demanding and the proofs are detailed and rigorous. The construction proceeds through several carefully orchestrated stages:

1. Fixed-point analysis of iterated tanh (Section 4.1): The authors characterize the fixed-point structure of $\sigma_a(x) = \tanh(ax)$ for $a > 1$ , showing convergence of iterates $\sigma_a^n$ toward a step-like function $\text{sign}(x) \cdot x^\star$ . The convergence rates are quantified through exit-time bounds (Lemma 17) and attraction estimates (Lemma 18-19), which are essential for controlling how many layers are needed to sharpen the bump.

2. Coarse bump construction and separation (Section 4.2): Lemma 20 establishes that a carefully designed average of shifted tanh pairs produces a function that is positive on a small cube and sufficiently negative outside a slightly larger cube. The analysis requires several non-trivial auxiliary lemmas (Lemmas 21-24).

3. Bump sharpening and grid packing (Section 4.3): Theorem 25 combines the coarse bump with iterative sharpening, and Theorem 27 provides the full constructive result. The final proof uses a pigeonhole/volume-packing argument with Markov's inequality to extend from deterministic to randomized algorithms.

The proofs are complete and self-contained, with all constants tracked explicitly. The authors even provide an interactive web application for numerical verification of the theorem's conditions, which is commendable for reproducibility.

Potential Impact

Theoretical significance: This paper addresses a genuine gap in the theory. The prior ReLU results relied heavily on homogeneity ( $\rho(\lambda x) = \lambda \rho(x)$ ) and exact compact support — properties absent for smooth activations. Demonstrating that hardness results persist for tanh networks, albeit through a fundamentally different mechanism (finite-precision truncation of analytic tails), broadens the scope of impossibility results in learning theory.

Practical relevance: The tanh activation is widely used in RNNs, LSTMs, PINNs, and is closely related to GELU used in transformers. The result implies that for networks of even moderate size (the authors give a concrete example with $d = 15$ , $B = 45$ , $L = 12$ ), uniform reconstruction from samples is infeasible. The stability implication — that networks differing by $2 \times 10^{-69}$ in sampled values can differ by $0.5$ in $L^\infty$ — highlights an intrinsic ill-posedness that no algorithmic improvement can overcome.

Broader impact on approximation theory: The connection to the impossibility results of Platte, Trefethen, and Kuijlaars (2011) for analytic interpolation is insightful, and the authors correctly note that their result is stronger: the instability holds regardless of sampling point placement, not just for equispaced points.

Timeliness & Relevance

The paper addresses a timely question given the growing interest in understanding fundamental limits of neural network learning. As the community pushes for theoretical guarantees in scientific computing (PINNs) and AI safety, understanding when learning is fundamentally impossible — independent of computational budget — is crucial. The finite-precision perspective is particularly relevant as quantized inference and low-precision training become standard practice.

Strengths & Limitations

Strengths:

Novel and technically sophisticated construction that overcomes the main obstacles (analyticity, non-homogeneity) of extending ReLU hardness results to smooth activations

The finite-precision assumption is natural, well-motivated, and minimally invasive (only requires small values map to zero)

Explicit, non-asymptotic bounds with a concrete numerical example

Interactive verification tool enhances reproducibility

Clean exposition despite technical complexity; Table 1 and Figure 2 are effective organizational aids

Limitations:

The lower bound structure differs from the ReLU case: the factor

m^{-1/d}

is absent, and the bound is uniformly bounded due to

|\tanh| \leq 1

. This makes direct comparison with ReLU results somewhat nuanced.

The finite-precision assumption, while natural, is essential — the result does not hold in exact arithmetic. This is a weaker impossibility statement than the ReLU case, which holds without such assumptions.

The choice

b = \tanh(c/2)

introduces a potentially suboptimal dependence on

B^{1/q}/c

; the authors acknowledge that

b \asymp cM

might improve this but leave it for future work.

The effective dimension reduction parameter

s

must satisfy constraint (2), which can be restrictive for small architectures.

The paper focuses exclusively on lower bounds and does not discuss whether existing algorithms actually achieve the Monte Carlo rate, leaving the question of tightness open.

Summary

This is a strong theoretical contribution that meaningfully extends the understanding of fundamental learning limitations from piecewise-linear to smooth neural network classes. The technical depth is substantial, the modeling choices are well-justified, and the results are both qualitatively interesting and quantitatively precise. The main caveat is the essential role of finite precision, which weakens the impossibility compared to the ReLU analog but remains practically relevant.

Rating:7.2/ 10

Significance 7.5Rigor 8.5Novelty 7.5Clarity 7

Generated Jun 10, 2026

Comparison History (23)

Wonvs. Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

Paper 2 establishes fundamental theoretical limitations on learning tanh neural networks under finite precision, extending important prior results from ReLU to tanh settings. This addresses a core question in approximation theory and learning theory with broad implications for understanding neural network learnability. Paper 1 proposes a practical but incremental forgetting mechanism for non-stationary RL that, while useful, addresses a narrower problem with acknowledged limitations. Paper 2's rigorous mathematical contributions (novel bump function constructions, impossibility results) are more likely to have lasting theoretical impact across multiple subfields of machine learning and computational mathematics.

claude-opus-4-6·Jun 11, 2026

Wonvs. Efficient Multinomial Logistic Bandit via Frequent Directions

Paper 2 likely has higher scientific impact: it establishes fundamental, architecture-relevant lower bounds for learning tanh networks under realistic finite-precision constraints, extending prior ReLU impossibility results and clarifying intrinsic limits beyond algorithmic choices. Such theory can influence broad areas (learning theory, numerical analysis, approximation theory, and practical ML expectations) and is timely given growing attention to compute/precision constraints. Paper 1 is a solid algorithmic/efficiency contribution for a specific bandit model, but its impact is narrower and more incremental (sketching applied to reduce OFUL-MLogB costs) compared to a general limitation result.

gpt-5.2·Jun 11, 2026

Wonvs. RePAIR: Predictive Self-Supervised Representation Learning in Chess

Paper 2 likely has higher scientific impact: it provides rigorous, general lower bounds on learnability under finite precision, extending foundational ReLU results to tanh via a novel bump-function construction. Such negative results can influence theory, algorithm design, and expectations across approximation theory, learning theory, and numerical analysis. Paper 1 is an interesting architectural synthesis demonstrated in chess, but appears more domain-specific and incremental relative to existing masked/predictive SSL paradigms, with less guaranteed breadth and rigor than the theoretical limitations established in Paper 2.

gpt-5.2·Jun 11, 2026

Wonvs. Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

Paper 2 has higher potential impact because it delivers a fundamental, broadly applicable theoretical limitation: finite-precision constraints yield Monte Carlo–rate lower bounds for learning tanh networks unless sampling grows exponentially. Such results can influence learning theory, numerical analysis, and practical ML assumptions across many domains and architectures, and are timely given concerns about precision, quantization, and reliability. Paper 1 proposes an efficient clustering method with promising applications, but its impact is likely narrower (time-series clustering) and more contingent on empirical adoption than a general impossibility/lower-bound result.

gpt-5.2·Jun 11, 2026

Wonvs. Transformer Based Model for Spatiotemporal Feature Learning in EEG Emotion Recognition

Paper 1 establishes fundamental theoretical limitations for learning neural networks under finite precision, offering rigorous mathematical guarantees that apply broadly across deep learning. In contrast, Paper 2 presents an applied, domain-specific architectural combination for EEG data. While Paper 2 has direct clinical applications, Paper 1's insights into the learnability and computational bounds of networks hold broader, more foundational scientific significance for artificial intelligence.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Closing the Modality Gap in Zero-Shot HAR: Contrastive Training and Separability-Optimized Prototypes on IMU Data

Paper 2 establishes fundamental theoretical limitations on learning tanh neural networks under finite precision, extending known impossibility results from ReLU to smooth activations. This addresses a core question in approximation theory and learning theory with broad implications across machine learning. Paper 1, while methodologically sound, is an incremental empirical study on a specific HAR dataset with a relatively narrow scope (zero-shot IMU-based activity recognition). Paper 2's theoretical contribution—proving exponential lower bounds on sample complexity—is more likely to influence multiple research communities and have lasting impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Express Language Modeling

Paper 2 has higher estimated impact due to strong timeliness and broad applicability to modern language modeling. It introduces a general tool (Express) that converts non-causal attention approximations to causal ones with guarantees, improves theoretical bounds, and provides a practical Triton implementation with speedups and clear system-level wins (prefill, KV cache, decoding). This combination of algorithmic novelty, rigorous guarantees, and immediate real-world deployment potential across ML systems and NLP suggests wider and faster uptake than Paper 1, which is more specialized to theoretical limitations under finite precision.

gpt-5.2·Jun 10, 2026

Wonvs. A Systematic Approach for Selecting Trajectories for Data Augmentation

Paper 1 addresses fundamental theoretical limitations of learning neural networks under finite precision, extending important lower bounds from ReLU to tanh networks. This contributes to computational learning theory with broad implications for understanding learnability. The novel construction of localized bump functions via iterated tanh activations is a meaningful technical contribution. Paper 2, while practical, is a thesis presenting an incremental empirical study on trajectory data augmentation selection strategies with domain-specific findings and limited generalizability. Paper 1's theoretical results have broader and more lasting impact across machine learning theory.

claude-opus-4-6·Jun 10, 2026

Lostvs. From Observation to Intervention: A Causal Audit of Expert Importance in Mixture-of-Experts Models

Paper 1 addresses a timely and broadly impactful issue in AI interpretability and MoE model pruning—a rapidly growing area given the deployment of large MoE models. It introduces a rigorous causal framework (connecting Pearl's causal hierarchy to interpretability practices) and provides concrete empirical evidence challenging widely-used assumptions, which could reshape pruning methodologies and interpretability standards across the field. Paper 2, while technically strong, extends known impossibility results from ReLU to tanh networks in a narrower theoretical niche, with less immediate practical impact and a smaller affected community.

claude-opus-4-6·Jun 10, 2026

Lostvs. SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors

Paper 1 addresses the highly practical and timely problem of uncertainty quantification in machine learning. By directly integrating conformal prediction into training, it offers broad, real-world utility across safety-critical domains like healthcare and finance. Paper 2, while methodologically rigorous, focuses on theoretical limitations of finite precision for tanh networks, which has a narrower scope and lower potential for immediate, widespread application.

gemini-3.1-pro-preview·Jun 10, 2026

#4729of 5669·cs.LG

#4729 of 5669 · cs.LG

Tournament Score

1305±43

10501750

39%

Win Rate

Wins

Losses

Matches

Rating

7.2/ 10

Significance7.5

Rigor8.5

Novelty7.5

Clarity7