Jakob Galley, Vahid Shahverdi, Axel Flinth
We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing \emph{tensorizable networks} to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and polynomial networks, as well as Lightning Attention.
This paper addresses a fundamental question at the intersection of geometric deep learning, optimization theory, and classical mechanics: Can symmetries in training data create new conserved quantities during gradient-flow training? The answer is nuanced and depends critically on the loss function and architecture.
The paper delivers two main theoretical results:
1. Negative result (Theorem 3): For analytic, non-polynomial margin losses (e.g., logistic, exponential) and finite symmetry groups, data symmetries generically do *not* produce new integrals of motion. The proof exploits the infinite Taylor expansion of non-polynomial analytic functions to show that group-averaged gradients span the same space as unaveraged gradients.
2. Positive result (Theorem 4): For MSE loss and *tensorizable networks* — a newly introduced class of architectures where input and parameter dependence separate via a lifted feature space — data augmentation *can* create new conserved quantities. The mechanism works through a lifted symmetry group acting on the feature space, which can be continuous even when the data symmetry group is discrete.
The concept of tensorizable networks is itself a notable conceptual contribution. These are networks satisfying , encompassing linear networks, polynomial networks, and Lightning Attention. This abstraction cleanly separates where symmetry analysis can be performed.
The mathematical framework is rigorous and carefully constructed. The proofs leverage sophisticated tools:
The experiments are explicitly presented as qualitative/illustrative rather than empirical validation at scale, which is appropriate for a theory paper. They demonstrate approximate conservation under gradient descent discretization and finite Haar sampling.
Theoretical impact: This work opens a new direction in understanding how data structure interacts with optimization geometry. The classical conservation law literature for neural networks (Marcotte et al. 2023, 2024) focused on data-independent integrals of motion. This paper extends that framework to data-dependent settings, which is more realistic.
Practical implications: Understanding conserved quantities constrains the optimization landscape, which has downstream consequences for:
Connections to adjacent fields: The Noether-inspired framework bridges classical mechanics, representation theory, and deep learning theory. The characterization of via Frobenius's theorem and the appearance of orthogonal/unitary/symplectic groups suggests deep structural connections to physics-inspired machine learning.
The paper is well-timed. There is growing interest in (1) geometric deep learning and equivariant architectures, (2) implicit bias and conservation laws in optimization, and (3) understanding data augmentation theoretically. This work sits precisely at the intersection, providing a principled framework where these threads meet.
The inclusion of Lightning Attention as a tensorizable network is particularly timely given the dominance of attention mechanisms, though the analysis applies to a simplified (unnormalized, single-head) variant.
This is a mathematically sophisticated paper that introduces a well-motivated question and provides clean, rigorous answers under specific assumptions. The dichotomy between polynomial and non-polynomial losses is a genuine insight. The tensorizable network framework and the complete representation-theoretic characterization of represent substantial intellectual contributions. However, the practical applicability is currently limited by the restrictiveness of both the loss function assumptions and the tensorizable architecture class.
Generated Jun 10, 2026
Paper 1 addresses a fundamental theoretical question about the relationship between data symmetries and conservation laws in neural network training, with broad implications across deep learning theory. It introduces the novel concept of tensorizable networks and provides rigorous proofs connecting symmetry, loss functions, and training dynamics. This has wider impact across multiple areas (optimization theory, architecture design, data augmentation). Paper 2, while rigorous and useful, addresses a more specialized problem (error bounds for PINNs applied to ODEs) with narrower scope and incremental advances over existing a posteriori bounds.
Paper 2 addresses fundamental theoretical questions about neural network optimization, providing mathematical proofs on how data symmetries affect conserved quantities during training. Such foundational insights into network dynamics generally have a broader and more lasting scientific impact across the deep learning community. Paper 1, while highly practical and empirically rigorous, focuses on a narrower applied problem (continual tabular anomaly detection), making its overall scientific footprint likely more domain-specific compared to the foundational theory established in Paper 2.
Paper 2 has higher likely impact: it introduces a practical, timely PEFT alternative that works with precompiled/high-throughput inference engines by optimizing only raw visual inputs, and demonstrates competitive results vs LoRA on multiple benchmarks and model sizes—suggesting immediate real-world applicability and broad relevance to multimodal LLM deployment. Paper 1 is theoretically novel and rigorous, but its impact is narrower (training dynamics/conservation laws under specific assumptions) and less directly actionable for widespread systems, though it may influence theory-focused subfields.
Paper 2 addresses a fundamental theoretical question connecting symmetries, conservation laws, and neural network training dynamics. This bridges deep learning theory with mathematical physics concepts (Noether's theorem analogy), offering broad theoretical implications across multiple fields. The introduction of 'tensorizable networks' as a framework and the rigorous proofs about when data symmetries do/don't yield conserved quantities provide foundational insights. Paper 1, while practically useful, represents an incremental engineering contribution in the crowded speech-LLM adaptation space. Paper 2's theoretical depth and cross-disciplinary nature suggest broader long-term scientific impact.
Paper 1 addresses a fundamental theoretical question about the relationship between data symmetries and conservation laws in neural network training dynamics, with broad implications across deep learning theory. It introduces a novel mathematical framework (tensorizable networks) that encompasses multiple architectures. While Paper 2 presents a strong applied contribution to brain-computer interfaces with state-of-the-art results, its impact is more narrowly scoped to neural population modeling. Paper 1's theoretical insights about symmetry, conservation laws, and gradient flow have potential to influence a wider range of fields and future research directions.
Paper 2 likely has higher impact due to clear, actionable findings on benchmark saturation and the metric–utility gap in EEG denoising, with immediate implications for how the field evaluates models and for edge/BCI deployment. It uses controlled capacity sweeps, cross-dataset tests, multiple downstream decoders, and statistical testing, strengthening rigor and generalizability. The results are timely amid model scaling trends and broadly relevant to ML-for-health, signal processing, and benchmarking methodology. Paper 1 is more theoretical and novel, but its main conclusion is largely negative (generic non-emergence of conserved quantities), potentially limiting near-term adoption and applications.
Paper 2 addresses a fundamental question connecting data symmetries, conservation laws, and neural network training dynamics—a topic with broad implications across deep learning theory, physics-informed ML, and optimization. Its framework of 'tensorizable networks' introduces a novel structural concept applicable to multiple architectures. Paper 1, while technically rigorous and advancing multi-agent RL locality analysis, addresses a more specialized problem with narrower impact. Paper 2's interdisciplinary nature (connecting physics concepts to ML theory) and relevance to understanding training dynamics give it broader potential impact.
Paper 2 offers fundamental theoretical insights into neural network training dynamics, linking physics concepts (conservation laws) with deep learning. While Paper 1 provides a useful methodological improvement for time series forecasting, Paper 2's rigorous mathematical proofs and introduction of 'tensorizable networks' have broader foundational impact, offering a deeper understanding that applies across various deep learning architectures and tasks.
Paper 2 offers higher immediate scientific and practical impact due to its timeliness and clear real-world applications in evaluating LLMs and RAG systems. While Paper 1 provides rigorous fundamental theory on neural network dynamics, Paper 2 directly addresses a critical bottleneck in modern AI: costly human annotations and LLM judge bias. By significantly reducing computational complexity and annotation requirements, Paper 2 provides a highly scalable methodological framework that will broadly impact AI engineering, search, and information retrieval domains.
Paper 1 has higher likely impact: it targets the timely, high-demand area of mechanistic interpretability for sparse autoencoders, offering a unified geometric/set-theoretic framework with concrete notions (detection/separation/approximation), bounds, and explanations of observed SAE phenomena, plus empirical demonstrations. This combination of conceptual clarity and practical relevance could generalize across interpretability, representation learning, and theory. Paper 2 is theoretically interesting but largely a negative result (generic non-conservation) with more niche applicability; its positive cases are restricted (e.g., MSE, tensorizable architectures), likely limiting breadth and near-term influence.