Conservation Laws from Data Symmetry in Neural Networks

Jakob Galley, Vahid Shahverdi, Axel Flinth

Jun 9, 2026arXiv:2606.10913v1

cs.LGstat.ML

#2418of 5669·cs.LG

#2418 of 5669 · cs.LG

Tournament Score

1419±43

10501750

56%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor8.5

Novelty7.5

Clarity7.5

Abstract

We explore whether intrinsic symmetries of the training data lead to conserved quantities during gradient-flow training of neural networks. Under the assumption that the loss function is analytic and non-polynomial, we prove that data symmetries generically do not induce any additional integrals of motion. For mean squared error (MSE) loss, on the other hand, there are situations in which data augmentation yields extra conserved quantities. We build a framework, utilizing \emph{tensorizable networks} to describe this phenomenon. Tensorizable networks are a family of architectures whose dependence on parameters and inputs can be separated using an intermediate representation. They include linear and polynomial networks, as well as Lightning Attention.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a fundamental question at the intersection of geometric deep learning, optimization theory, and classical mechanics: Can symmetries in training data create new conserved quantities during gradient-flow training? The answer is nuanced and depends critically on the loss function and architecture.

The paper delivers two main theoretical results:

1. Negative result (Theorem 3): For analytic, non-polynomial margin losses (e.g., logistic, exponential) and finite symmetry groups, data symmetries generically do *not* produce new integrals of motion. The proof exploits the infinite Taylor expansion of non-polynomial analytic functions to show that group-averaged gradients span the same space as unaveraged gradients.

2. Positive result (Theorem 4): For MSE loss and *tensorizable networks* — a newly introduced class of architectures where input and parameter dependence separate via a lifted feature space — data augmentation *can* create new conserved quantities. The mechanism works through a lifted symmetry group $\mathcal{H}$ acting on the feature space, which can be continuous even when the data symmetry group $G$ is discrete.

The concept of tensorizable networks is itself a notable conceptual contribution. These are networks satisfying $f_\theta(x) = M(\theta)T(x)$ , encompassing linear networks, polynomial networks, and Lightning Attention. This abstraction cleanly separates where symmetry analysis can be performed.

Methodological Rigor

The mathematical framework is rigorous and carefully constructed. The proofs leverage sophisticated tools:

Theorem 3 uses a Vandermonde-type argument: the infinite non-zero Taylor coefficients of the analytic loss create an overdetermined system that prevents gradient collapse under group averaging. Assumption 1 (injectivity modulo signs of

\chi_{x,\theta}

) is explicitly stated and shown to hold generically.

Theorem 4 relies on the first fundamental theorem of the orthogonal group (Weyl) to show that

O (V)

-invariance of the loss implies dependence only through

P^\top P

, forcing the gradient flow to preserve

\text{range}(P)

Appendix D provides a complete characterization of

\mathcal{H}

using real representation theory (Maschke's theorem, Schur's lemma, Frobenius's classification of real division algebras), showing

\mathcal{H}

decomposes into products of orthogonal, unitary, or compact symplectic groups depending on the type (real/complex/quaternionic) of irreducible representations.

The experiments are explicitly presented as qualitative/illustrative rather than empirical validation at scale, which is appropriate for a theory paper. They demonstrate approximate conservation under gradient descent discretization and finite Haar sampling.

Potential Impact

Theoretical impact: This work opens a new direction in understanding how data structure interacts with optimization geometry. The classical conservation law literature for neural networks (Marcotte et al. 2023, 2024) focused on data-independent integrals of motion. This paper extends that framework to data-dependent settings, which is more realistic.

Practical implications: Understanding conserved quantities constrains the optimization landscape, which has downstream consequences for:

Implicit bias characterization (what solutions gradient descent favors)

Initialization sensitivity analysis

Understanding when data augmentation changes optimization dynamics qualitatively versus merely smoothing the loss

Connections to adjacent fields: The Noether-inspired framework bridges classical mechanics, representation theory, and deep learning theory. The characterization of $\mathcal{H}$ via Frobenius's theorem and the appearance of orthogonal/unitary/symplectic groups suggests deep structural connections to physics-inspired machine learning.

Timeliness & Relevance

The paper is well-timed. There is growing interest in (1) geometric deep learning and equivariant architectures, (2) implicit bias and conservation laws in optimization, and (3) understanding data augmentation theoretically. This work sits precisely at the intersection, providing a principled framework where these threads meet.

The inclusion of Lightning Attention as a tensorizable network is particularly timely given the dominance of attention mechanisms, though the analysis applies to a simplified (unnormalized, single-head) variant.

Strengths

Clean dichotomy: The contrast between analytic non-polynomial losses (no new conservation laws) and polynomial/MSE loss (possible new conservation laws) is elegant and provides clear conceptual guidance.

Complete algebraic characterization: The full characterization of

\mathcal{H}

in Appendix D via representation theory is thorough and self-contained.

Novel architectural abstraction: Tensorizable networks provide a useful conceptual tool that may find applications beyond this paper.

Concrete examples: The

C_{3}

-linear model and Lightning Attention examples make the abstract framework tangible.

Limitations

Restricted loss functions: Theorem 3 covers only margin losses; cross-entropy with softmax, for instance, is not directly addressed. Theorem 4 requires MSE specifically.

Tensorizable networks are restrictive: Many practical architectures (ReLU networks, transformers with softmax attention, normalization layers) are not tensorizable. The paper acknowledges this but leaves extensions to future work.

Realizability gap: Not every

\mathcal{H}

-symmetry can be pulled back to parameter space via condition (25). The paper does not characterize when realizability holds in general.

No discussion of approximate conservation: In practice, approximate symmetries and finite training time mean exact conservation is never achieved. The framework does not address stability or perturbation analysis.

Scale of experiments: The experiments are minimal (3D linear model, small attention model). While appropriate for a theory paper, larger-scale verification would strengthen confidence.

Infinite groups: Theorem 3 is restricted to finite groups; whether similar results hold for compact infinite groups is unaddressed.

Overall Assessment

This is a mathematically sophisticated paper that introduces a well-motivated question and provides clean, rigorous answers under specific assumptions. The dichotomy between polynomial and non-polynomial losses is a genuine insight. The tensorizable network framework and the complete representation-theoretic characterization of $\mathcal{H}$ represent substantial intellectual contributions. However, the practical applicability is currently limited by the restrictiveness of both the loss function assumptions and the tensorizable architecture class.

Rating:6.8/ 10

Significance 7Rigor 8.5Novelty 7.5Clarity 7.5

Generated Jun 10, 2026

Comparison History (16)

Wonvs. Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

Paper 1 addresses a fundamental theoretical question about the relationship between data symmetries and conservation laws in neural network training, with broad implications across deep learning theory. It introduces the novel concept of tensorizable networks and provides rigorous proofs connecting symmetry, loss functions, and training dynamics. This has wider impact across multiple areas (optimization theory, architecture design, data augmentation). Paper 2, while rigorous and useful, addresses a more specialized problem (error bounds for PINNs applied to ODEs) with narrower scope and incremental advances over existing a posteriori bounds.

claude-opus-4-6·Jun 11, 2026

Wonvs. TaskFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

Paper 2 addresses fundamental theoretical questions about neural network optimization, providing mathematical proofs on how data symmetries affect conserved quantities during training. Such foundational insights into network dynamics generally have a broader and more lasting scientific impact across the deep learning community. Paper 1, while highly practical and empirically rigorous, focuses on a narrower applied problem (continual tabular anomaly detection), making its overall scientific footprint likely more domain-specific compared to the foundational theory established in Paper 2.

gemini-3.1-pro-preview·Jun 11, 2026

Lostvs. Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

Paper 2 has higher likely impact: it introduces a practical, timely PEFT alternative that works with precompiled/high-throughput inference engines by optimizing only raw visual inputs, and demonstrates competitive results vs LoRA on multiple benchmarks and model sizes—suggesting immediate real-world applicability and broad relevance to multimodal LLM deployment. Paper 1 is theoretically novel and rigorous, but its impact is narrower (training dynamics/conservation laws under specific assumptions) and less directly actionable for widespread systems, though it may influence theory-focused subfields.

gpt-5.2·Jun 11, 2026

Wonvs. AuRA: Internalizing Audio Understanding into LLMs as LoRA

Paper 2 addresses a fundamental theoretical question connecting symmetries, conservation laws, and neural network training dynamics. This bridges deep learning theory with mathematical physics concepts (Noether's theorem analogy), offering broad theoretical implications across multiple fields. The introduction of 'tensorizable networks' as a framework and the rigorous proofs about when data symmetries do/don't yield conserved quantities provide foundational insights. Paper 1, while practically useful, represents an incremental engineering contribution in the crowded speech-LLM adaptation space. Paper 2's theoretical depth and cross-disciplinary nature suggest broader long-term scientific impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. GRAFT: Gain-Recalibrated Adapters for Transformer-Based Neural Population Activity Modeling

Paper 1 addresses a fundamental theoretical question about the relationship between data symmetries and conservation laws in neural network training dynamics, with broad implications across deep learning theory. It introduces a novel mathematical framework (tensorizable networks) that encompasses multiple architectures. While Paper 2 presents a strong applied contribution to brain-computer interfaces with state-of-the-art results, its impact is more narrowly scoped to neural population modeling. Paper 1's theoretical insights about symmetry, conservation laws, and gradient flow have potential to influence a wider range of fields and future research directions.

claude-opus-4-6·Jun 10, 2026

Lostvs. How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

Paper 2 likely has higher impact due to clear, actionable findings on benchmark saturation and the metric–utility gap in EEG denoising, with immediate implications for how the field evaluates models and for edge/BCI deployment. It uses controlled capacity sweeps, cross-dataset tests, multiple downstream decoders, and statistical testing, strengthening rigor and generalizability. The results are timely amid model scaling trends and broadly relevant to ML-for-health, signal processing, and benchmarking methodology. Paper 1 is more theoretical and novel, but its main conclusion is largely negative (generic non-emergence of conserved quantities), potentially limiting near-term adoption and applications.

gpt-5.2·Jun 10, 2026

Wonvs. A Unified Framework for Locality in Scalable MARL

Paper 2 addresses a fundamental question connecting data symmetries, conservation laws, and neural network training dynamics—a topic with broad implications across deep learning theory, physics-informed ML, and optimization. Its framework of 'tensorizable networks' introduces a novel structural concept applicable to multiple architectures. Paper 1, while technically rigorous and advancing multi-agent RL locality analysis, addresses a more specialized problem with narrower impact. Paper 2's interdisciplinary nature (connecting physics concepts to ML theory) and relevance to understanding training dynamics give it broader potential impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Causal Semantic Alignment for LLM-based Time Series Forecasting

Paper 2 offers fundamental theoretical insights into neural network training dynamics, linking physics concepts (conservation laws) with deep learning. While Paper 1 provides a useful methodological improvement for time series forecasting, Paper 2's rigorous mathematical proofs and introduction of 'tensorizable networks' have broader foundational impact, offering a deeper understanding that applies across various deep learning architectures and tasks.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

Paper 2 offers higher immediate scientific and practical impact due to its timeliness and clear real-world applications in evaluating LLMs and RAG systems. While Paper 1 provides rigorous fundamental theory on neural network dynamics, Paper 2 directly addresses a critical bottleneck in modern AI: costly human annotations and LLM judge bias. By significantly reducing computational complexity and annotation requirements, Paper 2 provides a highly scalable methodological framework that will broadly impact AI engineering, search, and information retrieval domains.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Paper 1 has higher likely impact: it targets the timely, high-demand area of mechanistic interpretability for sparse autoencoders, offering a unified geometric/set-theoretic framework with concrete notions (detection/separation/approximation), bounds, and explanations of observed SAE phenomena, plus empirical demonstrations. This combination of conceptual clarity and practical relevance could generalize across interpretability, representation learning, and theory. Paper 2 is theoretically interesting but largely a negative result (generic non-conservation) with more niche applicability; its positive cases are restricted (e.g., MSE, tensorizable architectures), likely limiting breadth and near-term influence.

gpt-5.2·Jun 10, 2026

#2418of 5669·cs.LG

#2418 of 5669 · cs.LG

Tournament Score

1419±43

10501750

56%

Win Rate

Wins

Losses

Matches

Rating

6.8/ 10

Significance7

Rigor8.5

Novelty7.5

Clarity7.5