Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

Claudio Nordio

Jun 8, 2026arXiv:2606.09744v1

cs.LGcond-mat.dis-nn

#4324of 5669·cs.LG

#4324 of 5669 · cs.LG

Tournament Score

1330±43

10501750

38%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance4

Rigor4.5

Novelty5

Clarity6.5

Abstract

We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

1. Core Contribution

The paper proposes reformulating gradient descent in feed-forward ReLU networks (with fixed readout and quadratic loss) from a weight-space dynamics into a "collective dynamics" expressed in terms of fields defined on the training-set space. The central finding is that this reformulation reveals a hierarchy of three types of dynamical variables:

Activation fields (u_ℓ): sufficient for closure at depth 1

Conjugate fields (b_ℓ): needed starting at depth 2, defined recursively as backpropagated weighted activation indicators

Pullback Gram metrics (G_ℓ): quadratic weight-dependent operators that emerge at depth ≥3, taking the form W^T D W where D is a co-activation projector

The key structural claim is that the residual kernel decomposes layer-wise as K = Σ Q^(ℓ-1) · S^(ℓ), where Q are second-order activation overlaps and S are fourth-order conjugate-field correlators. This means depth increases geometric complexity of the dynamical state without increasing the statistical order of kernel observables.

2. Methodological Rigor

The paper proceeds through careful, explicit derivation for depths 1 through 4, then extrapolates to arbitrary depth. The calculations are straightforward applications of the chain rule and gradient descent updates, presented with commendable transparency. However, several concerns arise:

Strengths in rigor:

The one-hidden-layer case is exact and completely self-contained

The two-hidden-layer derivation is clean and the emergence of conjugate fields is well-motivated

Properties of the co-activation projectors (symmetry, positive semidefiniteness, idempotence) are rigorously established

Weaknesses in rigor:

The paper consistently neglects threshold-crossing events (Δa = 0), which is a significant approximation for ReLU networks. This is stated but its implications are not quantified or bounded. For finite learning rates, activation pattern changes can be substantial.

The "closure" at depth ≥3 is qualified in footnote 2—the Gram operators G_ℓ explicitly depend on weights W^(ℓ+1), so the system is not truly closed without tracking weight evolution. The paper acknowledges this but the framing sometimes obscures this important caveat.

The extension to arbitrary depth (Section 7) is presented as suggestive rather than proven. The recursive structure is plausible but a formal induction proof is absent.

No numerical experiments validate the theoretical framework, even for the simplest cases.

The fixed-readout assumption is restrictive and eliminates important dynamics in the final layer.

3. Potential Impact

The reformulation offers a potentially useful lens for understanding deep network training:

Theoretical understanding: The factorization of learning dynamics into geometric (input overlap) and dynamical (co-activation) components could inform theoretical analyses of feature learning in finite-width networks.

Connection to NTK: The framework generalizes NTK-type analysis while retaining finite-width structure, potentially bridging lazy and feature-learning regimes.

Spectral theory connection: The speculative connection to WeightWatcher's heavy-tailed spectral observations (Section 9.3) is interesting but entirely qualitative—no computation or simulation supports this link.

Geometric interpretation: The pullback Gram metric interpretation could inspire new architectural insights or training diagnostics.

However, the practical impact is currently limited by the restrictive assumptions (fixed readout, ReLU only, quadratic loss, neglected threshold crossings) and the absence of any empirical validation.

4. Timeliness & Relevance

The paper addresses a relevant question—understanding what gradient descent actually does in deep networks beyond weight-space optimization. The concurrent appearance of Cha et al. (2025) on weight Gram matrices suggests the community is converging on similar objects from different angles. The connection to the NTK literature and mean-field/field-theoretic approaches to deep learning (Roberts, Yaida, Hanin 2022) positions this work within active research threads.

The timing is appropriate as the field increasingly recognizes that lazy/NTK descriptions miss feature learning, and researchers seek intermediate descriptions that capture finite-width phenomena while maintaining analytical tractability.

5. Strengths & Limitations

Key Strengths:

Clean mathematical presentation with explicit, reproducible derivations

The progressive construction from 1 to 4 hidden layers effectively conveys the structural emergence

The observation that weight dependence enters only through quadratic Gram operators is elegant

The factorization K = Q · S and its persistence across depths is a concrete, testable structural claim

The fourth-order closure property is a non-trivial finding about the complexity of collective descriptions

Notable Limitations:

The "closure" is incomplete—G_ℓ operators carry explicit weight dependence, undermining the stated goal of eliminating weights

No experiments whatsoever—not even a toy demonstration on a small network

The neglect of threshold crossings is uncontrolled and potentially invalidates the analysis for practical learning rates

Fixed readout is a severe restriction; the final layer's trainability is often crucial

The paper does not analyze convergence, stability, or any dynamical consequences of the formulation

The connection to WeightWatcher is speculative without quantitative support

Self-described as a "draft research note," suggesting incompleteness

The paper does not discuss how this compares to or improves upon existing mean-field descriptions or tensor program frameworks

6. Additional Observations

The paper is essentially a calculation paper—it derives equations but does not analyze their consequences. Questions like: Does this representation reveal new regimes of training? Does it suggest new optimization strategies? Can it predict generalization? remain entirely unaddressed. The paper would benefit enormously from (1) numerical validation, (2) analysis of at least one non-trivial consequence of the framework, and (3) a more honest treatment of the closure limitation.

The writing is clear and well-organized, though repetitive due to the case-by-case construction. The paper could be significantly condensed.

Rating:3.5/ 10

Significance 4Rigor 4.5Novelty 5Clarity 6.5

Generated Jun 9, 2026

Comparison History (24)

Lostvs. Encoding the Euler Characteristic Transform

Paper 1 presents a practical, broadly applicable framework for encoding topological shape descriptors (ECT) with neural networks, demonstrating improvements across six diverse benchmarks covering multiple data modalities. It introduces actionable architectural insights (continuous encoding vs. discretization, representation architecture comparisons) with immediate utility in applied ML/TDA. Paper 2 offers elegant theoretical analysis of ReLU network dynamics via collective kernels, but its scope is narrower (specific architecture, quadratic loss) and lacks empirical validation, limiting near-term impact. Paper 1's combination of methodological novelty, empirical rigor, and cross-domain applicability gives it broader potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Algorithmic and Minimax Complexities in Kernel Bandits

Paper 1 has higher likely impact due to a clearer, field-bridging unification of two influential kernel-bandit theories (GP-UCB and DEC/MAMS), plus concrete algorithmic contributions (heterogeneous algorithmic priors, safeguarded master) and an explicit separation construction that sharpens understanding of overparameterized regimes. This combination of conceptual clarification, new methods, and a negative/limitation result is timely for bandits and learning theory, with direct implications for practical kernelized exploration. Paper 2 is elegant and potentially insightful for theory of deep learning dynamics, but appears more exploratory and narrower in immediate applicability and rigor/closure beyond certain depths.

gpt-5.2·Jun 10, 2026

Lostvs. SPACR: Single-Pass Adaptive Training of Uncertainty-Aware Conformal Regressors

SPACR addresses a practical and widely relevant problem in uncertainty quantification for machine learning, offering a method that integrates conformal prediction into training with clear computational and performance benefits. It has immediate real-world applicability across diverse domains requiring reliable prediction intervals. Paper 2 provides interesting theoretical insights into neural network learning dynamics and kernel structures, but its impact is more niche, primarily advancing theoretical understanding of deep network training without immediate practical applications. SPACR's combination of methodological novelty, practical utility, and demonstrated empirical improvements gives it broader potential impact.

claude-opus-4-6·Jun 10, 2026

Wonvs. Population-Aware Physics-Informed Neural Particle Flow for Bayesian Update

Paper 1 offers a fundamental theoretical contribution to understanding deep learning dynamics by reformulating gradient descent in terms of collective kernel structures and discovering a hierarchy of weight-induced Gram operators. This has broad implications for understanding neural network training, generalization, and the role of depth—core questions in deep learning theory. Paper 2 is a solid incremental improvement to a specific method (PINPF) with relatively narrow application scope (Bayesian particle transport). Paper 1's theoretical insights are more likely to influence multiple research directions across the ML theory community.

claude-opus-4-6·Jun 10, 2026

Lostvs. Exploring the Design Space of Reward Backpropagation for Flow Matching

Paper 2 likely has higher impact: it targets an urgent, widely-used problem (preference alignment for large text-to-image flow/diffusion models) and proposes a unified framework (FlowBP) that systematizes prior connector methods while offering practical, scalable variants with clear memory/gradient-stability advantages and demonstrated gains on major modern model families. Its real-world applicability and timeliness are strong, with broad relevance to generative modeling and RLHF-style alignment. Paper 1 is conceptually novel for theory of deep learning dynamics, but its immediate applicability and cross-field uptake are less certain.

gpt-5.2·Jun 10, 2026

Wonvs. PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

Paper 2 addresses fundamental theoretical questions regarding the learning dynamics of deep neural networks. Theoretical advancements that mathematically describe gradient descent dynamics often yield a broader, more profound long-term impact across the entire machine learning field compared to specialized, domain-specific architectures like the federated graph learning framework proposed in Paper 1.

gemini-3.1-pro-preview·Jun 9, 2026

Wonvs. A Universal Dense Football Event Representation Based on TabTransformer

Paper 2 addresses a fundamental theoretical question about deep learning dynamics, deriving closed-form descriptions of gradient descent in terms of collective fields on training-set space rather than weight space. This reveals new structural insights (hierarchy of weight-induced Gram operators) about how information propagates in deep ReLU networks, connecting to the broader neural tangent kernel literature. Its breadth of impact across theoretical ML, optimization theory, and deep learning understanding is greater than Paper 1, which applies an existing architecture (TabTransformer) to a specific sports analytics domain with incremental improvements.

claude-opus-4-6·Jun 9, 2026

Lostvs. In-Context Learning for Latent Space Bayesian Optimization

Paper 2 demonstrates higher potential scientific impact due to its direct applicability to critical real-world problems like molecular design and protein engineering. By bridging the highly timely fields of in-context learning foundation models and latent-space Bayesian optimization, it offers a practical tool for scientific discovery across chemistry and biology. While Paper 1 provides rigorous theoretical insights into neural network learning dynamics, Paper 2's methodological innovation solves a practical distribution mismatch problem, paving the way for immediate, broad impact in applied sciences and AI-driven drug discovery.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. A Unifying Framework for Concept-Based Representational Similarity

Paper 2 addresses the highly timely and impactful fields of AI interpretability, representation learning, and model alignment. By providing a unifying framework, a new benchmark, and a novel autoencoder model (CoSAE), it offers broad utility across multiple domains like AI safety and multimodal learning. Paper 1 offers rigorous theoretical insights into neural network learning dynamics, but its impact is likely confined to a narrower theoretical machine learning audience compared to the broader practical and conceptual implications of Paper 2.

gemini-3.1-pro-preview·Jun 9, 2026

Lostvs. AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

Paper 1 presents a complete, practical system (AutoMegaKernel) with extensive empirical validation across multiple GPU architectures, demonstrating real speedups for LLM inference. It addresses a timely problem (efficient LLM deployment), includes novel contributions in static verification of GPU kernel safety, agent-driven code synthesis, and cross-architecture retargeting. Paper 2 provides interesting theoretical insights into neural network training dynamics via kernel decompositions, but its impact is more incremental within a well-studied theoretical area (NTK-style analyses) and lacks immediate practical applications. Paper 1's breadth of impact across systems, ML, and compiler communities, combined with its open-source release, gives it higher potential impact.

claude-opus-4-6·Jun 9, 2026

#4324of 5669·cs.LG

#4324 of 5669 · cs.LG

Tournament Score

1330±43

10501750

38%

Win Rate

Wins

Losses

Matches

Rating

3.5/ 10

Significance4

Rigor4.5

Novelty5

Clarity6.5