Claudio Nordio
We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers.
The paper proposes reformulating gradient descent in feed-forward ReLU networks (with fixed readout and quadratic loss) from a weight-space dynamics into a "collective dynamics" expressed in terms of fields defined on the training-set space. The central finding is that this reformulation reveals a hierarchy of three types of dynamical variables:
The key structural claim is that the residual kernel decomposes layer-wise as K = Σ Q^(ℓ-1) · S^(ℓ), where Q are second-order activation overlaps and S are fourth-order conjugate-field correlators. This means depth increases geometric complexity of the dynamical state without increasing the statistical order of kernel observables.
The paper proceeds through careful, explicit derivation for depths 1 through 4, then extrapolates to arbitrary depth. The calculations are straightforward applications of the chain rule and gradient descent updates, presented with commendable transparency. However, several concerns arise:
The reformulation offers a potentially useful lens for understanding deep network training:
However, the practical impact is currently limited by the restrictive assumptions (fixed readout, ReLU only, quadratic loss, neglected threshold crossings) and the absence of any empirical validation.
The paper addresses a relevant question—understanding what gradient descent actually does in deep networks beyond weight-space optimization. The concurrent appearance of Cha et al. (2025) on weight Gram matrices suggests the community is converging on similar objects from different angles. The connection to the NTK literature and mean-field/field-theoretic approaches to deep learning (Roberts, Yaida, Hanin 2022) positions this work within active research threads.
The timing is appropriate as the field increasingly recognizes that lazy/NTK descriptions miss feature learning, and researchers seek intermediate descriptions that capture finite-width phenomena while maintaining analytical tractability.
The paper is essentially a calculation paper—it derives equations but does not analyze their consequences. Questions like: Does this representation reveal new regimes of training? Does it suggest new optimization strategies? Can it predict generalization? remain entirely unaddressed. The paper would benefit enormously from (1) numerical validation, (2) analysis of at least one non-trivial consequence of the framework, and (3) a more honest treatment of the closure limitation.
The writing is clear and well-organized, though repetitive due to the case-by-case construction. The paper could be significantly condensed.
Generated Jun 9, 2026
Paper 1 presents a practical, broadly applicable framework for encoding topological shape descriptors (ECT) with neural networks, demonstrating improvements across six diverse benchmarks covering multiple data modalities. It introduces actionable architectural insights (continuous encoding vs. discretization, representation architecture comparisons) with immediate utility in applied ML/TDA. Paper 2 offers elegant theoretical analysis of ReLU network dynamics via collective kernels, but its scope is narrower (specific architecture, quadratic loss) and lacks empirical validation, limiting near-term impact. Paper 1's combination of methodological novelty, empirical rigor, and cross-domain applicability gives it broader potential impact.
Paper 1 has higher likely impact due to a clearer, field-bridging unification of two influential kernel-bandit theories (GP-UCB and DEC/MAMS), plus concrete algorithmic contributions (heterogeneous algorithmic priors, safeguarded master) and an explicit separation construction that sharpens understanding of overparameterized regimes. This combination of conceptual clarification, new methods, and a negative/limitation result is timely for bandits and learning theory, with direct implications for practical kernelized exploration. Paper 2 is elegant and potentially insightful for theory of deep learning dynamics, but appears more exploratory and narrower in immediate applicability and rigor/closure beyond certain depths.
SPACR addresses a practical and widely relevant problem in uncertainty quantification for machine learning, offering a method that integrates conformal prediction into training with clear computational and performance benefits. It has immediate real-world applicability across diverse domains requiring reliable prediction intervals. Paper 2 provides interesting theoretical insights into neural network learning dynamics and kernel structures, but its impact is more niche, primarily advancing theoretical understanding of deep network training without immediate practical applications. SPACR's combination of methodological novelty, practical utility, and demonstrated empirical improvements gives it broader potential impact.
Paper 1 offers a fundamental theoretical contribution to understanding deep learning dynamics by reformulating gradient descent in terms of collective kernel structures and discovering a hierarchy of weight-induced Gram operators. This has broad implications for understanding neural network training, generalization, and the role of depth—core questions in deep learning theory. Paper 2 is a solid incremental improvement to a specific method (PINPF) with relatively narrow application scope (Bayesian particle transport). Paper 1's theoretical insights are more likely to influence multiple research directions across the ML theory community.
Paper 2 likely has higher impact: it targets an urgent, widely-used problem (preference alignment for large text-to-image flow/diffusion models) and proposes a unified framework (FlowBP) that systematizes prior connector methods while offering practical, scalable variants with clear memory/gradient-stability advantages and demonstrated gains on major modern model families. Its real-world applicability and timeliness are strong, with broad relevance to generative modeling and RLHF-style alignment. Paper 1 is conceptually novel for theory of deep learning dynamics, but its immediate applicability and cross-field uptake are less certain.
Paper 2 addresses fundamental theoretical questions regarding the learning dynamics of deep neural networks. Theoretical advancements that mathematically describe gradient descent dynamics often yield a broader, more profound long-term impact across the entire machine learning field compared to specialized, domain-specific architectures like the federated graph learning framework proposed in Paper 1.
Paper 2 addresses a fundamental theoretical question about deep learning dynamics, deriving closed-form descriptions of gradient descent in terms of collective fields on training-set space rather than weight space. This reveals new structural insights (hierarchy of weight-induced Gram operators) about how information propagates in deep ReLU networks, connecting to the broader neural tangent kernel literature. Its breadth of impact across theoretical ML, optimization theory, and deep learning understanding is greater than Paper 1, which applies an existing architecture (TabTransformer) to a specific sports analytics domain with incremental improvements.
Paper 2 demonstrates higher potential scientific impact due to its direct applicability to critical real-world problems like molecular design and protein engineering. By bridging the highly timely fields of in-context learning foundation models and latent-space Bayesian optimization, it offers a practical tool for scientific discovery across chemistry and biology. While Paper 1 provides rigorous theoretical insights into neural network learning dynamics, Paper 2's methodological innovation solves a practical distribution mismatch problem, paving the way for immediate, broad impact in applied sciences and AI-driven drug discovery.
Paper 2 addresses the highly timely and impactful fields of AI interpretability, representation learning, and model alignment. By providing a unifying framework, a new benchmark, and a novel autoencoder model (CoSAE), it offers broad utility across multiple domains like AI safety and multimodal learning. Paper 1 offers rigorous theoretical insights into neural network learning dynamics, but its impact is likely confined to a narrower theoretical machine learning audience compared to the broader practical and conceptual implications of Paper 2.
Paper 1 presents a complete, practical system (AutoMegaKernel) with extensive empirical validation across multiple GPU architectures, demonstrating real speedups for LLM inference. It addresses a timely problem (efficient LLM deployment), includes novel contributions in static verification of GPU kernel safety, agent-driven code synthesis, and cross-architecture retargeting. Paper 2 provides interesting theoretical insights into neural network training dynamics via kernel decompositions, but its impact is more incremental within a well-studied theoretical area (NTK-style analyses) and lacks immediate practical applications. Paper 1's breadth of impact across systems, ML, and compiler communities, combined with its open-source release, gives it higher potential impact.