Gauthier Boeshertz, Razvan Pascanu, Claudia Clopath
Backpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.
This paper provides a diagnosis and partial remedy for a well-known problem: Feedback Alignment (FA) fails to scale to deeper architectures. The core insight is that FA gradients suffer from rank collapse — they become confined to a low-dimensional subspace compared to backpropagation (BP) gradients, limiting the parameter space exploration needed for weight-feedback alignment to continue. The paper proposes two complementary interventions: (1) the Muon optimizer, which orthogonalizes momentum updates to flatten the singular value spectrum, and (2) Batch Normalization (BN), which promotes orthogonal hidden representations. Combined, these yield substantial improvements: e.g., ResNet-18 on CIFAR-100 jumps from 1.4% (baseline FA) to 46.1% (Muon+BN).
The contribution is primarily diagnostic rather than algorithmic — neither Muon nor BN is novel, but the paper reinterprets their utility through the lens of gradient dimensionality in the FA setting, which is a genuinely new perspective.
The experimental methodology is generally sound, with systematic controlled comparisons:
Within the FA/bio-plausible learning community: This paper provides actionable insights. Identifying rank collapse as a bottleneck gives the community a concrete target for future methods. The observation that gradient geometry, not just alignment, matters is a useful reframing.
Within optimization theory: The finding that Muon's benefits are much more pronounced in FA than BP is interesting — it suggests FA is a useful "stress test" for understanding optimizer behavior when learning signals are approximate.
Biological plausibility: As the authors acknowledge, neither Muon nor BN is biologically plausible in its current form. However, the connection to divisive normalization and homeostatic regulation (discussed in Section 6) opens interesting bridges to neuroscience. The suggestion that maintaining high-dimensional representations is crucial for learning with approximate error signals could inform neural coding theories.
Practical impact: Limited. FA still substantially underperforms BP, and the paper does not claim to close this gap. The practical utility of FA itself remains unclear outside neuroscience-motivated research.
The paper is timely in two respects:
However, the field has somewhat moved beyond pure FA toward methods that adapt feedback weights (Akrout et al., 2019; Kunin et al., 2020) or use other credit assignment schemes. The paper's self-imposed constraint of not adapting feedback weights limits its relevance to a subset of the bio-plausible learning literature.
This is a well-executed diagnostic study that identifies gradient rank collapse as a key failure mode of FA and demonstrates that two existing techniques (Muon, BN) can partially mitigate it. The paper's main value lies in its geometric analysis rather than in proposing new methods. While the improvements are substantial in relative terms, the absolute gap with BP remains large, and the lack of theoretical grounding limits the depth of the contribution. The work will be useful to the bio-plausible learning community but has limited broader impact.
Generated Jun 10, 2026
Paper 2 has higher estimated impact: it tackles a central, timely problem in deep learning—scaling biologically plausible alternatives to backprop—introduces a clear mechanistic diagnosis (rank collapse of the FA error signal), and demonstrates consistent performance gains on modern architectures/benchmarks (e.g., ResNet-18 on CIFAR100). The insight about low-dimensional gradient dynamics can influence optimization, learning theory, and neuroscience-inspired ML. Paper 1 is valuable for fairness in synthetic data, but its contributions are more niche and method-focused (PGM SDG, group-wise models) with narrower cross-field reach.
Paper 2 addresses a highly timely and impactful problem—aligning state-of-the-art flow matching models (e.g., FLUX, SD3.5) with human preferences. Its framework solves critical memory and gradient scaling issues in modern generative AI. While Paper 1 tackles an interesting foundational problem (biologically plausible learning), its empirical validation is limited to older architectures (ResNet-18 on CIFAR100), whereas Paper 2 demonstrates immediate applicability and scalability to large-scale, cutting-edge models, guaranteeing broader immediate adoption.
Paper 1 has higher likely impact due to stronger novelty and broader relevance to current LLM practice: it unifies and tests hypotheses behind task vectors/LoRA/activation steering, introduces an explicit “non-stationary local linear geometry” picture, and adds theory explaining why random search can work in high dimensions. Its applications span model editing, steering, fine-tuning, and interpretability across many pretrained models. Paper 2 is valuable and more biologically motivated, but its scope is narrower (scaling FA in vision nets) and the proposed fixes (orthogonalization/normalization) are more incremental.
The survey on unifying data, memory, and compute efficiency in LLM training addresses a broadly impactful topic at the center of current AI research. Its constraint-centric framework synthesizing data efficiency, memory optimization, and compute budgeting for LLMs has wide applicability across industry and academia. While Paper 2 presents interesting mechanistic insights about feedback alignment's rank collapse and proposes remedies, it addresses a more niche problem (biologically plausible learning) with limited practical adoption compared to backpropagation. The LLM efficiency survey's timeliness, breadth, and practical relevance give it higher potential impact.
Paper 1 likely has higher scientific impact due to strong timeliness and real-world relevance (global 3D hydrometeor/extreme-weather prediction), clear application pathways in operational forecasting and climate analysis, and breadth across ML, meteorology, and remote sensing. The physics-guided architecture plus spectral/adversarial supervision targets a well-known failure mode (oversmoothing, long tails) and reports comparisons against major baselines including GFS and GPM consistency, suggesting practical rigor and adoption potential. Paper 2 is novel for biologically plausible learning and improves FA scaling, but its impact may remain narrower and more contingent on broader uptake beyond specialized deep learning theory.
Paper 2 likely has higher impact: it targets a central, timely question in deep learning and neuroscience—scaling biologically plausible alternatives to backprop—while identifying a clear bottleneck (rank collapse) and demonstrating sizable gains on standard benchmarks and modern architectures. The proposed remedies (orthogonalised updates, activity normalization) are broadly applicable beyond feedback alignment, potentially influencing optimizer/normalization design and theory of gradient dynamics. Paper 1 is novel and rigorous, but its applications are narrower (covariance estimation) and empirical validation appears more limited, suggesting smaller cross-field uptake.
Paper 1 addresses a high-impact industrial problem (semiconductor manufacturing control) with a novel event-driven RL framework validated on industry-real scenarios, demonstrating significant practical gains in throughput and utilization. Its contributions span RL methodology, manufacturing systems, and complex adaptive systems. Paper 2 makes a solid contribution to understanding feedback alignment's limitations (rank collapse) and proposes remedies, but it addresses a more niche problem in biologically plausible learning that has struggled to gain practical traction. Paper 1's real-world applicability to a critical industry and methodological generality give it broader potential impact.
Paper 2 introduces a novel amortization strategy for Bayesian optimization acquisition functions that achieves >50x speedups while maintaining competitive performance. It addresses a widely recognized computational bottleneck in BO with broad applicability across optimization, AutoML, and experimental design. Paper 1 makes a solid contribution to understanding feedback alignment's scaling limitations, but addresses a more niche problem (biologically plausible learning) with incremental improvements. Paper 2's practical speedups, methodological novelty (PFN-based amortization), and broader applicability give it higher potential impact.
Paper 1 addresses a foundational limitation in uncertainty estimation by proposing a novel theoretical framework and metric (EECE) for epistemic calibration. This has broad, immediate real-world applications in high-stakes AI deployments (e.g., healthcare, autonomous driving) where reliable uncertainty quantification is critical. Paper 2 is highly innovative but focuses on biologically plausible alternatives to backpropagation, a more niche subfield with less immediate practical applicability compared to the ubiquitous need for trustworthy uncertainty estimation across all ML domains.
Paper 2 likely has higher impact: it tackles a central, timely problem—scaling biologically plausible alternatives to backprop in deep networks—and provides an empirically validated diagnosis (rank collapse) plus practical interventions that improve performance on standard benchmarks and architectures (e.g., ResNet-18 on CIFAR100). This has clear real-world relevance for training methods and neuroscience-inspired learning, and could influence multiple fields (deep learning optimization, computational neuroscience). Paper 1 is novel and rigorous but narrower (linear Q-learning, deterministic analysis) and less immediately applicable to modern deep RL practice.