Overcoming Rank Collapse in Feedback Alignment

Gauthier Boeshertz, Razvan Pascanu, Claudia Clopath

Jun 9, 2026arXiv:2606.11123v1

cs.LG

#3066of 5669·cs.LG

#3066 of 5669 · cs.LG

Tournament Score

1392±43

10501750

50%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor6.5

Novelty5

Clarity7.5

Abstract

Backpropagation (BP) is widely viewed as biologically implausible, in part because it requires feedback weights to be the transpose of forward weights for error propagation. Interestingly, when training a network with fixed random feedback weights to circumvent this issue, learning aligns the forward weights with the feedback weights, leading the backpropagated error signal to become an approximation of the standard gradient used by BP. This process, called Feedback Alignment (FA), occurs in MLPs and very shallow CNNs but does not scale well to deeper architectures. In this work, we first investigated differences between BP and FA models, trained on CIFAR10, specifically focusing on the effective rank of the signal. We found that the FA error has a considerably lower rank and hence is constrained to a lower-dimensional subspace compared to BP, limiting exploration of the parameter space. Motivated by this observation, we evaluated two mechanisms for increasing the effective dimensionality of FA: Muon, an optimiser that orthogonalises weight updates; and hidden activity normalisation, which promotes activation orthogonality. Across larger architectures and benchmarks, we find that these methods consistently improve over FA baselines, for example, on CIFAR100 with a Resnet-18, accuracy increases by 9 percentage points. Our results identify low-dimensional gradient dynamics as a key obstacle to scaling FA and suggest that inducing higher-dimensional update geometry is a promising route toward scaling alternatives to backpropagation.

AI Impact Assessments

(1 models)

Scientific Impact Assessment: "Overcoming Rank Collapse in Feedback Alignment"

1. Core Contribution

This paper provides a diagnosis and partial remedy for a well-known problem: Feedback Alignment (FA) fails to scale to deeper architectures. The core insight is that FA gradients suffer from rank collapse — they become confined to a low-dimensional subspace compared to backpropagation (BP) gradients, limiting the parameter space exploration needed for weight-feedback alignment to continue. The paper proposes two complementary interventions: (1) the Muon optimizer, which orthogonalizes momentum updates to flatten the singular value spectrum, and (2) Batch Normalization (BN), which promotes orthogonal hidden representations. Combined, these yield substantial improvements: e.g., ResNet-18 on CIFAR-100 jumps from 1.4% (baseline FA) to 46.1% (Muon+BN).

The contribution is primarily diagnostic rather than algorithmic — neither Muon nor BN is novel, but the paper reinterprets their utility through the lens of gradient dimensionality in the FA setting, which is a genuinely new perspective.

2. Methodological Rigor

The experimental methodology is generally sound, with systematic controlled comparisons:

Strengths in methodology:

Clean ablation structure varying depth (1-4 layers), optimizers (SGD, AdamW, Muon), and normalization (with/without BN)

Multiple complementary metrics: weight alignment, gradient alignment, effective rank of gradients, gradient trajectory dimensionality, feature dimensionality

The low-rank SGD control experiment (Figure 3) is well-designed — showing that deliberately reducing update dimensionality catastrophically harms FA while leaving BP relatively unaffected

The noise injection experiment (Appendix D) rules out the alternative hypothesis that arbitrary high-dimensional perturbations suffice

The Freon interpolation (Appendix C) provides a smooth parameterization between SGD and Muon, showing monotonic improvement

Weaknesses in methodology:

Only 2 random seeds per experiment is minimal, though the authors claim stability

The causal direction of the rank collapse → poor performance link is not definitively established; it could be correlational (poor learning → low-rank gradients)

The paper lacks formal theoretical analysis of why FA gradients should be lower rank. The observation is empirical, and a theoretical treatment would strengthen the claims significantly

CIFAR-10/100 and Tiny ImageNet are relatively small-scale benchmarks; the gap with BP remains very large (e.g., 46.1% vs 75.2% on CIFAR-100 with ResNet-18)

3. Potential Impact

Within the FA/bio-plausible learning community: This paper provides actionable insights. Identifying rank collapse as a bottleneck gives the community a concrete target for future methods. The observation that gradient geometry, not just alignment, matters is a useful reframing.

Within optimization theory: The finding that Muon's benefits are much more pronounced in FA than BP is interesting — it suggests FA is a useful "stress test" for understanding optimizer behavior when learning signals are approximate.

Biological plausibility: As the authors acknowledge, neither Muon nor BN is biologically plausible in its current form. However, the connection to divisive normalization and homeostatic regulation (discussed in Section 6) opens interesting bridges to neuroscience. The suggestion that maintaining high-dimensional representations is crucial for learning with approximate error signals could inform neural coding theories.

Practical impact: Limited. FA still substantially underperforms BP, and the paper does not claim to close this gap. The practical utility of FA itself remains unclear outside neuroscience-motivated research.

4. Timeliness & Relevance

The paper is timely in two respects:

Muon is a very recent optimizer gaining attention in the LLM community; applying it to FA is novel and reveals properties of Muon beyond its standard use case

Biologically plausible learning remains an active research area, and understanding FA's failure modes has been an open question since Bartunov et al. (2018) highlighted scaling issues

However, the field has somewhat moved beyond pure FA toward methods that adapt feedback weights (Akrout et al., 2019; Kunin et al., 2020) or use other credit assignment schemes. The paper's self-imposed constraint of not adapting feedback weights limits its relevance to a subset of the bio-plausible learning literature.

5. Strengths & Limitations

Key strengths:

Clear, well-structured narrative from diagnosis to intervention to validation

The effective rank analysis provides a compelling geometric explanation for FA failure

Comprehensive set of control experiments (low-rank SGD, noise injection, Freon interpolation) that strengthen the causal interpretation

Complementarity of Muon and BN is demonstrated, suggesting they address related but distinct aspects of the problem

Extensive appendix with per-layer, per-depth analyses

Notable limitations:

The gap between FA and BP remains large (often 20-30+ percentage points), limiting practical significance

No theoretical justification for why FA gradients exhibit rank collapse — the paper is entirely empirical

The paper doesn't compare to more recent FA variants (sign-concordant feedback, learned feedback) that partially close the BP-FA gap

BN's effects on FA could be confounded by its many other benefits (smoother optimization landscape, learning rate robustness), and the paper does not disentangle these

Scale remains modest — no experiments on ImageNet-scale or with modern architectures (ViTs, etc.)

The local loss approach (Appendix B) is acknowledged to not scale, which weakens the "dimensionality is the key" narrative somewhat

Summary

This is a well-executed diagnostic study that identifies gradient rank collapse as a key failure mode of FA and demonstrates that two existing techniques (Muon, BN) can partially mitigate it. The paper's main value lies in its geometric analysis rather than in proposing new methods. While the improvements are substantial in relative terms, the absolute gap with BP remains large, and the lack of theoretical grounding limits the depth of the contribution. The work will be useful to the bio-plausible learning community but has limited broader impact.

Rating:5.5/ 10

Significance 5.5Rigor 6.5Novelty 5Clarity 7.5

Generated Jun 10, 2026

Comparison History (20)

Wonvs. Disparate Impact in Synthetic Data Generation

Paper 2 has higher estimated impact: it tackles a central, timely problem in deep learning—scaling biologically plausible alternatives to backprop—introduces a clear mechanistic diagnosis (rank collapse of the FA error signal), and demonstrates consistent performance gains on modern architectures/benchmarks (e.g., ResNet-18 on CIFAR100). The insight about low-dimensional gradient dynamics can influence optimization, learning theory, and neuroscience-inspired ML. Paper 1 is valuable for fairness in synthetic data, but its contributions are more niche and method-focused (PGM SDG, group-wise models) with narrower cross-field reach.

gpt-5.2·Jun 12, 2026

Lostvs. Exploring the Design Space of Reward Backpropagation for Flow Matching

Paper 2 addresses a highly timely and impactful problem—aligning state-of-the-art flow matching models (e.g., FLUX, SD3.5) with human preferences. Its framework solves critical memory and gradient scaling issues in modern generative AI. While Paper 1 tackles an interesting foundational problem (biologically plausible learning), its empirical validation is limited to older architectures (ResNet-18 on CIFAR100), whereas Paper 2 demonstrates immediate applicability and scalability to large-scale, cutting-edge models, guaranteeing broader immediate adoption.

gemini-3.1-pro-preview·Jun 10, 2026

Lostvs. Recoverable but Not Stationary:Local Linear Structures in Weights and Activations

Paper 1 has higher likely impact due to stronger novelty and broader relevance to current LLM practice: it unifies and tests hypotheses behind task vectors/LoRA/activation steering, introduces an explicit “non-stationary local linear geometry” picture, and adds theory explaining why random search can work in high dimensions. Its applications span model editing, steering, fine-tuning, and interpretability across many pretrained models. Paper 2 is valuable and more biologically motivated, but its scope is narrower (scaling FA in vision nets) and the proposed fixes (orthogonalization/normalization) are more incremental.

gpt-5.2·Jun 10, 2026

Lostvs. Unifying Data, Memory, and Compute Efficiency in LLM training: A Survey

The survey on unifying data, memory, and compute efficiency in LLM training addresses a broadly impactful topic at the center of current AI research. Its constraint-centric framework synthesizing data efficiency, memory optimization, and compute budgeting for LLMs has wide applicability across industry and academia. While Paper 2 presents interesting mechanistic insights about feedback alignment's rank collapse and proposes remedies, it addresses a more niche problem (biologically plausible learning) with limited practical adoption compared to backpropagation. The LLM efficiency survey's timeliness, breadth, and practical relevance give it higher potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Physics-Guided Dual Decoding and Spectral Supervision for Global 3D Hydrometeor Prediction

Paper 1 likely has higher scientific impact due to strong timeliness and real-world relevance (global 3D hydrometeor/extreme-weather prediction), clear application pathways in operational forecasting and climate analysis, and breadth across ML, meteorology, and remote sensing. The physics-guided architecture plus spectral/adversarial supervision targets a well-known failure mode (oversmoothing, long tails) and reports comparisons against major baselines including GFS and GPM consistency, suggesting practical rigor and adoption potential. Paper 2 is novel for biologically plausible learning and improves FA scaling, but its impact may remain narrower and more contingent on broader uptake beyond specialized deep learning theory.

gpt-5.2·Jun 10, 2026

Wonvs. Covariance Shrinkage via Stochastic Interpolation

Paper 2 likely has higher impact: it targets a central, timely question in deep learning and neuroscience—scaling biologically plausible alternatives to backprop—while identifying a clear bottleneck (rank collapse) and demonstrating sizable gains on standard benchmarks and modern architectures. The proposed remedies (orthogonalised updates, activity normalization) are broadly applicable beyond feedback alignment, potentially influencing optimizer/normalization design and theory of gradient dynamics. Paper 1 is novel and rigorous, but its applications are narrower (covariance estimation) and empirical validation appears more limited, suggesting smaller cross-field uptake.

gpt-5.2·Jun 10, 2026

Lostvs. Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

Paper 1 addresses a high-impact industrial problem (semiconductor manufacturing control) with a novel event-driven RL framework validated on industry-real scenarios, demonstrating significant practical gains in throughput and utilization. Its contributions span RL methodology, manufacturing systems, and complex adaptive systems. Paper 2 makes a solid contribution to understanding feedback alignment's limitations (rank collapse) and proposes remedies, but it addresses a more niche problem in biologically plausible learning that has struggled to gain practical traction. Paper 1's real-world applicability to a critical industry and methodological generality give it broader potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. $α$-PFN: Fast Entropy Search via In-Context Learning

Paper 2 introduces a novel amortization strategy for Bayesian optimization acquisition functions that achieves >50x speedups while maintaining competitive performance. It addresses a widely recognized computational bottleneck in BO with broad applicability across optimization, AutoML, and experimental design. Paper 1 makes a solid contribution to understanding feedback alignment's scaling limitations, but addresses a more niche problem (biologically plausible learning) with incremental improvements. Paper 2's practical speedups, methodological novelty (PFN-based amortization), and broader applicability give it higher potential impact.

claude-opus-4-6·Jun 10, 2026

Lostvs. Can we trust our models? Epistemic calibration in second-order classification

Paper 1 addresses a foundational limitation in uncertainty estimation by proposing a novel theoretical framework and metric (EECE) for epistemic calibration. This has broad, immediate real-world applications in high-stakes AI deployments (e.g., healthcare, autonomous driving) where reliable uncertainty quantification is critical. Paper 2 is highly innovative but focuses on biologically plausible alternatives to backpropagation, a more niche subfield with less immediate practical applicability compared to the ubiquitous need for trustworthy uncertainty estimation across all ML domains.

gemini-3.1-pro-preview·Jun 10, 2026

Wonvs. Geometrically Averaged Hard Target Updates for Linear Q-Learning

Paper 2 likely has higher impact: it tackles a central, timely problem—scaling biologically plausible alternatives to backprop in deep networks—and provides an empirically validated diagnosis (rank collapse) plus practical interventions that improve performance on standard benchmarks and architectures (e.g., ResNet-18 on CIFAR100). This has clear real-world relevance for training methods and neuroscience-inspired learning, and could influence multiple fields (deep learning optimization, computational neuroscience). Paper 1 is novel and rigorous but narrower (linear Q-learning, deterministic analysis) and less immediately applicable to modern deep RL practice.

gpt-5.2·Jun 10, 2026

#3066of 5669·cs.LG

#3066 of 5669 · cs.LG

Tournament Score

1392±43

10501750

50%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor6.5

Novelty5

Clarity7.5