Ayushman Trivedi, Bhavika Melwani
Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.
This paper investigates the geometric structure of recoverability in continual learning, building on the authors' prior "Accessibility Collapse" framework. The central contribution is the Stable Recovery Manifold (SRM) hypothesis: forgotten task knowledge persists in a compact, low-dimensional subspace of approximately 8 dimensions (out of 512) that does not expand as more tasks are learned. The paper introduces Recovery Subspace Dimensionality (k_t) as a metric, falsifies the authors' own "Recoverability Diffusion" hypothesis, and shows that a simple three-variable geometric model explains 82.2% of recoverability variance, with principal-angle drift as the dominant predictor (r = −0.862).
The conceptual reframing—forgetting as geometric misalignment rather than information destruction—is not entirely novel (Davari et al. 2022 established that probing accuracy exceeds task accuracy under forgetting), but the paper provides the most thorough geometric characterization of this phenomenon to date.
Strengths in design: The paper tests and falsifies its own hypothesis (Recoverability Diffusion), which is commendable scientific practice. The six-experiment structure is systematic, and the metrics (k_t, principal-angle drift, projection energy, CKA, participation ratio) are well-chosen and complementary.
1. Single architecture, single dataset, single run (implied). All results come from ResNet-18 on Split CIFAR-100 with 10 tasks. No error bars from multiple random seeds are reported. The k_t values fluctuating in {7, 8, 9} could easily be within noise of a single run. Without confidence intervals, the claim of "stability" at exactly 8 is not statistically grounded.
2. Small sample regression. The R² = 0.82 regression is fitted on only 10 data points with 3 predictors, yielding only 6 degrees of freedom. This is severely underpowered and at high risk of overfitting. The R² value is essentially meaningless as a generalizable result with this sample size. No cross-validation or leave-one-out analysis is reported.
3. The 90% threshold for k_t is arbitrary. The choice of 90% of full probe performance as the cutoff directly determines k_t. A sensitivity analysis showing how k_t changes at 85%, 95%, or 99% thresholds would be essential to establish robustness.
4. Only Task 0 is tracked for recoverability. All recoverability measurements concern Task 0 only. This limits conclusions about whether the SRM generalizes across different forgotten tasks.
5. The "naive" training protocol (no regularization, no replay) is useful for studying raw forgetting but limits practical relevance. The interaction between the SRM and existing continual learning methods is only speculated upon.
The paper's strongest potential impact lies in two areas:
Theoretical reframing: If the geometric accessibility interpretation holds broadly, it could redirect continual learning research toward orientation-preserving methods rather than information-preserving ones. The proposed "Manifold Anchor Regulariser" (penalizing rotation of 8 singular directions while allowing plasticity in remaining 504 dimensions) is an elegant and testable idea.
Practical efficiency: The claim that storing only 8 × 512 = 4,096 parameters per task could maintain recoverability is provocative and connects well to subspace methods like GPM. If validated, this could substantially reduce overhead in continual learning systems.
However, the impact is heavily contingent on generalization beyond the single experimental setting. The authors acknowledge this but provide no evidence toward it.
Continual learning remains highly relevant, and the geometric/mechanistic understanding of forgetting is an active area. The paper connects to emerging interest in understanding neural network representations geometrically (task vectors, mode connectivity, loss landscape geometry). The timing is appropriate, though the narrow experimental scope limits immediate applicability.
1. Clean experimental narrative: The hypothesis-falsification-revision structure is well-executed and intellectually honest.
2. Novel metric (k_t): Recovery Subspace Dimensionality is a simple, interpretable, and potentially useful measure.
3. Depth stratification finding: The observation that early layers become more distributed while late layers concentrate is mechanistically informative and connects nicely to the layer-wise retention hierarchy.
4. Strong conceptual clarity: The paper is well-written, with clear notation and logical progression.
5. The principal-angle drift correlation (r = −0.862) is a compelling finding that provides a geometric "clock" for forgetting.
1. Extreme narrowness of empirical validation. One architecture, one dataset, one split, apparently one seed. Claims about "geometric principles" are far stronger than what the evidence supports.
2. Statistical underpoweredness. The regression with 10 points and 3 predictors, and the k_t stability claim without confidence intervals, are the paper's most serious methodological weaknesses.
3. No comparison to existing geometric analyses. How does this relate to Ramasesh et al.'s findings on layer susceptibility in transformers, or to Fort and Ganguli's loss landscape geometry?
4. The "approximately 8" claim lacks theoretical grounding. Why 8? Is it related to the number of classes per task (10)? To the architecture? No explanation is offered.
5. Reference [1] is the authors' own unpublished arxiv preprint from 2026, making this a two-paper sequence where the foundational result has not yet been peer-reviewed.
6. The proposed regularizer (Section IX-A) is purely speculative with no preliminary results, reducing the paper's practical contribution.
The paper's framing as extending an arXiv preprint from 2026 raises questions about the maturity of this research program. The k_t ≈ 8 result, while intriguing, requires substantial additional validation before it can be considered a reliable finding. The depth-stratification result (Experiment 6) may actually be the most robust and novel finding, as it relies less on arbitrary thresholds and small-sample statistics.
The paper would benefit enormously from: (1) multiple random seeds with error bars, (2) at least one additional architecture, (3) threshold sensitivity analysis for k_t, and (4) preliminary results on the proposed manifold anchor regularizer.
Generated Jun 12, 2026
Paper 1 has higher estimated impact: it introduces a highly practical, low-code-change position-independent caching design for vLLM with clear, immediate real-world benefits (large throughput/TTFT gains) and broad relevance to rapidly growing RAG/agentic inference workloads. The approach is novel in its minimalist integration (unrotated K + in-attention RoPE + user primitives) and is likely to be adopted by industry/OSS stacks, amplifying impact. Paper 2 offers interesting geometric analysis and metrics for continual learning, but is primarily explanatory on limited benchmarks and less directly enabling.
Paper 2 addresses catastrophic forgetting, a fundamental and pervasive challenge in continual learning, with broad real-world implications. By demonstrating that forgotten knowledge is not destroyed but merely misaligned in a stable recovery manifold, it offers a major conceptual shift that could inspire novel algorithms. While Paper 1 provides rigorous theoretical insights into grokking, Paper 2's findings are likely to have a wider, more immediate impact across applied and theoretical AI research by reframing how we approach sequential learning.
Paper 1 addresses catastrophic forgetting, a critical bottleneck in artificial intelligence. By reframing forgetting as an accessibility issue rather than information destruction, it provides a conceptual breakthrough that could significantly advance lifelong learning in neural networks. While Paper 2 presents a rigorous and broadly applicable mathematical tool for continuous-time event data, Paper 1's insights into deep learning geometry have a higher potential to drive immediate, transformative impact in the rapidly moving field of AI.
Paper 2 has higher estimated impact due to a more broadly applicable and methodologically rigorous contribution: a novel formulation of matrix completion with distribution-valued entries, a principled low-rank notion (Tucker rank) in RKHS via functional unfoldings, and non-asymptotic error bounds plus experiments and a real application. This advances statistical learning theory and practical imputation for uncertainty-aware data across domains (recommenders, healthcare, sensing). Paper 1 is insightful for continual learning geometry but appears more empirical/specific (Split CIFAR-100, ResNet-18) with narrower immediate applicability.
Paper 2 has higher potential impact due to its more novel, cross-disciplinary framing (developmental emergence of agency/self-models from prediction), clearer real-world relevance to autonomous agents and robotics, and broader conceptual reach across ML, cognitive science, and neuroscience. Its 40-experiment developmental sequence with ordered necessary conditions, falsified hypotheses, and ablations suggests stronger methodological rigor and theory-building. Paper 1 is timely and solid but more incremental within continual learning, focused on a specific geometric characterization on limited benchmarks, with narrower downstream applicability.
Paper 2 addresses a fundamental and pervasive challenge in artificial intelligence—catastrophic forgetting in continual learning. By proposing a conceptual shift from 'information destruction' to an 'accessibility and manifold-alignment problem' backed by geometric evidence, it has the potential to reshape theoretical understanding and algorithmic design across deep learning. In contrast, Paper 1 offers a valuable but more incremental architectural improvement (adaptive gating) constrained to the specific subfield of neural operators for PDE solving, yielding narrower overall scientific impact.
Paper 2 introduces a novel geometric framework (Stable Recovery Manifold hypothesis) that fundamentally recharacterizes catastrophic forgetting as an accessibility/alignment problem rather than information destruction. This conceptual reframing has broad implications across continual learning, neuroscience-inspired AI, and lifelong learning systems. The clean theoretical insight (k_t stability, principal-angle drift predicting recoverability) is elegant and generalizable. Paper 1, while technically thorough with its factorial evaluation of conformal adaptation for safety classifiers, addresses a more narrowly scoped engineering problem and reveals significant limitations (ESS collapse for most classifiers), reducing its practical impact.
Paper 1 fundamentally challenges the traditional view of catastrophic forgetting by demonstrating that forgotten knowledge is not destroyed but remains geometrically recoverable. This paradigm-shifting theoretical insight has the potential to broadly impact how neural network representations and continual learning are understood and developed, offering deeper long-term scientific value compared to the highly practical but more narrowly focused LLM evaluation method in Paper 2.
Paper 2 introduces a novel geometric framework (Stable Recovery Manifold hypothesis) that fundamentally reframes catastrophic forgetting as an accessibility/alignment problem rather than information destruction. This conceptual shift has broader implications for continual learning, a major challenge in AI. The introduction of Recovery Subspace Dimensionality as a quantitative measure and the strong predictive relationships discovered (r=-0.862) provide actionable insights. Paper 1, while practically useful, primarily confirms that simple noise injection suffices—a somewhat incremental finding that narrows rather than expands understanding. Paper 2's theoretical contribution has greater potential to redirect research in continual learning.
Paper 2 offers a highly scalable, linear-time algorithm that solves a pervasive real-world problem (simultaneous forecasting of interacting systems) with a massive 10-70x speedup. Its demonstrated applicability across diverse fields like economics and epidemiology highlights immediate, cross-disciplinary impact. While Paper 1 provides valuable theoretical insights into continual learning, Paper 2's combination of algorithmic innovation, exceptional efficiency gains, and broad real-world utility gives it a higher potential for widespread scientific and societal impact.