The Geometry of Forgetting: Temporal Knowledge Drift as an Independent Axis in LLM Representations

Rania Elbadry, Ahmed Heakl, Fan Zhang, Dani Bouch, Yuxia Wang, Preslav Nakov, Zhuohan Xie

#70 of 2292 · Artificial Intelligence
Share
Tournament Score
1555±47
10501800
94%
Win Rate
17
Wins
1
Losses
18
Matches
Rating
7.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models confidently produce outdated answers, and no existing method can detect them. We show this is not an engineering failure but a structural one: temporal drift, whether a stored fact has changed since training, is encoded as a direction in the residual stream geometrically orthogonal to both correctness and uncertainty. Any method operating on correctness or uncertainty signals is therefore blind to drift by construction. We verify this across six instruction-tuned models. A linear probe trained directly on drift labels achieves AUROC 0.830.83--0.950.95; methods based on token entropy, semantic entropy, CCS, and SAPLMA all remain near chance (0.490.49--0.570.57). Five tests confirm the geometric orthogonality: weight cosines (cos0.14|\cos| \leq 0.14), score correlations (r0.20|r| \leq 0.20), bidirectional null-space projection (Δ0.008|Δ| \leq 0.008), iterative null-space projection with k=10k{=}10, and difference-of-means dissociation. Mechanistically, the MLP retrieval circuit produces identical dynamics for stale recall and confabulation (r>0.81r > 0.81, six models), explaining why output confidence cannot separate them. A cross-cutoff experiment holds inputs constant and varies only the model: the probe fires on the model whose training predates the fact's transition and stays silent otherwise (P(A>B)=0.975P(A{>}B) = 0.975--0.9980.998, twelve model pairs), confirming it reads model-internal knowledge state rather than input properties. Our code and datasets will be publicly released.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper identifies temporal knowledge drift — whether a fact stored in an LLM has become outdated since training — as a geometrically independent representational axis in the residual stream, orthogonal to both correctness and uncertainty. The central claim is that this orthogonality is why existing hallucination detection methods (entropy-based, semantic entropy, CCS, SAPLMA) fundamentally cannot detect stale outputs: they operate in a subspace where drift has near-zero variance. A linear probe trained directly on drift labels achieves AUROC 0.83–0.95, while all baselines remain at 0.49–0.57.

This reframes temporal staleness from an engineering failure to a structural-geometric property of LLM representations, which is a conceptually significant shift. The paper constructs a five-cell taxonomy distinguishing STALE-RECALL from CONFABULATION — two failure modes that existing hallucination taxonomies conflate — and shows they are linearly separable in hidden states despite being indistinguishable at output level.

Methodological Rigor

The methodology is notably thorough and multi-layered:

Confound control. The paper identifies and addresses the calendar-token confound (drifted facts concentrate in post-cutoff years), restricting evaluation to post-cutoff queries only. This controlled protocol drops AUROC by a mean of 0.07, bounding the confound's contribution, and shifts peak layers on 5/6 models — a revealing structural effect.

Five independence tests. The convergence of weight cosines (|cos| ≤ 0.14), score correlations (|r| ≤ 0.20), bidirectional null-space projection (|Δ| ≤ 0.008), INLP with k=10, and difference-of-means dissociation is methodologically strong. Each test addresses a specific failure mode of the others (e.g., sparsity artifacts, structured data correlations, multi-dimensional confounds). The random-direction baselines for null-space projection add calibration.

Cross-cutoff experiment. This is perhaps the strongest methodological contribution: holding inputs byte-identical while varying only the model (whose training cutoff determines whether a fact has drifted) yields P(A>B) = 0.975–0.998 across twelve model pairs. This definitively rules out input-level confounds (entity familiarity, query year, relation type) and confirms the probe reads model-internal knowledge state.

Mechanistic analysis. The DLA trajectory correlation (r > 0.81) between STALE-RECALL and CONFABULATION provides a compelling mechanistic explanation for why output confidence cannot separate them. The causal steering experiments showing the drift direction is "latent but activatable" — ablation produces no changes but amplification produces structured logit redistribution among specific competing holders — is a nuanced and informative finding.

Weaknesses in rigor. The entity-disjoint split is appropriate but the dataset covers only four Wikidata relations, limiting diversity. Bootstrap CIs for some models (Qwen-2.5, Gemma) are wide, reflecting limited post-cutoff data. The 2–9B parameter range leaves scaling behavior unknown.

Potential Impact

Practical deployment. The drift probe could serve as a lightweight filter in production LLM systems, flagging outputs likely to be stale before expensive retrieval-augmented generation is invoked. The paper notes this explicitly: triggering retrieval on an internal drift signal rather than output uncertainty could reduce retrieval cost while maintaining accuracy.

Hallucination detection. The finding that temporal drift is orthogonal to correctness/uncertainty challenges the assumption that a single hallucination detector can catch all failure modes. This has implications for the design of multi-axis detection systems.

Mechanistic interpretability. The discovery of a third independent representational axis (beyond correctness and uncertainty) extends the linear representation hypothesis and the "geometry of truth" line of work. The latent-but-activatable characterization adds a new category to the taxonomy of representational features.

Knowledge editing. Understanding that drift is encoded as a specific direction could inform targeted knowledge editing approaches (ROME, MEMIT), potentially enabling more precise updates.

Timeliness & Relevance

This paper addresses a critical and timely problem. As LLMs are deployed in rapidly-changing domains (politics, sports, corporate governance), confident production of outdated information is a major reliability concern. The finding that existing detection methods are *structurally* blind to this failure mode — not just insufficiently calibrated — is both alarming and actionable. The paper also connects to the growing mechanistic interpretability community's interest in linear representations.

Strengths

1. Exceptional experimental design: The cross-cutoff experiment is a clean causal identification strategy rarely seen in interpretability work.

2. Convergent evidence: Five independence tests, mechanistic analysis, and cross-model validation create a robust evidentiary structure.

3. Conceptual clarity: The geometric framing is intuitive and the implications are clearly stated.

4. Practical relevance: The drift probe is a deployable artifact, not just a theoretical observation.

5. Comprehensive baselines: Ten output-based and representation-based methods are evaluated, all failing convincingly.

Limitations

1. Four relations only: Head of government, head coach, chair, owned_by are all discrete leadership/ownership roles; generalization to continuous quantities, scientific facts, or gradual shifts is unknown.

2. Scale: 2–9B parameters only; whether orthogonality holds at frontier scale (70B+) is open.

3. Supervised probe: Requires drift-labeled training data, limiting out-of-the-box deployment; the paper acknowledges unsupervised variants as future work.

4. Linear assumption: The independence tests all assume linear structure; nonlinear entanglement could exist but remain undetected.

5. Wikidata dependency: The ground truth relies entirely on Wikidata's accuracy and completeness.

Overall Assessment

This is a well-executed paper with a clear, important finding supported by unusually thorough methodology. The geometric independence claim is supported by five complementary tests, a strong causal experiment, and mechanistic analysis across six models and four architecture families. The practical implications for deployed systems and the conceptual contribution to understanding LLM representations are both significant. The main limitations — narrow domain coverage and small model scale — are acknowledged and do not undermine the core claims within the stated scope.

Rating:7.8/ 10
Significance 8Rigor 8.5Novelty 8Clarity 8.5

Generated May 12, 2026

Comparison History (18)

vs. Imperfect World Models are Exploitable
gpt-5.25/19/2026

Paper 1 offers a concrete, empirically supported mechanistic claim about LLMs: temporal knowledge drift is encoded along a representation axis orthogonal to correctness/uncertainty, explaining why common uncertainty-based detectors fail. It demonstrates strong cross-model generalization, multiple orthogonality tests, and provides a practical detection tool (high AUROC) with immediate applications to reliability, evaluation, and safety. Its impact likely spans interpretability, model auditing, and deployment practices. Paper 2 is theoretically valuable in RL safety, but appears more abstract and may have slower, narrower uptake absent direct empirical demonstrations or tooling.

vs. SGA-MCTS: Decoupling Planning from Execution via Training-Free Atomic Experience Retrieval
gemini-3.15/16/2026

Paper 2 uncovers a fundamental structural property of LLM representations—that temporal drift is geometrically orthogonal to correctness and uncertainty. This discovery challenges existing assumptions about hallucination detection and offers deep mechanistic insights, paving the way for fundamentally new approaches to model interpretability and knowledge editing. While Paper 1 offers a strong, practical engineering solution for LLM planning efficiency, Paper 2 provides a novel theoretical framework with broader, foundational implications for our understanding of AI systems.

vs. A Foundation Model for Zero-Shot Logical Rule Induction
claude-opus-4.65/16/2026

Paper 1 discovers a fundamental geometric property of LLM representations—that temporal knowledge drift is encoded orthogonally to correctness and uncertainty—which explains a critical failure mode affecting all deployed LLMs. This finding has immediate implications for AI safety, hallucination detection, and knowledge maintenance, with rigorous verification across multiple models and methods. Paper 2 presents a solid contribution to ILP with a zero-shot foundation model approach, but addresses a narrower community. Paper 1's insight that existing uncertainty methods are structurally blind to staleness is both surprising and broadly consequential.

vs. Geometric Metrics for MoE Specialization: From Fisher Information to Early Failure Detection
gpt-5.25/16/2026

Paper 2 has higher likely impact: it introduces a theoretically grounded, parameterization-invariant information-geometric framework (Fisher metric) with formal theorems and broadly applicable metrics for MoE models, a rapidly growing architecture class. It demonstrates strong empirical correlations and practical early-failure detection with substantial compute savings, plus actionable interventions across NLP and vision and multiple scales—supporting real-world deployment and cross-field uptake. Paper 1 is novel and timely for LLM temporal drift, but its applications are narrower and rely on supervised drift labels/probes, potentially limiting generality.

vs. IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling
gpt-5.25/16/2026

Paper 2 has higher potential scientific impact due to a novel, general claim about LLM representation geometry: temporal knowledge drift is an independent axis orthogonal to correctness/uncertainty, implying broad limitations of existing detection approaches. It offers mechanistic evidence across multiple models with diverse, rigorous tests and a strong probe baseline, and is timely given widespread reliance on LLMs for factual answers. Its implications span interpretability, calibration, auditing, safety, and continual learning. Paper 1 is strong and application-ready, but its impact is more domain-specific (sensor scheduling/IoT) and less likely to generalize across ML subfields.

vs. Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights
gpt-5.25/16/2026

Paper 1 is more conceptually novel: it proposes a mechanistic/geometric claim (temporal drift as an independent, orthogonal representation axis) with multiple corroborating tests across models and strong predictive probes, implying a fundamental limitation of uncertainty/correctness-based detectors. This reframes “outdated answers” as an intrinsic representational property, likely influencing interpretability, model auditing, and time-aware alignment. Paper 2 is valuable and timely as a benchmark/desiderata contribution, but benchmark papers often have narrower long-term impact unless they become dominant standards; its core methodological novelty is more incremental.

vs. Towards Conversational Medical AI with Eyes, Ears and a Voice
gemini-3.15/16/2026

Paper 1 provides a foundational, mechanistic insight into how LLMs encode outdated information, revealing a fundamental geometric structure in their representations. This deep theoretical and empirical contribution has broad implications for AI interpretability, safety, and continual learning across all domains. While Paper 2 presents an impressive applied system with high real-world potential in healthcare, Paper 1's methodological rigor and discovery of a fundamental structural property of LLMs offer a wider, more paradigm-shifting scientific impact.

vs. Distribution-Aware Algorithm Design with LLM Agents
claude-opus-4.65/16/2026

Paper 1 identifies a fundamentally new geometric structure in LLM representations—temporal knowledge drift as an independent axis orthogonal to correctness and uncertainty—which is a profound conceptual discovery. It explains why all existing hallucination/uncertainty detection methods fail at detecting outdated knowledge, opening an entirely new research direction. The rigorous geometric verification across multiple models and the mechanistic insight (stale recall vs confabulation indistinguishability) have broad implications for AI safety, trustworthiness, and deployment. Paper 2 is strong applied work but more incremental in combining LLMs with algorithm design. Paper 1's foundational insight will likely reshape how the field thinks about knowledge representation in LLMs.

vs. Agentic Systems as Boosting Weak Reasoning Models
claude-opus-4.65/16/2026

Paper 2 identifies a fundamentally new geometric structure in LLM representations—temporal knowledge drift as an independent axis orthogonal to correctness and uncertainty. This is a deeper scientific discovery with broad implications: it explains why existing hallucination/uncertainty methods fail for outdated facts, opens a new research direction in mechanistic interpretability, and has immediate practical applications for detecting stale knowledge. Paper 1, while strong engineering work showing weak models can match strong ones via committee search, is more incremental—formalizing known intuitions about inference-time scaling. Paper 2's novelty and cross-field implications (interpretability, safety, epistemology of LLMs) give it higher impact.

vs. OmniDiagram: Advancing Unified Diagram Code Generation via Visual Interrogation Reward
claude-opus-4.65/12/2026

Paper 1 reveals a fundamental geometric property of LLM representations—that temporal knowledge drift is encoded orthogonally to correctness and uncertainty—which explains a deep structural limitation of current hallucination detection methods. This finding has broad implications for AI safety, factuality, and model interpretability, and opens an entirely new research direction. Paper 2, while solid engineering work on diagram code generation with a novel reward strategy, addresses a narrower application domain with more incremental contributions. Paper 1's mechanistic insight and cross-model validation suggest wider and more lasting scientific influence.

vs. RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
claude-opus-4.65/12/2026

Paper 1 identifies a fundamentally new geometric structure in LLM representations—temporal knowledge drift as an independent axis orthogonal to correctness and uncertainty. This is a deeper scientific insight with broad implications for AI safety, hallucination detection, and mechanistic interpretability. The rigorous verification across multiple models with five orthogonality tests and the explanation of why existing methods structurally fail represents a paradigm-shifting finding. Paper 2, while useful, addresses a more incremental engineering problem (optimizing multi-agent communication topology) with narrower theoretical contribution.

vs. When Do We Need LLMs? A Diagnostic for Language-Driven Bandits
claude-opus-4.65/12/2026

Paper 1 reveals a fundamental structural property of LLM representations—that temporal knowledge drift is geometrically orthogonal to correctness and uncertainty—which has broad implications for AI safety, hallucination detection, and model evaluation. The discovery that existing uncertainty methods are blind to drift 'by construction' reframes a major open problem. The rigorous geometric verification across six models and multiple statistical tests, combined with mechanistic explanations, makes this highly novel and impactful. Paper 2 offers useful practical guidance for bandit algorithms but addresses a narrower application domain with more incremental contributions.

vs. Open Ontologies: Tool-Augmented Ontology Engineering with Stable Matching Alignment
gemini-3.15/12/2026

Paper 1 uncovers a fundamental structural property of LLM representations, providing novel insights into mechanistic interpretability and why models hallucinate outdated facts. This theoretical breakthrough has broad implications for AI safety and reliability, whereas Paper 2 offers a more applied engineering contribution specific to ontology alignment.

vs. Dsat: A Native SAT Solver for Discrete Logic
gpt-5.25/12/2026

Paper 2 has higher likely impact: it identifies a novel, mechanistic geometric axis (“temporal drift”) in LLM representations, supported by extensive multi-model experiments and multiple orthogonality tests. The findings are timely and broadly relevant across interpretability, evaluation, safety, and deployment, with clear real-world applications (detecting stale knowledge). Paper 1 is valuable—native discrete SAT could improve efficiency/semantics over binarization—but its impact is more specialized within constraint solving and depends heavily on empirical performance versus mature SAT/CSP ecosystems.

vs. SciIntegrity-Bench: A Benchmark for Evaluating Academic Integrity in AI Scientist Systems
claude-opus-4.65/12/2026

Paper 1 makes a fundamental discovery about LLM internal representations—that temporal knowledge drift is geometrically orthogonal to correctness and uncertainty, forming an independent axis. This has deep implications for AI safety, hallucination detection, and model interpretability, proving a structural limitation of existing uncertainty methods. The rigorous geometric analysis across six models with multiple verification tests represents a novel mechanistic insight. Paper 2, while timely and practically useful, is primarily a benchmark/evaluation contribution documenting known tendencies (completion bias, fabrication) without comparable mechanistic depth or broad theoretical impact.

vs. Enhancing Tabular Anomaly Detection via Pseudo-Label-Guided Generation
gemini-3.15/12/2026

Paper 2 addresses a critical and highly timely issue in LLMs (temporal knowledge drift) through a novel, structurally profound approach using mechanistic interpretability. Its discovery of an orthogonal axis for temporal drift offers broad implications for AI safety, factuality, and model updating. In contrast, Paper 1 offers an incremental methodological improvement in tabular anomaly detection, which has a narrower scope and less transformative potential across the broader AI field.

vs. E-TCAV: Formalizing Penultimate Proxies for Efficient Concept Based Interpretability
claude-opus-4.65/12/2026

Paper 1 identifies a fundamentally new geometric structure in LLM representations—temporal drift as an independent axis orthogonal to correctness and uncertainty—which explains a critical failure mode (confident outdated answers) that no existing method addresses. This is highly novel, methodologically rigorous (verified across six models with five orthogonality tests), and has immediate practical implications for LLM reliability. Paper 2 offers useful efficiency improvements to TCAV but is incremental in nature, optimizing an existing interpretability method rather than discovering a new structural phenomenon.

vs. GenTac: Generative Modeling and Forecasting of Soccer Tactics
gpt-5.25/12/2026

Paper 1 offers a broadly applicable, mechanistic insight into LLM behavior: temporal knowledge drift is encoded as a representation axis orthogonal to correctness/uncertainty, implying a fundamental limitation of many existing calibration-style detectors. It supports this with multiple rigorous geometric tests, cross-model cutoff experiments, and strong probe performance across several models, and it directly targets a timely, high-stakes reliability problem affecting many downstream applications. Paper 2 is innovative and useful for sports analytics, but its impact is more domain-bounded despite methodological strength.