Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
Qiyao Liang, Risto Miikkulainen, Ila Fiete
Abstract
Language models draw on two knowledge sources: facts baked into weights (parametric memory, PM) and information in context (working memory, WM). We study two mechanistically distinct failure modes--conflict, when PM and WM disagree and interfere; and hallucination, when the queried fact was never learned. Both produce confident output regardless, making output-based monitoring blind by design. We show both failures share a unified geometric account. In the hidden-state space of autoregressive generation, learned facts form attractor basins. Conflict is basin competition: WM disrupts convergence to the correct basin without raising output entropy. Hallucination is basin absence: the hidden state drifts freely when no memorized basin exists. The frozen LM head, designed for next-token prediction, cannot distinguish these cases and fires confidently either way. We verify this account in a controlled synthetic task--entity identifiers mapped to unique codes with PM installed via LoRA adapters--where ground truth is exact and component roles can be causally isolated through targeted adapter placement. Geometric margin--the hidden state's distance to the nearest memorized basin--reads this geometry directly and separates correct recall from hallucination far more cleanly than output entropy, with zero false refusals where entropy-based detection cannot avoid rejecting the vast majority of correct outputs. The separation holds on natural-language factual queries from the pretrained model with no adaptation, confirming attractor geometry is structural rather than a fine-tuning artifact. The fraction of confident hallucinations follows a scaling law , growing with scale even as overall error rates fall. Hidden states reliably encode epistemic state; the frozen output head systematically erases it--and this erasure worsens with scale.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper proposes a unified attractor geometry framework to explain two distinct failure modes in language models: (1) conflict, when parametric memory (PM) and working memory (WM) disagree, and (2) hallucination, when the model generates confident answers about facts it never learned. The key insight is that learned facts form attractor basins in hidden-state space—conflict is basin competition, hallucination is basin absence—and the frozen LM head cannot distinguish between these states and correct recall, producing confident outputs regardless. The paper introduces geometric margin (distance to nearest memorized basin center) as a diagnostic that separates correct recall from hallucination with zero false refusals, dramatically outperforming output entropy.
The conceptual unification is compelling: two seemingly distinct failure modes are explained as different geometric configurations in a single attractor landscape. This is more than a detection method—it's a mechanistic account of *why* the LM head is epistemically blind and *why* this blindness worsens with scale.
Methodological Rigor
The experimental design is well-structured with three orthogonal degrees of freedom (what is memorized, how strongly, and where via LoRA placement). The synthetic task—mapping entity identifiers to 5-digit codes—provides exact ground truth, which is a genuine methodological strength for mechanistic studies where natural-language tasks confound too many variables.
Strengths in rigor:
Weaknesses in rigor:
Potential Impact
Detection applications: The zero-false-refusal property of geometric margin is practically significant. If scalable basin center computation can be solved, this could enable deployment-time hallucination monitoring that is fundamentally more reliable than entropy-based methods, especially for larger models where the problem worsens.
Architectural implications: The finding that the LM head is an "epistemic bottleneck"—erasing geometric information that reliably encodes knowledge state—has design implications. The negative result on end-to-end metacognitive heads (Appendix H) and the temporal paradox (the signal arrives too late to influence routing) point toward multi-pass architectures, which could influence how future models are designed for reliability.
Scaling insight: The demonstration that confident hallucination fraction *grows* with scale (C rising from 52% to 99% as width increases 16×) while geometric separation simultaneously improves (1.2× to 153×) is an important warning for the field—larger models become worse at self-monitoring through their outputs even as their internal representations become more informative.
Theoretical contribution: The connection between dynamical systems theory and practical LM failure modes, grounded in empirical Jacobian analysis, advances mechanistic interpretability beyond circuit-level descriptions toward global geometric accounts.
Timeliness & Relevance
This addresses a critical bottleneck: as LMs are deployed in high-stakes settings, confident hallucination is perhaps the most dangerous failure mode. The finding that this problem *worsens* with scale—contradicting the hope that scaling alone will solve reliability—is highly relevant. The paper arrives as the field grapples with the inadequacy of output-based uncertainty estimation.
Key Strengths
1. Unified mechanistic account connecting conflict and hallucination through attractor geometry
2. Clean causal dissection via targeted LoRA placement with predictions confirmed
3. Scaling law with clean empirical verification across model families
4. Honest limitation reporting (TruthfulQA failures, conflict detection limits, metacognitive head negative results)
5. Cross-validation on pretrained knowledge rules out fine-tuning artifacts
Key Limitations
1. Scalability gap: Basin center computation doesn't scale to real deployment
2. Synthetic-to-natural gap: 5-digit codes are far from real-world knowledge complexity
3. Conflict detection is weak: Static geometry achieves only modest AUROC for the conflict failure mode
4. Limited model scale: Experiments primarily on 3B model; scaling analyses cover up to 14B but not frontier-scale models
5. Adversarial brittleness: Framework fails on reasoning-based errors (TruthfulQA cross-domain AUROC=0.476)
Overall Assessment
This is a theoretically rich paper that provides a compelling geometric framework for understanding LM failure modes, with strong controlled experiments and honest evaluation of boundaries. The core insight—that hidden states encode epistemic state but the LM head erases it, and this erasure worsens with scale—is important and well-supported. The practical impact is currently limited by scalability concerns, but the theoretical contribution and the scaling law are substantial. The work advances mechanistic interpretability in a direction with clear practical relevance.
Generated May 8, 2026
Comparison History (21)
Paper 1 provides fundamental mechanistic insights into LLM hallucinations and memory conflicts using attractor geometry. It addresses a critical, widely studied problem in AI reliability and safety, offering a novel detection metric and revealing scaling laws. Paper 2, while presenting an efficient architectural improvement for memory, focuses more on engineering optimization and has narrower theoretical implications compared to the deep interpretability contributions of Paper 1.
Paper 1 offers a more novel, mechanistic account—unifying conflict and hallucination via attractor-basin geometry—and proposes an interpretable internal-state metric (geometric margin) that outperforms entropy for detection, with controlled causal validation and evidence on natural queries plus a scaling law. This combination of theory, measurement, and scaling relevance can influence interpretability, evaluation, and mitigation across many LLM settings. Paper 2 is timely and application-relevant for agent safety, but is primarily an empirical dataset finding with a prompt-induced effect that may be more contingent on deployment conventions and mitigations.
Paper 1 offers a mechanistic, geometry-based theory unifying conflict and hallucination in transformers, proposes a concrete internal metric (geometric margin) with strong empirical separation beyond entropy, and suggests a scaling law—advances likely to influence interpretability, reliability, and model design broadly. It combines causal interventions (LoRA placement) with transfer to natural queries, strengthening rigor and generality. Paper 2 is timely and practically valuable as a large-scale evaluation and dataset on metacognitive degradation, but it is more diagnostic than mechanistic and may generalize less across architectures/training regimes.
Paper 2 offers a more novel, general mechanistic theory (attractor-basin geometry) unifying conflict and hallucination, with a concrete diagnostic (geometric margin) that outperforms entropy and appears to transfer from controlled causal experiments to natural queries. Its claims are timely (hallucinations, monitoring) and potentially broad-impact across interpretability, safety, scaling laws, and model design. Paper 1 is rigorous and practically valuable for mobile GUI agents, but its contributions are more domain-specific and incremental (modalities, benchmarks, guidance efficacy) with narrower cross-field reach.
Paper 2 offers a fundamental mechanistic insight into transformer behavior—explaining both conflict and hallucination through attractor geometry—with broad implications across all LLM research. The discovery that the output head systematically erases epistemic state information, and that this worsens with scale (following a precise scaling law), is a deep theoretical contribution that could reshape how the community approaches hallucination detection, uncertainty quantification, and model interpretability. Paper 1, while valuable as a benchmark for disaster response agents, is more domain-specific and incremental in nature, primarily revealing performance gaps rather than new mechanistic understanding.
Paper 2 provides a fundamental mechanistic understanding of transformer memory, offering a unified geometric account of conflict and hallucination. This foundational insight into LLM behavior has broader implications for interpretability, model design, and mitigating hallucinations across the field, whereas Paper 1 focuses on a more specific, albeit practical, engineering solution for safety guardrails.
Paper 2 offers a mechanistic, unifying theory (attractor-basin geometry) connecting conflict and hallucination, plus a new internal metric (geometric margin) that outperforms entropy and appears to generalize from a controlled causal setup to natural queries, with an explicit scaling law. This combination of conceptual novelty, methodological rigor, and broad relevance to interpretability, safety, and monitoring across many transformer uses suggests wide cross-field impact. Paper 1 is highly practical and timely for benchmark security, but is narrower in scope and more engineering/auditing-oriented.
Paper 1 offers a fundamental mechanistic understanding of transformer memory and hallucination through attractor geometry. This theoretical foundation has profound implications for understanding model behavior, detecting hallucinations, and influencing future architectures, representing a deeper scientific contribution than the specific, albeit practical, alignment optimization technique presented in Paper 2.
Paper 2 likely has higher impact: it targets a high-value, timely problem (scalable post-training of vision-language-action agents) and proposes a broadly applicable paradigm—task-agnostic world models plus VLM-derived rewards—for zero-shot imagination-based RL, with claimed gains in both simulation and real-world settings. The approach has clearer near-term real-world applications (robotics/embodied AI) and wider cross-field relevance (RL, world models, VLMs, robotics). Paper 1 is novel and mechanistic, but its immediate practical leverage may be narrower and more diagnostic than enabling.
Paper 1 offers a mechanistic, geometry-based theory unifying conflict and hallucination in transformer hidden states, proposes a measurable diagnostic (geometric margin), provides causal isolation via controlled LoRA experiments, and reports scaling behavior—advances likely to influence core LM interpretability, reliability, and safety across many domains. Paper 2 is highly valuable and timely as an ecologically valid benchmark with strong stakeholder grounding, but its primary contribution is evaluative infrastructure for one profession; broader scientific generalization and conceptual novelty are more limited compared to Paper 1’s potential to reshape understanding and mitigation of hallucinations system-wide.
Paper 1 offers a profound mechanistic understanding of LLM hallucinations and memory conflicts using attractor geometry, addressing a critical and universal flaw in modern AI. By providing a structural explanation and a superior detection metric (geometric margin) with derived scaling laws, it advances foundational AI science. In contrast, Paper 2 provides a valuable but more narrowly focused engineering and dataset contribution for mobile agents. Paper 1's theoretical depth and broader applicability across all LLM research give it a significantly higher potential scientific impact.
Paper 1 provides fundamental, mechanistic insights into how transformers process memory and why they hallucinate, proposing a novel geometric framework of 'attractor basins'. This theoretical depth addresses a core challenge in AI reliability with broad implications across the field. In contrast, Paper 2 presents a practical, application-specific optimization for inference-time budget control in search agents. Paper 1's structural discoveries and scaling laws regarding epistemic state erasure offer greater potential for paradigm-shifting scientific impact.
Paper 2 investigates the fundamental geometric mechanics of Transformer memory, offering a profound theoretical explanation for hallucinations and memory conflicts. By identifying attractor basins in hidden states, it provides a highly novel and robust method for hallucination detection that outperforms traditional output entropy. Its insights into scaling laws and mechanistic interpretability give it much broader impact across AI safety, alignment, and core architectural research compared to Paper 1's more specialized, albeit useful, algorithmic optimization for search agents.
Paper 1 is more novel and timely: it offers a mechanistic, geometry-based unification of conflict vs hallucination in transformers, proposes a concrete hidden-state metric (geometric margin) that outperforms entropy-based monitoring, and reports a scaling-law insight about confident hallucinations. The work is methodologically strong (controlled synthetic setup with causal isolation via LoRA placement plus validation on natural queries) and broadly relevant across interpretability, reliability, alignment, and scaling of LMs. Paper 2 is solid applied integration of DRL, model learning, and planning with a useful benchmark tool, but its core ideas are more incremental and impact is narrower.
Paper 1 offers a highly novel geometric framework unifying two critical LLM failure modes (conflict and hallucination) through attractor basin dynamics, revealing a scaling law for confident hallucinations and demonstrating that hidden states encode epistemic information that the output head systematically erases. This has broad implications for AI safety, interpretability, and LLM reliability—topics of immense current relevance. Paper 2 makes a solid but incremental contribution to heterogeneous federated learning with a structural alignment method yielding modest improvements. Paper 1's mechanistic insights, scaling laws, and cross-cutting relevance give it substantially higher impact potential.
Paper 1 offers a more novel, mechanistic account of LLM failure modes via attractor geometry, unifying conflict and hallucination and proposing an internal-state metric (geometric margin) that outperforms entropy-based monitoring. It combines causal isolation in a controlled synthetic setup with validation on natural-language queries and posits a scaling law, making it timely and broadly relevant to interpretability, reliability, and safety across LLM applications. Paper 2 is solid and practical with formal guarantees for multi-agent code systems, but its impact is narrower (software-agent orchestration) and more incremental relative to existing routing/retrieval frameworks.
Paper 1 offers a mechanistic, geometry-based unification of conflict vs. hallucination in transformer generation, proposes a measurable internal signal (geometric margin) that outperforms entropy, validates via controlled causal LoRA setups and natural queries, and highlights a scaling-law implication—broadly relevant to interpretability, reliability, safety, and monitoring of LLMs. Its conceptual novelty and cross-domain impact potential are higher. Paper 2 has strong applied value for traffic control, but is more domain-specific and depends on LLM-guided heuristic evolution in simulation, with narrower methodological/theoretical generality.
Paper 2 investigates fundamental failure modes of LLMs (hallucinations and memory conflicts) through a novel mechanistic lens (attractor geometry). This deep theoretical insight into working vs. parametric memory has broad implications for interpretability, AI safety, and hallucination mitigation across all LLM applications. Paper 1, while rigorously proposing an architecture for multi-agent code generation, addresses a much narrower domain and technical problem, limiting its broader scientific impact compared to the foundational discoveries in Paper 2.
Paper 2 provides a fundamental mechanistic insight into transformer behavior—showing that conflict and hallucination share unified attractor geometry, that the LM head systematically erases epistemic signals, and that confident hallucinations follow a scaling law worsening with model size. This is a deep theoretical contribution with broad implications for interpretability, hallucination detection, and scaling research. Paper 1 makes a solid engineering contribution on reasoning-trace safety with practical mitigations, but is more incremental (extending safety evaluation to reasoning chains). Paper 2's geometric framework and scaling law discovery have broader cross-field impact and longer-lasting theoretical significance.
Paper 2 addresses fundamental issues in LLMs (hallucinations and memory conflicts) using mechanistic interpretability, providing insights into attractor geometry that apply broadly across AI. Paper 1 offers an innovative neuro-symbolic approach but is restricted to the specific domain of traffic signal control, resulting in a narrower scope of scientific impact compared to Paper 2's foundational contributions.