When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction
Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-Tür
Abstract
Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper proposes a channel-transition framework to mechanistically explain why LLMs lose instruction-following ability over multi-turn conversations — a well-documented behavioral phenomenon lacking internal explanation. The framework decomposes goal-state propagation into two channels: (1) direct attention to goal-defining tokens (measured by the novel Goal Accessibility Ratio, GAR), and (2) residual-stream representations that persist independently of attention. The key insight is that as conversations grow, attention to system-prompt tokens decays monotonically, and what survives after this "attention closure" depends on architecture-specific residual encoding properties.
The paper introduces three complementary instruments: GAR as a diagnostic, sliding-window (SW) ablations as causal interventions, and linear residual-stream probes as measurements of the second channel. Together, these yield parametric predictions of when multi-turn failures occur.
Methodological Rigor
The experimental design is notably careful and well-structured:
Strengths in design: The four-task suite (information retention, controlled complexity, persona compliance, policy compliance) isolates different types of goal conditioning (lexical facts, stylistic rules, conditional policies). The sliding-window intervention is elegant — it structurally forces attention closure at a geometrically predictable turn, enabling clean pre/post-crossover comparisons within single inference runs. The dose-response relationship between window size and crossover turn (R² > 0.999) is compelling.
Statistical rigor: Mann-Kendall tests across 10 architectures for monotonic GAR decline, permutation nulls for probe significance, bootstrap confidence intervals, and Wilson intervals for proportions are all appropriate. The 5,483 episodes provide reasonable statistical power.
Important caveats the authors acknowledge: The causal interpretation is strongest for Mistral (natively trained with SW attention); for LLaMA and Qwen, the SW intervention is out-of-distribution, weakening causal claims. The activation patching experiment (Appendix D.5) is a notable negative result — residual encodings predict outcomes with AUC up to 0.99 but are not causally sufficient when patched, meaning the probes identify *where* information persists but not *how* it's used. This is a significant gap between encoding and causal mechanism.
Concerns: The 50-turn conversation limit is relatively short, and under default attention, GAR never reaches the closed-channel floor — the SW intervention is needed to study post-closure behavior within practical lengths. This means the framework's relevance to natural multi-turn degradation (without forced closure) remains somewhat indirect. The linear probing methodology, while standard, has known limitations regarding conflating encoding with use, which the authors partially address but cannot fully resolve.
Potential Impact
Diagnostic utility: GAR is immediately useful as a monitoring tool for deployed multi-turn systems. It provides a per-architecture, per-turn measurement of whether the model can still "see" its instructions, enabling runtime detection of impending failures.
Architectural insights: The finding that residual encoding depth varies dramatically (layer 2 to layer 27) across architectures, and that this correlates with post-closure behavioral survival, has implications for architecture design. The non-monotonic scaling with parameter count within the Qwen family challenges naive assumptions about bigger-is-better for multi-turn reliability.
Safety implications: The finding that forced channel closure raises persona violations *above adversarial-pressure baselines* without any user pressure is concerning for deployed systems and suggests a vector for exploitation in very long conversations.
Limitations on impact: The framework currently applies to structured, fixed-goal conversations. Real multi-turn degradation involves goal evolution, implicit instructions, and more complex dynamics. The paper doesn't directly bridge to natural degradation scenarios.
Timeliness & Relevance
This work addresses a timely and practical bottleneck. As LLMs are deployed as persistent conversational agents, customer service bots, and agentic systems, multi-turn reliability becomes critical. Multiple recent papers (Laban et al. 2025, He et al. 2024, Jia et al. 2025) have documented the behavioral phenomenon; this paper is among the first to provide a mechanistic account. The combination of behavioral measurement, internal mechanistic analysis, and causal intervention fills a clear gap.
Strengths & Limitations
Key strengths:
Notable weaknesses:
Overall assessment: This is a well-executed mechanistic interpretability study that introduces useful tools (GAR, the two-channel framework) and reveals meaningful architectural differences in multi-turn robustness. The work is careful about its claims, though the title and framing somewhat overstate the mechanistic completeness of the account given the activation-patching null. The contribution is primarily diagnostic and descriptive rather than providing a complete causal mechanism, but the diagnostics themselves are valuable.
Generated May 14, 2026
Comparison History (21)
Paper 2 addresses one of the most critical challenges in LLMs (hallucinations) with a training-free, universal, inference-time algorithm. Its demonstrated effectiveness across 15 models and 8 families without requiring external data or fine-tuning offers immense practical utility and immediate real-world applicability, likely leading to broader adoption and higher scientific impact compared to the mechanistic diagnostic focus of Paper 1.
Paper 2 offers a novel mechanistic explanation for a widely observed failure mode in LLMs (multi-turn instruction degradation), introduces a new diagnostic metric (GAR), and provides causal evidence through ablation studies. Its breadth of impact is higher—it applies across LLM architectures and has immediate implications for AI safety, alignment, and system design. Paper 1 addresses a valuable but narrower clinical niche (ECG simulation under interventions). While rigorous, its impact is more domain-specific. Paper 2's timeliness in the era of widespread LLM deployment and its foundational mechanistic insights give it broader and more transformative potential.
Paper 1 provides a foundational, mechanistic explanation for a ubiquitous behavioral failure in LLMs (losing context over multi-turn interactions). By introducing a novel metric (GAR) and demonstrating causal mechanisms across different architectures, it offers deep scientific insights that could fundamentally influence future architectural designs and context-handling techniques. While Paper 2 addresses a critical applied issue in AI safety, Paper 1's methodological rigor and contribution to mechanistic interpretability promise broader, more fundamental scientific impact.
Paper 2 provides a novel mechanistic explanation for a fundamental and widely-observed failure mode of LLMs (losing context in multi-turn interactions). It introduces a new diagnostic metric (GAR), a general theoretical framework (channel-transition account), and demonstrates findings across multiple architectures. This has broader impact across the entire LLM research community and practical applications in any multi-turn system. Paper 1, while rigorous and clinically relevant, addresses a narrower domain (sepsis management) with an incremental combination of existing techniques (world models + LLM agents + RL), limiting its breadth of impact.
Paper 2 has higher impact potential due to a more mechanistic, broadly applicable account of a central LLM limitation (instruction/goal drift in multi-turn dialogue). It introduces a general diagnostic metric (GAR), combines causal ablations and probing across multiple architectures, and yields predictive claims about failure timing—strong methodological rigor and transfer across models. Real-world relevance is immediate for deploying agents/chatbots and for model design (memory, attention, alignment). Paper 1 is novel and useful for safety policy/annotation workflows, but its impact is narrower and more domain-specific.
Paper 1 offers a more novel mechanistic framework for multi-turn instruction degradation in LLMs, introducing a concrete diagnostic (GAR) plus causal interventions and cross-architecture comparisons that can generalize broadly to interpretability, alignment, and long-context reliability. Its methodology is relatively rigorous (behavioral measures, attention/residual analyses, ablations, probing with strong predictive performance) and timely given widespread deployment of chat models. Paper 2 has clear real-world utility in traffic signal control, but its core contributions (reward shaping/regularization for RL fine-tuning) are more incremental and domain-specific, with narrower cross-field impact.
Paper 1 is more likely to have higher scientific impact: it introduces a broadly applicable learning-theoretic framework for distribution-aware synthesis of executable solvers, with provable generalization of both correctness and runtime—an important and under-addressed objective. It also demonstrates large empirical gains across multiple combinatorial optimization classes and competitive performance on real benchmark instances (PACE), indicating near-term practical adoption in operations research, systems, and automated algorithm design. Paper 2 offers strong mechanistic interpretability tools (GAR) and insights, but its immediate real-world leverage is narrower and more diagnostic than transformative.
Paper 1 provides a novel mechanistic explanation for a widely observed but poorly understood phenomenon—multi-turn degradation in LLMs. It introduces a new diagnostic metric (GAR), a causal framework (channel-transition), and demonstrates cross-architecture generalizability with rigorous ablation studies and probing experiments. The mechanistic insights have broad implications for LLM architecture design, safety, and alignment. Paper 2 introduces a useful engineering contribution (reasoning graphs) for agent accuracy, but is more narrowly scoped as a prompting/context-engineering technique without retraining, and lacks empirical results on benchmarks in the abstract.
Paper 1 offers higher scientific impact by addressing a fundamental, ubiquitous limitation of LLMs (multi-turn context degradation) through rigorous mechanistic interpretability. By introducing the Goal Accessibility Ratio and uncovering how attention channels transition to residual streams, it provides actionable, causal insights into architectural flaws. This will directly influence foundational model design and evaluation. While Paper 2 presents a highly innovative agent architecture, it relies on a specific Lisp-based paradigm. Paper 1's generalizable diagnostics and mechanistic explanations of a critical failure mode ensure broader and more immediate relevance across the AI field.
Paper 2 has higher likely impact due to broader relevance and mechanistic depth: it introduces a general framework (channel-transition) and a quantitative diagnostic (GAR) for multi-turn failure in LLMs, validated across architectures with causal interventions and predictive signals. This can inform model design, evaluation, and safety/reliability across many applications, beyond any single training method. Paper 1 is novel and useful for multimodal RL optimization, but its scope is narrower (LVLM RLVR/GRPO) and depends on structured token role decomposition, limiting cross-field breadth relative to Paper 2.
Paper 1 offers a more mechanistic, diagnostic contribution: it proposes a concrete channel-transition account for multi-turn instruction failure, introduces GAR, combines attention ablations with residual-stream probing, and provides causal interventions plus strong cross-architecture results and predictive timing. This advances interpretability, evaluation, and reliability across many LLM use cases and architectures, making its impact broad and timely. Paper 2 is practically useful for RL fine-tuning, but its stability metrics (ACF, PE) are relatively heuristic and likely narrower in scope and novelty than Paper 1’s mechanistic framework and diagnostics.
Paper 2 offers fundamental mechanistic insights into a widespread limitation of LLMs (context loss in multi-turn interactions) rather than just an optimization technique. By introducing novel metrics (GAR), causal ablations, and cross-architecture analysis, it provides a foundational understanding that can drive future architectural designs and broad downstream research. Paper 1 presents a solid but arguably more incremental training optimization (curriculum learning for RLHF), making Paper 2 more likely to have a profound and lasting scientific impact across the field of AI.
Paper 2 provides a novel mechanistic explanation for a widely observed but poorly understood phenomenon—instruction degradation in multi-turn LLM interactions. It introduces a new diagnostic metric (GAR), a causal framework (channel-transition account), and demonstrates cross-architecture generalizability with causal ablation experiments. This mechanistic understanding has broader implications for LLM architecture design, safety, and alignment. While Paper 1 offers a solid engineering contribution with its memory framework, Paper 2 advances fundamental understanding of transformer behavior, which is likely to inspire more follow-on research and have deeper scientific impact.
Paper 1 offers a novel mechanistic framework (GAR, channel-transition account) explaining a fundamental and widely-observed failure mode of LLMs—multi-turn degradation. It provides causal ablation evidence, cross-architecture analysis, and predictive diagnostics applicable broadly to LLM research. Its contributions are foundational and could influence LLM design, safety, and alignment across many domains. Paper 2 addresses a more specialized problem (RS tool selection) with a useful but incremental engineering contribution (hierarchical skill trees) within a narrower application domain, limiting its broader scientific impact.
Paper 2 offers a novel mechanistic explanation for a widely observed but poorly understood phenomenon in LLMs—instruction degradation over multi-turn interactions. It introduces a new diagnostic metric (GAR), a theoretical framework (channel-transition account), and provides causal evidence across multiple architectures. This has broader impact: it informs LLM architecture design, safety/alignment research, and deployment practices. Paper 1 addresses a practical but more incremental engineering problem (agent routing) with a narrower scope, while Paper 2 advances fundamental understanding of transformer behavior with cross-architecture generality.
Paper 2 offers a novel mechanistic explanation for a widely-observed but poorly understood phenomenon (multi-turn degradation in LLMs), introduces a new diagnostic metric (GAR), and provides causal evidence through ablation studies across multiple architectures. Its mechanistic insights into attention dynamics have broad implications for LLM architecture design, safety, and alignment. Paper 1 proposes a prompting strategy (HCoT) that, while practical, is more incremental—building on existing prompting paradigms (CoT, ToT)—and addresses a narrower problem with less fundamental insight into LLM behavior.
Paper 2 provides deep mechanistic insights into a fundamental and widespread limitation of LLMs (context loss in multi-turn interactions). By introducing novel diagnostic metrics and causal ablations, it informs foundational architectural improvements. This level of mechanistic interpretability generally yields higher scientific impact than the application-specific prompting framework presented in Paper 1.
Paper 2 offers a novel mechanistic explanation for a fundamental LLM limitation (multi-turn instruction degradation), introduces a new diagnostic metric (GAR), and provides causal evidence across multiple architectures. Its contributions—mechanistic framework, diagnostic tools, and predictive model of failure timing—have broad implications for LLM architecture design, safety, and alignment. Paper 1, while practically useful for depression detection, represents a more incremental application of existing LLM capabilities to a specific clinical domain with less potential for cross-field impact.
Paper 1 addresses a fundamental, widely-recognized limitation of LLMs (context loss in multi-turn interactions) by providing a novel mechanistic explanation and diagnostic tool (GAR). Its insights into attention channels and residual representations have broad implications for model architecture and interpretability across the AI field. Paper 2 presents a solid but more specialized application of LLMs to Graph Contrastive Learning, which, while methodologically rigorous, has a narrower potential impact compared to the foundational architectural insights offered by Paper 1.
Paper 2 has higher likely scientific impact due to a broadly relevant, timely problem (multi-turn instruction drift) and a mechanistic, testable framework (channel-transition account) with a clear quantitative metric (GAR), causal interventions (forced attention closure), and cross-architecture validation. Its methods (attention analysis, ablations, residual probes) are rigorous and likely reusable as diagnostics across labs and model families, affecting alignment, interpretability, and agent design. Paper 1 is applied and useful, but is more domain/product-oriented and may have narrower academic uptake and harder-to-validate novelty claims.