When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction

Vardhan Dongre, Joseph Hsieh, Viet Dac Lai, Seunghyun Yoon, Trung Bui, Dilek Hakkani-Tür

#123 of 2292 · Artificial Intelligence
Share
Tournament Score
1535±46
10501800
90%
Win Rate
19
Wins
2
Losses
21
Matches
Rating
6.8/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Large language models can follow complex instructions in a single turn, yet over long multi-turn interactions they often lose the thread of instructions, persona, and rules. This degradation has been measured behaviorally but not mechanistically explained. We propose a channel-transition account: goal-defining tokens become less accessible through attention, while goal-related information may persist in residual representations. We introduce the Goal Accessibility Ratio (GAR), measuring attention from generated tokens to task-defining goal tokens, and combine it with sliding-window ablations and residual-stream probes. When attention to instructions closes, what survives reveals architecture. Across architectures, the transition yields qualitatively distinct failure modes: some models preserve goal-conditioned behavior at vanishing attention, others fail despite decodable residual goal information, and the layer at which this encoding emerges varies from 2 to 27. A within-model causal ablation that force-closes the attention channel in Mistral collapses recall from near-perfect to 11% on a 20-fact retention task and raises persona-constraint violations above an adversarial-pressure baseline without user pressure, with both effects emerging at the predictable crossover turn. Linear probes recover per-episode recall outcomes from residual representations with AUC up to 0.99 across all four primary architectures, while input embeddings remain at chance. Across architectures and model scales, the gap between attention loss and residual decodability predicts whether goal-conditioned behavior survives channel closure. We contribute GAR as a diagnostic, the channel-transition framework as a controlled mechanistic account, and a parametric prediction of failure timing under windowed attention closure.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper proposes a channel-transition framework to mechanistically explain why LLMs lose instruction-following ability over multi-turn conversations — a well-documented behavioral phenomenon lacking internal explanation. The framework decomposes goal-state propagation into two channels: (1) direct attention to goal-defining tokens (measured by the novel Goal Accessibility Ratio, GAR), and (2) residual-stream representations that persist independently of attention. The key insight is that as conversations grow, attention to system-prompt tokens decays monotonically, and what survives after this "attention closure" depends on architecture-specific residual encoding properties.

The paper introduces three complementary instruments: GAR as a diagnostic, sliding-window (SW) ablations as causal interventions, and linear residual-stream probes as measurements of the second channel. Together, these yield parametric predictions of when multi-turn failures occur.

Methodological Rigor

The experimental design is notably careful and well-structured:

Strengths in design: The four-task suite (information retention, controlled complexity, persona compliance, policy compliance) isolates different types of goal conditioning (lexical facts, stylistic rules, conditional policies). The sliding-window intervention is elegant — it structurally forces attention closure at a geometrically predictable turn, enabling clean pre/post-crossover comparisons within single inference runs. The dose-response relationship between window size and crossover turn (R² > 0.999) is compelling.

Statistical rigor: Mann-Kendall tests across 10 architectures for monotonic GAR decline, permutation nulls for probe significance, bootstrap confidence intervals, and Wilson intervals for proportions are all appropriate. The 5,483 episodes provide reasonable statistical power.

Important caveats the authors acknowledge: The causal interpretation is strongest for Mistral (natively trained with SW attention); for LLaMA and Qwen, the SW intervention is out-of-distribution, weakening causal claims. The activation patching experiment (Appendix D.5) is a notable negative result — residual encodings predict outcomes with AUC up to 0.99 but are not causally sufficient when patched, meaning the probes identify *where* information persists but not *how* it's used. This is a significant gap between encoding and causal mechanism.

Concerns: The 50-turn conversation limit is relatively short, and under default attention, GAR never reaches the closed-channel floor — the SW intervention is needed to study post-closure behavior within practical lengths. This means the framework's relevance to natural multi-turn degradation (without forced closure) remains somewhat indirect. The linear probing methodology, while standard, has known limitations regarding conflating encoding with use, which the authors partially address but cannot fully resolve.

Potential Impact

Diagnostic utility: GAR is immediately useful as a monitoring tool for deployed multi-turn systems. It provides a per-architecture, per-turn measurement of whether the model can still "see" its instructions, enabling runtime detection of impending failures.

Architectural insights: The finding that residual encoding depth varies dramatically (layer 2 to layer 27) across architectures, and that this correlates with post-closure behavioral survival, has implications for architecture design. The non-monotonic scaling with parameter count within the Qwen family challenges naive assumptions about bigger-is-better for multi-turn reliability.

Safety implications: The finding that forced channel closure raises persona violations *above adversarial-pressure baselines* without any user pressure is concerning for deployed systems and suggests a vector for exploitation in very long conversations.

Limitations on impact: The framework currently applies to structured, fixed-goal conversations. Real multi-turn degradation involves goal evolution, implicit instructions, and more complex dynamics. The paper doesn't directly bridge to natural degradation scenarios.

Timeliness & Relevance

This work addresses a timely and practical bottleneck. As LLMs are deployed as persistent conversational agents, customer service bots, and agentic systems, multi-turn reliability becomes critical. Multiple recent papers (Laban et al. 2025, He et al. 2024, Jia et al. 2025) have documented the behavioral phenomenon; this paper is among the first to provide a mechanistic account. The combination of behavioral measurement, internal mechanistic analysis, and causal intervention fills a clear gap.

Strengths & Limitations

Key strengths:

  • Clean experimental isolation of the attention channel via SW intervention with parametric predictability
  • Cross-architectural comparison revealing qualitatively distinct failure modes rather than uniform degradation
  • The negative activation-patching result is honestly reported and informative — it prevents overclaiming about causal mechanisms
  • Comprehensive appendices with full methodological detail supporting reproducibility
  • Policy compliance evaluation carefully validated against human ground-truth
  • Notable weaknesses:

  • The gap between encoding (high AUC probes) and causal use (null patching result) is the paper's central unresolved tension. The framework describes *where* information persists but not *how* the model does or doesn't use it
  • The SW intervention for non-natively-trained models conflates channel closure with distributional shift
  • Only 50-turn conversations — natural degradation at this length may be minimal under default attention
  • Four structured tasks with binary/categorical outcomes; generalization to open-ended conversation is undemonstrated
  • The "channel-transition" framing may overstate mechanistic understanding given the patching null; what's established is correlation between attention decay and behavioral failure, with residual encoding as a descriptive observation
  • Overall assessment: This is a well-executed mechanistic interpretability study that introduces useful tools (GAR, the two-channel framework) and reveals meaningful architectural differences in multi-turn robustness. The work is careful about its claims, though the title and framing somewhat overstate the mechanistic completeness of the account given the activation-patching null. The contribution is primarily diagnostic and descriptive rather than providing a complete causal mechanism, but the diagnostics themselves are valuable.

    Rating:6.8/ 10
    Significance 7Rigor 7.5Novelty 7Clarity 7.5

    Generated May 14, 2026

    Comparison History (21)

    vs. TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction
    gemini-3.15/19/2026

    Paper 2 addresses one of the most critical challenges in LLMs (hallucinations) with a training-free, universal, inference-time algorithm. Its demonstrated effectiveness across 15 models and 8 families without requiring external data or fine-tuning offers immense practical utility and immediate real-world applicability, likely leading to broader adoption and higher scientific impact compared to the mechanistic diagnostic focus of Paper 1.

    vs. ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
    claude-opus-4.65/19/2026

    Paper 2 offers a novel mechanistic explanation for a widely observed failure mode in LLMs (multi-turn instruction degradation), introduces a new diagnostic metric (GAR), and provides causal evidence through ablation studies. Its breadth of impact is higher—it applies across LLM architectures and has immediate implications for AI safety, alignment, and system design. Paper 1 addresses a valuable but narrower clinical niche (ECG simulation under interventions). While rigorous, its impact is more domain-specific. Paper 2's timeliness in the era of widespread LLM deployment and its foundational mechanistic insights give it broader and more transformative potential.

    vs. Internalizing Safety Understanding in Large Reasoning Models via Verification
    gemini-3.15/16/2026

    Paper 1 provides a foundational, mechanistic explanation for a ubiquitous behavioral failure in LLMs (losing context over multi-turn interactions). By introducing a novel metric (GAR) and demonstrating causal mechanisms across different architectures, it offers deep scientific insights that could fundamentally influence future architectural designs and context-handling techniques. While Paper 2 addresses a critical applied issue in AI safety, Paper 1's methodological rigor and contribution to mechanistic interpretability promise broader, more fundamental scientific impact.

    vs. Agentifying Patient Dynamics within LLMs through Interacting with Clinical World Model
    claude-opus-4.65/16/2026

    Paper 2 provides a novel mechanistic explanation for a fundamental and widely-observed failure mode of LLMs (losing context in multi-turn interactions). It introduces a new diagnostic metric (GAR), a general theoretical framework (channel-transition account), and demonstrates findings across multiple architectures. This has broader impact across the entire LLM research community and practical applications in any multi-turn system. Paper 1, while rigorous and clinically relevant, addresses a narrower domain (sepsis management) with an incremental combination of existing techniques (world models + LLM agents + RL), limiting its breadth of impact.

    vs. Understanding Annotator Safety Policy with Interpretability
    gpt-5.25/16/2026

    Paper 2 has higher impact potential due to a more mechanistic, broadly applicable account of a central LLM limitation (instruction/goal drift in multi-turn dialogue). It introduces a general diagnostic metric (GAR), combines causal ablations and probing across multiple architectures, and yields predictive claims about failure timing—strong methodological rigor and transfer across models. Real-world relevance is immediate for deploying agents/chatbots and for model design (memory, attention, alignment). Paper 1 is novel and useful for safety policy/annotation workflows, but its impact is narrower and more domain-specific.

    vs. OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
    gpt-5.25/16/2026

    Paper 1 offers a more novel mechanistic framework for multi-turn instruction degradation in LLMs, introducing a concrete diagnostic (GAR) plus causal interventions and cross-architecture comparisons that can generalize broadly to interpretability, alignment, and long-context reliability. Its methodology is relatively rigorous (behavioral measures, attention/residual analyses, ablations, probing with strong predictive performance) and timely given widespread deployment of chat models. Paper 2 has clear real-world utility in traffic signal control, but its core contributions (reward shaping/regularization for RL fine-tuning) are more incremental and domain-specific, with narrower cross-field impact.

    vs. Distribution-Aware Algorithm Design with LLM Agents
    gpt-5.25/16/2026

    Paper 1 is more likely to have higher scientific impact: it introduces a broadly applicable learning-theoretic framework for distribution-aware synthesis of executable solvers, with provable generalization of both correctness and runtime—an important and under-addressed objective. It also demonstrates large empirical gains across multiple combinatorial optimization classes and competitive performance on real benchmark instances (PACE), indicating near-term practical adoption in operations research, systems, and automated algorithm design. Paper 2 offers strong mechanistic interpretability tools (GAR) and insights, but its immediate real-world leverage is narrower and more diagnostic than transformative.

    vs. Reasoning Graphs: Deterministic Agent Accuracy through Evidence-Centric Chain-of-Thought Feedback
    claude-opus-4.65/16/2026

    Paper 1 provides a novel mechanistic explanation for a widely observed but poorly understood phenomenon—multi-turn degradation in LLMs. It introduces a new diagnostic metric (GAR), a causal framework (channel-transition), and demonstrates cross-architecture generalizability with rigorous ablation studies and probing experiments. The mechanistic insights have broad implications for LLM architecture design, safety, and alignment. Paper 2 introduces a useful engineering contribution (reasoning graphs) for agent accuracy, but is more narrowly scoped as a prompting/context-engineering technique without retraining, and lacks empirical results on benchmarks in the abstract.

    vs. Self-Programmed Execution for Language-Model Agents
    gemini-3.15/16/2026

    Paper 1 offers higher scientific impact by addressing a fundamental, ubiquitous limitation of LLMs (multi-turn context degradation) through rigorous mechanistic interpretability. By introducing the Goal Accessibility Ratio and uncovering how attention channels transition to residual streams, it provides actionable, causal insights into architectural flaws. This will directly influence foundational model design and evaluation. While Paper 2 presents a highly innovative agent architecture, it relies on a specific Lisp-based paradigm. Paper 1's generalizable diagnostics and mechanistic explanations of a critical failure mode ensure broader and more immediate relevance across the AI field.

    vs. Structured Role-Aware Policy Optimization for Multimodal Reasoning
    gpt-5.25/16/2026

    Paper 2 has higher likely impact due to broader relevance and mechanistic depth: it introduces a general framework (channel-transition) and a quantitative diagnostic (GAR) for multi-turn failure in LLMs, validated across architectures with causal interventions and predictive signals. This can inform model design, evaluation, and safety/reliability across many applications, beyond any single training method. Paper 1 is novel and useful for multimodal RL optimization, but its scope is narrower (LVLM RLVR/GRPO) and depends on structured token role decomposition, limiting cross-field breadth relative to Paper 2.

    vs. StaRPO: Stability-Augmented Reinforcement Policy Optimization
    gpt-5.25/16/2026

    Paper 1 offers a more mechanistic, diagnostic contribution: it proposes a concrete channel-transition account for multi-turn instruction failure, introduces GAR, combines attention ablations with residual-stream probing, and provides causal interventions plus strong cross-architecture results and predictive timing. This advances interpretability, evaluation, and reliability across many LLM use cases and architectures, making its impact broad and timely. Paper 2 is practically useful for RL fine-tuning, but its stability metrics (ACF, PE) are relatively heuristic and likely narrower in scope and novelty than Paper 1’s mechanistic framework and diagnostics.

    vs. SPARD: Self-Paced Curriculum for RL Alignment via Integrating Reward Dynamics and Data Utility
    gemini-3.15/16/2026

    Paper 2 offers fundamental mechanistic insights into a widespread limitation of LLMs (context loss in multi-turn interactions) rather than just an optimization technique. By introducing novel metrics (GAR), causal ablations, and cross-architecture analysis, it provides a foundational understanding that can drive future architectural designs and broad downstream research. Paper 1 presents a solid but arguably more incremental training optimization (curriculum learning for RLHF), making Paper 2 more likely to have a profound and lasting scientific impact across the field of AI.

    vs. GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
    claude-opus-4.65/14/2026

    Paper 2 provides a novel mechanistic explanation for a widely observed but poorly understood phenomenon—instruction degradation in multi-turn LLM interactions. It introduces a new diagnostic metric (GAR), a causal framework (channel-transition account), and demonstrates cross-architecture generalizability with causal ablation experiments. This mechanistic understanding has broader implications for LLM architecture design, safety, and alignment. While Paper 1 offers a solid engineering contribution with its memory framework, Paper 2 advances fundamental understanding of transformer behavior, which is likely to inspire more follow-on research and have deeper scientific impact.

    vs. RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
    claude-opus-4.65/14/2026

    Paper 1 offers a novel mechanistic framework (GAR, channel-transition account) explaining a fundamental and widely-observed failure mode of LLMs—multi-turn degradation. It provides causal ablation evidence, cross-architecture analysis, and predictive diagnostics applicable broadly to LLM research. Its contributions are foundational and could influence LLM design, safety, and alignment across many domains. Paper 2 addresses a more specialized problem (RS tool selection) with a useful but incremental engineering contribution (hierarchical skill trees) within a narrower application domain, limiting its broader scientific impact.

    vs. AgentGate: A Lightweight Structured Routing Engine for the Internet of Agents
    claude-opus-4.65/14/2026

    Paper 2 offers a novel mechanistic explanation for a widely observed but poorly understood phenomenon in LLMs—instruction degradation over multi-turn interactions. It introduces a new diagnostic metric (GAR), a theoretical framework (channel-transition account), and provides causal evidence across multiple architectures. This has broader impact: it informs LLM architecture design, safety/alignment research, and deployment practices. Paper 1 addresses a practical but more incremental engineering problem (agent routing) with a narrower scope, while Paper 2 advances fundamental understanding of transformer behavior with cross-architecture generality.

    vs. Heuristic Classification of Thoughts Prompting (HCoT): Integrating Expert System Heuristics for Structured Reasoning into Large Language Models
    claude-opus-4.65/14/2026

    Paper 2 offers a novel mechanistic explanation for a widely-observed but poorly understood phenomenon (multi-turn degradation in LLMs), introduces a new diagnostic metric (GAR), and provides causal evidence through ablation studies across multiple architectures. Its mechanistic insights into attention dynamics have broad implications for LLM architecture design, safety, and alignment. Paper 1 proposes a prompting strategy (HCoT) that, while practical, is more incremental—building on existing prompting paradigms (CoT, ToT)—and addresses a narrower problem with less fundamental insight into LLM behavior.

    vs. Transferable Expertise for Autonomous Agents via Real-World Case-Based Learning
    gemini-3.15/14/2026

    Paper 2 provides deep mechanistic insights into a fundamental and widespread limitation of LLMs (context loss in multi-turn interactions). By introducing novel diagnostic metrics and causal ablations, it informs foundational architectural improvements. This level of mechanistic interpretability generally yields higher scientific impact than the application-specific prompting framework presented in Paper 1.

    vs. Dynamic Summary Generation for Interpretable Multimodal Depression Detection
    claude-opus-4.65/14/2026

    Paper 2 offers a novel mechanistic explanation for a fundamental LLM limitation (multi-turn instruction degradation), introduces a new diagnostic metric (GAR), and provides causal evidence across multiple architectures. Its contributions—mechanistic framework, diagnostic tools, and predictive model of failure timing—have broad implications for LLM architecture design, safety, and alignment. Paper 1, while practically useful for depression detection, represents a more incremental application of existing LLM capabilities to a specific clinical domain with less potential for cross-field impact.

    vs. Disentangle-then-Refine: LLM-Guided Decoupling and Structure-Aware Refinement for Graph Contrastive Learning
    gemini-3.15/14/2026

    Paper 1 addresses a fundamental, widely-recognized limitation of LLMs (context loss in multi-turn interactions) by providing a novel mechanistic explanation and diagnostic tool (GAR). Its insights into attention channels and residual representations have broad implications for model architecture and interpretability across the AI field. Paper 2 presents a solid but more specialized application of LLMs to Graph Contrastive Learning, which, while methodologically rigorous, has a narrower potential impact compared to the foundational architectural insights offered by Paper 1.

    vs. IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation
    gpt-5.25/14/2026

    Paper 2 has higher likely scientific impact due to a broadly relevant, timely problem (multi-turn instruction drift) and a mechanistic, testable framework (channel-transition account) with a clear quantitative metric (GAR), causal interventions (forced attention closure), and cross-architecture validation. Its methods (attention analysis, ablations, residual probes) are rigorous and likely reusable as diagnostics across labs and model families, affecting alignment, interpretability, and agent design. Paper 1 is applied and useful, but is more domain/product-oriented and may have narrower academic uptake and harder-to-validate novelty claims.