Where does Absolute Position come from in decoder-only Transformers?

Valeria Ruscio, Umberto Nanni, Fabrizio Silvestri

#834 of 3404 · Artificial Intelligence
Share
Tournament Score
1456±46
10501800
65%
Win Rate
13
Wins
7
Losses
20
Matches
Rating
6.5/ 10
Significance
Rigor
Novelty
Clarity

Abstract

RoPE-trained transformers distinguish absolute position in their attention patterns, even though RoPE encodes only relative offsets in the inner product. We trace this leakage to two architectural components, The causal mask is responsible for the first: its per-query softmax denominator depends on the absolute query position by construction. The residual stream supplies the second. Under causal attention the activation at position 00 attends only to itself and runs as a closed dynamical system from the embedding of the token at that position; downstream attention reads this trajectory through sink-reading heads. Both components appear in all three architectures we study, in architecturally specific balance: NTK scaling suppresses the residual-stream component, sliding-window attention allows it to accumulate with depth, and standard RoPE sits between. Replacing the \texttt{BOS} embedding before the forward pass removes 40%40\% of the residual-stream component at early queries. Attention sinks are token-anchored stabilizers that pass forward a deterministic fingerprint of the token at position 00, constant across inputs when that token is the auto-prepended \texttt{BOS} and varying with it otherwise.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper addresses a subtle but important puzzle in transformer architecture theory: RoPE encodes only relative positional offsets in the attention inner product, yet RoPE-trained models demonstrably distinguish absolute positions in their attention patterns. The paper identifies two concrete architectural sources of this "leakage":

1. The causal mask: The softmax denominator ZiZ_i sums over i+1i+1 terms for query position ii, creating an inherent absolute-position dependence regardless of what the logits encode. This is an elegant mathematical observation that, once stated, is almost obvious—but had not been clearly articulated before.

2. The residual stream: Position 0 under causal attention attends only to itself, creating a closed dynamical system from the BOS embedding. This deterministic trajectory propagates through sink-reading heads, providing a fixed reference signal that downstream attention can read.

The paper cleanly separates these two components using bidirectional ablation (which removes the causal-mask contribution) and demonstrates their relative balance across three architectures (Llama, Qwen, Mistral).

Methodological Rigor

The methodology is well-designed and appropriately cautious. Several strengths stand out:

  • The saturated baseline addresses a genuine statistical concern: naive linear baselines underfit the relative-offset contribution (which involves cosines at multiple frequencies), potentially inflating absolute-position attribution. The one-free-parameter-per-offset approach is a principled solution.
  • Cross-architecture validation across Llama, Qwen, and Mistral with different RoPE variants (standard, NTK-scaled, sliding-window) strengthens the generality claims.
  • The decomposition's internal consistency is checked through identity-RoPE and scrambled-RoPE producing nearly identical ratios (Table 6), supporting the additivity assumption in Equation 5.
  • Statistical rigor: permutation nulls with FDR correction, bootstrap confidence intervals, and stratified sampling.
  • However, the additivity assumption in Equation 5 deserves more scrutiny than the authors provide. The bidirectional ablation changes the forward pass (even with fixed weights), so the "residual-stream content" being measured is not identical to what exists under causal attention. The authors acknowledge this in a footnote but treat it as approximate without quantifying the approximation error.

    The position-0 trajectory verification is compelling: cross-prompt standard deviation of exactly 0.0000 across all layers tested confirms the closed dynamical system prediction. This is a clean theoretical prediction with exact empirical confirmation.

    Potential Impact

    Mechanistic interpretability: The paper provides a concrete, testable account of how absolute position information enters attention computations despite RoPE's relative-only design. This has implications for understanding length generalization failures—if models rely on absolute position cues that RoPE was designed to eliminate, interventions targeting only RoPE parameters will be insufficient.

    Attention sink understanding: The paper reframes attention sinks from potential "content aggregators" to "token-anchored stabilizers," providing evidence that sinks carry no decodable content but instead propagate a deterministic fingerprint of the position-0 token. This resolves an ambiguity in the literature and unifies seemingly different behaviors across architectures (constant output in BOS-prepending models vs. varying output in Qwen).

    Architecture design: The finding that NTK scaling suppresses the residual-stream component while sliding-window attention allows it to accumulate with depth provides actionable insight for architecture designers concerned with length generalization.

    Practical applications: Understanding the sources of absolute-position leakage could inform better strategies for context-window extension, improved position interpolation methods, and more principled approaches to length generalization.

    Timeliness & Relevance

    This work is highly timely. RoPE has become the dominant positional encoding in open-source LLMs, and context-window extension is an active area of engineering and research. Understanding why RoPE's relative-only design property fails to prevent absolute-position dependence is directly relevant to current bottlenecks in deploying LLMs at varying sequence lengths. The attention sink phenomenon has received growing attention since StreamingLLM (Xiao et al., 2024), and this paper provides the most mechanistically precise account to date.

    Strengths

    1. Clean theoretical framework: The causal-mask component follows directly from Equation 6—a simple, elegant observation. The position-0 dynamical system is equally clean.

    2. Quantitative decomposition: The bidirectional ablation provides a concrete, reproducible way to separate the two components.

    3. Cross-architecture consistency: All findings replicate across three distinct architectures with interpretable architectural differences.

    4. Unifying account: The paper explains why sink-reading heads in Llama/Mistral produce constant output while Qwen's vary—same mechanism, different inputs at position 0.

    5. Scale analysis: The composition shift from causal-mask-dominated (82% at 1B) to roughly balanced (64/36 at 3B and 8B) adds depth.

    Limitations

    1. No downstream task impact: The paper measures leakage in attention logits but does not connect it to generation quality, length generalization failure, or any behavioral consequence. This significantly limits practical impact.

    2. Small corpus: 64 Wikipedia chunks at 256 tokens is modest. The content-independence claim (natural text vs. random tokens) partially mitigates this.

    3. 60% unexplained: The position-0 embedding intervention accounts for only 40% of the residual-stream component. The remaining sources (layer-norm interactions, accumulated rotated writes) are acknowledged but uncharacterized.

    4. No training-time analysis: All interventions are inference-time. Whether the two-component decomposition holds during training or in bidirectionally-trained models is unknown.

    5. Limited model scale: The largest model tested is 8B parameters. Whether the saturation of the composition shift holds at larger scales is untested.

    6. The "replace all" intervention increases leakage (Table 9, 128% of baseline), which complicates the narrative and suggests the decomposition may have non-trivial interaction effects.

    Overall Assessment

    This is a solid mechanistic analysis paper that provides genuine insight into an underexplored property of modern transformer architectures. The main contributions—identifying the causal-mask softmax denominator and the position-0 dynamical system as sources of absolute-position leakage—are clean, well-supported, and novel. The cross-architecture analysis adds meaningful breadth. The primary weakness is the absence of any connection to downstream task performance, which limits the practical significance of the findings. The paper advances understanding of transformer mechanics without yet demonstrating why that understanding matters for model behavior.

    Rating:6.5/ 10
    Significance 6.5Rigor 7.5Novelty 7Clarity 7.5

    Generated Jun 5, 2026

    Comparison History (20)

    vs. When Should Memory Stay Silent: Measuring Memory-Use Boundaries in Memory-Augmented Conversational Agents
    claude-opus-4.66/6/2026

    Paper 1 offers a deeper mechanistic understanding of a fundamental architectural phenomenon in transformers—how absolute position information emerges in RoPE-based models despite only relative encoding. This addresses a core theoretical question relevant to the entire transformer research community, with implications for positional encoding design, attention sink understanding, and context length generalization. Paper 2 addresses an important but narrower applied problem (memory safety in conversational agents) with an empirical benchmark study. While timely, its contributions are more incremental compared to the mechanistic insights and broader architectural implications of Paper 1.

    vs. MAVEN-T: Reinforced Heterogeneous Distillation for Real-Time Multi-Agent Trajectory Prediction
    claude-opus-4.66/6/2026

    Paper 2 provides fundamental mechanistic insight into how RoPE-based transformers encode absolute position despite only explicitly encoding relative offsets. This addresses a core theoretical question about the dominant architecture (decoder-only transformers with RoPE) used across modern LLMs. The finding that causal masks and residual streams leak absolute position, along with the explanation of attention sinks, has broad implications for architecture design, positional encoding research, and context-length generalization. Paper 1, while technically solid, is a more incremental engineering contribution combining known techniques (distillation, PPO, MoE) for trajectory prediction efficiency.

    vs. Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo
    gemini-3.16/5/2026

    Paper 1 tackles the critical issue of reliability and safety in generative AI. By introducing a comprehensive framework for knowledge infusion and demonstrating a ~71% reduction in knowledge-violating outputs, it offers immediate, highly practical applications across domains requiring factual precision. While Paper 2 provides excellent mechanistic insights into Transformer architectures, Paper 1 directly addresses urgent real-world alignment and safety challenges, giving it broader multidisciplinary impact and higher potential for immediate adoption in applied AI systems.

    vs. Retrospective Harness Optimization: Improving LLM Agents via Self-Preference over Trajectory Rollouts
    gpt-5.26/5/2026

    Paper 1 likely has higher impact due to strong practical relevance and immediate applicability: it proposes a self-supervised optimization loop for deployed LLM agents without labeled validation data and reports a large improvement on a major benchmark (SWE-Bench Pro 59%→78%). This method could generalize across many agentic systems and organizations, affecting tooling, workflows, and continual improvement practices. Paper 2 offers valuable mechanistic insight into positional information in RoPE transformers, but its contributions are more explanatory/diagnostic and may translate into fewer near-term system-level gains than a broadly deployable optimization procedure.

    vs. Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study
    claude-opus-4.66/5/2026

    Paper 1 provides fundamental mechanistic insights into how absolute positional information emerges in RoPE-based decoder-only transformers, a widely-used architecture (e.g., LLaMA, GPT variants). Understanding attention sinks and positional leakage has broad implications for transformer design, context extension, and interpretability research. Paper 2 addresses a narrower applied problem—traffic sign defect detection via image difference classification—with limited generalizability beyond infrastructure inspection. Paper 1's findings are more likely to influence a larger research community and inspire follow-up work in mechanistic interpretability and architecture design.

    vs. Closing the Loop on Latent Reasoning via Test-Time Reconstruction
    gpt-5.26/5/2026

    Paper 1 likely has higher impact due to a more novel, actionable method (test-time self-supervised reconstruction to anchor latent reasoning) with clear, broad real-world applications across reasoning/QA/code and strong reported gains on widely used benchmarks/models. It introduces a general mechanism for improving reliability of latent reasoning, relevant to current trends in efficient inference and interpretability. Paper 2 provides valuable mechanistic insight into positional information leakage in RoPE models, but is more explanatory/diagnostic and may translate less directly into immediate performance or application gains, narrowing near-term impact.

    vs. Knowledge Activation: AI Skills as the Institutional Knowledge Primitive for Agentic Software Development
    claude-opus-4.66/5/2026

    Paper 2 offers fundamental theoretical insight into how absolute positional information emerges in RoPE-based transformers despite only relative offsets being explicitly encoded. This mechanistic understanding of attention sinks and positional leakage has broad implications for transformer architecture design, long-context modeling, and interpretability research. Paper 1 presents an applied enterprise framework (Knowledge Activation/AKUs) with a useful but narrow deployment study at Yahoo. While practically relevant, it is more incremental in nature—combining existing concepts (knowledge graphs, AI skills, RAG) into an enterprise workflow—and its impact is limited to software engineering practices rather than advancing foundational understanding of AI systems.

    vs. Evaluating Agentic Configuration Repair for Computer Networks
    gemini-3.16/5/2026

    Paper 2 provides fundamental insights into Transformer mechanics, explaining how absolute positional information emerges in models using only relative encodings. This deep understanding of attention sinks and architecture directly impacts the broader AI community's approach to designing and interpreting foundation models. While Paper 1 presents a valuable and practical application of LLMs for network configuration, its scientific impact is narrower and more localized to the systems and networking fields compared to the foundational nature of Paper 2.

    vs. MIRAGE: Mobile Agents with Implicit Reasoning and Generative World Models
    gpt-5.26/5/2026

    Paper 2 has higher estimated impact: it proposes a broadly applicable framework (latent internal reasoning + generative world-model alignment) with clear, timely real-world utility for efficient mobile UI agents and measurable gains (token reduction, performance improvements) on established benchmarks. Its contribution could transfer across agentic RL, multimodal modeling, and deployment-constrained settings. Paper 1 is novel and methodologically insightful for transformer interpretability/architecture, but is more specialized and primarily diagnostic, with less immediate application breadth and product-facing impact than MIRAGE.

    vs. Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories
    claude-opus-4.66/5/2026

    Paper 2 provides a fundamental mechanistic understanding of how absolute position information emerges in RoPE-based transformers despite only relative positional encoding being explicitly applied. This addresses a core architectural mystery with broad implications for transformer design, context window extension, and interpretability research. Paper 1 addresses an important but more incremental safety concern—extending the understanding of inference-time vulnerabilities beyond shallow safety. While practically relevant, Paper 2's insights are more foundational, affecting how the community understands and designs transformer architectures, giving it broader and longer-lasting impact.

    vs. SCI-PRM: A Tool Aware Process Reward Model for Scientific Reasoning Verification
    claude-opus-4.66/5/2026

    SCI-PRM addresses a significant gap in applying process reward models to scientific reasoning with tool use, introducing both a large-scale dataset (SCIPRM70K) and a novel reward model with demonstrated improvements in test-time scaling and reinforcement learning. This has broad practical impact across multiple scientific domains. Paper 2, while providing valuable mechanistic interpretability insights about positional encoding in transformers, is more narrowly focused on understanding an existing phenomenon rather than enabling new capabilities. SCI-PRM's combination of dataset contribution, methodological innovation, and practical applicability across sciences gives it higher potential impact.

    vs. Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?
    gemini-3.16/5/2026

    Paper 2 addresses a critical, highly timely issue in AI safety and human-agent collaboration. Through a rigorous, large-scale human study, it reveals significant vulnerabilities in how developers interact with AI agents, extending beyond theoretical AI-only settings. Its findings have immediate, broad implications across software engineering, HCI, and AI alignment, likely driving significant future research and policy changes. Paper 1 offers valuable mechanistic insights into Transformer architectures, but its impact is relatively niche compared to the systemic security and safety concerns raised by Paper 2.

    vs. Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models
    claude-opus-4.66/5/2026

    Paper 2 provides fundamental mechanistic insights into how absolute position information emerges in RoPE-based transformers despite only encoding relative offsets. This addresses a deep theoretical puzzle about widely-used architectures, with implications for positional encoding design, attention sink phenomena, and length generalization. Its findings are broadly applicable across transformer architectures. Paper 1, while practically useful for KV cache efficiency, represents a more incremental contribution in the well-explored space of inference optimization, building on known observations about token importance in reasoning traces.

    vs. Breaking the Chain: A Causal Analysis of LLM Faithfulness to Intermediate Structures
    claude-opus-4.66/5/2026

    Paper 2 provides a fundamental mechanistic understanding of how absolute position information emerges in RoPE-based transformers despite only relative offsets being encoded. This addresses a core theoretical puzzle affecting virtually all modern LLMs (GPT, Llama, etc.), with implications for architecture design, context extension, and interpretability. Paper 1 makes a valuable empirical contribution about faithfulness in schema-guided reasoning pipelines, but its scope is narrower—focused on a specific pipeline paradigm. Paper 2's insights into attention sinks, causal mask effects, and positional encoding have broader downstream impact across the transformer research community.

    vs. WorldFly: A World-Model-Based Vision-Language-Action Model for UAV Navigation
    gpt-5.26/5/2026

    Paper 2 likely has higher scientific impact: it offers a mechanistic explanation of how absolute position information emerges in widely used decoder-only Transformers despite RoPE’s relative design, attributing it to the causal mask and residual-stream dynamics and connecting to attention sinks across multiple architectures. This is broadly relevant to interpretability, architecture design, and training/inference behavior across many NLP/LLM systems, with immediate implications. Paper 1 is innovative and application-driven, but its impact is narrower (UAV navigation benchmarks/models) and more contingent on domain adoption.

    vs. PerceptUI: LLM Agents as Human-Aligned Synthetic Users for UI/UX Evaluation
    claude-opus-4.66/5/2026

    Paper 2 addresses a fundamental question about how transformer architectures work internally—specifically how absolute position information emerges in RoPE-based models despite only relative offsets being explicitly encoded. This mechanistic insight into attention sinks, causal masks, and residual streams has broad implications for transformer architecture design, positional encoding research, and interpretability. Paper 1, while practically useful for UI/UX evaluation, is more application-specific and incremental. Paper 2's findings are likely to influence a wider range of downstream research in NLP, architecture design, and mechanistic interpretability.

    vs. PLAN-S: Bridging Planning with Latent Style Dynamics for Autonomous Driving World Models
    gemini-3.16/5/2026

    Paper 2 investigates fundamental mechanistic properties of decoder-only Transformers (RoPE and attention sinks), the foundational architecture for modern LLMs. Insights here broadly impact LLM design, long-context scaling, and interpretability across all of AI. Paper 1 presents a strong, innovative approach for autonomous driving world models, but its impact is more narrowly confined to robotics and vehicle planning.

    vs. When AI Says It Feels
    gemini-3.16/5/2026

    Paper 1 offers fundamental insights into the inner workings of decoder-only Transformers by explaining how absolute position information leaks into models using relative positional encodings like RoPE. This mechanistic interpretability is highly impactful for understanding, debugging, and improving widely used LLM architectures. While Paper 2 presents an interesting behavioral experiment regarding AI alignment and expressions of feeling, Paper 1 addresses a core architectural mechanism that directly impacts foundation model design, scaling laws, and context window extension strategies, granting it broader and more immediate scientific relevance.

    vs. QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving
    gpt-5.26/5/2026

    Paper 2 likely has higher impact due to strong real-world applicability and timeliness: it targets a key deployment bottleneck (RAG prefill cost) and proposes an implementable systems method with concrete speedups at matched quality, evaluated across multiple LLMs/datasets and integrated into SGLang. This makes it broadly useful to industry and research and immediately actionable. Paper 1 is novel mechanistic interpretability work with potential theoretical importance, but its direct downstream applications and breadth are less immediate compared to a scalable serving optimization.

    vs. SciVisAgentSkills: Design and Evaluation of Agent Skills for Scientific Data Analysis and Visualization
    claude-opus-4.66/5/2026

    Paper 1 provides fundamental mechanistic insight into how absolute positional information emerges in RoPE-based transformers despite only relative encodings being used—a surprising and broadly relevant finding for the transformer/LLM community. It identifies specific architectural mechanisms (causal mask softmax denominators, residual stream dynamics, attention sinks) with rigorous analysis across multiple architectures. This deepens theoretical understanding of widely-deployed models. Paper 2 is a useful engineering contribution (reusable agent skills for scientific visualization) but is narrower in scope, more incremental, and primarily an applied systems/benchmark paper with limited conceptual novelty.