How LLMs Are Persuaded: A Few Attention Heads, Rerouted

Xiangkun Sun, Lingkai Kong, Aoqi Zhang, Liang Zeng, Tonghan Wang

#44 of 2292 · Artificial Intelligence
Share
Tournament Score
1571±44
10501800
88%
Win Rate
22
Wins
3
Losses
25
Matches
Rating
7.4/ 10
Significance
Rigor
Novelty
Clarity

Abstract

Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper provides the first circuit-level mechanistic explanation of how persuasive text overrides factual knowledge in large language models. The key finding is that persuasion operates through a remarkably compact internal pathway: (1) a sparse set of mid-layer "decision heads" encode answer options as vertices of a tetrahedron in a low-dimensional subspace; (2) these heads are simple copy mechanisms that attend to whichever option token a one-dimensional "routing feature" selects; (3) persuasion works by redirecting this routing feature, causing a discrete vertex jump rather than gradual confidence erosion; and (4) shallower attention heads (layers 8–12) construct the routing feature from persuasive keywords. The contribution is distinctive because persuasion is fundamentally different from static untruthfulness or stored factual errors—the same model with identical weights answers correctly without persuasive context and incorrectly with it. Prior work (ITI, ROME/MEMIT) targeted fixed knowledge representations, whereas this paper addresses a dynamic, input-dependent override mechanism.

Methodological Rigor

The methodology is strong and follows mechanistic interpretability best practices. The paper builds its argument through a layered series of causal interventions, each validated independently:

1. Activation patching to localize decision heads via restoration scores, with appropriate controls (length-matched clean/persuasive prompts, filtering for examples the model answers correctly without persuasion).

2. PCA analysis showing the tetrahedral geometry with explained variance analysis (75.84% in 3 components with sharp drop at PC4).

3. OV circuit decomposition via SVD showing decision heads copy rather than compute, with cosine similarity matrices confirming faithful copying (>0.94 diagonal, negative off-diagonal).

4. Rank-1 QK approximation with cross-validation (10-fold CV: 0.0339±0.0027), followed by causal validation through direct feature injection.

5. Layer-window patching with both denoising and noising directions to localize routing feature construction.

6. Composition score analysis providing geometric alignment evidence that corroborates the patching results.

The attention-pattern patching experiment (restoring 36.3% vs 41.3% for full patching) provides compelling evidence that attention rerouting is the primary mechanism. The paper is careful about potential confounds like the Hydra effect and copy suppression.

One methodological concern is that the rank-1 approximation is the only optimization-based step, and while cross-validated, the approximation error analysis could be more thorough—it's unclear how much of the full QK behavior is captured. The attention-pattern patching recovery rate of 36.3% vs 41.3% leaves a meaningful gap, suggesting some persuasion effects operate through value vectors rather than pure attention rerouting.

Potential Impact

Defensive applications: The compact circuit description enables targeted runtime monitors that could detect persuasion by tracking the routing feature or decision-head attention patterns. This is practically valuable for RAG systems, search engines, and any LLM pipeline processing untrusted content.

Mechanistic interpretability: The paper contributes a clean, complete example of circuit-level analysis for a safety-relevant behavior. The tetrahedral geometry finding is particularly elegant and may inspire similar geometric analyses of other discrete choice behaviors in transformers.

GEO defense: The demonstration that Generative Engine Optimization exploits the same circuit is immediately relevant to search engine and information retrieval communities.

Limitations on impact: The restriction to multiple-choice settings is significant. Real-world persuasion in LLMs often occurs in open-ended generation where there are no discrete option tokens to attend to. The mechanism may not transfer. Additionally, no defense is actually implemented or evaluated—the paper stops at mechanism identification.

Timeliness & Relevance

This paper addresses a critical and timely problem. LLM sycophancy and susceptibility to persuasion are among the most pressing deployment concerns, especially as RAG systems proliferate and GEO becomes commercially exploited. The mechanistic interpretability community has been calling for safety-relevant circuit discoveries beyond toy tasks, and this paper delivers one. The connection to GEO makes it particularly relevant to the search and information retrieval communities currently grappling with adversarial content optimization.

Strengths

  • Complete causal chain: Unlike many mechanistic interpretability papers that identify correlates, this traces a full input-to-output pathway with intervention at every link.
  • Cross-model generalization: Results replicate across LLAMA-3, QWEN-3, GEMMA-2, and GEMMA-3, suggesting the mechanism is architectural rather than model-specific.
  • Elegant geometric finding: The tetrahedral choice geometry and discrete vertex jumping provide a crisp, falsifiable characterization of persuasion's internal signature.
  • Practical relevance: Direct connection to GEO and realistic poisoning scenarios elevates this beyond a purely academic exercise.
  • Reproducibility: Single A100, code promised, deterministic evaluations.
  • Limitations & Weaknesses

  • Restricted to multiple-choice: The mechanism depends heavily on discrete option tokens serving as attention targets. Whether this extends to free-form generation—where persuasion is arguably more dangerous—remains entirely open.
  • No defense evaluation: The paper motivates defenses but doesn't build or test any, leaving the practical safety impact speculative.
  • Single primary head dominance: Heavy reliance on L17H24 in LLAMA-3 raises questions about whether this is a genuine architectural feature or an artifact of the specific models studied.
  • Incomplete attention rerouting story: The 36.3% vs 41.3% gap in attention-pattern patching suggests ~12% of persuasion's effect operates through non-attention channels, which is not fully explored.
  • Limited dataset diversity: NQ2 and Geo-Bench are useful but narrow; testing on more diverse persuasion types (multi-turn, subtle rhetorical strategies) would strengthen claims.
  • Overall Assessment

    This is a well-executed mechanistic interpretability paper that provides a compelling and compact explanation of a safety-relevant LLM vulnerability. The layered causal validation methodology is exemplary, and the geometric insights are both novel and aesthetically satisfying. The main limitations—restriction to multiple-choice and absence of implemented defenses—prevent it from achieving maximum impact, but the mechanistic contribution is solid and timely.

    Rating:7.4/ 10
    Significance 7.5Rigor 7.8Novelty 7.5Clarity 8.5

    Generated May 12, 2026

    Comparison History (25)

    vs. Towards a General Intelligence and Interface for Wearable Health Data
    claude-opus-4.65/22/2026

    Paper 1 presents a foundation model for wearable health pretrained on unprecedented scale (1 trillion minutes, 5 million participants), demonstrating systematic scaling improvements across 35 health tasks with clinical validation. It addresses a critical gap in digital health with broad real-world applications spanning cardiovascular, metabolic, sleep, and mental health. Paper 2 provides elegant mechanistic interpretability insights into LLM persuasion, but its scope is narrower. Paper 1's combination of massive scale, diverse clinical applications, novel architecture integrating LLM agents, and clinician-validated health agent gives it broader and deeper potential impact across healthcare and AI.

    vs. Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search
    claude-opus-4.65/18/2026

    Paper 2 demonstrates a fully autonomous system for infectious disease forecasting that matches or outperforms CDC gold-standard ensembles in real-time prospective evaluation across multiple pathogens. Its practical public health impact is immediate and scalable, addressing a critical bottleneck in pandemic preparedness. While Paper 1 provides elegant mechanistic interpretability insights into LLM persuasion circuits—valuable for AI safety—Paper 2's combination of novel LLM-guided automated scientific discovery, prospective real-world validation, and direct applicability to a pressing global health challenge gives it broader and more immediate cross-disciplinary impact.

    vs. SMCEvolve: Principled Scientific Discovery via Sequential Monte Carlo Evolution
    gpt-5.25/18/2026

    Paper 2 likely has higher scientific impact due to a more novel, mechanistic-causal account of a central AI safety failure mode (persuasion-induced factual errors), validated via targeted interventions and shown to generalize across models and realistic attack settings (e.g., GEO/poisoning). Its findings are broadly relevant to interpretability, robustness, alignment, and security, with clear actionable implications (monitoring/ablating circuits). Paper 1 is rigorous and useful, but is more incremental within LLM-driven search frameworks and primarily impacts automated discovery/optimization rather than core safety-reliability concerns.

    vs. Recursive Multi-Agent Systems
    gpt-5.25/16/2026

    Paper 2 offers a highly novel, mechanistic, causally validated explanation of LLM persuasion via a compact circuit (specific heads + rank-one routing feature), with interventions that both induce and block the failure mode across models and realistic attack settings. This is timely for AI safety, broadly relevant to interpretability, robustness, and security, and yields actionable monitoring/mitigation handles. Paper 1 is innovative and potentially useful for efficiency and performance in multi-agent LLM systems, but its impact may be more incremental/engineering-oriented and sensitive to rapidly evolving agent frameworks, whereas Paper 2’s mechanistic insight is likely to generalize and influence multiple subfields.

    vs. PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations
    gemini-3.15/16/2026

    Paper 2 addresses a critical vulnerability in AI safety (LLM persuasion) by uncovering a specific, generalizable mechanistic circuit within attention heads. Its findings have immediate, broad implications across all LLM deployments and the rapidly growing field of mechanistic interpretability. While Paper 1 is highly innovative in robotics, Paper 2's insights into fundamental LLM reasoning and safety offer a wider breadth of impact and higher relevance to current global AI alignment efforts.

    vs. SimPersona: Learning Discrete Buyer Personas from Raw Clickstreams for Grounded E-Commerce Agents
    gemini-3.15/16/2026

    Paper 2 offers foundational insights into the mechanistic interpretability of LLMs, addressing critical AI safety concerns regarding how models can be persuaded to abandon facts. Its discovery of a monitorable attention circuit has broad, cross-disciplinary implications for model alignment and security. While Paper 1 presents a strong, highly applicable framework for e-commerce agents, Paper 2's fundamental contributions to understanding LLM behavior offer a wider and more profound scientific impact.

    vs. Learning to Build the Environment: Self-Evolving Reasoning RL via Verifiable Environment Synthesis
    gpt-5.25/16/2026

    Paper 2 likely has higher impact due to a clearer, broadly relevant mechanistic discovery about LLM vulnerability: a compact, intervention-validated causal circuit for persuasion-induced factual errors that generalizes across models and realistic attack settings. This is timely for AI safety, interpretability, and security, with direct monitoring/mitigation implications. Paper 1 is novel and useful for scalable RL via environment synthesis, but the reported gains are modest and applicability depends on robust validation against reward hacking and generalization. Paper 2’s mechanism could influence multiple subfields and defenses.

    vs. BitCal-TTS: Bit-Calibrated Test-Time Scaling for Quantized Reasoning Models
    gemini-3.15/16/2026

    Paper 2 addresses a critical AI safety issue by uncovering a fundamental, generalizable mechanism behind LLM persuasion using rigorous mechanistic interpretability. Its findings have broad implications for model alignment and security. In contrast, Paper 1 focuses on a niche inference optimization technique and explicitly acknowledges severe statistical limitations due to extremely small evaluation sample sizes.

    vs. OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control
    gemini-3.15/16/2026

    Paper 2 addresses a fundamental and critical issue in AI safety and mechanistic interpretability, exploring how LLMs can be persuaded to abandon facts. Its insights into the internal workings of LLMs have broad implications across all domains utilizing large language models. In contrast, Paper 1 offers a highly applied, domain-specific improvement for traffic signal control. While valuable, Paper 2's fundamental discoveries regarding LLM vulnerabilities and internal mechanisms present much greater potential for widespread scientific impact and future research.

    vs. Simulating clinical interventions with a generative multimodal model of human physiology
    gemini-3.15/16/2026

    Paper 1 introduces a generative foundation model for human physiology with profound implications for personalized medicine, diagnostics, and in-silico clinical trials. Its ability to accurately predict disease endpoints across independent cohorts and simulate intervention outcomes offers a highly transformative and broad real-world impact across healthcare and biology. While Paper 2 provides valuable mechanistic insights for AI safety, Paper 1 represents a more paradigm-shifting advancement in a critical applied domain.

    vs. Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost
    gpt-5.25/16/2026

    Paper 2 likely has higher scientific impact due to its mechanistic, causally validated account of a critical AI safety failure mode (persuasion-induced factual errors). It identifies a compact, monitorable circuit (specific attention heads and a rank-one routing feature), demonstrates controllable interventions (steering and blocking), and generalizes across models and realistic threat scenarios (e.g., GEO). This offers broadly useful tools for interpretability, robustness, and alignment, with cross-field relevance to security and mechanistic interpretability. Paper 1 is practical and valuable for cost/performance, but is less fundamentally novel and narrower in scientific breadth.

    vs. Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight
    gpt-5.25/16/2026

    Paper 1 likely has higher impact due to stronger novelty and mechanistic rigor: it identifies a compact, causal circuit for persuasion-induced factual errors (specific attention heads, low-rank routing feature) validated via interventions and shown across multiple open-source LLMs and realistic poisoning scenarios. This provides broadly useful interpretability and safety insights with clear monitoring/mitigation hooks. Paper 2 is timely and practically relevant for oversight, but is a training/monitoring scheme whose generality and robustness may depend on setup details; it appears less foundational than a cross-model mechanistic explanation.

    vs. Agentic Discovery of Exchange-Correlation Density Functionals
    gemini-3.15/16/2026

    Paper 2 offers deep mechanistic insights into a critical AI safety vulnerability (persuasion). By mapping the exact causal circuitry and validating via interventions, it provides foundational knowledge applicable across all LLM deployments. While Paper 1 presents a strong domain-specific application of AI in computational chemistry, Paper 2's findings have a broader, immediate impact on the rapidly growing field of AI safety, model robustness, and interpretability.

    vs. Forge: Quality-Aware Reinforcement Learning for NP-Hard Optimization in LLMs
    gemini-3.15/16/2026

    While Paper 1 presents a highly effective training framework for optimization tasks, Paper 2 provides a profound breakthrough in mechanistic interpretability. Uncovering the exact causal circuit for how LLMs are persuaded to abandon facts has immense implications for AI safety, alignment, and defending against data poisoning. Its rigorous intervention-based validation and discovery of a monitorable, generalizable mechanism across models offer deeper foundational scientific insights compared to the benchmarking and training improvements in Paper 1.

    vs. BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
    gemini-3.15/16/2026

    Paper 1 offers fundamental insights into the mechanistic interpretability of LLMs, uncovering the specific circuits responsible for persuasion and factual alteration. This contributes deeply to the broader fields of AI safety, alignment, and model robustness. While Paper 2 presents a valuable and practical domain-specific benchmark for AI agents in finance, Paper 1 addresses a core, generalizable vulnerability in foundation models with more profound theoretical and scientific implications.

    vs. RewardHarness: Self-Evolving Agentic Post-Training
    claude-opus-4.65/16/2026

    Paper 1 offers a fundamental mechanistic understanding of how LLMs can be persuaded to override factual knowledge, uncovering a compact causal circuit (rank-one features, attention rerouting, discrete latent jumps). This has broad implications for AI safety, interpretability, and robustness across all LLM applications. The mechanistic insights generalize across models and attack scenarios. Paper 2, while practically useful for data-efficient reward modeling in image editing, addresses a narrower application domain with incremental methodological contributions (agentic reward frameworks). Paper 1's foundational nature and relevance to AI safety give it substantially broader and longer-lasting impact.

    vs. ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
    gpt-5.25/12/2026

    Paper 2 likely has higher scientific impact due to stronger novelty and broader relevance: it identifies a compact, causally validated internal circuit for persuasion-induced factual errors across multiple LLMs, with clear intervention-based evidence (feature removal/modification) and immediate AI-safety applications (monitoring, mitigation). Its mechanism-level interpretability generalizes across settings (e.g., poisoning/GEO), making it timely and field-spanning (interpretability, robustness, safety, security). Paper 1 is practically useful for tabular reasoning and explainability, but is more incremental (scaffolded data generation/fine-tuning) and narrower in cross-field impact.

    vs. RADAR: Redundancy-Aware Diffusion for Multi-Agent Communication Structure Generation
    gemini-3.15/12/2026

    Paper 2 addresses a critical and timely issue in AI safety and mechanistic interpretability. Uncovering a precise causal mechanism for how LLMs are persuaded to abandon facts offers deep, generalizable insights that can directly inform defenses against data poisoning and manipulation. While Paper 1 provides a valuable optimization for multi-agent communication, Paper 2's foundational contribution to understanding and securing core LLM behavior gives it a significantly broader and more profound potential scientific impact.

    vs. Towards Autonomous Railway Operations: A Semi-Hierarchical Deep Reinforcement Learning Approach to the Vehicle Rescheduling Problem
    claude-opus-4.65/12/2026

    Paper 1 offers a novel mechanistic interpretation of how LLMs are persuaded to abandon factual knowledge, identifying a compact causal circuit with intervention-validated steps. This addresses a critical AI safety problem with broad implications across the rapidly growing LLM field. The discovery of a monitorable, rank-one feature controlling persuasion has immediate applications for robustness and alignment. Paper 2, while solid applied work on railway rescheduling with RL, addresses a narrower domain with incremental advances over existing methods and limited cross-field impact.

    vs. AI Identity: Standards, Gaps, and Research Directions for AI Agents
    gemini-3.15/12/2026

    Paper 1 offers a highly rigorous, empirical discovery in mechanistic interpretability, isolating a specific causal circuit for LLM persuasion. Its methodology (validated by intervention) and novel technical insights provide actionable mechanisms for AI safety. In contrast, Paper 2 is a conceptual survey and gap analysis. While relevant for governance, Paper 1's fundamental technical breakthrough is likely to drive more immediate and impactful follow-up research in model alignment, safety, and architecture.