How LLMs Are Persuaded: A Few Attention Heads, Rerouted
Xiangkun Sun, Lingkai Kong, Aoqi Zhang, Liang Zeng, Tonghan Wang
Abstract
Language models can be persuaded to abandon factual knowledge. This vulnerability is central to AI safety, but its internal mechanism remains poorly understood. We uncover a compact causal mechanism for persuasion-induced factual errors. A small set of mid-layer attention heads almost entirely determines the model's answer. These heads write answer options into a low-dimensional polyhedron, with options occupying distinct vertices. Persuasion does not blur belief or merely reduce confidence; it causes a discrete latent jump from the correct-answer vertex to the persuasion-target vertex. We show that decision heads are not reasoning over evidence. Instead, they copy whichever option token their attention selects. Persuasion works by redirecting attention. We isolate a rank-one evidence-routing feature that controls the route. Directly modifying this feature steers the model's choice, and removing it blocks persuasion. We then trace the feature back to a band of shallower attention heads that build it from persuasive keywords in the input. Every step is validated by intervention. This mechanism appears across open-source LLMs and realistic poisoning scenarios such as Generative Engine Optimization, revealing persuasion as a narrow, monitorable circuit.
AI Impact Assessments
(1 models)Scientific Impact Assessment
Core Contribution
This paper provides the first circuit-level mechanistic explanation of how persuasive text overrides factual knowledge in large language models. The key finding is that persuasion operates through a remarkably compact internal pathway: (1) a sparse set of mid-layer "decision heads" encode answer options as vertices of a tetrahedron in a low-dimensional subspace; (2) these heads are simple copy mechanisms that attend to whichever option token a one-dimensional "routing feature" selects; (3) persuasion works by redirecting this routing feature, causing a discrete vertex jump rather than gradual confidence erosion; and (4) shallower attention heads (layers 8–12) construct the routing feature from persuasive keywords. The contribution is distinctive because persuasion is fundamentally different from static untruthfulness or stored factual errors—the same model with identical weights answers correctly without persuasive context and incorrectly with it. Prior work (ITI, ROME/MEMIT) targeted fixed knowledge representations, whereas this paper addresses a dynamic, input-dependent override mechanism.
Methodological Rigor
The methodology is strong and follows mechanistic interpretability best practices. The paper builds its argument through a layered series of causal interventions, each validated independently:
1. Activation patching to localize decision heads via restoration scores, with appropriate controls (length-matched clean/persuasive prompts, filtering for examples the model answers correctly without persuasion).
2. PCA analysis showing the tetrahedral geometry with explained variance analysis (75.84% in 3 components with sharp drop at PC4).
3. OV circuit decomposition via SVD showing decision heads copy rather than compute, with cosine similarity matrices confirming faithful copying (>0.94 diagonal, negative off-diagonal).
4. Rank-1 QK approximation with cross-validation (10-fold CV: 0.0339±0.0027), followed by causal validation through direct feature injection.
5. Layer-window patching with both denoising and noising directions to localize routing feature construction.
6. Composition score analysis providing geometric alignment evidence that corroborates the patching results.
The attention-pattern patching experiment (restoring 36.3% vs 41.3% for full patching) provides compelling evidence that attention rerouting is the primary mechanism. The paper is careful about potential confounds like the Hydra effect and copy suppression.
One methodological concern is that the rank-1 approximation is the only optimization-based step, and while cross-validated, the approximation error analysis could be more thorough—it's unclear how much of the full QK behavior is captured. The attention-pattern patching recovery rate of 36.3% vs 41.3% leaves a meaningful gap, suggesting some persuasion effects operate through value vectors rather than pure attention rerouting.
Potential Impact
Defensive applications: The compact circuit description enables targeted runtime monitors that could detect persuasion by tracking the routing feature or decision-head attention patterns. This is practically valuable for RAG systems, search engines, and any LLM pipeline processing untrusted content.
Mechanistic interpretability: The paper contributes a clean, complete example of circuit-level analysis for a safety-relevant behavior. The tetrahedral geometry finding is particularly elegant and may inspire similar geometric analyses of other discrete choice behaviors in transformers.
GEO defense: The demonstration that Generative Engine Optimization exploits the same circuit is immediately relevant to search engine and information retrieval communities.
Limitations on impact: The restriction to multiple-choice settings is significant. Real-world persuasion in LLMs often occurs in open-ended generation where there are no discrete option tokens to attend to. The mechanism may not transfer. Additionally, no defense is actually implemented or evaluated—the paper stops at mechanism identification.
Timeliness & Relevance
This paper addresses a critical and timely problem. LLM sycophancy and susceptibility to persuasion are among the most pressing deployment concerns, especially as RAG systems proliferate and GEO becomes commercially exploited. The mechanistic interpretability community has been calling for safety-relevant circuit discoveries beyond toy tasks, and this paper delivers one. The connection to GEO makes it particularly relevant to the search and information retrieval communities currently grappling with adversarial content optimization.
Strengths
Limitations & Weaknesses
Overall Assessment
This is a well-executed mechanistic interpretability paper that provides a compelling and compact explanation of a safety-relevant LLM vulnerability. The layered causal validation methodology is exemplary, and the geometric insights are both novel and aesthetically satisfying. The main limitations—restriction to multiple-choice and absence of implemented defenses—prevent it from achieving maximum impact, but the mechanistic contribution is solid and timely.
Generated May 12, 2026
Comparison History (25)
Paper 1 presents a foundation model for wearable health pretrained on unprecedented scale (1 trillion minutes, 5 million participants), demonstrating systematic scaling improvements across 35 health tasks with clinical validation. It addresses a critical gap in digital health with broad real-world applications spanning cardiovascular, metabolic, sleep, and mental health. Paper 2 provides elegant mechanistic interpretability insights into LLM persuasion, but its scope is narrower. Paper 1's combination of massive scale, diverse clinical applications, novel architecture integrating LLM agents, and clinician-validated health agent gives it broader and deeper potential impact across healthcare and AI.
Paper 2 demonstrates a fully autonomous system for infectious disease forecasting that matches or outperforms CDC gold-standard ensembles in real-time prospective evaluation across multiple pathogens. Its practical public health impact is immediate and scalable, addressing a critical bottleneck in pandemic preparedness. While Paper 1 provides elegant mechanistic interpretability insights into LLM persuasion circuits—valuable for AI safety—Paper 2's combination of novel LLM-guided automated scientific discovery, prospective real-world validation, and direct applicability to a pressing global health challenge gives it broader and more immediate cross-disciplinary impact.
Paper 2 likely has higher scientific impact due to a more novel, mechanistic-causal account of a central AI safety failure mode (persuasion-induced factual errors), validated via targeted interventions and shown to generalize across models and realistic attack settings (e.g., GEO/poisoning). Its findings are broadly relevant to interpretability, robustness, alignment, and security, with clear actionable implications (monitoring/ablating circuits). Paper 1 is rigorous and useful, but is more incremental within LLM-driven search frameworks and primarily impacts automated discovery/optimization rather than core safety-reliability concerns.
Paper 2 offers a highly novel, mechanistic, causally validated explanation of LLM persuasion via a compact circuit (specific heads + rank-one routing feature), with interventions that both induce and block the failure mode across models and realistic attack settings. This is timely for AI safety, broadly relevant to interpretability, robustness, and security, and yields actionable monitoring/mitigation handles. Paper 1 is innovative and potentially useful for efficiency and performance in multi-agent LLM systems, but its impact may be more incremental/engineering-oriented and sensitive to rapidly evolving agent frameworks, whereas Paper 2’s mechanistic insight is likely to generalize and influence multiple subfields.
Paper 2 addresses a critical vulnerability in AI safety (LLM persuasion) by uncovering a specific, generalizable mechanistic circuit within attention heads. Its findings have immediate, broad implications across all LLM deployments and the rapidly growing field of mechanistic interpretability. While Paper 1 is highly innovative in robotics, Paper 2's insights into fundamental LLM reasoning and safety offer a wider breadth of impact and higher relevance to current global AI alignment efforts.
Paper 2 offers foundational insights into the mechanistic interpretability of LLMs, addressing critical AI safety concerns regarding how models can be persuaded to abandon facts. Its discovery of a monitorable attention circuit has broad, cross-disciplinary implications for model alignment and security. While Paper 1 presents a strong, highly applicable framework for e-commerce agents, Paper 2's fundamental contributions to understanding LLM behavior offer a wider and more profound scientific impact.
Paper 2 likely has higher impact due to a clearer, broadly relevant mechanistic discovery about LLM vulnerability: a compact, intervention-validated causal circuit for persuasion-induced factual errors that generalizes across models and realistic attack settings. This is timely for AI safety, interpretability, and security, with direct monitoring/mitigation implications. Paper 1 is novel and useful for scalable RL via environment synthesis, but the reported gains are modest and applicability depends on robust validation against reward hacking and generalization. Paper 2’s mechanism could influence multiple subfields and defenses.
Paper 2 addresses a critical AI safety issue by uncovering a fundamental, generalizable mechanism behind LLM persuasion using rigorous mechanistic interpretability. Its findings have broad implications for model alignment and security. In contrast, Paper 1 focuses on a niche inference optimization technique and explicitly acknowledges severe statistical limitations due to extremely small evaluation sample sizes.
Paper 2 addresses a fundamental and critical issue in AI safety and mechanistic interpretability, exploring how LLMs can be persuaded to abandon facts. Its insights into the internal workings of LLMs have broad implications across all domains utilizing large language models. In contrast, Paper 1 offers a highly applied, domain-specific improvement for traffic signal control. While valuable, Paper 2's fundamental discoveries regarding LLM vulnerabilities and internal mechanisms present much greater potential for widespread scientific impact and future research.
Paper 1 introduces a generative foundation model for human physiology with profound implications for personalized medicine, diagnostics, and in-silico clinical trials. Its ability to accurately predict disease endpoints across independent cohorts and simulate intervention outcomes offers a highly transformative and broad real-world impact across healthcare and biology. While Paper 2 provides valuable mechanistic insights for AI safety, Paper 1 represents a more paradigm-shifting advancement in a critical applied domain.
Paper 2 likely has higher scientific impact due to its mechanistic, causally validated account of a critical AI safety failure mode (persuasion-induced factual errors). It identifies a compact, monitorable circuit (specific attention heads and a rank-one routing feature), demonstrates controllable interventions (steering and blocking), and generalizes across models and realistic threat scenarios (e.g., GEO). This offers broadly useful tools for interpretability, robustness, and alignment, with cross-field relevance to security and mechanistic interpretability. Paper 1 is practical and valuable for cost/performance, but is less fundamentally novel and narrower in scientific breadth.
Paper 1 likely has higher impact due to stronger novelty and mechanistic rigor: it identifies a compact, causal circuit for persuasion-induced factual errors (specific attention heads, low-rank routing feature) validated via interventions and shown across multiple open-source LLMs and realistic poisoning scenarios. This provides broadly useful interpretability and safety insights with clear monitoring/mitigation hooks. Paper 2 is timely and practically relevant for oversight, but is a training/monitoring scheme whose generality and robustness may depend on setup details; it appears less foundational than a cross-model mechanistic explanation.
Paper 2 offers deep mechanistic insights into a critical AI safety vulnerability (persuasion). By mapping the exact causal circuitry and validating via interventions, it provides foundational knowledge applicable across all LLM deployments. While Paper 1 presents a strong domain-specific application of AI in computational chemistry, Paper 2's findings have a broader, immediate impact on the rapidly growing field of AI safety, model robustness, and interpretability.
While Paper 1 presents a highly effective training framework for optimization tasks, Paper 2 provides a profound breakthrough in mechanistic interpretability. Uncovering the exact causal circuit for how LLMs are persuaded to abandon facts has immense implications for AI safety, alignment, and defending against data poisoning. Its rigorous intervention-based validation and discovery of a monitorable, generalizable mechanism across models offer deeper foundational scientific insights compared to the benchmarking and training improvements in Paper 1.
Paper 1 offers fundamental insights into the mechanistic interpretability of LLMs, uncovering the specific circuits responsible for persuasion and factual alteration. This contributes deeply to the broader fields of AI safety, alignment, and model robustness. While Paper 2 presents a valuable and practical domain-specific benchmark for AI agents in finance, Paper 1 addresses a core, generalizable vulnerability in foundation models with more profound theoretical and scientific implications.
Paper 1 offers a fundamental mechanistic understanding of how LLMs can be persuaded to override factual knowledge, uncovering a compact causal circuit (rank-one features, attention rerouting, discrete latent jumps). This has broad implications for AI safety, interpretability, and robustness across all LLM applications. The mechanistic insights generalize across models and attack scenarios. Paper 2, while practically useful for data-efficient reward modeling in image editing, addresses a narrower application domain with incremental methodological contributions (agentic reward frameworks). Paper 1's foundational nature and relevance to AI safety give it substantially broader and longer-lasting impact.
Paper 2 likely has higher scientific impact due to stronger novelty and broader relevance: it identifies a compact, causally validated internal circuit for persuasion-induced factual errors across multiple LLMs, with clear intervention-based evidence (feature removal/modification) and immediate AI-safety applications (monitoring, mitigation). Its mechanism-level interpretability generalizes across settings (e.g., poisoning/GEO), making it timely and field-spanning (interpretability, robustness, safety, security). Paper 1 is practically useful for tabular reasoning and explainability, but is more incremental (scaffolded data generation/fine-tuning) and narrower in cross-field impact.
Paper 2 addresses a critical and timely issue in AI safety and mechanistic interpretability. Uncovering a precise causal mechanism for how LLMs are persuaded to abandon facts offers deep, generalizable insights that can directly inform defenses against data poisoning and manipulation. While Paper 1 provides a valuable optimization for multi-agent communication, Paper 2's foundational contribution to understanding and securing core LLM behavior gives it a significantly broader and more profound potential scientific impact.
Paper 1 offers a novel mechanistic interpretation of how LLMs are persuaded to abandon factual knowledge, identifying a compact causal circuit with intervention-validated steps. This addresses a critical AI safety problem with broad implications across the rapidly growing LLM field. The discovery of a monitorable, rank-one feature controlling persuasion has immediate applications for robustness and alignment. Paper 2, while solid applied work on railway rescheduling with RL, addresses a narrower domain with incremental advances over existing methods and limited cross-field impact.
Paper 1 offers a highly rigorous, empirical discovery in mechanistic interpretability, isolating a specific causal circuit for LLM persuasion. Its methodology (validated by intervention) and novel technical insights provide actionable mechanisms for AI safety. In contrast, Paper 2 is a conceptual survey and gap analysis. While relevant for governance, Paper 1's fundamental technical breakthrough is likely to drive more immediate and impactful follow-up research in model alignment, safety, and architecture.