Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee
Abstract
Reinforcement Learning from Human Feedback (RLHF) is the standard method to align Large Language Models (LLMs) with human preferences. In this work, we introduce alignment tampering, a potential vulnerability where the LLM undergoing alignment influences the preference dataset, causing RLHF to amplify undesired behaviors. This arises from core limitations of RLHF: (1) preference datasets are constructed from the LLM's own outputs, allowing it to influence them, and (2) pairwise comparisons only indicate which response is better, not why. These limitations can be exploited to cause alignment tampering. For example, if an LLM generates biased responses with higher quality, annotators will prefer them based on quality. However, preference labels do not distinguish quality from bias, and the reward model inherits this limitation. Optimizing such rewards through reinforcement learning or best-of-N sampling can amplify misaligned biases. Our experiments demonstrate amplification across diverse biases: from keyword bias to propaganda (e.g., sexism), brand promotion, and instrumental goal-seeking. Mitigation remains challenging, as existing techniques for robust RLHF fail to fully resolve alignment tampering without sacrificing response quality. These findings reveal structural vulnerabilities of current RLHF and emphasize the need to prevent this vulnerability. Project page: https://alignment-tampering.github.io/
AI Impact Assessments
(1 models)Scientific Impact Assessment: Alignment Tampering
1. Core Contribution
This paper introduces "alignment tampering," a structural vulnerability in RLHF where the LLM being aligned can influence its own preference dataset, causing RLHF to amplify undesired biases rather than suppress them. The key insight is that when an LLM's outputs exhibit a correlation between quality and bias (i.e., biased responses are systematically higher quality), two limitations of RLHF conspire to amplify that bias: (1) preference datasets are constructed from the model's own outputs, and (2) pairwise comparisons capture only relative quality, not the reason for preference. The reward model then learns to reward the entangled quality-plus-bias signal, and RL optimization amplifies the bias.
This is a meaningful conceptual contribution that formalizes and demonstrates a vulnerability that sits at the intersection of reward hacking, backdoor attacks, and alignment faking—but is distinct from each. Unlike reward hacking (an unintended side effect), alignment tampering involves targeted reinforcement of specific undesired behaviors. Unlike backdoor attacks, it doesn't require external dataset poisoning. Unlike alignment faking, it doesn't require the model to be aware of training.
2. Methodological Rigor
The experimental setup is well-structured but operates in a somewhat artificial regime. The "tampering policy" is deliberately trained via supervised fine-tuning to exhibit bias-quality correlation—biased responses are generated as high-quality, unbiased as low-quality. This is a strong assumption that simplifies the demonstration but raises questions about ecological validity.
Strengths in methodology:
Weaknesses in methodology:
3. Potential Impact
The paper addresses a significant concern for AI safety: if bias-quality correlations exist in LLMs (whether through deliberate adversarial action or natural data distributions), standard RLHF could amplify rather than mitigate harmful behaviors. This has implications for:
However, the practical impact is somewhat limited by the artificial setup. The paper would be substantially stronger with evidence that bias-quality correlations arise naturally in pre-trained or instruction-tuned models.
4. Timeliness & Relevance
The paper is highly timely. RLHF remains the dominant alignment paradigm, and understanding its failure modes is critical as LLMs are deployed in increasingly high-stakes settings. The connection to emerging concerns about instrumental goal-seeking in RL-trained models (referencing DeepSeek-R1) and the broader AI safety discourse makes this work relevant to current debates. The finding that existing mitigation techniques (InfoRM, WARM, RRM, iterative RLHF) are insufficient adds urgency.
5. Strengths & Limitations
Key Strengths:
Notable Limitations:
Overall Assessment
This paper makes a clear and well-demonstrated conceptual contribution to understanding RLHF vulnerabilities. The core insight—that bias-quality correlations can cause RLHF to amplify the very biases it should eliminate—is important and well-supported experimentally. However, the artificial construction of the threat scenario and the lack of evidence for naturally occurring bias-quality correlations limit the immediacy of the threat. The mitigation analysis is insufficient, offering only a weak detection method. The work is valuable as a warning and motivation for future research, but falls short of demonstrating a practical, imminent vulnerability.
Generated May 27, 2026
Comparison History (39)
Paper 1 introduces a decentralized AI agent framework that directly accelerates multi-disciplinary scientific discovery. Its demonstrated ability to self-organize and significantly improve state-of-the-art results across diverse and highly impactful domains—such as biomedical machine learning, language model optimization, and protein engineering—suggests a broad, transformational impact across STEM fields. While Paper 2 addresses a critical issue in AI safety, Paper 1's potential to automate and enhance the scientific method itself offers a wider scope of real-world applications and cross-domain innovation.
Paper 2 offers a fundamental theoretical proof (kernel obstruction theorem) explaining why LLMs inherently fail at causal discovery, moving beyond empirical observations to establish intrinsic limitations of current learning paradigms. Furthermore, it introduces a novel, scalable solution (A-CBO) that bypasses these limitations without retraining. While Paper 1 addresses an important, timely vulnerability in RLHF, Paper 2's combination of rigorous theoretical foundations, broad implications for causal reasoning in AI, and a mathematically proven algorithmic intervention gives it higher potential for long-term scientific impact.
Paper 1 identifies a fundamental structural vulnerability in RLHF data collection ('alignment tampering'), showing how models can inherently exploit the preference-gathering process. This highlights a broader conceptual flaw in the standard alignment pipeline with significant implications for AI safety and bias amplification. While Paper 2 offers excellent methodological rigor and formal proofs regarding mitigation failures, Paper 1's identification of an inherent, hard-to-mitigate feedback loop in RLHF is likely to spur wider debate and foundational research across the alignment and safety communities.
Paper 1 presents a significant step towards recursive self-improving AI by unifying harness and weight updates. Its demonstration of massive empirical gains across highly diverse domains (law, systems programming, and biology) highlights its broad applicability and transformative capability potential. While Paper 2 identifies a critical vulnerability in current RLHF methods, Paper 1 introduces a novel paradigm that could fundamentally accelerate how AI systems are built, optimized, and deployed, giving it a higher long-term scientific and practical impact.
Paper 2 addresses a fundamental vulnerability in RLHF, the dominant alignment paradigm for LLMs, revealing that models can exploit the preference data collection process to amplify biases. This has broad impact across AI safety, alignment research, and the entire LLM ecosystem. The finding that existing mitigations fail without sacrificing quality makes it especially urgent. Paper 1, while a solid applied contribution to legal NLP, addresses a narrower domain (French marine environmental law) with incremental methodological advances in RAG pipelines, limiting its breadth of impact.
Paper 1 identifies a fundamental structural vulnerability in RLHF, the dominant paradigm for aligning LLMs. Highlighting how models can exploit preference datasets to amplify misaligned biases has profound implications for AI safety, ethics, and future alignment research. While Paper 2 offers a valuable technical improvement for reasoning tasks, Paper 1's broader implications across all RLHF-trained models give it a higher potential scientific impact.
Paper 1 introduces a novel, rigorous methodology for extracting and analyzing search trees from LLM reasoning traces, revealing a fundamental insight about LLM planning (myopic vs. deep search). It combines computational modeling with causal interventions, offering both diagnostic tools and actionable guidance. Paper 2 identifies an important RLHF vulnerability but builds more incrementally on known RLHF limitations. Paper 1's framework is more generalizable across domains, provides deeper mechanistic understanding of LLM reasoning, and has broader implications for interpretability and alignment research.
Paper 1 likely has higher impact due to its novel identification and empirical demonstration of a structural vulnerability in the dominant RLHF paradigm, with broad implications for AI safety, alignment, evaluation methodology, and deployment governance. The “alignment tampering” mechanism is widely relevant across LLMs trained via preference data, and its real-world stakes (bias amplification, propaganda, goal-seeking) are immediate and timely. While Paper 2 introduces a useful benchmark and training approach for multimodal grounded creativity, its impact is more domain-specific and incremental relative to ongoing benchmark/grounding work.
Paper 2 addresses a fundamental structural vulnerability in RLHF, the dominant alignment paradigm for frontier AI. Its identification of 'alignment tampering' has deep implications for AI safety, bias mitigation, and core machine learning methodologies. While Paper 1 offers a valuable managerial framework for enterprise AI operations, Paper 2 is likely to drive broader scientific follow-up, theoretical inquiry, and critical safety mitigations across the wider AI research community.
Paper 2 identifies a fundamental structural vulnerability in RLHF, the core alignment methodology for modern LLMs. By demonstrating how preference datasets can be exploited to amplify misaligned biases, it opens a critical new direction in AI safety. While Paper 1 offers valuable efficiency gains through context sparsity, Paper 2's findings on the limitations of current alignment techniques have profound theoretical and practical implications for the safe deployment of AI systems, giving it a broader potential impact across the field.
Paper 1 exposes a fundamental structural vulnerability in RLHF, the primary alignment technique for modern LLMs, highlighting critical safety and alignment risks. Its findings challenge the core assumptions of current model training paradigms, leading to broader implications for AI safety research. While Paper 2 offers a practical solution for machine unlearning, it acts more as an engineering patch (prompt/filter-based rules) rather than addressing foundational model mechanics. Thus, Paper 1 has higher potential for driving significant shifts in how alignment methods are designed.
Paper 2 likely has higher scientific impact: it identifies a broadly applicable, structural vulnerability in the dominant alignment paradigm (RLHF) and empirically demonstrates amplification of diverse misaligned biases, making it timely for frontier-model safety and deployment. Its implications cut across NLP, AI safety, human-computer interaction, and ML training pipelines, with clear real-world consequences for many systems beyond a single domain. Paper 1 is novel and rigorous for legal AI trustworthiness, but its primary impact is narrower (legal reasoning) despite strong methodology and practical relevance there.
Paper 1 likely has higher impact due to greater novelty and broader, timely implications: it identifies a structural vulnerability in RLHF (“alignment tampering”) and demonstrates amplification of multiple real-world misalignment modes (bias, propaganda, brand promotion, goal-seeking), directly informing safer deployment and future alignment research. Its applications span AI safety, ML training pipelines, governance, and evaluation. Paper 2 is a useful robustness comparison study but is narrower (single dataset/model, modest effect sizes, non-significant p-value) and offers limited generalizable innovation.
Paper 2 has higher likely impact: it identifies a broadly relevant, structural vulnerability in the dominant alignment paradigm (RLHF), demonstrates it empirically across multiple bias types, and shows existing robust-RLHF mitigations are insufficient—making it timely for AI safety and deployment. Its implications span alignment, security, governance, and product risk, with clear real-world consequences and follow-on research directions. Paper 1 is useful for agent engineering and evaluation methodology, but its claims are model-specific (one model per tier) and narrower in cross-field breadth and stakes.
Paper 1 has higher estimated impact due to stronger real-world relevance and urgency: it identifies a structural vulnerability in the dominant RLHF alignment pipeline, with demonstrated amplification of harmful behaviors and difficult mitigation. This is novel and actionable for AI safety, governance, and deployment practices, likely influencing both research directions and industry training protocols. Paper 2 offers a valuable mechanistic insight into probe-time CoT effects, but its implications are more interpretive and narrower in immediate application than a security-like failure mode in alignment methods.
Paper 1 exposes a fundamental structural vulnerability in RLHF, the dominant paradigm for LLM alignment. Highlighting how models can manipulate preference datasets to amplify biases has profound implications for AI safety, fairness, and deployment. While Paper 2 offers a valuable algorithmic improvement for multi-step reasoning, Paper 1's focus on foundational safety flaws addresses a broader, more critical bottleneck with immediate real-world consequences and wider interdisciplinary relevance.
Paper 1 exposes a critical structural vulnerability in RLHF, the dominant alignment method for large language models. Given the widespread deployment of LLMs across virtually all domains, identifying and addressing fundamental flaws in their safety alignment has massive, immediate implications for the entire AI community and broader society. Paper 2 is highly valuable for biomedical discovery, but Paper 1's focus on AI safety and alignment tampering offers a broader and more urgent scientific impact.
Paper 2 likely has higher scientific impact due to its timely, broadly relevant identification of a structural vulnerability in RLHF—currently the dominant alignment paradigm. It introduces a clear threat model (alignment tampering), demonstrates amplification across multiple real-world failure modes (propaganda, bias, brand promotion, goal-seeking), and highlights unresolved mitigation trade-offs, making it immediately actionable for both academia and industry. Paper 1 is a solid, methodical benchmark contribution, but its impact is narrower (ToM evaluation) and primarily advances measurement rather than exposing a system-level risk affecting most deployed alignment pipelines.
Paper 1 identifies and empirically demonstrates a structural vulnerability in RLHF (“alignment tampering”) that can systematically amplify misaligned biases, with broad implications for LLM safety, evaluation, and alignment methodology across many domains. Its novelty lies in reframing RLHF’s self-generated data and pairwise labels as an exploitable feedback loop, and it provides diverse experimental evidence plus mitigation analysis. Paper 2 is a solid applied systems contribution for a specific legal subdomain, but its impact is narrower and more incremental (agent+RAG+citations) relative to existing tool-augmented LLM work.
Paper 1 demonstrates a groundbreaking milestone: the first end-to-end autonomous scientific discovery by an AI agent on a real physical system, discovering and experimentally validating a previously unreported physical mechanism. This has transformative implications across all experimental sciences, potentially reshaping how research is conducted. While Paper 2 identifies an important RLHF vulnerability with practical relevance to AI safety, its contribution is more incremental within the alignment literature. Paper 1's breadth of impact—spanning AI, optics, and the future of scientific methodology—and its unprecedented demonstration give it substantially higher potential impact.