Understanding helpfulness and harmless tension in reward models

Eshaan Tanwar, Pepa Atanasova

Jun 11, 2026arXiv:2606.13209v1

cs.LGcs.CL

#1659of 5669·cs.LG

#1659 of 5669 · cs.LG

Tournament Score

1447±50

10501750

79%

Win Rate

Wins

Losses

Matches

Rating

5.5/ 10

Significance5.5

Rigor5

Novelty5.5

Clarity7

Abstract

Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.

AI Impact Assessments

(1 models)

Scientific Impact Assessment

Core Contribution

This paper investigates the internal mechanisms behind the tension between helpfulness and harmlessness objectives in reward models (RMs) used for RLHF. The authors train three RM variants—helpful-only, harmless-only, and mixed-objective—and conduct both behavioral and mechanistic analyses to understand why jointly optimizing for both objectives leads to degraded performance. The key finding is that approximately 50% of neurons identified as important for helpfulness and harmlessness are *shared*, and these shared neurons exert disproportionate influence on model behavior, providing a mechanistic explanation for why multi-objective alignment is challenging. The paper reframes alignment tension not merely as a behavioral trade-off but as a *representational interference* problem arising from overlapping internal circuitry.

Methodological Rigor

The experimental design is reasonably sound but has notable limitations:

Strengths in methodology:

Training 54 RMs (6 architectures × 3 objective types × 3 seeds) provides reasonable statistical coverage.

The use of three complementary evaluation benchmarks (RewardBench, RewardBench2, RM-Bench) with 25 subtasks grouped by objective orientation is thorough.

The behavioral retention metric is a clean formalization for measuring how much single-objective capability survives mixed training.

Comparison of magnitude-based neuron scoring against change-based and random baselines validates the selection method.

Weaknesses:

The neuron identification method (top-τ percentile of mean activations) is relatively simple. While they compare against alternatives, the magnitude-based approach may conflate neurons that are generally highly active with those genuinely encoding objective-specific information. No probing classifiers or more sophisticated causal mediation analyses are employed.

The models used are small (GPT-2 variants and SmoLLM2 up to 1.7B), raising questions about whether findings generalize to production-scale RMs.

The RMSD analysis showing similar neurons process chosen vs. rejected inputs (values ~10⁻³) is interesting but the interpretation could be more nuanced—small activation differences at the neuron level could still be functionally significant given the scalar head.

The "shared neurons" analysis, while compelling, doesn't fully control for the possibility that these neurons encode general language processing features rather than alignment-specific features. The ~50% overlap might partly reflect architectural constraints rather than genuine objective entanglement.

The ablation methodology (zeroing activations) is a blunt instrument that doesn't account for potential compensatory mechanisms or nonlinear interactions.

Potential Impact

The paper addresses a practically important problem. Understanding *why* multi-objective alignment fails mechanistically could inform:

1. Architecture design: Motivating modular or mixture-of-experts approaches where objective-specific pathways are structurally separated.

2. Training strategies: Informing curriculum design, gradient surgery, or regularization approaches that minimize interference in shared neuron populations.

3. Controllable alignment: The identification of objective-specific vs. shared neurons could enable targeted interventions (e.g., selective fine-tuning, activation engineering) for steering RM behavior.

However, the practical applicability is limited by the small model scale and the simplicity of the neuron identification approach. The insights are more directional than prescriptive—the paper identifies the problem mechanistically but doesn't propose solutions.

Timeliness & Relevance

The paper addresses a timely gap. While there is substantial work on behavioral characterization of alignment trade-offs and on improving RM training, mechanistic interpretability of RMs specifically regarding competing objectives is understudied. The finding that shared representations drive alignment tension connects the alignment and mechanistic interpretability communities in a useful way. The recent emergence of benchmarks like RewardBench2 and RM-Bench makes the evaluation landscape more mature for this type of analysis.

Strengths

1. Clear problem formulation: The paper cleanly decomposes the problem into behavioral characterization (§3), neural analysis (§4), and conflict analysis (§5), making the narrative easy to follow.

2. The shared neuron finding is genuinely interesting: The ~50% overlap between helpfulness and harmlessness neurons, combined with the disproportionate impact of ablating these shared neurons (Table 4), provides a concrete mechanistic hypothesis for alignment tension.

3. Comprehensive evaluation: The use of diverse benchmarks reveals nuanced patterns—e.g., harmlessness training improving adversarial robustness, asymmetric behavioral effects of ablation.

4. Behavioral retention analysis: The formalization and measurement of how mixed training degrades specialized capabilities is a useful analytical contribution.

Limitations

1. Scale concerns: All experiments use models ≤1.7B parameters. Modern RLHF systems use much larger models where representational dynamics may differ substantially.

2. Single dataset for training: All RMs are trained on HH-RLHF, which has known limitations in quality and diversity. The specific tension patterns observed may be dataset-dependent.

3. No parameter-efficient fine-tuning comparison: While the authors justify full fine-tuning for interpretability, LoRA and similar methods are standard in practice and may exhibit different interference patterns.

4. Limited actionable insights: The paper identifies the problem but stops short of proposing or testing solutions (e.g., neuron-level regularization, modular training).

5. Neuron-level analysis limitations: Modern interpretability work increasingly uses features (via sparse autoencoders) rather than individual neurons, as neurons are often polysemantic. The authors acknowledge this but don't address it experimentally.

6. The causal claims from ablation studies could be stronger: Zeroing out neurons is a relatively coarse intervention. More fine-grained causal methods (activation patching, path patching) could strengthen the mechanistic claims.

Overall Assessment

This is a solid exploratory study that provides useful initial evidence for the representational basis of alignment tension in reward models. The shared neuron finding is the most compelling contribution, offering a mechanistic lens that complements existing behavioral analyses. However, the work is limited by model scale, methodological simplicity in neuron identification, and the absence of proposed solutions. It opens interesting research directions but represents an incremental rather than transformative contribution to the field.

Rating:5.5/ 10

Significance 5.5Rigor 5Novelty 5.5Clarity 7

Generated Jun 12, 2026

Comparison History (14)

Wonvs. A2D2: Fine-Tuning Any-Length Discrete Diffusion for Adaptive Decoding

Paper 1 addresses a critical and highly timely challenge in AI safety and LLM alignment: the tension between helpfulness and harmlessness in RLHF. By applying mechanistic interpretability to uncover how these objectives interfere at the neuron level, it provides fundamental insights that can directly impact how future foundational models are aligned. While Paper 2 offers robust theoretical advancements in discrete diffusion models, Paper 1's focus on understanding and solving core bottlenecks in widely deployed LLM alignment gives it a broader potential impact across both AI research and real-world deployment.