
#107 of 2682 in Artificial Intelligence (All Time)
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
Congratulate the authors
Know the authors? Send them a congratulation.

Know the authors? Send them a congratulation.