notesum.ai

Published at November 4

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

cs.LG
cs.AI

Released Date: November 4, 2024

Authors: Marcus Williams1, Micah Carroll1, Adhyyan Narang2, Constantin Weisser3, Brendan Murphy4, Anca Dragan5

Aff.: 1MATS; 2University of Washington; 3MATS & Haize Labs; 4Independent; 5UC Berkeley

Arxiv: http://arxiv.org/abs/2411.02306v1