notesum.ai

Published at November 25

Interpreting Language Reward Models via Contrastive Explanations

cs.AI

Released Date: November 25, 2024

Authors: Junqi Jiang1, Tom Bewley2, Saumitra Mishra2, Freddy Lecue2, Manuela Veloso2

Aff.: 1Imperial College London; 2J.P. Morgan AI Research

Arxiv: http://arxiv.org/abs/2411.16502v1