notesum.ai
Published at November 25Interpreting Language Reward Models via Contrastive Explanations
cs.AI
Released Date: November 25, 2024
Authors: Junqi Jiang1, Tom Bewley2, Saumitra Mishra2, Freddy Lecue2, Manuela Veloso2
Aff.: 1Imperial College London; 2J.P. Morgan AI Research

| Dataset | Method | Chosen | Rejected | Both | |||
|---|---|---|---|---|---|---|---|
| CF cov. | SF cov. | CF cov. | SF cov. | CF cov. | SF cov. | ||
| harmless | PJ | 0.74.149 | 0.94.039 | 0.40.016 | 0.95.064 | 0.26.080 | 0.89.061 |
| RP | 0.64.118 | 0.93.051 | 0.38.159 | 0.92.082 | 0.22.100 | 0.86.072 | |
| OURS | 0.76.086 | 0.97.031 | 0.85.109 | 0.91.059 | 0.69.111 | 0.90.061 | |
| helpful | PJ | 0.97.035 | 0.87.042 | 0.14.097 | 0.99.008 | 0.13.088 | 0.87.041 |
| RP | 0.84.066 | 0.81.081 | 0.23.095 | 0.99.008 | 0.21.088 | 0.81.079 | |
| OURS | 0.81.042 | 0.99.020 | 0.98.028 | 0.75.081 | 0.80.047 | 0.74.067 | |
| hs2 | RP | 0.86.035 | 0.54.060 | 0.05.038 | 0.99.008 | 0.04.035 | 0.54.060 |
| OURS | 0.83.069 | 0.86.160 | 0.39.145 | 0.95.047 | 0.33.144 | 0.84.149 | |