notesum.ai
Published at November 2Rule Based Rewards for Language Model Safety
cs.AI
Released Date: November 2, 2024
Authors: Tong Mu1, Alec Helyar1, Johannes Heidecke1, Joshua Achiam1, Andrea Vallone1, Ian Kivlichan1, Molly Lin1, Alex Beutel1, John Schulman1, Lilian Weng1
Aff.: 1OpenAI

| Human Evaluation | Internal Automated | |||||
|---|---|---|---|---|---|---|
| Not-Unsafe | Not-Overref | F1-Score* | Not-Unsafe | Not-Overref | F1-Score* | |
| Helpful-PPO | 93.64 1.3% | 98.13 0.8% | 95.8 0.8% | 86.98 1.6% | 97.84 0.7% | 92.1 0.9% |
| Human-PPO | 100.00 0.0% | 84.70 2.2% | 91.7 1.3% | 99.04 0.4% | 84.40 1.8% | 91.1 1.1% |
| RBR-PPO | 97.27 0.9% | 97.01 1.0% | 97.1 0.7% | 93.95 1.1% | 94.95 1.0% | 94.4 0.7% |