notesum.ai
Published at October 31Towards Reliable Alignment: Uncertainty-aware RLHF
cs.AI
cs.LG
Released Date: October 31, 2024
Authors: Debangshu Banerjee1, Aditya Gopalan1
Aff.: 1Department of Electrical and Communication Engineering, Indian Institute of Science, India

| Model | Average Score | Chat | Chat-Hard | Safety | Reasoning | Prior Sets |
|---|---|---|---|---|---|---|
| GRM-Gemma-2B-sftreg (Yang et al., 2024) | 74.7 | 95.5 | 48.7 | 80.0 | 76.8 | 69.8 |
| Gemma-2B-rewardmodel-baseline | 73.1 | 94.1 | 46.9 | 79.7 | 73.8 | 69.0 |
| Our Model | 69.4 | 95.6 | 44.5 | 55.9 | 81.8 | 69.0 |
| Qwen1.5-72B-Chat (Bai et al., 2023) | 68.2 | 62.3 | 66.0 | 72.0 | 85.5 | 42.3 |
| MiniCPM-2B-dpo-fp32 (Hu et al., 2024) | 66.2 | 89.1 | 49.3 | 52.5 | 82.3 | 49.6 |
| RM-Gemma-2B (Dong et al., 2023) | 64.2 | 94.4 | 40.8 | 44.0 | 76.4 | 66.5 |