notesum.ai
Published at November 25Self-Generated Critiques Boost Reward Modeling for Language Models
cs.CL
cs.AI
Released Date: November 25, 2024
Authors: Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou

| Models | Chat | Chat_Hard | Reasoning | Safety | Overall |
| LLM-as-a-judge (For Reference) | |||||
| Prometheus-8*7b-v2† (Kim et al., 2024b) | 93.0 | 47.1 | 77.4 | 80.5 | 74.5 |
| Llama3.1-70B-Instruct† (Dubey et al., 2024) | 97.2 | 70.2 | 82.8 | 86.0 | 84.0 |
| Llama3.1-405B-Instruct† (Dubey et al., 2024) | 97.2 | 74.6 | 77.6 | 87.1 | 84.1 |
| GPT-4-0125† (Achiam et al., 2023) | 95.3 | 74.3 | 87.6 | 86.9 | 86.0 |
| GPT-4o-0806† (Hurst et al., 2024) | 96.1 | 76.1 | 88.1 | 86.6 | 86.7 |
| Gemini-1.5-pro-0514† (Reid et al., 2024) | 92.3 | 80.6 | 92.0 | 87.9 | 88.2 |
| Self-taught Evaluator§ (Wang et al., 2024b) (Iter 1) | 98.3 | 69.0 | 82.6 | 85.7 | 83.9 |
| Self-taught Evaluator§ (Wang et al., 2024b) (Iter 2) | 97.5 | 75.4 | 81.7 | 89.5 | 86.0 |
| Self-taught Evaluator§ (Wang et al., 2024b) | 96.6 | 84.2 | 91.5 | 81.0 | 88.3 |
| w/ inference scaling, | 96.9 | 84.0 | 91.5 | 82.5 | 88.7 |
| Standard Reward Models | |||||
| RM (Stiennon et al., 2020) | 98.3 | 74.5 | 88.0 | 83.8 | 86.4 |
| Cohere-0514† | 96.4 | 71.3 | 92.3 | 97.7 | 89.4 |
| SteerLM-RM 70B† (Wang et al., 2024e) | 91.3 | 80.3 | 92.8 | 90.6 | 88.8 |
| Nemotron-RM 340B† (Adler et al., 2024) | 95.8 | 87.1 | 91.5 | 93.6 | 92.0 |
| (Concurrent Work) Reward Models with Critiques | |||||
| SynRM† (Ye et al., 2024) (Reported Best) | 38.0 | 82.5 | 87.1 | 74.1 | 70.4 |
| SynRM (Ye et al., 2024) (Ours) | 97.5 | 76.8 | 88.5 | 86.3 | 87.3 |
| CLoud† (Ankner et al., 2024) (Reported) | 97.0 | 58.0 | 92.0 | 84.0 | 82.8 |
| CLoud (Ankner et al., 2024) (Ours) | 98.0 | 75.6 | 87.6 | 89.0 | 87.6 |
| w/ inference scaling, | 98.0 | 75.2 | 89.3 | 91.5 | 88.5 |
| Critic-RM-Summ | 98.0 | 77.0 | 88.9 | 94.5 | 89.6 |
| w/ inference scaling, | 97.5 | 77.0 | 91.6 | 95.9 | 90.5 |
| Critic-RM-Rank | 97.5 | 79.6 | 90.6 | 94.1 | 90.5 |
| w/ inference scaling, | 97.2 | 80.0 | 91.6 | 95.1 | 91.0 |