notesum.ai
Published at December 3T-REG: Preference Optimization with Token-Level Reward Regularization
cs.CL
cs.AI
cs.LG
Released Date: December 3, 2024
Authors: Wenxuan Zhou1, Shujian Zhang, Lingxiao Zhao, Tao Meng
Aff.: 1Zoom Communications

| Method | Llama-3-Instruct (8B) | Gemma-2-Instruct (9B) | ||||
|---|---|---|---|---|---|---|
| Alpaca Eval 2.0 | Arena-Hard | Alpaca Eval 2.0 | Arena-Hard | |||
| Len-control. | Win Rate | Win Rate | Len-control. | Win Rate | Win Rate | |
| Win Rate | vs GPT-4 | vs GPT-4 | Win Rate | vs GPT-4 | vs GPT-4 | |
| SFT | 26.0 | 25.3 | 22.3 | 51.1 | 38.1 | 40.8 |
| RTO | 49.2 | 47.2 | 37.6 | 67.6 | 68.7 | 63.1 |
| SePO | 48.5 | 45.5 | 33.1 | - | - | -††footnotemark: |
| TDPO1 | 45.4 | 37.4 | 29.1 | 66.5 | 56.6 | 46.7 |
| TDPO2 | 43.2 | 40.2 | 34.0 | 64.3 | 59.3 | 55.7 |
| DPO | 47.0 | 46.0 | 35.9 | 68.9 | 66.9 | 58.6 |
| SimPO | 52.5 | 47.1 | 33.1 | 73.5 | 70.7 | 63.0 |
| DPO-REG | 50.8 | 51.1 | 40.3 | 70.3 | 66.4 | 60.2 |
| SimPO-REG | 53.8 | 48.8 | 34.4 | 74.5 | 70.5 | 64.2 |