notesum.ai
Published at October 21A Comprehensive Survey of Datasets, Theories, Variants, and Applications in Direct Preference Optimization
cs.SD
cs.AI
eess.AS
Released Date: October 21, 2024
Authors: Wenyi Xiao1, Zechuan Wang1, Leilei Gan1, Shuai Zhao2, Wanggui He3, Luu Anh Tuan2, Long Chen3, Hao Jiang3, Zhou Zhao1, Fei Wu1
Aff.: 1Zhejiang University, China; 2Nanyang Technological University, Singapore; 3Alibaba Group, China
| Method | Objective |
|---|---|
| RQ0: why DPO? | |
| PPO (Ouyang et al., 2022a) | |
| DPO (Rafailov et al., 2023) | |
| RQ1: Generalization Ability of RM? | |
| IPO (Azar et al., 2023) | |
| RQ2: Rewarding Signals? | |
| RRHF (Yuan et al., 2023) | |
| SPIN (Chen et al., 2024a) | |
| Step-DPO (Lai et al., 2024) | |
| T-DPO (Zeng et al., 2024) | |
| KTO (Ethayarajh et al., 2024) | |
| RQ3: Coeficient and Reference Model? | |
| -DPO (Wu et al., 2024a) | |
| CPO (Xu et al., 2024b) | |
| ORPO (Hong et al., 2024) | |
| RQ4: Training Stratege of DPO? | |
| OPtune (Chen et al., 2024b) | |
| IRPO (Pang et al., 2024) | |
| RQ5: Reward Hacking? | |
| R-DPO (Park et al., 2024) | |
| LD-DPO (Liu et al., 2024d) | |
| SimPO (Meng et al., 2024) | |
| RQ6: Alignment Tax? | |
| SPO (Lou et al., 2024a) | |