notesum.ai
Published at November 7Towards Improved Preference Optimization Pipeline: from Data Generation to Budget-Controlled Regularization
cs.LG
cs.AI
cs.CL
Released Date: November 7, 2024
Authors: Zhuotong Chen1, Fang Liu1, Jennifer Zhu1, Wanyu Du1, Yanjun Qi1
Aff.: 1AWS Bedrock Science

| Llama-3.1-Instruct (8B) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Alpaca-Eval (Base Model: 47.64) | ||||||||
| DPO | IPO | SimPO | DPOBCR | IPOBCR | SimPOBCR | CPO | DPOP | |
| Armo Llama3 | 58.07 | 57.00 | 65.16 | / | / | / | 55.71 | 48.94 |
| IPR(Llama8B) | 72.86 | 69.94 | 66.77 | / | / | / | 82.86 | 54.66 |
| IPR(Llama70B) | 73.11 | 71.30 | 85.32 | 74.35 | 72.92 | 85.90 | 79.69 | 54.16 |
| Arena-Hard (Base Model: 71.44) | ||||||||
| DPO | IPO | SimPO | DPOBCR | IPOBCR | SimPOBCR | CPO | DPOP | |
| Armo Llama3 | 79.90 | 78.10 | 84.10 | / | / | / | 74.00 | 71.30 |
| IPR(Llama8B) | 80.70 | 82.40 | 80.00 | / | / | / | 85.90 | 71.60 |
| IPR(Llama70B) | 80.50 | 80.40 | 89.30 | 79.30 | 79.50 | 89.30 | 83.37 | 73.90 |
| Mistral-Instruct (7B) | ||||||||
| Alpaca-Eval (Base Model: 25.03) | ||||||||
| DPO | IPO | SimPO | DPOBCR | IPOBCR | SimPOBCR | CPO | DPOP | |
| Armo Llama3 | / | / | / | 28.79 | 28.70 | |||
| IPR(Llama8B) | 60.34 | 58.30 | 57.35 | / | / | / | 47.39 | 41.98 |
| IPR(Llama70B) | 67.75 | 65.49 | 61.06 | 67.40 | 65.52 | 64.99 | 48.63 | 41.28 |
| Arena-Hard (Base Model: 56.70) | ||||||||
| DPO | IPO | SimPO | DPOBCR | IPOBCR | SimPOBCR | CPO | DPOP | |
| Armo Llama3 | / | / | / | 62.00 | 62.90 | |||
| IPR(Llama8B) | 68.70 | 65.20 | 67.40 | / | / | / | 67.20 | 66.93 |
| IPR(Llama70B) | 71.80 | 71.70 | 70.84 | 71.53 | 71.20 | 63.10 | 71.10 | 65.40 |