notesum.ai
Published at December 4Weighted-Reward Preference Optimization for Implicit Model Fusion
cs.CL
Released Date: December 4, 2024
Authors: Ziyi Yang1, Fanqi Wan, Longguang Zhong, Tianyuan Shi, Xiaojun Quan
Aff.: 1School of Computer Science and Engineering, Sun Yat-sen University, China

| Model | Size | AlpacaEval-2 | Arena-Hard | MT-Bench | |||
| (GPT-4-1106-Preview) | (GPT-4-1106-Preview) | (GPT-4-0125-Preview) | |||||
| LC(%) | WR(%) | WR(%) | T1 | T2 | Overall | ||
| Source&Target LLMs | |||||||
| Target | 8B | 26.0 | 25.3 | 20.6 | 7.41 | 7.04 | 7.23 |
| Mistral-Large-Instruct-2407 | 123B | 54.3 | 46.8 | 70.4 | 8.83 | 8.31 | 8.57 |
| Gemma2-27B-IT | 27B | 55.5 | 41.0 | 57.5 | 8.34 | 8.03 | 8.19 |
| Qwen2-72B-Instruct | 72B | 38.1 | 29.9 | 46.9 | 8.44 | 7.84 | 8.15 |
| LLaMA3-70B-Instruct | 70B | 34.4 | 33.2 | 46.6 | 8.61 | 7.77 | 8.19 |
| Gemma2-9B-IT | 9B | 51.1 | 38.1 | 40.8 | 8.27 | 7.44 | 7.86 |
| Internlm2.5-20B-Chat | 20B | 37.4 | 45.3 | 31.2 | 8.03 | 7.23 | 7.64 |
| DeepSeek-V2-Chat | 236B | 51.4 | 51.3 | 68.3 | 8.65 | 7.96 | 8.31 |
| DeepSeek-Coder-V2-Instruct | 236B | 50.7 | 54.0 | 66.3 | 8.80 | 7.42 | 8.13 |
| Yi-1.5-34B-Chat | 34B | 37.5 | 44.5 | 42.6 | 7.99 | 7.64 | 7.81 |
| Phi-3-Medium-4K-Instruct | 14B | 29.8 | 24.2 | 33.4 | 8.63 | 7.46 | 8.04 |
| Collective LLMs | |||||||
| PackLLM-Top1-PPL | 849B | 49.1 | 48.0 | 64.8 | 8.29 | 8.20 | 8.25 |
| LLM-Blender-Top1 | 849B | 46.2 | 44.3 | 58.2 | 8.69 | 8.06 | 8.38 |
| MOA | 849B | 61.3 | 77.2 | 83.1 | 9.04 | 8.03 | 8.54 |
| Target-FuseLLM | 8B | 36.0 | 33.8 | 32.1 | 7.53 | 7.13 | 7.33 |
| Target-FuseChat | 8B | 38.1 | 35.2 | 32.7 | 7.68 | 7.07 | 7.38 |
| Preference Optimization Methods | |||||||
| Target-DPO | 8B | 48.2 | 47.5 | 35.2 | 7.68 | 7.23 | 7.46 |
| Target-SimPO | 8B | 53.7 | 47.5 | 36.5 | 7.73 | 7.00 | 7.38 |
| Target-IPO | 8B | 46.8 | 42.4 | 36.6 | 7.89 | 7.19 | 7.54 |
| Our Methods | |||||||
| Target-SFT | 8B | 27.2 | 26.0 | 24.7 | 7.69 | 7.03 | 7.36 |
| Target-SFT-DPO | 8B | 50.7 | 53.1 | 40.2 | 7.98 | 7.23 | 7.61 |
| Target-SFT-WRPO-Medium | 8B | 53.5 | 53.8 | 41.6 | 7.80 | 7.03 | 7.42 |
| Target-SFT-WRPO | 8B | 55.9 | 57.6 | 46.2 | 7.95 | 7.31 | 7.63 |