notesum.ai
Published at December 5If You Can't Use Them, Recycle Them: Optimizing Merging at Scale Mitigates Performance Tradeoffs
cs.CL
cs.AI
Released Date: December 5, 2024
Authors: Muhammad Khalifa1, Yi-Chern Tan, Arash Ahmadian, Tom Hosking, Honglak Lee, Lu Wang, Ahmet Üstün, Tom Sherborne, Matthias Gallé
Aff.: 1University of Michigan

| Model ID | Info | MBPP | GSM8K | IFEval | MMLUPro | MUSR | MT-Bench | LBPP |
|---|---|---|---|---|---|---|---|---|
| Supervised Finetuning | ||||||||
| 1 | Without MuP | 58.4 | 76.6 | 64.1 | 29.4 | 18.1 | 8.01 | 30.4 |
| 2 | Two stage SFT | 54.4 | 75.3 | 65.7 | 29.0 | 17.9 | 7.74 | 25.5 |
| 3 | Academic + Code data only. 2 epochs | 63.0 | 79.0 | 66.6 | 31.1 | 14.6 | 7.42 | 29.2 |
| 4 | Academic + Code data only. | 63.0 | 68.2 | 59.0 | 31.9 | 10.5 | 7.55 | 26.7 |
| 5 | Academic + Code data only. 2 epochs | 64.0 | 75.7 | 56.5 | 31.1 | 17.0 | 7.68 | 32.9 |
| 6 | Two stage SFT | 60.8 | 76.7 | 65.7 | 29.0 | 17.1 | 7.80 | 31.7 |
| 7 | Two stage SFT | 58.2 | 76.6 | 66.9 | 29.0 | 21.1 | 7.59 | 27.3 |
| 8 | Only Code | 63.4 | 37.8 | 32.1 | 26.0 | 0.0 | 4.75 | 30.4 |
| Preference Optimization | ||||||||
| 9 | Light offline Pref | 57.0 | 75.0 | 63.0 | 26.8 | 15.0 | 7.90 | 24.8 |
| 10 | Data-filtered. Offline Pref | 57.4 | 74.0 | 63.0 | 27.5 | 16.0 | 7.85 | 23.0 |
| 11 | Different Preamble | 60.4 | 77.0 | 66.0 | 28.3 | 18.0 | 8.08 | 24.8 |
| 12 | Light offline Pref, with different margin scaling | 56.8 | 81.0 | 72.0 | 28.0 | 19.0 | 8.50 | 28.0 |
| 13 | Full offline Pref | 59.0 | 75.0 | 66.0 | 28.9 | 20.0 | 8.22 | 24.2 |
| 14 | Specific data mix. Offline Pref | 58.8 | 77.0 | 69.0 | 28.5 | 20.0 | 8.04 | 26.1 |
| 15 | Specific data mix. Offline Pref. With warmup | 58.6 | 75.0 | 66.0 | 28.9 | 21.0 | 8.41 | 24.2 |
| 16 | Specific data mix. Offline Pref | 58.6 | 74.0 | 67.0 | 29.0 | 19.0 | 8.37 | 24.2 |