notesum.ai
Published at May 11Instruction Tuning With Loss Over Instructions
NeurIPS
Released Date: May 11, 2024
Authors: Zhengyan Shi1, Adam X. Yang2, Bin Wu1, Laurence Aitchison2, Emine Yilmaz1, Aldo Lipani1
Aff.: 1University College London; 2University of Bristol
Arxiv: https://openreview.net/pdf/2489a69de5d25fdd19c9250d4ef90033722cebc1.pdf

| NLP Benchmarks | LLM-based Evaluation | |||||||||
| Method | Understanding & Knowledge | Multi- linguality | Commonsense Reasoning | Math&Code Reasoning | BBH | Safety & Helpfulness | Mean | MT-Bench | AlpacaEval 1.0 | AlpacaEval 2.0 |
| Llama-2-Base | 63.91 | 61.99 | 75.86 | 13.32 | 38.80 | 42.03 | 49.32 | 1.16 | 0.01 | 0.01 |
| Llama-2-Chat | 63.42 | 55.15 | 70.28 | 15.33 | 38.92 | 51.79 | 49.15 | 6.63 | 79.04 | 6.48 |
| Alpagasus Alpaca 5k (5,305 training examples) | ||||||||||
| \hdashline IT | 64.98 | 57.24 | 66.06 | 8.93 | 26.80 | 47.74 | 45.29 | 3.62 | 16.29 | 2.46 |
| Neftune | 65.18 | 56.88 | 66.45 | 10.24 | 29.53 | 45.46 | 45.620.33 | 3.500.12 | 21.375.08 | 2.370.09 |
| IM (ours) | 64.01 | 56.63 | 72.47 | 11.58 | 35.52 | 44.62 | 47.472.18 | 3.480.14 | 19.523.23 | 3.290.83 |
| Alpagasus Dolly 3k (2,996 training examples) | ||||||||||
| \hdashline IT | 65.81 | 57.46 | 67.55 | 11.96 | 33.02 | 43.70 | 46.58 | 4.23 | 13.42 | 2.00 |
| Neftune | 65.90 | 57.79 | 67.28 | 11.64 | 35.43 | 44.36 | 47.070.49 | 4.420.19 | 14.040.62 | 2.030.03 |
| IM (ours) | 65.66 | 57.47 | 73.24 | 14.57 | 37.48 | 45.29 | 48.952.37 | 4.060.17 | 15.111.69 | 2.440.44 |
| Alpagasus Dolly 9k (9,229 training examples) | ||||||||||
| \hdashline IT | 64.10 | 56.62 | 69.70 | 7.96 | 32.19 | 42.65 | 45.54 | 4.33 | 21.54 | 2.28 |
| Neftune | 64.20 | 56.69 | 69.51 | 8.99 | 33.91 | 42.62 | 45.990.45 | 4.210.12 | 31.6110.07 | 2.840.56 |
| IM (ours) | 64.67 | 55.32 | 74.87 | 12.50 | 36.69 | 43.96 | 48.002.46 | 4.550.22 | 30.779.23 | 2.670.39 |
| Less Tydiqa (13,533 training examples) | ||||||||||
| \hdashline IT | 64.01 | 56.81 | 64.77 | 12.06 | 36.54 | 55.09 | 48.21 | 4.08 | 5.12 | 1.88 |
| Neftune | 64.03 | 55.09 | 64.02 | 13.84 | 36.65 | 51.21 | 47.470.74 | 4.190.11 | 8.353.23 | 2.580.70 |
| IM (ours) | 64.28 | 56.10 | 65.70 | 17.15 | 34.86 | 54.09 | 48.700.49 | 4.360.28 | 10.104.98 | 2.881.00 |
| Less MMLU Chat (13,533 training examples) | ||||||||||
| \hdashline IT | 64.74 | 57.42 | 62.94 | 9.53 | 33.13 | 55.35 | 47.18 | 3.86 | 4.42 | 1.20 |
| Neftune | 65.21 | 57.43 | 63.14 | 9.45 | 35.89 | 55.32 | 47.740.56 | 4.060.20 | 6.221.80 | 1.060.14 |
| IM (ours) | 63.95 | 56.34 | 64.76 | 12.52 | 36.94 | 52.55 | 47.840.66 | 4.540.68 | 9.785.36 | 1.930.73 |
| Less BBH ICL (13,533 training examples) | ||||||||||
| \hdashline IT | 63.83 | 62.04 | 75.92 | 6.90 | 38.93 | 42.07 | 48.28 | 4.78 | 36.20 | 2.36 |
| Neftune | 63.88 | 58.83 | 67.97 | 13.54 | 38.63 | 51.33 | 49.030.75 | 5.050.27 | 39.813.61 | 2.870.51 |
| IM (ours) | 64.14 | 56.72 | 71.12 | 13.56 | 39.03 | 50.34 | 49.150.87 | 5.030.25 | 44.157.95 | 3.561.20 |
| LIMA (1,030 training examples) | ||||||||||
| \hdashline IT | 63.92 | 58.29 | 71.96 | 16.01 | 39.27 | 43.29 | 48.79 | 4.77 | 33.06 | 2.58 |
| 10 epoch Neftune | 63.66 | 57.67 | 73.03 | 15.95 | 38.77 | 43.14 | 48.700.09 | 4.790.02 | 30.512.55 | 2.430.15 |
| IM (ours) | 64.49 | 58.21 | 75.55 | 17.06 | 38.84 | 43.45 | 49.600.81 | 4.830.06 | 32.940.12 | 2.470.11 |