notesum.ai
Published at November 1MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition
cs.LG
cs.AI
Released Date: November 1, 2024
Authors: Cheng Yang1, Yang Sui2, Jinqi Xiao1, Lingyi Huang1, Yu Gong1, Yuanlin Duan1, Wenqi Jia3, Miao Yin3, Yu Cheng4, Bo Yuan1
Aff.: 1Rutgers University; 2Rice University; 3The University of Texas at Arlington; 4The Chinese University of Hong Kong

| Model | Method | Params | ARC-c | ARC-e | BoolQ | HellaSwag | OBQA | RTE | WinoGrande | Average |
|---|---|---|---|---|---|---|---|---|---|---|
| 87B | baseline | 0 | 57.17 | 84.01 | 85.35 | 64.88 | 35.00 | 70.40 | 75.93 | 67.53 |
| 87B | P | 25% | 51.79 | 81.36 | 84.07 | 61.99 | 32.80 | 71.12 | 75.85 | 65.57 |
| 87B | P+F | 25% | 56.23 | 82.49 | 86.42 | 64.48 | 36.00 | 72.92 | 74.98 | 67.65 |
| 87B | P+D | 51.79% | 40.70 | 71.51 | 67.83 | 45.34 | 26.00 | 61.37 | 67.56 | 54.33 |
| 87B | MoE-I2 | 51.79% | 52.20 | 78.22 | 82.62 | 61.07 | 34.00 | 72.20 | 71.50 | 64.55 |
| Qwen | baseline | 0 | 41.89 | 73.11 | 79.76 | 57.90 | 30.40 | 70.04 | 68.67 | 60.25 |
| Qwen | P | 25% | 38.57 | 70.37 | 73.30 | 55.84 | 29.80 | 64.98 | 67.25 | 57.16 |
| Qwen | P+F | 25% | 45.14 | 75.93 | 78.01 | 57.83 | 32.80 | 71.12 | 68.51 | 61.33 |
| Qwen | P+D | 53.98% | 37.71 | 65.91 | 71.41 | 49.34 | 29.40 | 64.26 | 67.88 | 55.13 |
| Qwen | MoE-I2 | 53.98% | 41.13 | 71.68 | 75.08 | 53.08 | 30.80 | 66.43 | 66.54 | 57.82 |
| DeepSeek | baseline | 0 | 46.93 | 78.37 | 79.82 | 58.70 | 34.60 | 60.65 | 71.35 | 61.49 |
| DeepSeek | P | 25% | 45.31 | 74.62 | 67.95 | 57.38 | 33.20 | 59.93 | 70.01 | 58.34 |
| DeepSeek | P+F | 25% | 47.44 | 78.16 | 79.79 | 60.32 | 35.40 | 74.56 | 71.35 | 63.86 |
| DeepSeek | P+D | 53.98% | 38.48 | 71.42 | 70.09 | 48.15 | 27.80 | 60.65 | 65.98 | 54.65 |
| DeepSeek | MoE-I2 | 53.98% | 42.58 | 71.80 | 76.79 | 55.16 | 32.60 | 70.76 | 67.64 | 59.62 |