notesum.ai
Published at November 10Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation
cs.AI
Released Date: November 10, 2024
Authors: Yu-Liang Zhan1, Zhong-Yi Lu2, Hao Sun1, Ze-Feng Gao2
Aff.: 1Gaoling School of Artificial Intelligence, Renmin University of China; 2School of Physics, Renmin University of China

| Datasets | RTE Acc. | MRPC F1/Acc. | STS-B Corr. | CoLA Mcc. | SST-2 F1/Acc. | QNLI Acc. | QQP F1/Acc. | MNLI Acc. | Avg. | # Train Params (M) | # Inference Params (M) |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BERT-base [2] | 70.5 | 86.5/81.8 | 86.6 | 54.2 | 92.0 | 91.2 | 88.0/91.0 | 84.2 | 83.4 | 110 | 110 |
| BERT-of-Theseus [56] | |||||||||||
| None | 65.5 | 85.3/79.6 | 86.2 | 39.2* | 90.4 | 88.7 | 86.1/89.6 | 81.5 | 79.2 | 66 | 66 |
| +SVD | 65.5 | 85.4/80.0 | 86.5 | 43.1 | 90.6 | 88.6 | 86.2/89.7 | 80.3 | 79.6 | 90 | 66 |
| +OPDF (Ours) | 66.2 | 85.9/80.5 | 88.6 | 45.2 | 91.3 | 89.0 | 86.8/90.2 | 81.4 | 80.5 | 160 | 66 |
| LGTM [44] | |||||||||||
| None | 63.3 | 86.3/80.1 | 82.9* | 33.9* | 91.1 | 89.3 | 88.0/91.1 | 82.2 | 78.8 | 67 | 67 |
| +SVD | 64.7 | 86.8/81.9 | 83.1 | 37.4 | 91.2 | 88.6 | 86.5/89.4 | 79.3 | 78.9 | 91 | 67 |
| +OPDF (Ours) | 66.9 | 87.8/82.4 | 83.3 | 38.9 | 91.5 | 88.7 | 87.0/90.2 | 80.9 | 79.8 | 163 | 67 |
| DBKD [45] | |||||||||||
| None | 61.2 | 83.3/75.5 | / | 25.2 | 88.1 | 86.1 | 85.3/88.7 | 76.1 | 74.4 | 53 | 53 |
| +SVD | 64.7 | 86.5/78.6 | / | 26.4 | 88.8 | 85.8 | 85.5/89.0 | 76.5 | 75.8 | 69 | 53 |
| +OPDF (Ours) | 69.1 | 88.4/83.3 | / | 27.2 | 89.8 | 86.5 | 86.9/90.2 | 77.7 | 77.6 | 83 | 53 |
| AD-KD [46] | |||||||||||
| None | 68.8 | 88.7/84.3 | 89.3 | 53.1 | 91.5 | 90.8 | 85.9/89.5 | 81.7 | 82.4 | 67 | 67 |
| +SVD | 69.4 | 89.3/85.8 | 88.8 | 53.5 | 89.9 | 90.1 | 86.4/89.8 | 81.5 | 82.6 | 91 | 67 |
| +OPDF (Ours) | 71.7 | 90.3/86.8 | 88.9 | 55.0 | 91.3 | 91.1 | 86.8/90.0 | 82.1 | 83.4 | 182 | 67 |