notesum.ai
Published at November 29Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings
cs.CV
cs.CL
cs.LG
cs.MM
Released Date: November 29, 2024
Authors: Qiong Wu1, Wenhao Lin1, Weihao Ye1, Yiyi Zhou1, Xiaoshuai Sun1, Rongrong Ji1
Aff.: 1Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China

| Method | SEED | MME | MMB | POPE | MM-Vet | |||||
| Accuracy | TFLOPs | Score | TFLOPs | Accuracy | TFLOPs | Accuracy | TFLOPs | Accuracy | TFLOPs | |
| Eagle-X5 7B [45] | 73.9 | 47.8 | 1528.0 | 27.8 | 68.4 | 29.6 | 88.8 | 27.7 | 37.4 | 27.6 |
| Eagle-DyVTE 7B | 73.6 (-0.4%) | 43.0 (-10.0%) | 1581.7 (+3.5%) | 20.3 (-27.0%) | 68.8(+0.6%) | 23.7 (-19.9%) | 88.4 (-0.5%) | 20.0 (-27.8%) | 37.8 (+1.1%) | 23.5 (-14.9%) |
| VILA 7B [37] | 61.7 | 9.2 | 1489.2 | 8.9 | 69.9 | 9.5 | 86.3 | 8.8 | 36.3 | 8.7 |
| VILA-DyVTE 7B | 61.8 (+0.2%) | 5.9 (-35.9%) | 1503.1 (+0.1%) | 4.6 (-48.3%) | 69.8 (-0.1%) | 6.0 (-36.8%) | 85.6 (-0.8%) | 4.5 (-48.9%) | 36.7 (+1.1%) | 6.6 (-24.1%) |
| InternVL 7B [11] | 59.2 | 16.0 | 1525.1 | 15.5 | 64.6 | 16.2 | 86.4 | 15.4 | 31.2 | 15.4 |
| Intern-DyVTE 7B | 59.1 (-0.2%) | 11.9 (-25.6%) | 1474.1 (-3.3%) | 10.9 (-29.7%) | 64.4 (-0.3%) | 12.0 (-25.9%) | 81.3 (-5.9%) | 10.9 (-29.2%) | 29.5 (-5.4%) | 13.0 (-15.6%) |
| LLaVA-1.5 7B [40] | 58.6 | 9.2 | 1510.7 | 8.9 | 64.3 | 9.6 | 85.9 | 8.8 | 30.5 | 8.7 |
| LLaVA-DyVTE 7B | 58.6 (0.0%) | 5.0 (-45.7%) | 1491.4 (-1.3%) | 6.3 (-29.2%) | 64.7 (+0.6%) | 5.4 (-43.8%) | 81.6 (-5.0%) | 4.1 (-53.4%) | 31.9 (+4.6%) | 6.3 (-27.6%) |
| LLaVA-1.5 13B [40] | 61.6 | 17.6 | 1531.3 | 16.9 | 67.7 | 18.3 | 85.9 | 16.8 | 36.1 | 16.7 |
| LLaVA-DyVTE 13B | 59.3 (-3.7%) | 7.1 (-59.7%) | 1546.4 (+1.0%) | 7.2 (-57.4%) | 66.0 (-2.5%) | 7.8 (-57.4%) | 84.8 (-1.3%) | 7.6 (-54.8%) | 34.8 (-3.6%) | 10.6 (-36.5%) |