notesum.ai
Published at December 9Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models
cs.CV
Released Date: December 9, 2024
Authors: Wei Suo1, Ji Ma1, Mengyang Sun1, Lin Yuanbo Wu2, Peng Wang1, Yanning Zhang1
Aff.: 1Northwestern Polytechnical University; 2Swansea University

| LVLMs Are Retrained (Not Directly Comparable) | ||||||||||
| \addstackgap[.5]0 Method | TFLOPs () | Average | AOKVQA | SQA | MME | POPE | MMB | MMBCN | LLaVAW | SEEDI |
| \addstackgap[.5]0 LLaVA-1.5-7B [26] | 11.05 (100%) | 70.5 | 77.8 | 70.8 | 1467 | 86.1 | 65.3 | 59.4 | 65.5 | 66.7 |
| \addstackgap[.5]0 + RoE-LLaVA [44] | 8.29 (75.0%) | - | - | 68.7 | - | - | 64.6 | - | - | 57.8 |
| \addstackgap[.5]0 + TokenPacker [23] | 6.12 (55.4%) | - | - | - | - | 87.0 | 65.1 | - | - | - |
| \addstackgap[.5]0 + LLaVolta [7] | 5.78 (51.4%) | 71.0 | 77.7∗ | 70.5 | 1472 | 86.3 | 65.6 | 59.9 | 68.2 | 66.1 |
| \addstackgap[.5]0 + [6] | 4.93 (44.6%) | - | - | - | - | 85.5 | 64.8 | - | - | 58.0 |
| \addstackgap[.5]0 + PruMerge [36] | 4.88 (44.2%) | - | - | 68.5 | 1350 | 76.3 | 60.9 | - | - | - |
| LVLMs Are Frozen (Comparable Results) | ||||||||||
| \addstackgap[.5]0 Method | TFLOPs () | Average | AOKVQA | SQA | MME | POPE | MMB | MMBCN | LLaVAW | SEEDI |
| \addstackgap[.5]0 LLaVA-1.5-7B [26] | 11.05 (100%) | 70.5 | 77.8 | 70.8 | 1467 | 86.1 | 65.3 | 59.4 | 65.5 | 66.7 |
| (T=576,L=32) | ||||||||||
| \hdashline\addstackgap[.5]0 + Random Dropping [7] | 5.78 (51.4%) | 53.5 | 72.7∗ | 69.3 | 1142 | 55.8 | 39.7 | 33.3 | 47.6 | 52.2 |
| \addstackgap[.5]0 + ShortGPT [30]∗ | 8.30 (75.1%) | 53.9 | 74.2 | 64.6 | 964 | 69.7 | 50.4 | 37.5 | 52.9 | 54.3 |
| \addstackgap[.5]0 + LLaVolta [7] (test) | 5.78 (51.4%) | 60.8 | 74.9∗ | 69.4 | 1150 | 70.1 | 56.4 | 46.5 | 55.6 | 55.7 |
| \addstackgap[.5]0 + FastV [8] | 5.78 (51.4%) | 62.5 | 75.5 | 69.4 | 1298 | 65.6 | 60.1 | 53.0 | 54.8 | 56.3 |
| \hdashline\addstackgap[.5]0 + Ours (T=272, L=30) | 7.31 (66.2%) | 70.0 7.5 | 78.0 2.5 | 70.7 1.3 | 1448 150 | 85.9 15.8 | 64.8 4.7 | 56.5 3.5 | 65.3 9.7 | 66.4 10.1 |
| \addstackgap[.5]0 + Ours (T=224, L=28) | 6.40 (57.9%) | 67.9 5.4 | 77.7 2.2 | 70.4 1.0 | 1351 53 | 80.6 10.5 | 62.6 2.5 | 54.5 1.5 | 64.8 9.2 | 65.2 8.9 |
| \addstackgap[.5]0 + Ours (T=144, L=28) | 5.67 (51.3%) | 66.4 3.9 | 77.7 2.2 | 70.0 0.6 | 1300 2 | 76.1 6.0 | 62.2 2.1 | 53.8 0.8 | 63.8 8.2 | 62.7 6.4 |
| \addstackgap[.5]0 + Ours (T=176, L=24) | 5.16 (46.7%) | 65.7 3.2 | 77.6 2.1 | 69.8 0.4 | 1292 6 | 75.9 5.8 | 61.1 1.0 | 53.1 0.1 | 63.6 8.0 | 60.2 3.9 |
| \addstackgap[.5]0 + Ours (T=128, L=24) | 4.78 (43.3%) | 64.6 2.1 | 76.4 0.9 | 69.6 0.2 | 1286 12 | 73.6 3.5 | 60.2 0.1 | 50.7 2.3 | 62.5 6.9 | 59.4 3.1 |
| \addstackgap[.5]0 Qwen-VL-Chat-9B [3] | 9.27 (100%) | 70.1 | 75.6 | 68.2 | 1487 | 86.5 | 60.6 | 56.7 | 73.5 | 65.4 |
| (T=256,L=32) | ||||||||||
| \hdashline\addstackgap[.5]0 + Random Dropping [7]∗ | 7.43 (80.1%) | 59.3 | 70.1 | 64.9 | 1138 | 80.2 | 44.3 | 37.6 | 62.5 | 57.7 |
| \addstackgap[.5]0 + ShortGPT [30]∗ | 8.39 (90.5%) | 58.8 | 63.5 | 52.6 | 1398 | 81.1 | 46.6 | 39.2 | 61.3 | 56.0 |
| \addstackgap[.5]0 + LLaVolta [7]∗ (test) | 7.43 (80.1%) | 63.2 | 71.6 | 65.3 | 1336 | 80.8 | 51.1 | 45.8 | 64.0 | 59.8 |
| \addstackgap[.5]0 + FastV [8]∗ | 7.43 (80.1%) | 64.7 | 72.2 | 65.9 | 1405 | 81.4 | 53.5 | 49.1 | 64.8 | 60.1 |
| \hdashline\addstackgap[.5]0 + Ours (T=128, L=30) | 7.63 (82.3%) | 68.5 3.8 | 76.1 3.9 | 67.0 1.1 | 1474 69 | 83.3 1.9 | 58.9 5.4 | 55.6 6.5 | 71.2 6.4 | 62.3 2.2 |
| \addstackgap[.5]0 + Ours (T=102, L=30) | 7.39 (79.7%) | 67.6 2.9 | 74.7 2.5 | 66.8 0.9 | 1464 59 | 82.5 1.1 | 57.3 3.8 | 55.1 6.0 | 69.3 4.5 | 61.7 1.6 |
| \addstackgap[.5]0 + Ours (T=72, L=28) | 6.79 (73.3%) | 65.8 1.1 | 73.5 1.3 | 66.3 0.4 | 1425 20 | 81.9 0.5 | 54.3 0.8 | 50.3 1.2 | 68.5 3.7 | 60.2 0.1 |