notesum.ai
Published at December 4AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning
cs.CV
cs.AI
cs.CL
Released Date: December 4, 2024
Authors: Yiwu Zhong1, Zhuoming Liu2, Yin Li2, Liwei Wang1
Aff.: 1The Chinese University of Hong Kong; 2University of Wisconsin-Madison

| Model | FLOPs | Prefill Time | VideoMME | MVBench | MLVU | EgoSchema | NextQA | PerceptionTest |
|---|---|---|---|---|---|---|---|---|
| (TB) | (ms) | wo / w-subs | test | m-avg | test | mc | val | |
| Video LLMs | ||||||||
| VILA-40B [37] | - | - | 60.1 / 61.1 | - | - | 58.0 | 67.9 | 54.0 |
| PLLaVA-34B [77] | - | - | - | 58.1 | - | - | - | - |
| LLaVA-N-Video-32B [91] | - | - | 60.2 / 63.0 | - | 65.5 | 60.9 | 77.3 | 59.4 |
| IXC-2.5-7B [87] | - | - | 55.8 / 58.8 | 69.1 | 37.3 | - | 71.0 | 34.4 |
| LongVA-7B [88] | 381.09 | 2186.04 | 52.6 / 54.3 | - | 56.3 | - | 68.3 | - |
| LLaVA-OV-7B [30] | 99.63 | 439.58 | 58.2 / 61.5 | 56.7 | 64.7 | 60.1 | 79.4 | 57.1 |
| Training-free Method Applied during Inference | ||||||||
| FastV [5] | 21.24 | 79.56 | 55.9 / 60.0 | 55.9 | 61.1 | 57.5 | 77.5 | 56.3 |
| LLaVA-Prumerge [59] | 23.65 | 86.89 | 57.0 / 59.9 | 56.5 | 60.6 | 61.0 | 77.6 | 55.8 |
| Ours | 14.76 | 55.03 | 58.2 / 61.3 | 57.1 | 63.7 | 59.6 | 78.4 | 56.0 |