notesum.ai
Published at November 21FocusLLaVA: A Coarse-to-Fine Approach for Efficient and Effective Visual Token Compression
cs.CV
Released Date: November 21, 2024
Authors: Yuke Zhu1, Chi Xie2, Shuang Liang2, Bo Zheng1, Sheng Guo1
Aff.: 1Mybank, Ant Group; 2Tongji University

| Method | LLM | VQAT | SQA | GQA | POPE | MM-Vet | LLaVAW | MMEP | MMEC | MMB | MMBC |
| Instruct-BLIP [14] | Vicuna-7B | 50.1 | - | 49.2 | - | 26.2 | 60.9 | 1084 | 229 | 36.0 | 23.7 |
| Qwen-VL [3] | Qwen-7B | 63.8 | - | 59.3 | - | - | - | 1487.6 | - | 60.6 | 7.4 |
| LLaVA-1.5 [34] | Vicuna-7B | 58.2 | - | 62.0 | 85.9 | 30.5 | 65.4 | 1510 | - | 64.3 | 58.3 |
| LLaVA-1.5 | Llama3-8B | 58.9 | - | 61.9 | 85.1 | 34.8 | 70.5 | 1544 | 328 | 72.9 | 67.7 |
| mPlugOwl3 [49] | Qwen-8B | 69.0 | - | 65.0 | 88.2 | 40.1 | - | - | - | 77.6 | 74.3 |
| Otter-HD [24] | Fuyu-8B | - | - | - | 86.0 | - | - | 1223 | 331 | 58.30 | - |
| LLaVA-NeXT [35] | Vicuna-7B | 64.9 | 70.1 | 64.2 | 86.5 | - | - | 1519 | 332 | 67.4 | 60.6 |
| Mini-Gemini-HD [27] | Vicuna-7B | 68.4 | - | - | - | 41.3 | - | 1546 | 319 | 65.8 | - |
| SliME [54] | Llama3-8B | 64.7 | 84.2 | 63.9 | - | 37.4 | 73.9 | 1578 | 337 | 75.0 | 71.8 |
| LLaVA-Prumerge+ [43] | Vicuna-7B | 57.1 | 68.3 | - | 84.0 | - | 1462 | - | 64.9 | - | |
| Trim [45] | Vicuna-7B | - | 69.1 | 61.4 | 85.3 | 28.0 | 58.7 | 1461 | - | 67.4 | 54.9 |
| HiRED [2] | Vicuna-13B | 65.2 | 73.2 | - | 87.7 | - | - | 1570 | - | - | - |
| LLaVA-NeXT (Ours) | Llama3-8B | 69.4 | 77.3 | 65.7 | 86.9 | 40.6 | 64.7 | 1558 | 334 | 74.2 | 70.1 |
| FocusLLaVA (Ours) | Llama3-8B | 70.0 | 79.0 | 66.0 | 87.7 | 41.3 | 65.6 | 1600 | 328 | 74.7 | 70.3 |