notesum.ai
Published at December 5VisionZip: Longer is Better but Not Necessary in Vision Language Models
cs.CV
cs.AI
cs.CL
cs.LG
Released Date: December 5, 2024
Authors: Senqiao Yang1, Yukang Chen1, Zhuotao Tian2, Chengyao Wang1, Jingyao Li1, Bei Yu1, Jiaya Jia3
Aff.: 1CUHK; 2HITSZ; 3HKUST

| Method | GQA | MMB | MME | POPE | SQA | VQA | VQA | MMMU | SEED | MMVet | LLaVA-B | Avg. |
| Upper Bound, 576 Tokens (100%) | ||||||||||||
| Vanilla(CVPR24) | 61.9 | 64.7 | 1862 | 85.9 | 69.5 | 78.5 | 58.2 | 36.3 | 58.6 | 31.1 | 66.8 | 100% |
| 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | 100% | ||
| Retain 192 Tokens | ||||||||||||
| FastV (ECCV24) | 52.7 | 61.2 | 1612 | 64.8 | 67.3 | 67.1 | 52.5 | 34.3 | 57.1 | 27.7 | 49.4 | 88.2% |
| 85.1% | 94.6% | 86.6% | 75.4% | 96.8% | 85.5% | 90.2% | 94.5% | 97.4% | 89.7% | 74.0% | ||
| SparseVLM(2024.10) | 57.6 | 62.5 | 1721 | 83.6 | 69.1 | 75.6 | 56.1 | 33.8 | 55.8 | 31.5 | 66.1 | 96.4% |
| 93.1% | 96.6% | 92.4% | 97.3% | 99.4% | 96.3% | 96.4% | 93.1% | 95.2% | 101.3% | 99.0% | ||
| VisionZip | 59.3 | 63.0 | 1782.6 | 85.3 | 68.9 | 76.8 | 57.3 | 36.6 | 56.4 | 31.7 | 67.7 | 98.5% |
| 95.8% | 97.4% | 95.7% | 99.3% | 99.1% | 97.8% | 98.5% | 100.8% | 96.2% | 101.9% | 101.3% | ||
| VisionZip ‡ | 60.1 | 63.4 | 1834 | 84.9 | 68.2 | 77.4 | 57.8 | 36.2 | 57.1 | 32.6 | 66.7 | |
| 97.1% | 98.0% | 98.5% | 98.8% | 98.1% | 98.6% | 99.3% | 99.7% | 97.4% | 104.8% | 99.9% | ||
| Retain 128 Tokens | ||||||||||||
| FastV (ECCV24) | 49.6 | 56.1 | 1490 | 59.6 | 60.2 | 61.8 | 50.6 | 34.9 | 55.9 | 28.1 | 52.0 | 83.5% |
| 80.1% | 86.7% | 80.0% | 69.4% | 86.6% | 78.7% | 86.9% | 96.1% | 95.4% | 90.9% | 77.8% | ||
| SparseVLM(2024.10) | 56.0 | 60.0 | 1696 | 80.5 | 67.1 | 73.8 | 54.9 | 33.8 | 53.4 | 30 | 62.7 | 93.4% |
| 90.5% | 92.7% | 91.1% | 93.7% | 96.5% | 94.0% | 94.3% | 93.1% | 91.1% | 96.5% | 93.9% | ||
| VisionZip | 57.6 | 62.0 | 1761.7 | 83.2 | 68.9 | 75.6 | 56.8 | 37.9 | 54.9 | 32.6 | 64.8 | 97.6% |
| 93.1% | 95.8% | 94.6% | 96.9% | 99.1% | 96.3% | 97.6% | 104.4% | 93.7% | 104.8% | 97.6% | ||
| VisionZip ‡ | 58.9 | 62.6 | 1823 | 83.7 | 68.3 | 76.6 | 57.0 | 37.3 | 55.8 | 32.9 | 64.8 | |
| 95.2% | 96.8% | 97.9% | 97.4% | 98.3% | 97.6% | 97.9% | 102.8% | 95.2% | 105.8% | 97.0% | ||
| Retain 64 Tokens | ||||||||||||
| FastV (ECCV24) | 46.1 | 48.0 | 1256 | 48.0 | 51.1 | 55.0 | 47.8 | 34.0 | 51.9 | 25.8 | 46.1 | 75.6% |
| 74.5% | 74.2% | 67.5% | 55.9% | 73.5% | 70.1% | 82.1% | 93.7% | 88.6% | 83.0% | 69.0% | ||
| SparseVLM(2024.10) | 52.7 | 56.2 | 1505 | 75.1 | 62.2 | 68.2 | 51.8 | 32.7 | 51.1 | 23.3 | 57.5 | 85.8% |
| 85.1% | 86.9% | 80.8% | 87.4% | 89.4% | 86.9% | 89.0% | 90.1% | 87.2% | 74.5% | 86.1% | ||
| VisionZip | 55.1 | 60.1 | 1690 | 77.0 | 69.0 | 72.4 | 55.5 | 36.2 | 52.2 | 31.7 | 62.9 | 94.0% |
| 89.0% | 92.9% | 90.8% | 89.6% | 99.3% | 92.2% | 95.4% | 99.7% | 89.1% | 101.9% | 94.2% | ||
| VisionZip ‡ | 57.0 | 61.5 | 1756 | 80.9 | 68.8 | 74.2 | 56.0 | 35.6 | 53.4 | 30.2 | 63.6 | |
| 92.1% | 95.1% | 94.3% | 94.2% | 99.0% | 94.5% | 96.2% | 98.1% | 91.1% | 97.1% | 95.2% | ||