notesum.ai
Published at October 21xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
cs.LG
cs.AI
cs.CV
cs.IT
math.IT
68T07
Released Date: October 21, 2024
Authors: Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

| Method | Size | #tokens | MSVD-QA | MSRVTT-QA | ActivityNet-QA | TGIF-QA |
|---|---|---|---|---|---|---|
| VideoChat (Li et al., 2023b) | 7B | 32 | 56.3 / 2.8 | 45.0 / 2.5 | - / 2.2 | 34.4 / 2.3 |
| Video-LLaMA (Zhang et al., 2023) | 7B | 32 | 51.6 / 2.5 | 29.6 / 1.8 | 12.4 / 1.1 | - / - |
| Video-ChatGPT (Maaz et al., 2024) | 7B | 264+ | 64.9 / 3.3 | 49.3 / 2.8 | 34.2 / 2.8 | 51.4 / 3.0 |
| Chat-UniVi (Jin et al., 2024) | 7B | 112 | 69.3 / 3.7 | 55.0 / 3.1 | 46.1 / 3.3 | 69.0 / 3.8 |
| LLaMA-VID (Li et al., 2024c) | 7B | 32 | 69.7 / 3.7 | 57.7 / 3.2 | 47.4 / 3.3 | - |
| LLaMA-VID (Li et al., 2024c) | 13B | 32 | 70.0 / 3.7 | 58.9 / 3.3 | 47.5 / 3.3 | - |
| Video-LLaVA (Lin et al., 2023) | 7B | 2048 | 71.8 / 3.9 | 59.2 / 3.5 | 45.3 / 3.3 | 70.0 / 4.0 |
| MiniGPT4-Video (Ataallah et al., 2024) | 7B | 2880+ | 73.9 / 4.1 | 59.7 / 3.3 | 46.3 / 3.4 | 72.2 / 4.1 |
| PLLaVA (Xu et al., 2024a) | 7B | 576+ | 76.6 / 4.1 | 62.0 / 3.5 | 56.3 / 3.5 | 77.5 / 4.1 |
| SlowFast-LLaVA Xu et al. (2024b) | 7B | 3680 | 79.1 / 4.1 | 65.8 / 3.6 | 56.3 / 3.4 | 78.7 / 4.2 |
| LLaVA-Hound-DPO Zhang et al. (2024b) | 7B | 2048 | 80.7 / 4.1 | 70.2 / 3.7 | - / - | 61.4 / 3.5 |
| LLaVA-OneVision∗ (Wang et al., 2024a) | 7B | 1568 | 72.9 / 3.9 | 57.8 / 3.4 | 55.3 / 3.6 | 41.1 / 3.1 |
| Tarsier (Wang et al., 2024a) | 7B | 4608+ | 77.0 / 4.1 | 62.0 / 3.5 | 59.5 / 3.6 | 79.2 / 4.2 |
| Tarsier ∗ (Wang et al., 2024a) | 7B | 4608 | 74.4 / 4.0 | 59.1 / 3.4 | 54.3 / 3.5 | - / - |
| PLLaVA (Xu et al., 2024a) | 34B | 576+ | 79.9 / 4.2 | 68.7 / 3.8 | 60.9 / 3.7 | 80.6 / 4.3 |
| LLaVA-NeXT-Video∗ (Li et al., 2024b) | 32B | 1152 | 73.6 / 4.0 | 56.8 / 3.4 | 58.4 / 3.6 | 73.5 / 4.1 |
| Tarsier (Wang et al., 2024a) | 34B | 4608+ | 80.3 / 4.2 | 66.4 / 3.7 | 61.6 / 3.7 | 82.5 / 4.4 |
| Tarsier ∗ (Wang et al., 2024a) | 34B | 4608 | 79.3 / 4.1 | 62.2 / 3.5 | 61.5 / 3.7 | - / - |
| BLIP-3-Video | 4B | 32 | 77.7 / 4.2 | 60.0 / 3.6 | 55.7 / 3.5 | 76.5 / 4.3 |
| BLIP-3-Video | 4B | 128 | 77.9 / 4.3 | 59.7 / 3.6 | 56.9 / 3.6 | 77.1 / 4.3 |