notesum.ai
Published at December 6LinVT: Empower Your Image-level Large Language Model to Understand Videos
cs.CV
cs.LG
cs.MM
Released Date: December 6, 2024
Authors: Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao

| Method | Size | MVD-QA | MTT-QA | Act-QA | TGIF-QA |
|---|---|---|---|---|---|
| VideoChat [30] | 7B | 56.3 / 2.8 | 45.0 / 2.5 | - / 2.2 | 34.4 / 2.3 |
| Video-LLaMA [80] | 7B | 51.6 / 2.5 | 29.6 / 1.8 | 12.4 / 1.1 | - / - |
| Video-LLaMA2 [9] | 7B | 71.7 / 3.9 | - / - | 49.9 / 3.3 | - / - |
| Video-ChatGPT [46] | 7B | 64.9 / 3.3 | 49.3 / 2.8 | 34.2 / 2.8 | 51.4 / 3.0 |
| Chat-UniVi [23] | 7B | 69.3 / 3.7 | 55.0 / 3.1 | 46.1 / 3.3 | 69.0 / 3.8 |
| LLaMA-VID [35] | 7B | 69.7 / 3.7 | 57.7 / 3.2 | 47.4 / 3.3 | - / - |
| Video-LLaVA [37] | 7B | 71.8 / 3.9 | 59.2 / 3.5 | 45.3 / 3.3 | 70.0 / 4.0 |
| MiniGPT4-Video [2] | 7B | 73.9 / 4.1 | 59.7 / 3.3 | 46.3 / 3.4 | 72.2 / 4.1 |
| PLLaVA [71] | 7B | 76.6 / 4.1 | 62.0 / 3.5 | 56.3 / 3.5 | 77.5 / 4.1 |
| SlowFast-LLaVA [72] | 7B | 79.1 / 4.1 | 65.8 / 3.6 | 56.3 / 3.4 | 78.7 / 4.2 |
| Tarsier [64] | 7B | 77.0 / 4.1 | 62.0 / 3.5 | 59.5 / 3.6 | 79.2 / 4.2 |
| BLIP-3-Video [56] | 4B | 77.7 / 4.2 | 60.0 / 3.6 | 55.7 / 3.5 | 76.5 / 4.3 |
| LinVT-Mipha | 1.6B | 71.2 / 3.8 | 55.3 / 3.0 | 47.5 / 3.1 | 71.1 / 3.9 |
| LinVT-Aquila | 2B | 74.6 / 4.1 | 58.4 / 3.2 | 51.1 / 3.3 | 73.6 / 4.0 |
| LinVT-BLIP-3 | 4B | 79.1 / 4.4 | 61.5 / 3.9 | 58.9 / 3.6 | 78.7 / 4.3 |
| LinVT-Molmo | 7B | 78.1 / 4.3 | 60.3 / 3.7 | 59.6 / 3.7 | 79.3 / 4.2 |
| LinVT-Qwen2-VL | 7B | 80.2 / 4.4 | 66.2 / 4.0 | 60.1 / 3.6 | 81.3 / 4.3 |
| LinVT-InternVL2 | 8B | / 4.4 | / 4.0 | / 3.7 | / 4.3 |