notesum.ai
Published at December 5NVILA: Efficient Frontier Visual Language Models
cs.CV
Released Date: December 5, 2024
Authors: Zhijian Liu1, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, Yao Lu
Aff.: 1NVIDIA
![[Uncaptioned image]](https://arxiv.org/html/2412.04468v1/x1.png)
| ActivityNet-QA | LongVideoBench | MLVU | MVBench | NExT-QA | Video-MME | ||||||
| #F | acc. | score | val | test | m-avg | test | mc | w/o sub. | w/ sub. | ||
| GPT-4o mini | – | – | – | – | 56.5 | 58.8 | – | – | – | 64.8 | 68.9 |
| GPT-4o | – | – | 61.9 | – | 66.7 | 66.7 | 64.6 | – | – | 71.9 | 77.2 |
| LLaVA-NeXT-Video | 7B | 32 | 53.5 | 3.2 | 43.5 | 43.5 | – | 33.7 | – | 46.5 | – |
| Video-XL | 7B | 2048 | – | – | 49.5 | 51.3 | 64.9 | 55.3 | 77.2 | 55.5 | 61.0 |
| LLaVA-OneVision | 7B | 32 | 56.6 | – | 56.5 | – | 64.7 | 56.7 | 79.4 | 58.2 | 61.5 |
| Oryx-1.5 | 7B | 128 | – | – | 56.3 | – | 67.5 | 67.6 | 81.8 | 58.8 | 64.2 |
| LongVILA | 7B | 256 | 59.5 | – | 57.1 | – | – | 67.1 | 80.7 | 60.1 | 65.1 |
| LongVU | 7B | 1fps | – | – | – | – | 65.4 | 66.9 | – | 60.6 | – |
| Qwen2-VL | 7B | 2fps | – | – | 55.6 | 56.8 | – | 67.0 | – | 63.3 | 69.0 |
| NVILA-Lite | 8B | 256 | – | – | – | – | 69.2 | – | 79.6 | 60.9 | 68.1 |
| NVILA | 8B | 256 | 60.9 | 3.7 | 57.7 | 58.7 | 70.1 | 68.1 | 82.2 | 64.2 | 70.0 |