notesum.ai
Published at November 25SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context
cs.CV
Released Date: November 25, 2024
Authors: Jungang Li1, Sicheng Tao1, Yibo Yan1, Xiaojie Gu1, Haodong Xu1, Xu Zheng2, Yuanhuiyi Lyu2, Linfeng Zhang3, Xuming Hu2
Aff.: 1The Hong Kong University of Science and Technology (Guangzhou); 2The Hong Kong University of Science and Technology; 3Shanghai Jiao Tong University

| Models | #Parameters | #Frames | Supported Modality | Audio-Visual Task | Long-Video Task | |
|---|---|---|---|---|---|---|
| Audio | Video | Music-AVQA | Video-MME | |||
| Close Source | ||||||
| GPT4-o [50] | - | 1fps | ✗ | ✓ | - | 71.00 |
| Gemini 1.5 Pro [25] | - | 1fps | ✓ | ✓ | - | 75.00 |
| Open Source Video Model | ||||||
| LLaVA-NeXT-Video [86] | 7B | 32 | ✗ | ✓ | - | 46.50 |
| VideoLLaMA2 [12] | 7B | 32 | ✓ | ✓ | - | 46.60 |
| LongVA [85] | 7B | 128 | ✗ | ✓ | - | 52.60 |
| OneLLM [27] | 7B | 15 | ✓ | ✓ | 47.60 | - |
| NExT-GPT [71] | 7B | 24 | ✓ | ✓ | 79.84 | 42.64 |
| CREMA [82] | 4B | 4 | ✓ | ✓ | 75.60 | - |
| PandaGPT [57] | 7B | 10 | ✓ | ✓ | 81.85 | 43.45 |
| SAVEnVideo (w.o. SAVEnVid) | 7B | 16 | ✓ | ✓ | 74.80 | 53.60 |
| SAVEnVideo | 7B | 16 | ✓ | ✓ | 83.14 | 56.21 |