notesum.ai

Published at November 25

SAVEn-Vid: Synergistic Audio-Visual Integration for Enhanced Understanding in Long Video Context

cs.CV

Released Date: November 25, 2024

Authors: Jungang Li¹, Sicheng Tao¹, Yibo Yan¹, Xiaojie Gu¹, Haodong Xu¹, Xu Zheng², Yuanhuiyi Lyu², Linfeng Zhang³, Xuming Hu²

Aff.: ¹The Hong Kong University of Science and Technology (Guangzhou); ²The Hong Kong University of Science and Technology; ³Shanghai Jiao Tong University

Arxiv: http://arxiv.org/abs/2411.16213v1

Close Source
Models	#Parameters	#Frames	Supported Modality		Audio-Visual Task	Long-Video Task
Models	#Parameters	#Frames	Audio	Video	Music-AVQA	Video-MME
GPT4-o [50]	-	1fps	✗	✓	-	71.00
Gemini 1.5 Pro [25]	-	1fps	✓	✓	-	75.00
Open Source Video Model
LLaVA-NeXT-Video [86]	7B	32	✗	✓	-	46.50
VideoLLaMA2 [12]	7B	32	✓	✓	-	46.60
LongVA [85]	7B	128	✗	✓	-	52.60
OneLLM [27]	7B	15	✓	✓	47.60	-
NExT-GPT [71]	7B	24	✓	✓	79.84	42.64
CREMA [82]	4B	4	✓	✓	75.60	-
PandaGPT [57]	7B	10	✓	✓	81.85	43.45
SAVEnVideo (w.o. SAVEnVid)	7B	16	✓	✓	74.80	53.60
SAVEnVideo	7B	16	✓	✓	83.14	56.21