notesum.ai

Published at November 29

Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing

cs.CV

cs.AI

cs.LG

Released Date: November 29, 2024

Authors: Hosu Lee¹, Junho Kim¹, Hyunjun Kim¹, Yong Man Ro¹

Aff.: ¹Integrated Vision and Language Lab, KAIST, South Korea

Arxiv: http://arxiv.org/pdf/2411.19460v1

Refer to caption

		LongVideoBench
Model	Size	8-15s	15-60s	180-600s	900-3600s	test set	val set
GPT-4o [33]	-	71.6	76.8	66.7	61.6	66.7	66.7
Gemini 1.5 Pro [38]	-	70.2	75.3	65.0	59.1	64.4	64.0
GPT-4-Turbo [31]	-	66.4	71.1	61.7	54.5	60.7	59.1
\cdashline1-8 VideoChat2 [21]	7B	38.1	40.5	33.5	33.6	35.1	36.0
VideoLLaVA [23]	8B	43.1	44.6	36.4	34.4	37.6	39.1
PLLaVA [45]	7B	45.3	47.3	38.5	35.2	39.2	40.2
LLaVA-1.5 [25]	7B	45.0	47.4	40.1	37.0	40.4	40.3
ShareGPT4Video [4]	7B	46.9	50.1	40.0	38.7	41.8	39.7
\cdashline1-8 Video-Ma²mba-0.7B	0.7B	43.3	45.4	33.3	28.5	34.2	34.0
Video-Ma²mba-1.8B	1.8B	48.4	49.5	39.6	34.1	39.8	38.0
Video-Ma²mba-3.1B	3.1B	55.4	55.6	42.4	38.5	44.2	43.0