notesum.ai

Published at November 22

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning

cs.CV

cs.CL

Released Date: November 22, 2024

Authors: AJ Piergiovanni¹, Dahun Kim¹, Michael S. Ryoo¹, Isaac Noble¹, Anelia Angelova¹

Aff.: ¹Google Deepmind

Arxiv: http://arxiv.org/abs/2411.14688v1

Refer to caption

		ViTT				YouCook2				ActivityNet
Method	Pretraining	S	C	M	F1	S	C	M	F1	S	C	M	F1
E2ESG [77]	No	-	-	-	-	-	25.0	3.5	-	-	-	-	-
MT [74]	No	-	-	-	-	-	6.1	3.2	-	-	9.3	5.0	-
PDVC [56]	No	-	-	-	-	4.9	28.9	5.7	-	6.0	29.3	7.6	-
TimeChat [42]	No	-	-	-	-	3.4	11.0	-	19.5	-	-	-	-
GIT [53]	Multiple[53]*	7.1	15.1	3.4	32.5	3.1	12.1	3.4	17.7	5.7	29.8	7.8	50.6
OmniViD [54]	Kinetics-400	-	-	-	-	-	-	-	-	-	26.0	7.5	-
Vid2Seq $\dagger$ [65]	YT-Temporal-1B	9.8	23.0	5.0	37.7	5.7	25.3	6.4	23.5	5.9	30.2	8.5	51.8
DIBS [62]	Custom HowTo[62]*	-	-	-	-	6.4	44.4	7.5	31.4	5.9	31.9	8.9	55.6
Zhou et al. [75]	WebLI [8]*	10.0	25.2	5.8	35.4	6.0	32.9	7.1	24.1	6.2	37.8	10.0	52.9
Ours	HowTo	10.2	37.2	18.9	47.4	11.3	57.6	27.7	33.6	12.3	18.4	19.5	53.3