notesum.ai

Published at October 21

xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

cs.LG
cs.AI
cs.CV
cs.IT
math.IT
68T07

Released Date: October 21, 2024

Authors: Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles

Arxiv: https://arxiv.org/abs/2410.16267v1