notesum.ai

Published at November 22

VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection

cs.CV

cs.AI

cs.CL

Released Date: November 22, 2024

Authors: Songhao Han¹, Wei Huang², Hairong Shi¹, Le Zhuo³, Xiu Su⁴, Shifeng Zhang⁵, Xu Zhou⁵, Xiaojuan Qi², Yue Liao⁶, Si Liu¹

Aff.: ¹Beihang University; ²The University of Hong Kong; ³Shanghai AI Lab; ⁴Central South University; ⁵Sangfor Technologies Inc.; ⁶CUHK

Arxiv: http://arxiv.org/abs/2411.14794v1

[Uncaptioned image]

Models	Log.	Fac.	Acc.	Con.	Overall
Closed-source LVLMs
GPT-4o	73.15	63.11	61.66	70.02	66.13
Qwen-VL-Max	62.46	50.33	48.43	60.21	53.37
Open-source LVLMs
LLaVA 1.5	60.53	49.56	49.93	62.1	52.12
InternVL2	70.64	56.32	54.53	66.76	60.05
LLaVA-N-inter	63.27	52.34	48.45	66.78	55.16
Qwen2-VL-7B	66.31	53.67	50.84	68.88	57.66
LongVA-7B-DPO	67.98	54.72	52.78	58.38	57.19
mPLUG-Owl3	66.14	53.05	50.97	67.3	57.14
LLaVA-N-Video	63.42	54.11	49.55	63.31	56.43
Ours	72.25	61.28	59.68	75.73	65.84