notesum.ai

Published at November 11

StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification

cs.CV

cs.AI

Released Date: November 11, 2024

Authors: Yichen He¹, Yuan Lin¹, Jianchao Wu¹, Hanchong Zhang², Yuchen Zhang¹, Ruicheng Le³

Aff.: ¹ByteDance Research; ²Shanghai Jiao Tong University; ³Peking University

Arxiv: http://arxiv.org/abs/2411.07076v1

Refer to caption

Model	MovieQA Accuracy
Model	Character	Action	Plot	Total
Gemini-1.5-pro	0.578	0.501	0.534	0.544
GPT-4o	0.517	0.479	0.528	0.507
VILA1.5-8B	0.561	0.459	0.540	0.524
LLaVA-OneVision-7B	0.557	0.454	0.540	0.520
Qwen2-VL-7B	0.549	0.468	0.549	0.523
InternVL2-8B	0.535	0.448	0.506	0.501
StoryTeller	0.676	0.583	0.644	0.639