notesum.ai
Published at November 11StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification
cs.CV
cs.AI
Released Date: November 11, 2024
Authors: Yichen He1, Yuan Lin1, Jianchao Wu1, Hanchong Zhang2, Yuchen Zhang1, Ruicheng Le3
Aff.: 1ByteDance Research; 2Shanghai Jiao Tong University; 3Peking University

| Model | MovieQA Accuracy | |||
|---|---|---|---|---|
| Character | Action | Plot | Total | |
| Gemini-1.5-pro | 0.578 | 0.501 | 0.534 | 0.544 |
| GPT-4o | 0.517 | 0.479 | 0.528 | 0.507 |
| VILA1.5-8B | 0.561 | 0.459 | 0.540 | 0.524 |
| LLaVA-OneVision-7B | 0.557 | 0.454 | 0.540 | 0.520 |
| Qwen2-VL-7B | 0.549 | 0.468 | 0.549 | 0.523 |
| InternVL2-8B | 0.535 | 0.448 | 0.506 | 0.501 |
| StoryTeller | 0.676 | 0.583 | 0.644 | 0.639 |