notesum.ai
Published at November 22VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
cs.CV
cs.AI
cs.CL
Released Date: November 22, 2024
Authors: Songhao Han1, Wei Huang2, Hairong Shi1, Le Zhuo3, Xiu Su4, Shifeng Zhang5, Xu Zhou5, Xiaojuan Qi2, Yue Liao6, Si Liu1
Aff.: 1Beihang University; 2The University of Hong Kong; 3Shanghai AI Lab; 4Central South University; 5Sangfor Technologies Inc.; 6CUHK
![[Uncaptioned image]](https://arxiv.org/html/2411.14794v1/x1.png)
| Models | Log. | Fac. | Acc. | Con. | Overall |
| Closed-source LVLMs | |||||
| GPT-4o | 73.15 | 63.11 | 61.66 | 70.02 | 66.13 |
| Qwen-VL-Max | 62.46 | 50.33 | 48.43 | 60.21 | 53.37 |
| Open-source LVLMs | |||||
| LLaVA 1.5 | 60.53 | 49.56 | 49.93 | 62.1 | 52.12 |
| InternVL2 | 70.64 | 56.32 | 54.53 | 66.76 | 60.05 |
| LLaVA-N-inter | 63.27 | 52.34 | 48.45 | 66.78 | 55.16 |
| Qwen2-VL-7B | 66.31 | 53.67 | 50.84 | 68.88 | 57.66 |
| LongVA-7B-DPO | 67.98 | 54.72 | 52.78 | 58.38 | 57.19 |
| mPLUG-Owl3 | 66.14 | 53.05 | 50.97 | 67.3 | 57.14 |
| LLaVA-N-Video | 63.42 | 54.11 | 49.55 | 63.31 | 56.43 |
| Ours | 72.25 | 61.28 | 59.68 | 75.73 | 65.84 |