notesum.ai
Published at November 29LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
cs.CV
cs.CL
cs.LG
cs.MM
Released Date: November 29, 2024
Authors: Tiantian Geng1, Jinrui Zhang2, Qingni Wang3, Teng Wang2, Jinming Duan1, Feng Zheng2
Aff.: 1University of Birmingham; 2Southern University of Science and Technology; 3University of Electronic Science and Technology of China
![[Uncaptioned image]](https://arxiv.org/html/2411.19772v1/extracted/6033341/figures/fig1_new3.png)
| Model | A&V | TU | Omni-TVG | Omni-DVC | Omni-SC | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| R@0.3 | R@0.5 | R@0.7 | mIoU | S | C | M | B | R | C | M | |||
| VideoChat (7B) [28] | ✗ | ✗ | 2.2 | 0.9 | 0.4 | 3.0 | 0.7 | 0.2 | 0.9 | 0.5 | 9.6 | 0.0 | 8.2 |
| VideoChatGPT (7B) [37] | ✗ | ✗ | 4.9 | 2.0 | 0.9 | 5.0 | 0.7 | 0.1 | 0.9 | 0.4 | 14.0 | 0.9 | 5.9 |
| VideoLLaMA (7B) [61] | ✓ | ✗ | 2.5 | 1.1 | 0.3 | 1.9 | 0.6 | 0.6 | 0.9 | 0.9 | 11.5 | 0.1 | 8.9 |
| PandaGPT (7B) [49] | ✓ | ✗ | 2.5 | 1.0 | 0.3 | 2.2 | 0.5 | 0.0 | 0.6 | 0.6 | 14.9 | 0.3 | 8.9 |
| NExT-GPT (7B) [56] | ✓ | ✗ | 4.3 | 1.9 | 0.7 | 4.0 | 0.2 | 0.1 | 0.3 | 0.4 | 10.2 | 0.0 | 8.1 |
| TimeChat (7B) [46] | ✗ | ✓ | 5.8 | 2.6 | 1.1 | 5.2 | 1.6 | 0.1 | 1.4 | 1.2 | 16.1 | 1.6 | 10.0 |
| VTimeLLM (7B) [20] | ✗ | ✓ | 7.5 | 3.4 | 1.3 | 6.4 | 2.4 | 0.2 | 2.0 | 1.0 | 14.5 | 1.6 | 5.5 |
| LongVALE-LLM (7B) (ours) | ✓ | ✓ | 15.7 | 8.6 | 3.9 | 11.0 | 2.8 | 7.9 | 4.7 | 5.6 | 22.4 | 20.3 | 10.9 |