notesum.ai
Published at November 27TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
cs.CV
cs.AI
Released Date: November 27, 2024
Authors: Shimin Chen1, Xiaohan Lan1, Yitian Yuan1, Zequn Jie1, Lin Ma1
Aff.: 1Meituan Inc.

| Model Name | LLM |
VideoMME
-long(w/o subs) 30-60min |
LVBench
68min on aver. |
LongVideo
Bench (dev) 8s-60min |
MLVU (test)
3min-2hour |
| Open-Source Video MLLMs (>8B) | |||||
| VILA-1.5 [33] | 34B | 53.8 | - | - | 44.2 |
| VideoLLaMA2-72B [9] | 72B | 57.6 | - | - | 45.6 |
| Qwen2-VL-72B [55] | 72B | 62.2 | 41.3 | - | - |
| LLaVA-Video [71] | 72B | 61.5 | - | - | - |
| InternVL-2 [8] | 34B | 52.6 | 39.6 | 59.3 | 45.7 |
| InternVL-1.5 [8] | 20B | 45.6 | - | 51.2 | 37.3 |
| VITA [15] | 8x7B | 48.6 | - | - | - |
| PLLaVA-34B [62] | 34B | - | 26.1 | 53.2 | - |
| Open-Source Video MLLMs ( 8B) | |||||
| LLaVA-Next-Video [70] | 7B | - | - | 43.5 | - |
| PLLaVA-7B [62] | 7B | - | - | 40.2 | - |
| VideoChat2-HD [28] | 7B | - | - | - | 35.1 |
| VideoLLaMA2-7B [9] | 7B | - | - | - | - |
| LongVA [68] | 7B | 46.2 | - | - | 41.1 |
| Video-XL [51] | 7B | 49.2 | - | 49.5 | 45.5 |
| LongVU [50] | 7B | - | - | - | - |
| LongLLaVA [57] | 7B | 45.4 | - | - | - |
| Kangaroo [38] | 8B | 46.6 | 39.4 | 54.2 | - |
| TimeMarker (Ours) | 8B | 46.4 | 41.3 | 56.3 | 49.2 |