notesum.ai
Published at December 5Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
cs.CV
Released Date: December 5, 2024
Authors: Yuying Ge1, Yizhuo Li, Yixiao Ge, Ying Shan
Aff.: 1ARC Lab, Tencent PCG

| Model | LLM size | Video-Gen | EgoSchema | Perception-Test | MVBench | MSVD | ActivityNet |
| Gemini 1.0 Pro [58] | - | 55.7 | 51.1 | - | - | 49.8 | |
| Gemini 1.5 Pro [59] | - | 63.2 | - | - | - | 56.7 | |
| GPT4-V [46] | - | 55.6 | - | 43.7 | - | 59.5 | |
| GPT4-O [47] | - | 72.2 | - | - | - | 61.9 | |
| LLaMA-VID [35] | 7B | 38.5 | 44.6 | 41.9 | 69.7 | 47.4 | |
| Video-ChatGPT [43] | 7B | - | - | - | 64.9 | 35.2 | |
| Video-LLaVA [37] | 7B | 38.4 | 44.3 | 41.0 | 70.7 | 45.3 | |
| VideoChat2 [31] | 7B | 42.2 | 47.3 | 51.1 | 70.0 | 49.1 | |
| LLaVA-NeXT-Video [38] | 7B | 43.9 | 48.8 | 46.5 | 67.8 | 53.5 | |
| LLaVA-NeXT-Video [38] | 32B | 60.9 | - | - | - | 54.3 | |
| PLLaVA [81] | 34B | - | 58.1 | - | - | 60.9 | |
| LLaVA-OneVision [30] | 72B | 62.0 | - | - | - | 62.3 | |
| VideoLLaMA2 [10] | 7B | 51.7 | 51.4 | 54.6 | 70.9 | 50.2 | |
| VideoLLaMA2 [10] | 72B | 63.9 | 57.5 | 62.0 | 71.0 | 55.2 | |
| LWM [40] | 7B | ✓ | - | - | - | 55.9 | - |
| Video-LaVIT [26] | 7B | ✓ | 37.3 | 47.9 | - | 73.2 | 50.1 |
| VILA-U [74] | 7B | ✓ | - | - | - | 75.3 | 52.7 |
| Divot-LLM | 7B | ✓ | 46.5 | 58.3 | 52.1 | 76.4 | 55.8 |