notesum.ai
Published at November 25SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
cs.CV
cs.AI
Released Date: November 25, 2024
Authors: Junho Kim1, Hyunjun Kim1, Hosu Lee1, Yong Man Ro1
Aff.: 1KAIST, South Korea

| Video-MME | LVBench | |||||
| Model | #param | Short | Medium | Long | Overall | Acc. (val) |
| Proprietary LMMs | ||||||
| GPT-4V [45] | n/a | 70.5 | 55.8 | 53.5 | 59.9 | - |
| GPT-4o [46] | n/a | 80.0 | 70.3 | 65.3 | 71.9 | 66.7 |
| Gemini 1.5 Pro [50] | n/a | 81.7 | 74.3 | 67.4 | 75.0 | 64.0 |
| \cdashline1-7 Open-sourced LMMs | ||||||
| ST-LLM [39] | 7B | 45.7 | 36.8 | 31.3 | 37.9 | - |
| VideoChat2 [32] | 7B | 48.3 | 37.0 | 33.2 | 39.5 | 39.3 |
| ShareGPT4Video [8] | 8B | 48.3 | 36.3 | 35.0 | 39.9 | 39.7 |
| Video-LLaVA [35] | 7B | 45.3 | 38.0 | 36.2 | 39.9 | 39.1 |
| Chat-UniVi-V1.5 [26] | 7B | 45.7 | 40.3 | 35.8 | 40.6 | - |
| Qwen-VL-Chat [3] | 7B | 46.9 | 38.7 | 37.8 | 41.1 | - |
| ShareGemini [51] | 7B | 49.1 | 41.3 | 39.1 | 43.2 | - |
| SliME [72] | 8B | 53.3 | 42.7 | 39.8 | 45.3 | - |
| PLLaVA [59] | 7B | - | - | - | - | 40.2 |
| VideoLLaMA2 [12] | 8B | 56.0 | 45.4 | 42.1 | 47.9 | - |
| \cdashline1-7 Ours | ||||||
| SALOVA-Llama | 3B | 48.3 | 46.3 | 41.1 | 45.3 | 41.4 |
| SALOVA-Phi | 3.8B | 47.1 | 48.8 | 44.1 | 46.7 | 41.6 |
| SALOVA-Qwen | 7B | 52.3 | 50.9 | 46.8 | 50.0 | 43.5 |