notesum.ai
Published at November 20Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension
cs.CV
cs.AI
Released Date: November 20, 2024
Authors: Yongdong Luo1, Xiawu Zheng1, Xiao Yang1, Guilin Li1, Haojia Lin1, Jinfa Huang2, Jiayi Ji1, Fei Chao1, Jiebo Luo2, Rongrong Ji1
Aff.: 1Xiamen University; 2University of Rochester

| Model | #Text | LLM Params | Frames | Short | Medium | Long | Overall | Gain |
|---|---|---|---|---|---|---|---|---|
| Proprietary LVLMs | ||||||||
| GPT-4o [28] | - | - | 384 | 80.0 | 70.3 | 65.3 | 71.9 | - |
| Gemini-1.5-Pro [31] | - | - | 0.5 fps | 81.7 | 74.3 | 67.4 | 75.0 | - |
| Open-Source LVLMs | ||||||||
| Video-LLaVA [17] | - | 7B | 8 | 44.6 | 38.3 | 35.8 | 39.6 | - |
| Video-LLaVA + Video-RAG | 2.0K | 7B | 8 | 49.5 | 43.0 | 35.5 | 45.0 | +5.4 |
| LLaVA-NeXT-Video [48] | - | 7B | 16 | 49.4 | 43.0 | 36.7 | 43.0 | - |
| LLaVA-NeXT-Video + Video-RAG | 2.0K | 7B | 16 | 56.6 | 47.4 | 39.1 | 50.0 | +7.0 |
| LongVA [47] | - | 7B | 32 | 60.9 | 49.3 | 44.0 | 51.4 | - |
| LongVA + Video-RAG | 1.8K | 7B | 32 | 65.4 | 59.1 | 55.7 | 60.1 | +8.7 |
| Long-LLaVA [43] | - | 7B | 32 | 60.3 | 51.4 | 44.1 | 52.0 | - |
| Long-LLaVA + Video-RAG | 1.9K | 7B | 32 | 66.4 | 60.2 | 59.8 | 62.1 | +10.1 |
| Qwen2-VL [38] | - | 72B | 32 | 75.0 | 63.3 | 56.3 | 64.9 | - |
| Qwen2-VL + Video-RAG | 2.1K | 72B | 32 | 77.4 | 70.2 | 71.0 | 72.9 | +8.0 |
| LLaVA-Video [49] | - | 72B | 32 | 78.0 | 63.7 | 59.6 | 67.1 | - |
| LLaVA-Video + Video-RAG | 2.1K | 72B | 32 | 81.1 | 72.9 | 73.1 | 75.7 | +8.6 |