notesum.ai
Published at November 29Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing
cs.CV
cs.AI
cs.LG
Released Date: November 29, 2024
Authors: Hosu Lee1, Junho Kim1, Hyunjun Kim1, Yong Man Ro1
Aff.: 1Integrated Vision and Language Lab, KAIST, South Korea

| LongVideoBench | |||||||
| Model | Size |
8-15s |
15-60s |
180-600s |
900-3600s |
test set |
val set |
| GPT-4o [33] | - | 71.6 | 76.8 | 66.7 | 61.6 | 66.7 | 66.7 |
| Gemini 1.5 Pro [38] | - | 70.2 | 75.3 | 65.0 | 59.1 | 64.4 | 64.0 |
| GPT-4-Turbo [31] | - | 66.4 | 71.1 | 61.7 | 54.5 | 60.7 | 59.1 |
| \cdashline1-8 VideoChat2 [21] | 7B | 38.1 | 40.5 | 33.5 | 33.6 | 35.1 | 36.0 |
| VideoLLaVA [23] | 8B | 43.1 | 44.6 | 36.4 | 34.4 | 37.6 | 39.1 |
| PLLaVA [45] | 7B | 45.3 | 47.3 | 38.5 | 35.2 | 39.2 | 40.2 |
| LLaVA-1.5 [25] | 7B | 45.0 | 47.4 | 40.1 | 37.0 | 40.4 | 40.3 |
| ShareGPT4Video [4] | 7B | 46.9 | 50.1 | 40.0 | 38.7 | 41.8 | 39.7 |
| \cdashline1-8 Video-Ma2mba-0.7B | 0.7B | 43.3 | 45.4 | 33.3 | 28.5 | 34.2 | 34.0 |
| Video-Ma2mba-1.8B | 1.8B | 48.4 | 49.5 | 39.6 | 34.1 | 39.8 | 38.0 |
| Video-Ma2mba-3.1B | 3.1B | 55.4 | 55.6 | 42.4 | 38.5 | 44.2 | 43.0 |