notesum.ai
Published at November 21Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
cs.CV
Released Date: November 21, 2024
Authors: Yiming Zhang1, Zhuokai Zhao2, Zhaorun Chen2, Zenghui Ding3, Xianjun Yang1, Yining Sun1
Aff.: 1HFIPS, Chinese Academy of Sciences / University of Science and Technology of China; 2University of Chicago; 3HFIPS, Chinese Academy of Sciences

| Method | LLM | Video | Vision | NExTQA | EgoSchema | IntentQA | STAR | VideoMME |
|---|---|---|---|---|---|---|---|---|
| Size | Trained | Encoder | acc. | acc. | acc. | acc. | acc. | |
| InternVideo | 7B | CLIP-L | O | 59.1 | 32.1 | - | 41.6 | - |
| VideoChat2 | 7B | UMT-L | O | 61.7 | - | - | 59.0 | - |
| Sevilla | 7B | CLIP-L | O | 63.6 | - | 60.9 | 44.6 | - |
| Video-LLaVA | 7B | ViT-L | O | - | - | - | - | 39.9 |
| Chat-UniVi-v1.5 | 7B | CLIP-G | O | - | - | - | - | 40.6 |
| DeepStack-L | 7B | CLIP-L | X | 61.0 | 38.4 | - | - | - |
| IG-VLM | 7B | CLIP-L | X | 63.1 | 35.8 | 60.1 | 49.6 | 39.8 |
| SlowFast-LLaVA | 7B | CLIP-L | X | 64.2 | 45.5 | 60.1 | 49.0 | 40.4 |
| DyTo | 7B | CLIP-L | X | 65.7 | 48.6 | 61.6 | 50.7 | 41.2 |