notesum.ai
Published at December 9iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
cs.CV
Released Date: December 9, 2024
Authors: Lianyu Hu1, Fanhua Shang, Liang Wan, Wei Feng
Aff.: 1College of Intelligence and Computing, Tianjin University, China

| Model | Video benchmarks | ||||||||
| ActivityNet-QA | Egoschema | MLVU | NextQA | PerceptionTest | SeedBench | VideoChatGPT | VideoDC | VideoMME | |
| test | test | m-avg | mc | val | video | test | test | wo/w-subs | |
| LLaVA-OneVision 7B | 56.6 | 60.1 | 64.7 | 79.4 | 57.1 | 56.9 | 3.51 | 3.75 | 58.2/61.5 |
| iLLaVA 7B | 56.4 (-0.2%) | 60.2 (+0.1%) | 64.4 (-0.3%) | 79.0 (-0.5%) | 56.8 (-0.1%) | 56.5 (-0.4%) | 3.50 (-0.01) | 3.73 (-0.02) | 58.2/61.4 (-0.0%/-0.1%) |
| QWen2-VL 7B | - | 66.7 | - | - | 62.3 | - | - | - | 63.2/68.8 |
| iLLaVA 7B | - | 66.3 (-0.3%) | - | - | 61.8 (-0.5%) | - | - | - | 62.9/68.5 (-0.3%/-0.3%) |