notesum.ai
Published at December 4Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning
cs.CV
Released Date: December 4, 2024
Authors: Wujian Peng1, Lingchen Meng, Yitong Chen, Yiweng Xie, Yang Liu, Tao Gui, Hang Xu, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang
Aff.: 1Fudan University

| Method | LLM | Vision Encoder | AI2D [25] | MMMU [87] | POPE [34] | GQA [24] | MM-Vet [85] |
| (test) | (val) | (test F1) | (val) | (test) | |||
| LLaVA-1.5 [40] | Vicuna-7B [15] | CLIP-ViT-Large [61] | 54.8 | 35.3 | 85.9 | 62.0 | 30.5 |
| LLaVA-Next [41] | Vicuna-7B [15] | CLIP-ViT-Large [61] | 66.6 | 35.1 | 86.4 | 64.2 | 44.1 |
| DeepStack-L [54] | Vicuna-7B [15] | CLIP-ViT-Large [61] | - | 35.7 | 86.7 | 63.1 | 29.9 |
| DeepStack-L-HD [54] | Vicuna-7B [15] | CLIP-ViT-Large [61] | - | 35.6 | 86.5 | 65.2 | 37.5 |
| VILA [38] | Vicuna-7B [15] | CLIP-ViT-Large [61] | - | - | 85.5 | 62.3 | 34.9 |
| ShareGPT4V [10] | Vicuna-7B [15] | CLIP-ViT-Large [61] | - | - | - | 37.6 | |
| MM-1.5 [91] | MM-LLM-7B [91] | MM-CLIP [91] | 72.0 | 41.8 | 88.6 | - | 42.2 |
| InternVL2 [12] | InternLM-7B [68] | InternViT-300M [12] | 83.8 | 49.3 | - | - | 60.0 |
| LLaVA-OV (SI) [29] | Qwen2-7B [80] | SigLIP-SO400M [89] | 81.6 | 47.3 | - | - | 58.8 |
| LLaVA-OV [29] | Qwen2-7B [80] | SigLIP-SO400M [89] | 81.4 | 48.8 | - | - | 57.5 |
| Qwen2-VL-Instruct [74] | Qwen2-7B [80] | DFN-CLIP-H [20] | 83.0 | 54.1 | - | - | 62.0 |
| LLaVA-Next-Inst-IT | Vicuna-7B [15] | CLIP-ViT-Large [61] | 71.0 | 37.4 | 87.2 | 65.9 | 38.1 |
| LLaVA-Next-Inst-IT | Qwen2-7B [80] | SigLIP-SO400 [89] | 78.7 | 42.7 | 87.6 | 65.5 | 44.7 |