notesum.ai
Published at November 21Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models
cs.CV
Released Date: November 21, 2024
Authors: Yuhao Dong1, Zuyan Liu2, Hai-Long Sun2, Jingkang Yang1, Winston Hu2, Yongming Rao3, Ziwei Liu1
Aff.: 1S-Lab, NTU; 2Tencent; 3Tencent, Tsinghua University

| Model | Size | MMMU | MMMU-Pro | MMBench | MME | ChartQA | MMStar | MathVista | Average |
| DeepSeek-VL [29] | 7B | 35.4 | - | 73.5 | -/- | 59.1 | 37.1 | 36.1 | - |
| VILA-1.5 [20] | 8B | 38.6 | - | 75.3 | 1634.9/- | - | 39.7 | - | - |
| Cambrian-1 [43] | 8B | 42.7 | - | 75.9 | 1547.1/- | 73.3 | - | 49.0 | - |
| InternLM-XComposer2 [7] | 7B | 41.1 | - | 77.6 | 2220.4 | 71.8 | 56.2 | 59.5 | - |
| POINTS [26] | 7B | 51.4 | - | 78.0 | 2184.1 | - | 60.9 | 63.0 | - |
| IXC-2.5 [55] | 7B | 42.9 | - | 79.4 | 2233.1 | 82.2 | 59.9 | 63.7 | - |
| Bunny-LLaMA3 [12] | 8B | 43.4 | - | 77.2 | 1588.9/321.1 | - | - | 34.4 | - |
| MM-1.5 [54] | 7B | 41.8 | - | - | 1514.9/346.4 | 78.6 | - | 47.6 | - |
| MiniCPM-LLaMA3-V 2.5 [49] | 8B | 45.8 | 19.6 | 77.2 | 2024.6 | - | 51.8 | 54.3 | - |
| MiniCPM-V-2.6 [50] | 7B | 49.8 | 27.2 | 78.0 | 2268.7 | - | 57.5 | 60.6 | - |
| Qwen2-VL [38] | 7B | 53.7 | - | 81.0 | - | 83.0 | 60.7 | 61.4 | - |
| Idefics3-LLaMA3 [14] | 8B | 46.6 | 22.9 | 77.5 | 1937.4 | 74.8 | 55.9 | 58.4 | 48.1 |
| Ovis1.5-LLaMA3 [31] | 8B | 48.3 | 23.6 | 76.6 | 1948.5 | 76.4 | 57.3 | 63.0 | 49.4 |
| LLaVA-NeXT-LLaMA3 [22] | 8B | 36.9 | 13.2 | 72.3 | 1611.1/346.0 | 69.4 | 43.1 | 45.9 | 40.2 |
| + Multi-Agent | 8B | 40.8 | 17.8 | 77.6 | 1603.7/469.3 | 74.6 | 52.6 | 47.4 | 44.5 |
| + Iterative DPO (Insight-V-LLaVA) | 8B | 42.0 | 21.0 | 81.7 | 1583.9/485.4 | 77.4 | 57.4 | 49.8 | 47.2 (+7.0) |
| Our Base Model | 7B | 47.1 | 22.6 | 81.3 | 1573.7/482.5 | 75.7 | 57.0 | 56.9 | 48.7 |
| + Multi-Agent | 7B | 49.7 | 23.8 | 82.2 | 1662.2/629.3 | 81.2 | 58.6 | 58.7 | 50.7 |
| + Iterative DPO (Insight-V) | 7B | 50.2 | 24.9 | 82.3 | 1685.1/627.0 | 81.5 | 61.5 | 59.9 | 51.6 (+2.9) |