notesum.ai
Published at October 23MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
cs.CL
cs.AI
cs.CV
Released Date: October 23, 2024
Authors: Ziyu Liu1, Yuhang Zang2, Xiaoyi Dong2, Pan Zhang2, Yuhang Cao2, Haodong Duan2, Conghui He2, Yuanjun Xiong3, Dahua Lin4, Jiaqi Wang2
Aff.: 1SJTU; 2Shanghai AI Laboratory; 3MThreads, Inc.; 4CUHK

| Models | Parameter | MMStar | SQA | MMVet | POPE | MMB | Math | AI2D | Average |
|---|---|---|---|---|---|---|---|---|---|
| LLaVA-v1.6 (Li et al., 2024b) | 7B | 37.6 | 87.5 | 40.2 | 70.3 | 69.8 | 31.5 | 67.0 | 57.7 |
| Qwen-VL-Chat (Bai et al., 2023) | 7B | 34.5 | 68.8 | 47.3 | 74.9 | 61.8 | 15.5 | 63.0 | 52.3 |
| Idefics2 (Laurençon et al., 2024b) | 8B | 49.5 | 88.7 | 34.0 | 86.2 | 75.7 | 51.4 | 72.3 | 65.4 |
| OpenFlamingo (Awadalla et al., 2023b) | 9B | 36.9 | 44.8 | 23.2 | 52.6 | 32.4 | 18.6 | 31.7 | 34.3 |
| InstructBLIP (Dai et al., 2023) | 13B | 32.7 | 54.1 | 33.1 | 86.1 | 38.3 | 24.4 | 40.6 | 44.2 |
| CogVLM (Wang et al., 2023) | 17B | 39.9 | 66.2 | 54.5 | 88.0 | 65.8 | 35.0 | 63.3 | 58.9 |
| Emu2-Chat (Sun et al., 2024) | 37B | 40.7 | 68.2 | 31.0 | 88.0 | 63.4 | 30.7 | 49.7 | 53.1 |
| LLaVA-v1.5 (Liu et al., 2024a) | 7B | 32.9 | 66.6 | 30.5 | 85.9 | 64.3 | 25.4 | 55.5 | 51.6 |
| LLaVA-RLHF Sun et al. (2023) | 7B | 31.6 | 64.0 | 27.8 | 80.8 | 60.1 | 23.5 | 47.9 | 48.0 |
| HA-DPO (Zhao et al., 2023) | 7B | 33.5 | 67.3 | 29.1 | 84.3 | 64.9 | 25.8 | 53.9 | 51.3 |
| POVID (Zhou et al., 2024) | 7B | 36.2 | 68.8 | 31.8 | 86.3 | 64.9 | 24.4 | 55.2 | 52.5 |
| MIA-DPO (ours) | 7B | 32.9 | 67.6 | 32.1 | 87.2 | 63.1 | 24.4 | 54.7 | 51.7 |
| InternLM-XC2.5 (Zhang et al., 2024) | 7B | 59.7 | 96.3 | 48.7 | 87.9 | 81.9 | 63.3 | 81.5 | 74.2 |
| MIA-DPO (ours) | 7B | 61.1 | 96.2 | 46.7 | 86.9 | 80.4 | 61.7 | 81.6 | 73.5 |