notesum.ai
Published at November 22Continual SFT Matches Multimodal RLHF with Negative Supervision
cs.AI
cs.CL
cs.CV
Released Date: November 22, 2024
Authors: Ke Zhu1, Yu Wang2, Yanpeng Sun3, Qiang Chen2, Jiangjiang Liu2, Gang Zhang2, Jingdong Wang2
Aff.: 1Nanjing University; 2Baidu VIS; 3Nanjing University of Science and Technology

| Alignment Data | Method | Traditional VQA | MM Comprehension | Hallucinations | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SQA | GQA | VQA | total | MMVet | MME | MMB | total | POPE | CHAIR | MMHal | total | ||
| OCRVQA [21] | baseline | 66.8 | 62.0 | 58.0 | (+0.0) | 30.5 | 1510 | 64.3 | (+0.0) | 85.9 | 32.0 | 2.80 | (+0.0) |
| GT-DPO | 67.8 | 61.4 | 57.7 | (+0.1) | 32.5 | 1412 | 63.9 | (+1.6) | 84.3 | 31.5 | 2.90 | (+0.6) | |
| SeVa | 67.6 | 62.0 | 57.5 | (+0.3) | 32.5 | 1502 | 64.9 | (+2.6) | 86.6 | 27.3 | 3.00 | (+8.7) | |
| SIMA | 68.0 | 61.9 | 58.2 | (+1.3) | 32.5 | 1486 | 64.8 | (+2.5) | 86.2 | 29.4 | 2.93 | (+5.1) | |
| Cont. SFT | 67.9 | 61.7 | 56.9 | (-0.3) | 33.3 | 1490 | 64.5 | (+3.0) | 87.0 | 34.0 | 2.76 | (-1.6) | |
| nSFT | 68.1 | 62.0 | 58.1 | (+1.4) | 34.0 | 1515 | 64.9 | (+4.1) | 87.1 | 26.5 | 2.93 | (+8.9) | |
| TextCaps [27] | baseline | 66.8 | 62.0 | 58.0 | (+0.0) | 30.5 | 1510 | 64.3 | (+0.0) | 85.9 | 32.0 | 2.80 | (+0.0) |
| GT-DPO | 68.0 | 61.7 | 57.5 | (+0.4) | 34.2 | 1500 | 64.2 | (+3.6) | 86.5 | 29.2 | 2.83 | (+3.9) | |
| SeVa | 68.1 | 61.7 | 57.8 | (+0.8) | 34.6 | 1480 | 65.0 | (+4.8) | 86.3 | 26.3 | 2.90 | (+7.8) | |
| SIMA | 68.0 | 62.1 | 58.0 | (+1.3) | 32.2 | 1473 | 64.9 | (+2.3) | 85.9 | 27.6 | 2.87 | (+5.6) | |
| Cont. SFT | 66.9 | 61.3 | 56.6 | (-2.0) | 31.0 | 1520 | 64.4 | (+0.6) | 86.3 | 30.5 | 2.83 | (+2.4) | |
| nSFT | 68.4 | 62.3 | 58.2 | (+2.1) | 33.7 | 1521 | 65.3 | (+4.2) | 87.2 | 26.2 | 2.97 | (+9.9) | |
| LLaVA-150k [14] | baseline | 66.8 | 62.0 | 58.0 | (+0.0) | 30.5 | 1510 | 64.3 | (+0.0) | 85.9 | 32.0 | 2.80 | (+0.0) |
| GT-DPO | 68.1 | 61.6 | 57.6 | (+0.5) | 33.9 | 1497 | 63.9 | (+3.0) | 85.9 | 30.7 | 2.80 | (+1.3) | |
| SeVa | 67.5 | 61.4 | 58.0 | (+0.1) | 32.5 | 1490 | 64.7 | (+2.4) | 85.6 | 28.2 | 2.94 | (+5.8) | |
| SIMA | 67.9 | 62.2 | 58.2 | (+1.5) | 32.1 | 1511 | 64.9 | (+2.2) | 86.9 | 26.2 | 2.97 | (+9.6) | |
| Cont. SFT | 67.1 | 60.9 | 57.0 | (-1.8) | 31.2 | 1480 | 64.0 | (+0.4) | 86.3 | 29.1 | 2.91 | (+5.1) | |
| nSFT | 68.4 | 62.3 | 58.4 | (+2.3) | 34.2 | 1550 | 65.2 | (+4.6) | 87.4 | 25.4 | 3.02 | (+11.8) | |