notesum.ai
Published at November 27Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
cs.CV
cs.CL
Released Date: November 27, 2024
Authors: Di Zhang1, Jingdi Lei2, Junxian Li3, Xunzhi Wang4, Yujie Liu5, Zonglin Yang6, Jiatong Li7, Weida Wang8, Suorong Yang9, Jianbo Wu10, Peng Ye1, Wanli Ouyang11, Dongzhan Zhou11
Aff.: 1Fudan University; 2Beijing Institute of Technology; 3Shanghai Jiaotong University; 4Nankai University; 5Shanghai University; 6Nanyang Technological University; 7Hong Kong Polytechnic University; 8Tongji University; 9Nanjing University; 10University of California, Merced; 11Shanghai Artificial Intelligence Laboratory

| Model | Benchmarks | |||||
| RealWorldQA [53] | MMStar [6] | MMBench [30] | SEEDBench [23] | ScienceQA [32] | MMT-Bench [58] | |
| LLaVA-V1.5-7B | 50.7 | 32.2 | 68.4 | 65.6 | 60.8 | 36.0 |
| +POVID [66] | 51.8 | 33.6 | 71.6 | 65.4 | 65.0 | 33.4 |
| +CSR [67] | 51.8 | 32.4 | 70.6 | 65.4 | 66.0 | 33.2 |
| +SIMA [50] | 49.3 | 32.6 | 70.6 | 65.2 | 64.2 | 34.0 |
| +SCL [14] | 53.2 | 35.8 | 70.8 | 68.6 | 67.8 | 39.6 |
| +Critic-V(Ours) | 63.5 | 38.4 | 73.8 | 70.1 | 65.2 | 49.7 |