notesum.ai
Published at December 6EACO: Enhancing Alignment in Multimodal LLMs via Critical Observation
cs.CV
cs.AI
cs.CL
cs.LG
Released Date: December 6, 2024
Authors: Yongxin Wang1, Meng Cao, Haokun Lin, Mingfei Han, Liang Ma, Jin Jiang, Yuhao Cheng2, Xiaodan Liang
Aff.: 1Mohamed bin Zayed University of Artificial Intelligence; 2Lenovo Research
![[Uncaptioned image]](https://arxiv.org/html/2412.04903v1/x1.png)
| Dataset | Description | # of Instructions |
| LLaVA [25] | Visual Instruction Synthesized by GPT-4 | 14,128 |
| SVIT [51] | Visual Instruction Synthesized by GPT-4 | 12,142 |
| LRV [24] | Robust Visual Instruction | 7,650 |
| ComVint [11] | Complex Visual Reasoning Instruction | 1,476 |
| LLaVAR [50] | Text-rich Image Understanding | 8,524 |
| LLaVAMed [18] | Biomedical Vision-Language Instruction | 3,628 |
| PMC-VQA [49] | Medical Image Question Answering | 1,463 |
| PCA-EVAL [6] | Embodied Decision-making Instruction | 246 |
| RTVLM [22] | Red-Teaming Instructions | 1,317 |
| M3IT [21] | Academic Vision-Language Tasks | 425 |
| Total | Visual Instructions | 51,000 |
| Critic Dataset | Scoring Evaluation Instructions | 137,486 |