notesum.ai
Published at May 11Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
NeurIPS
Released Date: May 11, 2024
Authors: Haogeng Liu1, Quanzeng You2, Xiaotian Han2, Yongfei Liu2, Huaibo Huang1, Ran He1, Hongxia Yang1
Aff.: 1MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences; 2ByteDance, Inc
Arxiv: https://openreview.net/pdf/4041aad3b42874372a267d6990a28307a3c622bb.pdf

| Benchmark | Description of the task | Metric |
| TextVQA singh2019towards | QAs about text in image (Visual Perception) | VQA score () |
| VizWiz VQA gurari2018vizwiz | QAs about image from blinds (Visual Perception) | VQA score () |
| GQA hudson2019gqa | QAs of real world comprehension and complex reasoning | EM () |
| VQAv2 VQA | QAs require vision, language and prior world knowledge | VQA score () |
| POPE li2023evaluating | QAs for Object Hallucination evaluation | F1 Score () |
| Sci-QA(Img) lu2022learn | QAs about Science | Accuracy () |
| MME fu2023mme | Comprehensive Evaluation Benchmark for MLLMs | Accuracy () |
| MMbench liu2023mmbench | Comprehensive Evaluation Benchmark for MLLMs | Accuracy () |
| MM-Vet yu2023mmvet | Integrated Capabilities Benchmark for MLLMs | GPT-4 score() |