notesum.ai
Published at November 22mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA
cs.AI
cs.CL
Released Date: November 22, 2024
Authors: Tao Zhang1, Ziqi Zhang2, Zongyang Ma1, Yuxin Chen3, Zhongang Qi4, Chunfeng Yuan2, Bing Li2, Junfu Pu3, Yuxuan Zhao5, Zehua Xie5, Jin Ma5, Ying Shan3, Weiming Hu6
Aff.: 1State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; PCG ARC Lab; School of Artificial Intelligence, University of Chinese Academy of Sciences; 2State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; PeopleAl Inc; 3PCG ARC Lab; Tencent; 4Huawei Noah's Ark Lab; 5Tencent; 6State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; School of Artificial Intelligence, University of Chinese Academy of Sciences; School of Information Science and Technology, ShanghaiTech University

| Model | LLM | #Params | INFOSEEK | INFOSEEK | INFOSEEK | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Overall |
|
|
Overall |
|
|
Overall | |||||||||||||
| Retrieved Knowledge | |||||||||||||||||||||
| CLIP→PaLM [9] | PaLM | 540B | 21.9 | 18.6 | 20.1 | 15.6 | 14.9 | 15.2 | 22.7 | 18.5 | 20.4 | ||||||||||
| CLIP→FiD [9] | T5large | 660M | 20.7 | 18.1 | 19.3 | 18.9 | 17.6 | 18.2 | 23.3 | 19.1 | 20.9 | ||||||||||
| Wiki-LLaVA [7] | Vicuna | 7B | – | – | – | – | – | – | 30.1 | 27.8 | 28.9 | ||||||||||
| LLM-RA [20] | – | – | 26.1 | 20.9 | 23.1 | – | – | – | – | – | – | ||||||||||
| EchoSight [45] | LLaMA3 | 8B | – | – | – | – | – | – | – | – | 31.3 | ||||||||||
| LLaVA-mRAG | Vicuna | 7B | 30.3 | 29.2 | 29.8 | 17.6 | 15.9 | 16.7 | – | – | – | ||||||||||
| LLaVA-SFR | Vicuna | 7B | 20.8 | 19.1 | 19.9 | 18.5 | 17.2 | 17.9 | – | – | – | ||||||||||
| LLaVA-mR2AG | Vicuna | 7B | 39.1 | 38.0 | 38.6 | 30.2 | 27.5 | 28.8 | 40.6 | 39.8 | 40.2 | ||||||||||
| Oracle Knowledge | |||||||||||||||||||||
| Oracle→FID [9] | T5large | 660M | – | – | 52.0 | – | – | 45.6 | 52.1 | 53.0 | 52.5 | ||||||||||
| Wiki-LLaVA [7] | Vicuna | 7B | – | – | – | – | – | – | 52.7 | 50.3 | 51.5 | ||||||||||
| AVIS [18] | – | – | 56.4 | 50.7 | 53.4 | – | – | – | – | – | – | ||||||||||
| LLaVA-mRAG | Vicuna | 7B | 55.3 | 56.1 | 55.7 | 32.8 | 28.2 | 30.3 | – | – | – | ||||||||||
| LLaVA-SFR | Vicuna | 7B | 56.6 | 55.6 | 56.1 | 46.9 | 43.3 | 45.0 | – | – | – | ||||||||||
| LLaVA-mR2AG | Vicuna | 7B | 58.3 | 57.9 | 58.1 | 50.4 | 47.2 | 48.7 | 60.8 | 59.3 | 60.0 | ||||||||||