notesum.ai
Published at November 25Multi-modal Retrieval Augmented Multi-modal Generation: A Benchmark, Evaluate Metrics and Strong Baselines
cs.CL
Released Date: November 25, 2024
Authors: Zi-Ao Ma1, Tian Lan1, Rong-Cheng Tu2, Yong Hu3, Heyan Huang1, Xian-Ling Mao1
Aff.: 1School of Computer Science and Technology, Beijing Institute of Technology, China; 2Nanyang Technological University, Singapore; 3WeChat AI, Tencent Inc., China

| Model Type | Approach | Model | Text-modal Metrics | Multi-modal Metrics | Overall | ||||||
| Flu. | Rel. | CP. | Faith. | Coher. | Help. | Ref. | Recall | ||||
| LLMs | Separate | GPT-4o | 0.80 | 0.82 | 0.69 | 0.89 | 0.53 | 0.44 | 0.43 | 0.43 | 0.63 |
| Single | GPT-4o | 0.79 | 0.81 | 0.76 | 0.91 | 0.69 | 0.57 | 0.31 | 0.57 | 0.68 | |
| Llama-3.1-70B-Instruct | 0.75 | 0.79 | 0.75 | 0.85 | 0.57 | 0.47 | 0.18 | 0.22 | 0.57 | ||
| Qwen2.5-72B-Instruct | 0.79 | 0.83 | 0.77 | 0.87 | 0.62 | 0.52 | 0.35 | 0.52 | 0.66 | ||
| Llama-3.1-8B-Instruct | 0.74 | 0.77 | 0.72 | 0.83 | 0.49 | 0.38 | 0.28 | 0.62 | 0.60 | ||
| Qwen2.5-7B-Instruct | 0.71 | 0.73 | 0.73 | 0.91 | 0.52 | 0.41 | 0.22 | 0.63 | 0.61 | ||
| Average | 0.75 | 0.79 | 0.75 | 0.87 | 0.56 | 0.45 | 0.27 | 0.51 | 0.62 | ||
| Multi | GPT-4o | 0.78 | 0.76 | 0.68 | 0.87 | 0.68 | 0.60 | 0.81 | 0.97 | 0.77 | |
| Llama-3.1-70B-Instruct | 0.73 | 0.75 | 0.72 | 0.84 | 0.68 | 0.55 | 0.77 | 0.97 | 0.75 | ||
| Qwen2.5-72B-Instruct | 0.77 | 0.77 | 0.72 | 0.85 | 0.69 | 0.58 | 0.78 | 0.97 | 0.77 | ||
| Llama-3.1-8B-Instruct | 0.72 | 0.73 | 0.72 | 0.82 | 0.66 | 0.55 | 0.75 | 0.93 | 0.74 | ||
| Qwen2.5-7B-Instruct | 0.74 | 0.74 | 0.74 | 0.83 | 0.64 | 0.55 | 0.76 | 0.95 | 0.74 | ||
| Average | 0.75 | 0.75 | 0.72 | 0.84 | 0.67 | 0.57 | 0.77 | 0.96 | 0.75 | ||
| MLLMs | Single | GPT-4o | 0.78 | 0.82 | 0.76 | 0.90 | 0.64 | 0.53 | 0.24 | 0.60 | 0.66 |
| Llama-3.2-90B-V-Instruct | 0.77 | 0.68 | 0.64 | 0.76 | 0.51 | 0.34 | 0.10 | 0.01 | 0.48 | ||
| Qwen2-VL-72B-Instruct | 0.77 | 0.78 | 0.72 | 0.88 | 0.43 | 0.32 | 0.15 | 0.16 | 0.53 | ||
| Llama-3.2-11B-V-Instruct | 0.77 | 0.64 | 0.60 | 0.68 | 0.26 | 0.21 | 0.02 | 0.02 | 0.40 | ||
| Qwen2-VL-7B-Instruct | 0.66 | 0.64 | 0.71 | 0.83 | 0.39 | 0.30 | 0.23 | 0.15 | 0.49 | ||
| Average | 0.75 | 0.71 | 0.69 | 0.81 | 0.53 | 0.43 | 0.22 | 0.19 | 0.54 | ||
| Multi | GPT-4o | 0.77 | 0.78 | 0.72 | 0.86 | 0.65 | 0.56 | 0.81 | 0.97 | 0.76 | |
| Llama-3.2-90B-V-Instruct | 0.72 | 0.70 | 0.67 | 0.74 | 0.55 | 0.43 | 0.62 | 0.94 | 0.67 | ||
| Qwen2-VL-72B-Instruct | 0.75 | 0.74 | 0.71 | 0.80 | 0.63 | 0.52 | 0.73 | 0.94 | 0.73 | ||
| Llama-3.2-11B-V-Instruct | 0.72 | 0.69 | 0.70 | 0.69 | 0.47 | 0.34 | 0.34 | 0.81 | 0.59 | ||
| Qwen2-VL-7B-Instruct | 0.70 | 0.75 | 0.73 | 0.77 | 0.54 | 0.42 | 0.51 | 0.94 | 0.67 | ||
| Average | 0.73 | 0.73 | 0.70 | 0.77 | 0.57 | 0.46 | 0.62 | 0.92 | 0.69 | ||