notesum.ai
Published at December 5EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
cs.AI
cs.CV
Released Date: December 5, 2024
Authors: Lu Qiu1, Yuying Ge2, Yi Chen1, Yixiao Ge2, Ying Shan2, Xihui Liu1
Aff.: 1The University of Hong Kong; 2ARC Lab, Tencent PCG

| Model | Frames | LLM | Domain | Video Length | Total Acc | ||||
|---|---|---|---|---|---|---|---|---|---|
| Daily life | Work | Recreation | Hobbies | 30s | 30s | ||||
| Image MLLMs | |||||||||
| Yi-VL[70] | 8 | Yi-6B | 24.37 | 21.29 | 26.23 | 23.39 | 23.19 | 23.77 | 23.47 |
| MultiModal-GPT[71] | 8 | LLaMA-7B | 26.42 | 23.27 | 23.50 | 26.10 | 23.48 | 26.62 | 24.98 |
| LLaVA1.5[72] | 6 | LLaMA-7B | 29.61 | 21.04 | 27.32 | 24.07 | 23.77 | 27.26 | 25.44 |
| InternVL-1.5[73, 74] | 8 | InternLM2-Chat-1.8B | 28.02 | 24.75 | 23.50 | 24.41 | 23.91 | 27.42 | 25.59 |
| mPLUG-Owl-2[75] | 8 | LLaMA2-7B | 27.79 | 24.75 | 24.04 | 25.42 | 25.36 | 26.31 | 25.81 |
| BLIP-2[35] | 8 | Flan-T5-XL | 24.37 | 23.51 | 30.05 | 30.17 | 24.64 | 27.89 | 26.19 |
| InstructBLIP[76] | 8 | Flan-T5-XL | 27.33 | 23.02 | 26.23 | 29.49 | 25.51 | 27.26 | 26.34 |
| InstructBLIP Vicuna[76] | 8 | Vicuna-7B | 27.56 | 24.26 | 28.42 | 28.14 | 25.65 | 28.05 | 26.80 |
| DeepSeek-VL[77] | 6 | DeepSeek-LLM-7B | 32.12 | 24.75 | 26.23 | 29.83 | 28.55 | 28.53 | 28.54 |
| Qwen-VL-Chat[78] | 8 | Qwen-7B | 32.57 | 27.23 | 27.87 | 28.47 | 30.00 | 28.68 | 29.37 |
| InternVL-2[73, 74] | 8 | InternLM2.5-Chat-7B | 37.81 | 23.76 | 31.69 | 28.14 | 31.01 | 29.95 | 30.51 |
| Video MLLMs | |||||||||
| Video-LLaMA2[50] | 8 | Mistral-v0.2-Instruct-7B | 24.15 | 23.02 | 19.13 | 23.73 | 23.19 | 22.82 | 23.01 |
| LLaVA-NeXT-Video[48] | 16 | Vicuna1.5-7B | 26.42 | 19.55 | 24.59 | 23.05 | 22.61 | 24.09 | 23.32 |
| Video-ChatGPT[40] | 100 | LLaMA-7B | 24.15 | 22.77 | 24.59 | 24.07 | 23.33 | 24.25 | 23.77 |
| Video-LLaVA[79] | 8 | Vicuna1.5-7B | 27.11 | 22.52 | 27.87 | 24.75 | 25.22 | 25.36 | 25.28 |
| ShareGPT4Video[47] | 16 | LLaMA3-Instruct-8B | 25.51 | 23.02 | 26.78 | 27.46 | 25.07 | 25.67 | 25.36 |
| VILA[80] | 6 | LLaMA3-8B | 28.70 | 20.05 | 30.05 | 25.08 | 23.77 | 27.26 | 25.44 |
| LongVA[49] | 32 | Qwen2-Instruct-7B | 27.11 | 23.27 | 26.78 | 29.49 | 27.25 | 25.52 | 26.42 |
| VideoChat2[18] | 16 | Mistral-v0.2-Instruct-7B | 28.93 | 24.75 | 22.95 | 28.47 | 29.13 | 24.09 | 26.72 |
| Valley[44] | 8 | LLaMA-13B | 28.70 | 25.00 | 21.86 | 30.51 | 26.38 | 27.73 | 27.02 |
| Proprietary | |||||||||
| GPT-4V[3] | 8 | - | 36.67 | 27.72 | 33.88 | 32.54 | 33.62 | 31.54 | 32.63 |