notesum.ai

Published at December 5

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

cs.AI

cs.CV

Released Date: December 5, 2024

Authors: Lu Qiu¹, Yuying Ge², Yi Chen¹, Yixiao Ge², Ying Shan², Xihui Liu¹

Aff.: ¹The University of Hong Kong; ²ARC Lab, Tencent PCG

Arxiv: http://arxiv.org/pdf/2412.04447v1

Refer to caption

Image MLLMs
Model	Frames	LLM	Domain				Video Length		Total Acc
Model	Frames	LLM	Daily life	Work	Recreation	Hobbies	$\leq$ 30s	$>$ 30s	Total Acc
Yi-VL[70]	8	Yi-6B	24.37	21.29	26.23	23.39	23.19	23.77	23.47
MultiModal-GPT[71]	8	LLaMA-7B	26.42	23.27	23.50	26.10	23.48	26.62	24.98
LLaVA1.5[72]	6	LLaMA-7B	29.61	21.04	27.32	24.07	23.77	27.26	25.44
InternVL-1.5[73, 74]	8	InternLM2-Chat-1.8B	28.02	24.75	23.50	24.41	23.91	27.42	25.59
mPLUG-Owl-2[75]	8	LLaMA2-7B	27.79	24.75	24.04	25.42	25.36	26.31	25.81
BLIP-2[35]	8	Flan-T5-XL	24.37	23.51	30.05	30.17	24.64	27.89	26.19
InstructBLIP[76]	8	Flan-T5-XL	27.33	23.02	26.23	29.49	25.51	27.26	26.34
InstructBLIP Vicuna[76]	8	Vicuna-7B	27.56	24.26	28.42	28.14	25.65	28.05	26.80
DeepSeek-VL[77]	6	DeepSeek-LLM-7B	32.12	24.75	26.23	29.83	28.55	28.53	28.54
Qwen-VL-Chat[78]	8	Qwen-7B	32.57	27.23	27.87	28.47	30.00	28.68	29.37
InternVL-2[73, 74]	8	InternLM2.5-Chat-7B	37.81	23.76	31.69	28.14	31.01	29.95	30.51
Video MLLMs
Video-LLaMA2[50]	8	Mistral-v0.2-Instruct-7B	24.15	23.02	19.13	23.73	23.19	22.82	23.01
LLaVA-NeXT-Video[48]	16	Vicuna1.5-7B	26.42	19.55	24.59	23.05	22.61	24.09	23.32
Video-ChatGPT[40]	100	LLaMA-7B	24.15	22.77	24.59	24.07	23.33	24.25	23.77
Video-LLaVA[79]	8	Vicuna1.5-7B	27.11	22.52	27.87	24.75	25.22	25.36	25.28
ShareGPT4Video[47]	16	LLaMA3-Instruct-8B	25.51	23.02	26.78	27.46	25.07	25.67	25.36
VILA[80]	6	LLaMA3-8B	28.70	20.05	30.05	25.08	23.77	27.26	25.44
LongVA[49]	32	Qwen2-Instruct-7B	27.11	23.27	26.78	29.49	27.25	25.52	26.42
VideoChat2[18]	16	Mistral-v0.2-Instruct-7B	28.93	24.75	22.95	28.47	29.13	24.09	26.72
Valley[44]	8	LLaMA-13B	28.70	25.00	21.86	30.51	26.38	27.73	27.02
Proprietary
GPT-4V[3]	8	-	36.67	27.72	33.88	32.54	33.62	31.54	32.63