notesum.ai

Published at November 29

Accelerating Multimodal Large Language Models via Dynamic Visual-Token Exit and the Empirical Findings

cs.CV

cs.CL

cs.LG

cs.MM

Released Date: November 29, 2024

Authors: Qiong Wu¹, Wenhao Lin¹, Weihao Ye¹, Yiyi Zhou¹, Xiaoshuai Sun¹, Rongrong Ji¹

Aff.: ¹Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China

Arxiv: http://arxiv.org/pdf/2411.19628v1

Refer to caption

Method	SEED		MME		MMB		POPE		MM-Vet
Method	Accuracy $\uparrow$	TFLOPs $\downarrow$	Score $\uparrow$	TFLOPs $\downarrow$	Accuracy $\uparrow$	TFLOPs $\downarrow$	Accuracy $\uparrow$	TFLOPs $\downarrow$	Accuracy $\uparrow$	TFLOPs $\downarrow$
Eagle-X5 7B [45]	73.9	47.8	1528.0	27.8	68.4	29.6	88.8	27.7	37.4	27.6
Eagle-DyVTE 7B	73.6 (-0.4%)	43.0 (-10.0%)	1581.7 (+3.5%)	20.3 (-27.0%)	68.8(+0.6%)	23.7 (-19.9%)	88.4 (-0.5%)	20.0 (-27.8%)	37.8 (+1.1%)	23.5 (-14.9%)
VILA 7B [37]	61.7	9.2	1489.2	8.9	69.9	9.5	86.3	8.8	36.3	8.7
VILA-DyVTE 7B	61.8 (+0.2%)	5.9 (-35.9%)	1503.1 (+0.1%)	4.6 (-48.3%)	69.8 (-0.1%)	6.0 (-36.8%)	85.6 (-0.8%)	4.5 (-48.9%)	36.7 (+1.1%)	6.6 (-24.1%)
InternVL 7B [11]	59.2	16.0	1525.1	15.5	64.6	16.2	86.4	15.4	31.2	15.4
Intern-DyVTE 7B	59.1 (-0.2%)	11.9 (-25.6%)	1474.1 (-3.3%)	10.9 (-29.7%)	64.4 (-0.3%)	12.0 (-25.9%)	81.3 (-5.9%)	10.9 (-29.2%)	29.5 (-5.4%)	13.0 (-15.6%)
LLaVA-1.5 7B [40]	58.6	9.2	1510.7	8.9	64.3	9.6	85.9	8.8	30.5	8.7
LLaVA-DyVTE 7B	58.6 (0.0%)	5.0 (-45.7%)	1491.4 (-1.3%)	6.3 (-29.2%)	64.7 (+0.6%)	5.4 (-43.8%)	81.6 (-5.0%)	4.1 (-53.4%)	31.9 (+4.6%)	6.3 (-27.6%)
LLaVA-1.5 13B [40]	61.6	17.6	1531.3	16.9	67.7	18.3	85.9	16.8	36.1	16.7
LLaVA-DyVTE 13B	59.3 (-3.7%)	7.1 (-59.7%)	1546.4 (+1.0%)	7.2 (-57.4%)	66.0 (-2.5%)	7.8 (-57.4%)	84.8 (-1.3%)	7.6 (-54.8%)	34.8 (-3.6%)	10.6 (-36.5%)