notesum.ai

Published at December 5

Liquid: Language Models are Scalable Multi-modal Generators

cs.CV

Released Date: December 5, 2024

Authors: Junfeng Wu¹, Yi Jiang², Chuofan Ma³, Yuliang Liu¹, Hengshuang Zhao³, Zehuan Yuan², Song Bai², Xiang Bai¹

Aff.: ¹Huazhong University of Science and Technology; ²Bytedance Inc; ³University of Hong Kong

Arxiv: http://arxiv.org/pdf/2412.04332v1

Refer to caption

Type	Method	LLM	Visual Token	Res.	VQAv2	GQA	TextVQA	POPE	MME
Und. Only	LLaVA-1.5 [38]	Vicuna-1.5-7B	Continuous	336	78.5^∗	62.0^∗	58.2	85.9	1510.7
	VILA [33]	LLaMA-2-7B	Continuous	336	79.9^∗	62.3^∗	64.4	85.5	1533.0
	InstructBLIP [11]	Vicuna-7B	Continuous	224	–	49.2	50.1	–	–
	IDEFICS-9B [26]	LLaMA-7B	Continuous	224	50.9	38.4	25.9	–	–
Und. & Gen.	Unified-IO 2 [40]	6.8B from scratch	Continuous	384	79.4^∗	–	–	87.7	–
	Emu [58]	LLaMA-13B	Continuous	224	52.0	–	–	–	–
	LaVIT [25]	LLaMA-7B	Continuous	224	66.0	46.8	–	–	–
	DreamLLM [13]	Vicuna-7B	Continuous	224	72.9^∗	–	41.8	–	–
	CM3Leon-7B [72]	7B from scratch	Discrete	256	47.6	–	–	–	–
	LWM [39]	LLaMA-2-7B	Discrete	256	55.8	44.8	18.8	75.2	–
	Show-o [69]	Phi-1.5-1.3B	Discrete	256	59.3^∗	48.7^∗	–	73.8	948.4
	VILA-U [39]	LLaMA-2-7B	Discrete	256	75.3^∗	58.3^∗	48.3	83.9	1336.2
	Chameleon [59]	34B from scratch	Discrete	512	69.6	–	–	–	–
	Ours	Gemma-7B	Discrete	512	68.0^∗	56.1^∗	40.4	81.1	1107.2
	Ours $\dagger$	Gemma-7B	Discrete	512	71.3^∗	58.4^∗	42.4	81.1	1119.3