notesum.ai

Published at November 25

Factorized Visual Tokenization and Generation

cs.CV

Released Date: November 25, 2024

Authors: Zechen Bai¹, Jianxiong Gao², Ziteng Gao¹, Pichao Wang³, Zheng Zhang³, Tong He³, Mike Zheng Shou¹

Aff.: ¹Show Lab, National University of Singapore; ²Fudan University; ³Amazon

Arxiv: http://arxiv.org/abs/2411.16681v1

Refer to caption

Method	Downsample	Codebook	Code	rFID $\downarrow$	PSNR $\uparrow$
Method	Ratio	Size	Dim	rFID $\downarrow$	PSNR $\uparrow$
VQGAN [6]	16	16384	256	4.98	$-$
SD-VQGAN [23]	16	16384	4	5.15	$-$
RQ-VAE [12]	16	16384	256	3.20	$-$
LlamaGen [25]	16	16384	8	2.19	20.79
Titok-B [36]	$-$	4096	12	1.70	$-$
VQGAN-LC [41]	16	100000	8	2.62	23.80
VQ-KD [30]	16	8192	32	3.41	-
VILA-U [31]	16	16384	256	1.80	-
Open-MAGVIT2 [15]	16	262144	1	1.17	21.90
FQGAN-Dual	16	16384 $\times$ 2	8	0.94	22.02
FQGAN-Triple	16	16384 $\times$ 3	8	0.76	22.73
SD-VAE^† [23]	8		4	0.74	25.68
SDXL-VAE^† [19]	8	$-$	4	0.68	26.04
ViT-VQGAN [33]	8	8192	32	1.28	$-$
VQGAN^∗ [6]	8	16384	4	1.19	23.38
SD-VQGAN^∗ [23]	8	16384	4	1.14	$-$
OmniTokenizer [29]	8	8192	8	1.11	$-$
LlamaGen [25]	8	16384	8	0.59	25.45
Open-MAGVIT2 [15]	8	262144	1	0.34	26.19
FQGAN-Dual	8	16384 $\times$ 2	8	0.32	26.27
FQGAN-Triple	8	16384 $\times$ 3	8	0.24	27.58