notesum.ai
Published at December 3Taming Scalable Visual Tokenizer for Autoregressive Image Generation
cs.CV
cs.AI
Released Date: December 3, 2024
Authors: Fengyuan Shi1, Zhuoyan Luo2, Yixiao Ge3, Yujiu Yang2, Ying Shan3, Limin Wang1
Aff.: 1Nanjing University; 2Tsinghua University; 3ARC Lab, Tencent PCG

| Method | Token | Tokens | Ratio | Train | Codebook | Codebook | rFID | LPIPS | Codebook |
| Type | Resolution | Size | Dim | Usage | |||||
| VQGAN [6] | 2D | 16 16 | 16 | 256 256 | 1,024 | 256 | 7.94 | 44% | |
| VQGAN [6] | 2D | 16 16 | 16 | 256 256 | 16,384 | 256 | 4.98 | 0.2843 | 5.9% |
| VQGAN∗ [6] | 2D | 16 16 | 16 | 256 256 | 16,384 | 256 | 3.98 | 0.2873 | 5.3% |
| SD-VQGAN [20] | 2D | 16 16 | 16 | 256 256 | 16,384 | 8 | 5.15 | ||
| MaskGIT [3] | 2D | 16 16 | 16 | 256 256 | 1,024 | 256 | 2.28 | ||
| LlamaGen [24] | 2D | 16 16 | 16 | 256 256 | 16,384 | 256 | 9.21 | 0.29 | |
| LlamaGen [24] | 2D | 16 16 | 16 | 256 256 | 16,384 | 8 | 2.19 | 0.2281 | 97 |
| VQGAN-LC [36] | 2D | 16 16 | 16 | 256 256 | 16,384 | 8 | 3.01 | 0.2358 | 99 |
| VQGAN-LC [36] | 2D | 16 16 | 16 | 256 256 | 100,000 | 8 | 2.62 | 0.2212 | 99 |
| MaskBit [30] | 2D | 16 16 | 16 | 256 256 | 16,384 | 0 | 1.61 | ||
| Open-MAGVIT2 [16] | 2D | 16 16 | 16 | 256 256 | 16,384 | 0 | 1.58 | 0.2261 | 100% |
| Open-MAGVIT2 [16] | 2D | 16 16 | 16 | 256 256 | 262,144 | 0 | 1.17 | 0.2038 | 100% |
| IBQ (Ours) | 2D | 16 16 | 16 | 256 256 | 16,384 | 256 | 0.2235 | 96 | |
| IBQ (Ours) | 2D | 16 16 | 16 | 256 256 | 262,144 | 256 | 1.00 | 0.2030 | 84% |
| Titok-L [33] | 1D | 32 | 256 256 | 4,096 | 16 | 2.21 | |||
| Titok-B [33] | 1D | 64 | 256 256 | 4,096 | 16 | 1.70 | |||
| Titok-S [33] | 1D | 128 | 256 256 | 4,096 | 16 | 1.71 |