notesum.ai
Published at December 4TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
cs.CV
cs.AI
Released Date: December 4, 2024
Authors: Liao Qu1, Huichao Zhang1, Yiheng Liu1, Xu Wang1, Yi Jiang1, Yiming Gao1, Hu Ye1, Daniel K. Du1, Zehuan Yuan1, Xinglong Wu1
Aff.: 1ByteDance

| Model | Res. | ratio | #Lvls. | rFID | PSNR | SSIM |
| VQ-GAN [13] | 256 | 16 | 1 | 4.98 | 20.00 | 0.629 |
| LlamaGen [44] | 256 | 16 | 1 | 2.19 | 20.79 | 0.675 |
| RQ-VAE [21] | 256 | 32 | 4 | 3.20 | – | – |
| RQ-VAE [21] | 256 | 16 | 4 | 1.30 | – | – |
| VAR [51] | 256 | 16 | 10 | 1.00 | 22.63 | 0.755 |
| VILA-U [60] | 256 | 16 | 4 | 1.80 | – | – |
| Ours | 256 | 16 | 9 | 1.37 | 21.41 | 0.687 |
| LlamaGen [60] | 384 | 14.2 | 1 | 0.94 | 21.94 | 0.726 |
| VILA-U [60] | 384 | 14.2 | 16 | 1.25 | – | – |
| VAR [51] | 384 | 16 | 13 | 2.09 | 22.73 | 0.774 |
| Ours | 384 | 14.2 | 15 | 0.63 | 22.77 | 0.731 |