notesum.ai
Published at November 11ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis
cs.CV
cs.AI
Released Date: November 11, 2024
Authors: Zanlin Ni1, Yulin Wang1, Renping Zhou1, Yizeng Han1, Jiayi Guo1, Zhiyuan Liu1, Yuan Yao2, Gao Huang1
Aff.: 1Tsinghua University; 2National University of Singapore

| Method | Type | #Params | Steps | TFLOPs | FID | IS |
| BigGAN-deep [4] (ICLR’19) | GAN | - | 1 | - | 6.95 | 171.4 |
| StyleGAN-XL [63] (SIGGRAPH’22) | GAN | - | 1 | 1.5 | 2.30 | 265.1 |
| VQVAE-2 [57] (NeurIPS’19) | AR | 13.5B | 5120 | - | 31.1 | 45 |
| VQGAN [13] (CVPR’21) | AR | 1.4B | 256 | - | 15.78 | 78.3 |
| ADM-G [10] (NeurIPS’21) | Diff. | 554M | 250 | 334 | 4.59 | 186.7 |
| LDM [59] (CVPR’22) | Diff. | 400M | 250 | 52.3 | 3.60 | 247.7 |
| LDM† [59] (CVPR’22) | Diff. | 400M | 4 | 1.2 | 11.74 | - |
| 8 | 2.0 | 4.56 | 262.9 | |||
| U-ViT-H† [3] (CVPR’23) | Diff. | 501M | 4 | 1.4 | 8.45 | - |
| 8 | 2.4 | 3.37 | 235.9 | |||
| DiT-XL† [49] (ICCV’23) | Diff. | 675M | 4 | 1.3 | 9.71 | - |
| 8 | 2.2 | 5.18 | 213.0 | |||
| MDT-XL† [16] (ICCV’23) | Diff. | 676M | 4 | 1.3 | 11.36 | - |
| 8 | 2.2 | 4.00 | - | |||
| USF [38] (ICLR’24) | Diff. | 554M | 8 | 10.7 | 9.72 | - |
| MaskGIT [7] (CVPR’22) | NAT | 227M | 12 | 1.22 | 4.92 | - |
| Token-Critic [33] (ECCV’22) | NAT | 422M | 36 | 1.9 | 4.69 | 174.5 |
| Draft-and-revise [32] (NeurIPS’22) | NAT | 1.4B | 72 | - | 3.41 | 224.6 |
| MAGE [35] (CVPR’23) | NAT | 230M | 20 | 1.0 | 6.93 | - |
| MaskGIT-FSQ [43] (ICLR’24) | NAT | 225M | 12 | 0.8 | 4.53 | - |
| AdaNAT [47] (ECCV’24) | NAT | 206M | 8 | 0.9 | 2.86 | 265.4 |
| ENAT-B | NAT | 219M | 4 | 0.1 | 5.86 | - |
| 8 | 0.2 | 3.53 | 302.4 | |||
| ENAT-L | NAT | 574M | 4 | 0.2 | 4.13 | - |
| 8 | 0.3 | 2.79 | 326.7 |