notesum.ai
Published at December 10ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer
cs.CV
Released Date: December 10, 2024
Authors: Jinyi Hu1, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang2, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun
Aff.: 1Tsinghua University; 2ByteDance

| Model | Type | Latent | KV-Cache | Params | FID | IS | Pre | Rec |
|---|---|---|---|---|---|---|---|---|
| ADM [14] (NeuIPS’19) | Diff. | - | - | 554M | 10.94 | 101.0 | 0.69 | 0.63 |
| LDM-4-G [54] (CVPR’22) | Diff. | Cont. | - | 400M | 3.60 | 247.7 | - | - |
| DiT-XL/2 [49] (ICCV’23) | Diff. | Cont. | - | 675M | 2.27 | 278.2 | 0.83 | 0.57 |
| MaskGIT [73] (CVPR’22) | Mask. | Disc. | - | 227M | 6.18 | 316.2 | 0.83 | 0.58 |
| MAGE [38] (CVPR’23) | Mask. | Disc. | - | 230M | 6.93 | 195.8 | - | - |
| VQGAN [19] (CVPR’21) | AR | Disc. | ✓ | 1.4B | 15.78 | 78.3 | - | - |
| RQTran [36] (CVPR’22) | AR | Disc. | ✓ | 3.8B | 7.55 | 134.0 | - | - |
| VAR-d16 [66] (NeurIPS’24) | VAR | Disc. | ✓ | 310M | 3.30 | 274.4 | 0.84 | 0.51 |
| VAR-d20 [66] (NeurIPS’24) | VAR | Disc. | ✓ | 600M | 2.57 | 302.6 | 0.83 | 0.56 |
| LlamaGen-L [64] (arxiv’24) | AR | Disc. | ✓ | 343M | 3.07 | 256.1 | 0.83 | 0.52 |
| LlamaGen-XL [64] (arxiv’24) | AR | Disc. | ✓ | 775M | 2.62 | 244.1 | 0.80 | 0.57 |
| LlamaGen-XXL [64] (arxiv’24) | AR | Disc. | ✓ | 1.4B | 2.34 | 253.9 | 0.80 | 0.59 |
| ImageFolder [40] (arxiv’24) | AR | Disc. | ✓ | 362M | 2.60 | 295.0 | 0.75 | 0.63 |
| MAR-L [66] (NeurIPS’24) | AR | Cont. | ✓ | 479M | 4.07 | 232.4 | - | - |
| MAR-L [66] (NeurIPS’24) | MAR | Cont. | - | 479M | 1.78 | 296.0 | 0.81 | 0.60 |
| ACDiT-L | AR+Diff | Cont. | ✓ | 460M | 2.53 | 262.9 | 0.82 | 0.55 |
| ACDiT-XL | AR+Diff | Cont. | ✓ | 677M | 2.45 | 267.4 | 0.82 | 0.57 |
| ACDiT-H | AR+Diff | Cont. | ✓ | 954M | 2.37 | 273.3 | 0.82 | 0.57 |