notesum.ai
Published at November 22High-Resolution Image Synthesis via Next-Token Prediction
cs.CV
cs.AI
Released Date: November 22, 2024
Authors: Dengsheng Chen1, Jie Hu1, Tiezhu Yue1, Xiaoming Wei1
Aff.: 1Meituan, Beijing, China

| GenEval | T2I-CompBench++ | |||||||||||
| Model | NTP | #Params | Overall | Single Obj. | Two Obj. | Counting | Colors | Position | Color Attr. | Color | Shape | Texture |
| Small model size | ||||||||||||
| PixArt- [21] | 0.6B | 0.48 | 0.98 | 0.50 | 0.44 | 0.80 | 0.08 | 0.07 | 0.4232 | 0.3764 | 0.4808 | |
| SD v1.x [93] | 0.9B | 0.43 | 0.97 | 0.38 | 0.35 | 0.76 | 0.04 | 0.06 | 0.3765 | 0.3576 | 0.4156 | |
| SD v2.x [93] | 0.9B | 0.50 | 0.98 | 0.51 | 0.44 | 0.85 | 0.07 | 0.17 | 0.5065 | 0.4221 | 0.4922 | |
| SD 3.0 [36] | 1.0B | 0.58 | 0.97 | 0.72 | 0.52 | 0.78 | 0.16 | 0.34 | - | - | - | |
| Show-o [124] | 1.3B | 0.53 | 0.95 | 0.52 | 0.49 | 0.82 | 0.11 | 0.28 | - | - | - | |
| LDM [93] | 1.4B | 0.37 | 0.92 | 0.29 | 0.23 | 0.70 | 0.02 | 0.05 | - | - | - | |
| Hunyuan-DiT [64] | 1.5B | 0.57 | 0.96 | 0.67 | 0.59 | 0.83 | 0.11 | 0.26 | 0.6565 | 0.3577 | 0.4718 | |
| Mainstream model size | ||||||||||||
| Lumina-T2I [136] | 2.0B | 0.39 | 0.88 | 0.34 | 0.31 | 0.67 | 0.05 | 0.09 | 0.4081 | 0.3008 | 0.4071 | |
| SD 3.0 [36] | 2.0B | 0.62 | 0.98 | 0.74 | 0.63 | 0.67 | 0.34 | 0.36 | 0.8132 | 0.5885 | 0.7334 | |
| D-JEPAT2I | 2.6B | 0.66 | 0.99 | 0.80 | 0.59 | 0.87 | 0.22 | 0.47 | 0.7585 | 0.5036 | 0.6355 | |
| SDXL [83] | 2.6B | 0.55 | 0.98 | 0.74 | 0.39 | 0.85 | 0.15 | 0.23 | 0.5879 | 0.4687 | 0.5299 | |
| LlamaGen [109] | 3.1B | 0.32 | 0.71 | 0.34 | 0.21 | 0.58 | 0.07 | 0.04 | - | - | - | |
| SD 3.0 [36] | 4.0B | 0.64 | 0.96 | 0.80 | 0.65 | 0.73 | 0.33 | 0.37 | - | - | - | |
| DALLE 2 [90] | 4.2B | 0.52 | 0.94 | 0.66 | 0.49 | 0.77 | 0.10 | 0.19 | - | - | - | |
| Extensive model size | ||||||||||||
| Chameleon [111] | 7.0B | 0.39 | - | - | - | - | - | - | - | - | - | |
| Transfusion [135] | 7.3B | 0.63 | - | - | - | - | - | - | - | - | - | |
| SD 3.0 [36] | 8.0B | 0.68 | 0.98 | 0.84 | 0.66 | 0.74 | 0.40 | 0.43 | - | - | - | |
| Emu3 [122] | 8.0B | 0.54 | 0.98 | 0.71 | 0.34 | 0.81 | 0.17 | 0.21 | - | - | - | |
| Fluid [37] | 10.5B | 0.69 | 0.96 | 0.83 | 0.63 | 0.80 | 0.39 | 0.51 | - | - | - | |
| FLUX.1.dev [59] | 12.0B | - | - | - | - | - | - | - | 0.7407 | 0.5718 | 0.6922 | |
| DALLE 3 [12] | - | 0.67 | 0.96 | 0.87 | 0.47 | 0.83 | 0.43 | 0.45 | 0.7785 | 0.6205 | 0.7036 | |
| Midjourney v6 [75] | - | 0.63 | 0.96 | 0.81 | 0.56 | 0.83 | 0.22 | 0.42 | 0.7503 | 0.6885 | 0.6101 | |