notesum.ai
Published at October 31EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching
cs.CV
cs.AI
Released Date: October 31, 2024
Authors: Xinwang Chen1, Ning Liu1, Yichen Zhu1, Feifei Feng1, Jian Tang2
Aff.: 1Midea Group; 2Beijing Innovation Center of Humanoid Robotics

| Model | Cost (Iter×BS) | Params. (M) | T-speed (iter/s) | GFLOPs | I-Speed (step/s) | Mem. (MB) | FID |
| DiT-S[21] | 400K×256 | 32.90 | 12.50 | 6.06 | 2.70 | 4296 | 68.40 |
| SD-DiT-S[39] | 400K×256 | 32.90 | - | - | - | - | 48.39 |
| EDT-S*(our) | 400K×256 | 38.30 | 13.20 | 2.66 | 5.50 | 4268 | 38.73 |
| MDTv2-S[33] | 400K×256 | 33.10 | 2.25 | 6.07 | 2.40 | 4902 | 39.50 |
| EDT-S (our) | 400K×256 | 38.30 | 8.86 | 2.66 | 5.50 | 4268 | 34.27 |
| DiT-B[21] | 400K×256 | 130.30 | 4.30 | 23.01 | 1.11 | 8978 | 43.47 |
| SD-DiT-B[39] | 400K×256 | 130.30 | - | - | - | - | 28.62 |
| EDT-B*(our) | 400K×256 | 152.00 | 5.80 | 10.20 | 2.20 | 8584 | 23.19 |
| MDTv2-B[33] | 400K×256 | 130.80 | 1.42 | 23.02 | 0.96 | 9212 | 19.55 |
| MDTv2-B[33] | 1600K×256 | 130.80 | 1.42 | 23.02 | 0.96 | 9212 | 13.60 |
| EDT-B(our) | 400K×256 | 152.00 | 4.03 | 10.20 | 2.20 | 8584 | 19.18 |
| EDT-B(our) | 1000K×256 | 152.00 | 4.03 | 10.20 | 2.20 | 8584 | 13.58 |
| ADM[4] | 1980k×256 | 554.00 | - | 1120.00 | - | - | 10.94 |
| LDM-4[11] | 178k×1200 | 400.00 | - | 104.00 | - | - | 10.56 |
| DiT-XL[21] | 400K×256 | 674.80 | 0.93 | 118.64 | 0.25 | 17538 | 19.47 |
| SD-DiT-XL[39] | 1300K×256 | 740.60 | - | - | - | - | 9.01 |
| EDT-XL*(our) | 400K×256 | 698.40 | 1.49 | 51.83 | 0.51 | 14486 | 10.48 |
| MDTv2-XL[33] | 400K×256 | 675.80 | 0.51 | 118.69 | 0.23 | 23436 | 7.70 |
| EDT-XL(our) | 400K×256 | 698.40 | 0.98 | 51.83 | 0.51 | 14486 | 7.52 |