notesum.ai

Published at November 5

On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models

cs.CV

cs.AI

Released Date: November 5, 2024

Authors: Tariq Berrada Ifriqi, Pietro Astolfi, Melissa Hall, Reyhane Askari-Hemmat¹, Yohann Benchetrit, Marton Havasi, Matthew Muckley, Karteek Alahari, Adriana Romero-Soriano, Jakob Verbeek, Michal Drozdzal¹

Aff.: ¹FAIR at Meta

Arxiv: http://arxiv.org/abs/2411.03177v1

	ImageNet-1k		CC12M
	256	512	256		512
	FID ${}_{\text{train}}\downarrow$	FID ${}_{\text{train}}\downarrow$	FID ${}_{\text{val}}\downarrow$	CLIP ${}_{\text{COCO}}\uparrow$	FID ${}_{\text{val}}\downarrow$	FID ${}_{\text{COCO}}\downarrow$	CLIP ${}_{\text{COCO}}\uparrow$
Results taken from references
UNet (SD/LDM-G4) [39]	$3.60$	—	$17.01$	24	—	$9.62$	—
DiT-XL2 w/ LN [38]	$2.27$	$3.04$	—	—	—	—	—
mDT-v2-XL/2 w/ LN [15]	$1.79$	—	—	—	—	—	—
PixArt- $\alpha$ -XL/2 [7]	—	—	—	—	—	$10.65$	—
mmDiT-XL/2 (SD3) [14]	—	—	—	22.4	—	*	—
Our re-implementation of existing architectures
UNet (SDXL)	$2.05$	$4.81$	$8.53$	$\bf 25.36$	$12.56$	$7.26$	$24.79$
DiT-XL/2 w/ LN	$1.95$	$\bf 2.85$	—	—	—	—	—
DiT-XL/2 w/ Att	${\bf 1.71}$	✗	✗	✗	✗	✗	✗
mDT-v2-XL/2 w/ LN	$2.51$	$3.75$	—	—	—	—	—
PixArt- $\alpha$ -XL/2	$2.06$	$3.05$	✗	✗	✗	✗	✗
mmDiT-XL/2 (SD3)	${\bf 1.71}$	$3.02$	$\bf 7.54$	$24.78$	$\bf 11.24$	$\bf 6.78$	$\bf 26.01$
Our improved architecture and training
mmDiT-XL/2 (ours)	${\bf 1.59}$	${\bf 2.76}$	${\bf 6.79}$	${\bf 26.60}$	${\bf 6.27}$	${\bf 6.69}$	${\bf 26.17}$