notesum.ai
Published at December 10STIV: Scalable Text and Image Conditioned Video Generation
cs.CV
cs.AI
cs.LG
cs.MM
Released Date: December 10, 2024
Authors: Zongyu Lin1, Wei Liu1, Chen Chen1, Jiasen Lu1, Wenze Hu1, Tsu-Jui Fu1, Jesse Allardice1, Zhengfeng Lai1, Liangchen Song1, Bowen Zhang1, Cha Chen1, Yiran Fei1, Yifan Jiang1, Lezhi Li1, Yizhou Sun2, Kai-Wei Chang2, Yinfei Yang1
Aff.: 1Apple; 2University of California, Los Angeles

| Model | COCO | COCO | COCO | Gen | DSG | HPSv2 | Image |
| FID↓ | PICK↑ | CLIP↑ | Eval↑ | Eval↑ | Eval↑ | Reward↑ | |
| Baseline | 26.17 | 20.91 | 32.03 | 0.358 | 0.571 | 26.33 | -0.25 |
| + QK norm | 25.60 | 20.92 | 32.08 | 0.372 | 0.574 | 26.32 | -0.22 |
| + Sandwich norm | 25.76 | 20.97 | 32.13 | 0.366 | 0.577 | 26.32 | -0.23 |
| + Cond. norm | 25.58 | 21.05 | 32.27 | 0.393 | 0.583 | 26.43 | -0.22 |
| + LR to 2E-4 | 26.35 | 21.03 | 32.28 | 0.379 | 0.586 | 26.40 | -0.12 |
| + Flow | 24.96 | 21.45 | 32.90 | 0.457 | 0.639 | 26.95 | 0.15 |
| + Renorm | 21.16 | 21.46 | 32.93 | 0.471 | 0.668 | 27.27 | 0.32 |
| + AdaFactor | 20.26 | 21.47 | 32.97 | 0.474 | 0.661 | 27.26 | 0.32 |
| + MaskDiT | 23.85 | 21.51 | 33.07 | 0.499 | 0.663 | 27.28 | 0.30 |
| + Shared AdaLN | 22.83 | 21.44 | 33.12 | 0.496 | 0.658 | 27.27 | 0.24 |
| + Micro cond. | 20.02 | 21.50 | 33.09 | 0.498 | 0.673 | 27.27 | 0.41 |
| + RoPE | 18.40 | 21.46 | 33.11 | 0.502 | 0.680 | 27.26 | 0.48 |
| + Internal VAE | 19.57 | 21.79 | 33.26 | 0.492 | 0.668 | 27.26 | 0.52 |
| + Internal CLIP | 17.97 | 21.89 | 33.62 | 0.607 | 0.717 | 27.40 | 0.65 |
| + Synth. captions | 18.04 | 22.10 | 33.65 | 0.685 | 0.751 | 27.65 | 0.81 |