notesum.ai
Published at November 12JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
cs.CV
cs.AI
cs.CL
Released Date: November 12, 2024
Authors: Yiyang Ma1, Xingchao Liu2, Xiaokang Chen2, Wen Liu2, Chengyue Wu, Zhiyu Wu2, Zizheng Pan2, Zhenda Xie2, Haowei Zhang2, Xingkai yu2, Liang Zhao2, Yisong Wang3, Jiaying Liu4, Chong Ruan2
Aff.: 1DeepSeek-AI, Peking University; 2DeepSeek-AI; 3DeepSeek-AI, Tsinghua University; 4Peking University

| Exp. ID | Model Setting | Train. Iter. | Evaluation Benchmarks | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| REPA | Und. Modules | Gen. Modules | Type | POPE | VQAv2 | GQA | FID | CLIP | ||
| A | SigLIP | VAE†+ConvNeXt | Unified | 50,000 | 82.40 | 69.62 | 54.43 | 19.84 | 24.94 | |
| B | Shared VAE†+ConvNeXt | Unified | 50,000 | 78.13 | 53.94 | 44.04 | 18.05 | 26.38 | ||
| C | VAE+ConvNeXt | VAE†+ConvNeXt | Unified | 50,000 | 75.30 | 55.41 | 44.44 | 17.53 | 26.32 | |
| D | SigLIP | - | Und. Only | 13,000 | 85.03 | 69.10 | 54.23 | - | - | |
| E | - | VAE†+ConvNeXt | Gen. Only | 37,000 | - | - | - | 16.69 | 26.89 | |
| F | SigLIP | VAE†+ConvNeXt | Unified | 50,000 | 84.73 | 69.20 | 54.83 | 17.61 | 26.40 | |