notesum.ai
Published at October 20Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
cs.CR
cs.AI
cs.LG
Released Date: October 20, 2024
Authors: Yu Zhao1, Hao Fei2, Xiangtai Li3, Libo Qin4, Jiayi Ji2, Hongyuan Zhu5, Meishan Zhang6, Min Zhang6, Jianguo Wei1
Aff.: 1Tianjin University; 2National University of Singapore; 3Bytedance; 4Central South University; 5I2R & CFAR, A*STAR; 6Harbin Institute of Technology (Shenzhen)

| VSDv1 | VSDv2 | ||||||||||
| ST2I | SI2T | ST2I | SI2T | ||||||||
| FID | IS | CLIP | BLEU4 | SPICE | FID | IS | CLIP | BLEU4 | SPICE | ||
| T2I Baselines | |||||||||||
| DALLE [62] | 32.55 | 17.01 | 62.16 | - | - | 28.52 | 21.18 | 64.58 | - | - | |
| Cogview [13] | 32.30 | 17.07 | 61.85 | - | - | 28.17 | 21.74 | 64.76 | - | - | |
| LAFITE [97] | 30.73 | 24.39 | - | - | 25.73 | 25.47 | - | - | |||
| VQ-Diffusion [28] | 18.34 | 20.58 | 63.42 | - | - | 15.66 | 24.75 | 66.30 | - | - | |
| Friodo [15] | 12.86 | 25.92 | 64.65 | - | - | 11.41 | 26.02 | 67.01 | - | - | |
| I2T Baselines | |||||||||||
| 3DVSD [96] | - | - | - | 54.85 | 68.76 | - | - | - | 26.40 | 46.97 | |
| MNIC [25] | - | - | - | 34.21 | 66.87 | - | - | - | 20.01 | 43.88 | |
| FNIC [23] | - | - | - | 37.03 | 66.50 | - | - | - | 22.62 | 43.52 | |
| DiffCap [29] | - | - | - | 34.75 | 66.39 | - | - | - | 20.27 | 43.30 | |
| DDCap [98] | - | - | - | 37.93 | 67.10 | - | - | - | 23.14 | 44.07 | |
| Singleton | 18.05 | 20.42 | 63.51 | 48.77 | 66.59 | 14.70 | 24.62 | 66.41 | 23.51 | 43.70 | |
| Singleton + 3D | 12.56 | 26.92 | 65.62 | 50.05 | 67.20 | 10.43 | 25.62 | 67.29 | 25.37 | 45.13 | |
| Vanilla Dual Learning | 11.80 | 27.85 | 67.18 | 51.59 | 67.79 | 11.67 | 27.80 | 68.46 | 26.10 | 46.72 | |
| SD3 (Ours) | 11.04 | 29.20 | 68.31 | 56.23 | 68.02 | 10.09 | 29.76 | 71.10 | 27.63 | 48.03 | |