notesum.ai
Published at December 9Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
cs.CV
cs.MM
cs.SD
eess.AS
Released Date: December 9, 2024
Authors: Kim Sung-Bin1, Arda Senocak2, Hyunwoo Ha1, Tae-Hyun Oh3
Aff.: 1POSTECH, Pohang, Republic of Korea; 2KAIST, Daejeon, Republic of Korea; 3POSTECH, Pohang, Republic of Korea; Yonsei University, Seoul, Republic of Korea

| Method | Encoder (/) | Generator (/) | VGGSound (50 classes) | VEGAS | |||||
|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | FID () | IS () | R@1 | R@5 | ||||
| (A) | ICGAN (Casanova et al, 2021) | 30.06 | 62.59 | 16.11 | 12.61 | 46.60 | 82.48 | ||
| (B) | Ours | 40.71 | 77.36 | 17.97 | 19.46 | 57.44 | 84.08 | ||
| (C) | Retrieval | 51.28 | 80.37 | - | - | 67.20 | 85.00 | ||
| (D) | Upper bound | - | - | 57.82 | 85.79 | - | - | 73.60 | 88.2 |