notesum.ai

Published at December 9

Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment

cs.CV

cs.MM

cs.SD

eess.AS

Released Date: December 9, 2024

Authors: Kim Sung-Bin¹, Arda Senocak², Hyunwoo Ha¹, Tae-Hyun Oh³

Aff.: ¹POSTECH, Pohang, Republic of Korea; ²KAIST, Daejeon, Republic of Korea; ³POSTECH, Pohang, Republic of Korea; Yonsei University, Seoul, Republic of Korea

Arxiv: http://arxiv.org/pdf/2412.06209v1

	Method	Encoder ( $V$ / $A$ )	Generator ( $G$ / $R$ )	VGGSound (50 classes)				VEGAS
	Method	Encoder ( $V$ / $A$ )	Generator ( $G$ / $R$ )	R@1	R@5	FID ( $\downarrow$ )	IS ( $\uparrow$ )	R@1	R@5
(A)	ICGAN (Casanova et al, 2021)	$V$	$G$	30.06	62.59	16.11	12.61	46.60	82.48
(B)	Ours	$A$	$G$	40.71	77.36	17.97	19.46	57.44	84.08
(C)	Retrieval	$A$	$R$	51.28	80.37	-	-	67.20	85.00
(D)	Upper bound	-	-	57.82	85.79	-	-	73.60	88.2