notesum.ai

Published at November 25

SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations

cs.SD

cs.AI

eess.AS

Released Date: November 25, 2024

Authors: Youngjun Sim¹, Jinsung Yoon¹, Young-Joo Suh¹

Aff.: ¹Graduate School of Artificial Intelligence, POSTECH, Pohang, South Korea

Arxiv: http://arxiv.org/abs/2411.16147v1

Refer to caption

	seen-to-seen					unseen-to-unseen
Model	MOS $\uparrow$	SMOS $\uparrow$	WER(%) $\downarrow$	CER(%) $\downarrow$	EER(%) $\downarrow$	MOS $\uparrow$	SMOS $\uparrow$	WER(%) $\downarrow$	CER(%) $\downarrow$	EER(%) $\downarrow$
VQMIVC [9]	3.17 $\pm$ 0.17	2.66 $\pm$ 0.16	49.85	29.90	30.1	1.74 $\pm$ 0.13	1.58 $\pm$ 0.12	62.71	39.13	39.0
YourTTS [11]	3.03 $\pm$ 0.19	3.41 $\pm$ 0.15	31.22	16.27	5.0	1.98 $\pm$ 0.13	2.09 $\pm$ 0.15	41.04	23.65	15.9
FreeVC [12]	3.88 $\pm$ 0.14	4.10 $\pm$ 0.13	10.36	4.02	3.4	3.79 $\pm$ 0.13	2.56 $\pm$ 0.16	10.64	5.70	16.4
SKQVC	3.91 $\pm$ 0.13	4.28 $\pm$ 0.12	8.42	3.32	3.1	3.84 $\pm$ 0.14	3.95 $\pm$ 0.16	10.05	5.37	10.8
GT	4.24 $\pm$ 0.15	-	5.53	1.79	-	4.31 $\pm$ 0.13	-	8.09	4.31	-