notesum.ai

Published at October 18

Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech

cs.CV

cs.AI

cs.CL

cs.IR

Released Date: October 18, 2024

Authors: Shuwei He¹, Rui Liu¹, Haizhou Li²

Aff.: ¹Inner Mongolia University Hohhot, China; ²The Chinese University of Hong Kong Shenzhen, China

Arxiv: https://arxiv.org/abs/2410.14101v1

Refer to caption

System	Test-Unseen			Test-Seen
System	MOS ( $\uparrow$ )	RTE ( $\downarrow$ )	MCD ( $\downarrow$ )	MOS ( $\uparrow$ )	RTE ( $\downarrow$ )	MCD ( $\downarrow$ )
GT	4.353 $\pm$ 0.023	/	/	4.348 $\pm$ 0.022	/	/
GT(voc.)	4.149 $\pm$ 0.027	0.0080	1.4600	4.149 $\pm$ 0.023	0.0060	1.4600
ProDiff [20]	3.550 $\pm$ 0.023	0.1341	4.7689	3.647 $\pm$ 0.023	0.1243	4.6711
DiffSpeech [21]	3.649 $\pm$ 0.022	0.1193	4.7923	3.675 $\pm$ 0.011	0.1034	4.6630
VoiceLDM [14]	3.702 $\pm$ 0.020	0.0825	4.8952	3.702 $\pm$ 0.025	0.0714	4.6572
ViT-TTS [1]	3.700 $\pm$ 0.025	0.0759	4.5933	3.804 $\pm$ 0.022	0.0677	4.5535
MS²KU-VTTS	3.875 $\pm$ 0.011	0.0745	4.5544	3.947 $\pm$ 0.022	0.0668	4.5175