notesum.ai

Published at December 4

DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles

cs.SD

cs.AI

cs.CL

eess.AS

Released Date: December 4, 2024

Authors: Jiaxuan Liu¹, Zhaoci Liu¹, Yajun Hu², Yingying Gao³, Shilei Zhang³, Zhenhua Ling¹

Aff.: ¹University of Science and Technology of China; ²iFLYTEK CO.LTD.; ³China Mobile Research Institute

Arxiv: http://arxiv.org/pdf/2412.03388v1

Refer to caption

Model	MOS $\uparrow$	JS Divergence $\downarrow$			RTF $\downarrow$
Model	MOS $\uparrow$	Pitch	Energy	Duration	RTF $\downarrow$
Ground Truth	4.55 $\pm$ 0.05	-	-	-	-
FastSpeech2	3.85 $\pm$ 0.06	0.121	0.037	0.097	0.019
FastSpeech2*	4.11 $\pm$ 0.07	-	-	-	-
Grad-TTS	4.08 $\pm$ 0.07	0.115	0.040	0.088	0.250
Guided-TTS	4.15 $\pm$ 0.07	0.080	0.033	0.050	0.479
DiffProsody	4.10 $\pm$ 0.06	0.083	0.030	0.046	0.063
DiffStyleTTS	4.18 $\pm$ 0.06	0.065	0.030	0.045	0.048
w/o ISC	3.92 $\pm$ 0.07	0.090	0.038	0.078	-
w/o AISC	3.80 $\pm$ 0.06	0.071	0.033	0.051	-
w/o TEC	2.05 $\pm$ 0.24	0.445	0.125	0.390	-