notesum.ai
Published at December 4DiffStyleTTS: Diffusion-based Hierarchical Prosody Modeling for Text-to-Speech with Diverse and Controllable Styles
cs.SD
cs.AI
cs.CL
eess.AS
Released Date: December 4, 2024
Authors: Jiaxuan Liu1, Zhaoci Liu1, Yajun Hu2, Yingying Gao3, Shilei Zhang3, Zhenhua Ling1
Aff.: 1University of Science and Technology of China; 2iFLYTEK CO.LTD.; 3China Mobile Research Institute

| Model | MOS | JS Divergence | RTF | ||
|---|---|---|---|---|---|
| Pitch | Energy | Duration | |||
| Ground Truth | 4.550.05 | - | - | - | - |
| FastSpeech2 | 3.850.06 | 0.121 | 0.037 | 0.097 | 0.019 |
| FastSpeech2* | 4.110.07 | - | - | - | - |
| Grad-TTS | 4.080.07 | 0.115 | 0.040 | 0.088 | 0.250 |
| Guided-TTS | 4.150.07 | 0.080 | 0.033 | 0.050 | 0.479 |
| DiffProsody | 4.100.06 | 0.083 | 0.030 | 0.046 | 0.063 |
| DiffStyleTTS | 4.180.06 | 0.065 | 0.030 | 0.045 | 0.048 |
| w/o ISC | 3.920.07 | 0.090 | 0.038 | 0.078 | - |
| w/o AISC | 3.800.06 | 0.071 | 0.033 | 0.051 | - |
| w/o TEC | 2.050.24 | 0.445 | 0.125 | 0.390 | - |