notesum.ai
Published at November 4EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector
cs.SD
cs.AI
eess.AS
Released Date: November 4, 2024
Authors: Deok-Hyeon Cho, Hyung-Seok Oh, Seung-Bin Kim, Seong-Whan Lee
| Method | Subjective Evaluation | Objective Evaluation | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| nMOS () | sMOS () | eMOS () | WERWhis () | WERw2v () | WERAVG () | SECSR () | SECSW () | SECSAVG () | ECA () | SVAS () | EECS () | |
| GT | 4.060.05 | 4.150.05 | 4.110.05 | 11.42 | 14.99 | 13.21 | 0.7358 | 0.9264 | 0.8311 | 95.53 | - | 0.9487 |
| BigVGAN [61] | 3.950.05 | 4.010.06 | 3.980.06 | 11.36 | 15.15 | 13.26 | 0.7271 | 0.9231 | 0.8251 | 94.25 | 0.9815 | 0.9389 |
| Mellotron [44] | 3.420.07 | 3.950.07 | 3.750.07 | 14.49 | 18.90 | 16.70 | 0.6981 | 0.8939 | 0.7960 | 46.93 | 0.7838 | 0.5061 |
| Mixedemotion [10] | 3.360.07 | 3.890.07 | 3.690.07 | 20.24 | 26.48 | 23.36 | 0.6920 | 0.8764 | 0.7842 | 63.44 | 0.8669 | 0.6282 |
| YourTTS [46] | 3.520.06 | 3.940.06 | 3.750.07 | 28.71 | 32.79 | 30.75 | 0.7296 | 0.6751 | 0.6811 | 74.84 | 0.8657 | 0.7890 |
| Generspeech [47] | 3.850.06 | 3.960.06 | 3.860.06 | 16.63 | 22.56 | 19.60 | 0.7074 | 0.8888 | 0.7981 | 82.54 | 0.8460 | 0.8366 |
| iEmoTTS [45] | 3.770.06 | 3.940.06 | 3.790.07 | 26.07 | 29.74 | 27.91 | 0.6153 | 0.8208 | 0.7181 | 77.60 | 0.8000 | 0.7474 |
| EmoSphere ++ (Proposed) | 3.920.06 | 3.970.06 | 3.860.06 | 15.52 | 18.85 | 17.19 | 0.7314 | 0.9047 | 0.8181 | 93.53 | 0.8717 | 0.9270 |