notesum.ai
Published at October 18Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech
cs.CV
cs.AI
cs.CL
cs.IR
Released Date: October 18, 2024
Authors: Shuwei He1, Rui Liu1, Haizhou Li2
Aff.: 1Inner Mongolia University Hohhot, China; 2The Chinese University of Hong Kong Shenzhen, China

| System | Test-Unseen | Test-Seen | ||||
|---|---|---|---|---|---|---|
| MOS () | RTE () | MCD () | MOS () | RTE () | MCD () | |
| GT | 4.353 0.023 | / | / | 4.348 0.022 | / | / |
| GT(voc.) | 4.149 0.027 | 0.0080 | 1.4600 | 4.149 0.023 | 0.0060 | 1.4600 |
| ProDiff [20] | 3.550 0.023 | 0.1341 | 4.7689 | 3.647 0.023 | 0.1243 | 4.6711 |
| DiffSpeech [21] | 3.649 0.022 | 0.1193 | 4.7923 | 3.675 0.011 | 0.1034 | 4.6630 |
| VoiceLDM [14] | 3.702 0.020 | 0.0825 | 4.8952 | 3.702 0.025 | 0.0714 | 4.6572 |
| ViT-TTS [1] | 3.700 0.025 | 0.0759 | 4.5933 | 3.804 0.022 | 0.0677 | 4.5535 |
| MS2KU-VTTS | 3.875 0.011 | 0.0745 | 4.5544 | 3.947 0.022 | 0.0668 | 4.5175 |