notesum.ai
Published at November 26Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis
cs.MM
cs.CV
cs.SD
eess.AS
Released Date: November 26, 2024
Authors: Akshita Gupta1, Tatiana Likhomanenko2, Karren Dai Yang, Richard He Bai, Zakaria Aldeneh2, Navdeep Jaitly2
Aff.: 1University of Guelph; 2Apple

| Method | Input Modality | GT WER () | GT (discrete) WER () | WER () | Sync Score () | TimeSync (s) () |
|---|---|---|---|---|---|---|
| TTS | Text | 4.0 ±0.1 | 10.5 ±0.1 | 19.0 | - | - |
| VTTS (VT-ordered) | Video-Text | 17.2 | - | - | ||
| TTS | Text | 2.6 ±0.1 | 10.1 ±0.2 | 14.7 | 1.54 | 0.62 ±0.98 |
| VTTS (TV-streaming) | Text-Video | 14.5 | 1.66 | 0.49 ±0.63 | ||
| VTTS (TV-ordered) | Text-Video | 14.1 | 1.67 | 0.44 ±0.65 | ||
| VTTS (VT-ordered) | Video-Text | 12.2 | 1.64 | 0.47 ±0.63 |