notesum.ai
Published at November 29V2SFlow: Video-to-Speech Generation with Speech Decomposition and Rectified Flow
cs.CV
cs.SD
eess.AS
Released Date: November 29, 2024
Authors: Jeongsoo Choi1, Ji-Hoon Kim1, Jinyu Li2, Joon Son Chung1, Shujie Liu2
Aff.: 1Korea Advanced Institute of Science and Technology; 2Microsoft

| Method | LRS3-TED | LRS2-BBC | ||||||||
| UTMOS | WER | SECS | MAEF0 | LSE-C | UTMOS | WER | SECS | MAEF0 | LSE-C | |
| Ground Truth | 3.519 | 2.5 | – | – | 7.63 | 3.017 | 4.2 | – | – | 8.15 |
| with speaker embedding from audio | ||||||||||
| SVTS [11] | 1.256 | 78.0 | 0.557 | 0.389 | 6.04 | 1.349 | 80.9 | 0.593 | 0.374 | 7.91 |
| Intelligible [6] | 2.657 | 29.8 | 0.761 | 0.265 | 8.04 | 2.294 | 38.1 | 0.701 | 0.278 | 8.23 |
| V2SFlow-A | 3.624 | 28.5 | 0.851 | 0.245 | 7.97 | 3.393 | 35.2 | 0.819 | 0.263 | 8.28 |
| with speaker embedding from video | ||||||||||
| DiffV2S [12] | 2.989 | 39.2 | 0.627 | 0.290 | 7.28 | 2.877 | 51.9 | 0.568 | 0.306 | 7.45 |
| LTBS [7] | 2.428 | 79.7 | 0.607 | 0.289 | 7.84 | 2.319 | 86.4 | 0.534 | 0.306 | 7.74 |
| V2SFlow-V | 3.780 | 28.5 | 0.664 | 0.251 | 8.09 | 3.648 | 35.6 | 0.581 | 0.275 | 8.39 |