notesum.ai
Published at December 4SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model
cs.CV
cs.LG
cs.SD
Released Date: December 4, 2024
Authors: Yan Li1, Ziya Zhou1, Zhiqiang Wang1, Wei Xue1, Wenhan Luo1, Yike Guo1
Aff.: 1University

| Method | Video Quality | Lip Synchronization | Motion | ||||||
| SSIM | PSNR | CPBD | FVD | LMD | LSE-D | LSE-C | Diversity | BAS | |
| Audio2Head | 0.4896 | 28.281 | 0.4469 | 1089.7 | 75.791 | 9.1998 | 1.2458 | 10.846 | 0.1982 |
| SadTalker | 0.4134 | 29.872 | 0.5509 | 1030.3 | 59.625 | 9.1739 | 1.2454 | 14.253 | 0.1774 |
| MuseTalk | 0.5762 | 30.243 | 0.5266 | 1323.2 | 64.669 | 10.143 | 1.0545 | 1.4802 | 0.2400 |
| AniPortrait | 0.5364 | 29.872 | 0.5509 | 1030.3 | 76.442 | 10.226 | 0.8667 | 5.4195 | 0.2296 |
| Echomimic | 0.4035 | 28.865 | 0.4896 | 1221.7 | 73.163 | 9.8249 | 1.2290 | 11.732 | 0.1483 |
| Hallo | 0.5722 | 29.984 | 0.5486 | 897.65 | 64.346 | 9.1645 | 1.7012 | 8.5367 | 0.1850 |
| Hallo2 | 0.5659 | 30.058 | 0.5558 | 1478.2 | 60.741 | 9.6107 | 1.5967 | 9.2739 | 0.2184 |
| SINGER | 0.6364 | 30.686 | 0.5430 | 503.78 | 53.373 | 9.1269 | 1.6209 | 14.445 | 0.2405 |
| GT | - | - | 0.5338 | 0.0000 | 0.0000 | 8.5541 | 4.5286 | 21.754 | 0.2484 |