notesum.ai
Published at December 10PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation
cs.CV
cs.AI
cs.LG
Released Date: December 10, 2024
Authors: Fatemeh Nazarieh1, Zhenhua Feng2, Diptesh Kanojia1, Muhammad Awais1, Josef Kittler1
Aff.: 1University of Surrey, UK; 2Jiangnan University, China

| Method | MEAD (Wang et al. 2020) | HDTF (Zhang et al. 2021) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR↑ | SSIM↑ | M/F-LMD↓ | FID↓ | SyncNet ↑ | ADFD ↑ | PSNR↑ | SSIM↑ | M/F-LMD↓ | FID↓ | SyncNet ↑ | ADFD ↑ | |
| MakeItTalk (Zhou et al. 2020) | 19.442 | 0.614 | 2.541/2.309 | 37.917 | 5.176 | 0.693 | 21.985 | 0.709 | 2.395/2.182 | 18.730 | 4.753 | 0.726 |
| Wav2Lip (Prajwal et al. 2020) | 19.875 | 0.633 | 1.438/2.138 | 44.510 | 8.774 | 0.584 | 22.323 | 0.727 | 1.759/2.002 | 22.397 | 9.032 | 0.699 |
| Audio2Head (Wang et al. 2021) | 18.764 | 0.586 | 2.053/2.293 | 27.236 | 6.494 | 0.725 | 21.608 | 0.702 | 1.983/2.060 | 29.385 | 7.076 | 0.835 |
| SadTalker (Zhang et al. 2023) | 19.042 | 0.606 | 2.038/2.335 | 39.308 | 7.065 | 0.761 | 21.701 | 0.702 | 1.995/2.147 | 14.261 | 7.414 | 0.726 |
| IP-LAP (Zhong et al. 2023) | 19.832 | 0.627 | 2.140/2.116 | 46.502 | 4.156 | 0.527 | 22.615 | 0.731 | 1.951/1.938 | 19.281 | 3.456 | 0.633 |
| TalkLip (Wang et al. 2023) | 19.492 | 0.623 | 1.951/2.204 | 41.066 | 5.724 | 0.548 | 22.241 | 0.730 | 1.976/1.937 | 23.850 | 1.076 | 0.518 |
| EAMM (Ji et al. 2022) | 18.867 | 0.610 | 2.543/2.413 | 31.268 | 1.762 | 0.847 | 19.866 | 0.626 | 2.910/2.937 | 41.200 | 4.445 | 0.839 |
| EDTalk (Tan et al. 2024a) | 21.628 | 0.722 | 1.537/1.290 | 17.698 | 8.115 | 0.715 | 25.156 | 0.811 | 1.676/1.315 | 13.785 | 7.642 | 0.722 |
| PortraitTalk (Our Model) | 23.097 | 0.873 | 1.206/1.385 | 17.351 | 8.916 | 0.816 | 27.495 | 0.846 | 1.157/1.017 | 11.753 | 8.381 | 0.835 |