notesum.ai
Published at November 26Comparative Analysis of ASR Methods for Speech Deepfake Detection
cs.SD
Released Date: November 26, 2024
Authors: Davide Salvi1, Amit Kumar Singh Yadav2, Kratika Bhagtani2, Viola Negroni1, Paolo Bestagini1, Edward J. Delp2
Aff.: 1Image and Sound Processing Lab (ISPL), Politecnico di Milano, Milano, Italy; 2Video and Image Processing Lab (VIPER), Purdue University, West Lafayette, Indiana, USA

| Model | Param. | LibriSpeech | ASVspoof 2019 | ASVspoof 2021 | InTheWild | TIMIT-TTS | FakeOrReal | Average | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| WER | EER | AUC | EER | AUC | EER | AUC | EER | AUC | EER | AUC | EER | AUC | |||||||||
| Whisper tiny | 39M | 7.6 | 6.21 | 98.45 | 15.61 | 92.32 | 37.59 | 67.37 | 28.26 | 79.85 | 15.33 | 92.83 | 20.60 | 86.16 | |||||||
| Whisper base | 74M | 5.0 | 3.43 | 99.35 | 12.12 | 95.38 | 38.34 | 65.96 | 19.76 | 87.73 | 8.44 | 96.81 | 16.42 | 89.05 | |||||||
| Whisper small | 244M | 3.4 | 2.12 | 99.78 | 12.01 | 95.11 | 34.35 | 71.85 | 15.53 | 91.85 | 2.16 | 99.55 | 13.23 | 91.63 | |||||||
| Whisper medium | 769M | 2.9 | 1.58 | 99.87 | 12.40 | 92.32 | 32.19 | 73.16 | 21.74 | 87.01 | 12.50 | 93.64 | 16.08 | 89.20 | |||||||
| Whisper large | 1550M | 2.7 | 2.00 | 99.82 | 11.68 | 93.82 | 30.40 | 77.02 | 23.16 | 84.70 | 5.08 | 97.43 | 14.46 | 90.56 | |||||||
| Wav2Vec 2.0 base | 95M | 3.3 | 3.94 | 99.34 | 15.28 | 93.49 | 39.83 | 64.40 | 24.62 | 83.69 | 19.21 | 88.69 | 20.58 | 85.92 | |||||||
| Wav2Vec 2.0 large | 317M | 2.7 | 2.54 | 99.71 | 13.58 | 94.75 | 28.23 | 78.69 | 24.14 | 83.92 | 13.07 | 94.33 | 16.31 | 90.28 | |||||||
| Wav2Vec 2.0 xls-r | 317M | 4.5 | 3.93 | 99.29 | 17.36 | 89.12 | 30.32 | 75.99 | 29.07 | 75.58 | 36.12 | 69.12 | 23.36 | 81.82 | |||||||