notesum.ai
Published at November 5Speaker Emotion Recognition: Leveraging Self-Supervised Models for Feature Extraction Using Wav2Vec2 and HuBERT
cs.SD
cs.AI
cs.LG
eess.AS
Released Date: November 5, 2024
Authors: Pourya Jafarzadeh1, Amir Mohammad Rostami1, Padideh Choobdar1
Aff.: 1Lab of Artin, TOSAN TECHNO

| Dataset | Model | Weighted Accuracy | Unweighted Accuracy | |||||||||||||
| Proposed Method (Hubert) | 91.66 ± 1.2 | 90.48 ± 1.1 | ||||||||||||||
| Proposed Method (Wav2Vec2) | 83.34 ± 1.3 | 80.95 ± 1.1 | ||||||||||||||
| Ensemble softmax regression [48] | 51.50 | - | ||||||||||||||
| eGeMAPs feature set [15] | 42.40 | - | ||||||||||||||
| DCNN + CFS + MLP [13] | 66.90 | - | ||||||||||||||
| SAVEE | CNN [33] | 73.6 | - | |||||||||||||
| Proposed Method (Hubert) | 92.82 ± 1.5 | 93.78 ± 1.6 | ||||||||||||||
| Proposed Method (Wav2Vec2) | 97.67 ± 1.2 | 98.02 ± 1.3 | ||||||||||||||
| DCNN + CFS + MLP [13] | 73.50 | - | ||||||||||||||
| MFCC + Pitch + Energy + ZCR + DWT [28] | - | 85 | ||||||||||||||
| CNN [37] | 80 | 79 | ||||||||||||||
| GResNets [55] | - | 68.48 | ||||||||||||||
| CNN+BLSTM [24] | 69.4% | 68.10 | ||||||||||||||
| RAVDESS | Bagged SVM [6] | - | 75.69 | |||||||||||||
| Proposed Method (Hubert) | 97.83± 1.4 | 98.21 ± 1.3 | ||||||||||||||
| Proposed Method (Wav2Vec2) | 99 ± 0.2 | 99 ± 0.3 | ||||||||||||||
| Ensemble softmax regression [48] | 82.40 | - | ||||||||||||||
| eGeMAPs feature set [15] | 76.90 | - | ||||||||||||||
| OpenSmile features + ADAN [54] | 83.74 | - | ||||||||||||||
| RESNET MODEL + Deep BiLSTM [42] | 85.57 | - | ||||||||||||||
| Complementary Features + KELM [14] | 84.49 | - | ||||||||||||||
| ADRNN [34] | 85.39 | - | ||||||||||||||
| DCNN + DTPM [56] | 87.31 | - | ||||||||||||||
| DCNN + CFS + MLP [13] | 90.50 | - | ||||||||||||||
| CNN + LSTM [57] | - | 95.89 | ||||||||||||||
| MSFs [52] | - | 85.5 | ||||||||||||||
| CNN [59] | - | 85.2 | ||||||||||||||
| EMODB | Fuzzy C-Means [12] | - | 92.2 | |||||||||||||
| Proposed Method (Hubert) | 98.36 ± 1.2 | 98.33± 1.4 | ||||||||||||||
| AESDD | Proposed Method (Wav2Vec2) | 98.36 ± 1.1 | 98.33± 1.1 | |||||||||||||
| Proposed Method (Hubert) | 83.77 ± 1.7 | 71.38 ± 1.4 | ||||||||||||||
| SHEMO | Proposed Method (Wav2Vec2) | 95.52± 2.1 | 91.21± 1.3 |