notesum.ai
Published at October 30DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection
cs.SD
cs.AI
cs.LG
cs.MM
eess.AS
Released Date: October 30, 2024
Authors: Yoto Fujita1, Yoshiaki Bando2, Keisuke Imoto3, Masaki Onishi2, Kazuyoshi Yoshii1
Aff.: 1Graduate School of Informatics, Kyoto University, Japan; 2National Institute of Advanced Industrial Science and Technology, Japan; 3Faculty of Science and Engineering, Doshisha University, Japan

| Fine-tuning dataset | Pretraining method | |||||
|---|---|---|---|---|---|---|
| STARSS22Synth | None | 0.53 | 48.9 % | 18.2∘ | 68.7 % | 0.364 |
| AVC[arandjelovic2017look] | 0.52 | 49.7 % | 17.9∘ | 69.0 % | 0.359 | |
| AV-SSL with DOA-wise contrastive learning | 0.51 | 50.5 % | 17.9∘ | 70.1 % | 0.351 | |
| AV-SSL with recording-wise contrastive learning | 0.51 | 51.6 % | 17.1∘ | 69.0 % | 0.349 | |
| STARSS22 | None | 0.65 | 37.9 % | 22.1∘ | 58.4 % | 0.452 |
| AVC[arandjelovic2017look] | 0.63 | 37.7 % | 22.4∘ | 55.2 % | 0.458 | |
| AV-SSL with DOA-wise contrastive learning | 0.68 | 35.8 % | 22.9∘ | 53.7 % | 0.478 | |
| AV-SSL with recording-wise contrastive learning | 0.67 | 36.3 % | 23.2∘ | 56.3 % | 0.467 |