notesum.ai

Published at October 30

DOA-Aware Audio-Visual Self-Supervised Learning for Sound Event Localization and Detection

cs.SD

cs.AI

cs.LG

cs.MM

eess.AS

Released Date: October 30, 2024

Authors: Yoto Fujita¹, Yoshiaki Bando², Keisuke Imoto³, Masaki Onishi², Kazuyoshi Yoshii¹

Aff.: ¹Graduate School of Informatics, Kyoto University, Japan; ²National Institute of Advanced Industrial Science and Technology, Japan; ³Faculty of Science and Engineering, Doshisha University, Japan

Arxiv: http://arxiv.org/abs/2410.22803v1

Fine-tuning dataset	Pretraining method	$ER_{\leq 20^{\circ}}\downarrow$	$F_{\leq 20^{\circ}}\uparrow$	$LE\downarrow$	$LR\uparrow$	$SELD\downarrow$
STARSS22 $+$ Synth	None	0.53	48.9 %	18.2^∘	68.7 %	0.364
	AVC[arandjelovic2017look]	0.52	49.7 %	17.9^∘	69.0 %	0.359
	AV-SSL with DOA-wise contrastive learning	0.51	50.5 %	17.9^∘	70.1 %	0.351
	AV-SSL with recording-wise contrastive learning	0.51	51.6 %	17.1^∘	69.0 %	0.349
STARSS22	None	0.65	37.9 %	22.1^∘	58.4 %	0.452
	AVC[arandjelovic2017look]	0.63	37.7 %	22.4^∘	55.2 %	0.458
	AV-SSL with DOA-wise contrastive learning	0.68	35.8 %	22.9^∘	53.7 %	0.478
	AV-SSL with recording-wise contrastive learning	0.67	36.3 %	23.2^∘	56.3 %	0.467