notesum.ai

Published at December 6

Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance

cs.SD

cs.IR

cs.MM

eess.AS

Released Date: December 6, 2024

Authors: Xuchan Bao¹, Judith Yue Li², Zhong Yi Wan², Kun Su², Timo Denk³, Joonseok Lee⁴, Dima Kuzmin², Fei Sha²

Aff.: ¹University of Toronto; ²Google Research; ³Google DeepMind; ⁴Google Research, Seoul National University

Arxiv: http://arxiv.org/pdf/2412.04746v1

Refer to caption

Method	Input	MC w/ Images				MB
		R@100	R@10	M2I	TA	R@100	R@10	M2I	TA
Gemini-ImageCap	image	0.215	0.055	89.12	0.488	0.162	0.036	90.32	0.685
Gemini-MusicCap	image	0.210	0.049	84.48	0.521	0.145	0.026	88.09	0.695
Regression	image	0.129	0.026	96.21	0.646	0.165	0.032	95.79	0.724
Diff4Steer (ours)	image	0.334	0.105	89.69	0.778	0.341	0.086	90.28	0.836
Regression (txt)	genre	0.378	0.103	90.63	0.838	0.147	0.016	92.20	0.739
Diff4Steer (ours)	genre	0.389	0.108	88.02	0.855	0.165	0.019	89.65	0.762
Regression (txt)	caption	0.419	0.131	90.72	0.871	0.380	0.086	91.40	0.872
Diff4Steer (ours)	caption	0.435	0.127	87.79	0.877	0.384	0.085	89.67	0.876
Diff4Steer (ours)	image + genre	0.425	0.165	91.91	0.889	0.384	0.090	94.47	0.883
Diff4Steer (ours)	image + caption	0.536	0.184	91.56	0.915	0.488	0.141	93.19	0.916