notesum.ai
Published at December 6Diff4Steer: Steerable Diffusion Prior for Generative Music Retrieval with Semantic Guidance
cs.SD
cs.IR
cs.MM
eess.AS
Released Date: December 6, 2024
Authors: Xuchan Bao1, Judith Yue Li2, Zhong Yi Wan2, Kun Su2, Timo Denk3, Joonseok Lee4, Dima Kuzmin2, Fei Sha2
Aff.: 1University of Toronto; 2Google Research; 3Google DeepMind; 4Google Research, Seoul National University

| Method | Input | MC w/ Images | MB | ||||||
|---|---|---|---|---|---|---|---|---|---|
| R@100 | R@10 | M2I | TA | R@100 | R@10 | M2I | TA | ||
| Gemini-ImageCap | image | 0.215 | 0.055 | 89.12 | 0.488 | 0.162 | 0.036 | 90.32 | 0.685 |
| Gemini-MusicCap | image | 0.210 | 0.049 | 84.48 | 0.521 | 0.145 | 0.026 | 88.09 | 0.695 |
| Regression | image | 0.129 | 0.026 | 96.21 | 0.646 | 0.165 | 0.032 | 95.79 | 0.724 |
| Diff4Steer (ours) | image | 0.334 | 0.105 | 89.69 | 0.778 | 0.341 | 0.086 | 90.28 | 0.836 |
| Regression (txt) | genre | 0.378 | 0.103 | 90.63 | 0.838 | 0.147 | 0.016 | 92.20 | 0.739 |
| Diff4Steer (ours) | genre | 0.389 | 0.108 | 88.02 | 0.855 | 0.165 | 0.019 | 89.65 | 0.762 |
| Regression (txt) | caption | 0.419 | 0.131 | 90.72 | 0.871 | 0.380 | 0.086 | 91.40 | 0.872 |
| Diff4Steer (ours) | caption | 0.435 | 0.127 | 87.79 | 0.877 | 0.384 | 0.085 | 89.67 | 0.876 |
| Diff4Steer (ours) | image + genre | 0.425 | 0.165 | 91.91 | 0.889 | 0.384 | 0.090 | 94.47 | 0.883 |
| Diff4Steer (ours) | image + caption | 0.536 | 0.184 | 91.56 | 0.915 | 0.488 | 0.141 | 93.19 | 0.916 |