notesum.ai
Published at November 29Zero-shot Musical Stem Retrieval with Joint-Embedding Predictive Architectures
cs.SD
cs.AI
eess.AS
Released Date: November 29, 2024
Authors: Alain Riou1, Antonin Gagneré1, Gaëtan Hadjeres2, Stefan Lattner3, Geoffroy Peeters1
Aff.: 1LTCI, Télécom-Paris, Institut Polytechnique de Paris, France; 2Sony AI, Zurich, Switzerland; 3Sony Computer Science Laboratories - Paris, France

| MUSDB18 | MoisesDB (fine conditioning) | MoisesDB (coarse conditioning) | |||||||||||||||
| Recall | Norm. Rank | Recall | Norm. Rank | Recall | Norm. Rank | ||||||||||||
| Model | R@1 | R@5 | R@10 | mean | median | R@1 | R@5 | R@10 | mean | median | R@1 | R@5 | R@10 | mean | median | ||
| Contrastive baseline | 2.2 | 72.5 | 86.0 | 1.5 | 0.7 | 2.7 | 31.2 | 43.3 | 7.7 | 0.6 | 2.7 | 31.2 | 43.3 | 7.7 | 0.6 | ||
| Stem-JEPA [2] | 33.0 | 63.2 | 76.2 | 2.0 | 0.5 | 9.9 | 24.1 | 31.6 | 11.7 | 1.7 | 9.9 | 24.1 | 31.6 | 11.7 | 1.7 | ||
| +CLAP conditioning | 33.5 | 63.0 | 74.5 | 2.5 | 0.5 | 19.3 | 35.8 | 44.0 | 7.2 | 0.7 | 18.9 | 37.8 | 46.9 | 7.3 | 0.5 | ||
| +contrastive pretraining | 32.2 | 91.0 | 96.2 | 0.6 | 0.3 | 12.5 | 45.6 | 59.3 | 1.9 | 0.3 | 11.1 | 44.5 | 58.1 | 2.1 | 0.3 | ||
| +FiLM conditioning | 38.8 | 89.7 | 95.0 | 0.7 | 0.3 | 22.0 | 49.5 | 57.8 | 4.5 | 0.2 | 19.2 | 49.5 | 60.0 | 3.8 | 0.2 | ||