notesum.ai
Published at December 9VidMusician: Video-to-Music Generation with Semantic-Rhythmic Alignment via Hierarchical Visual Features
cs.SD
eess.AS
Released Date: December 9, 2024
Authors: Sifei Li1, Binxin Yang, Chunji Yin, Chong Sun, Yuxin Zhang, Weiming Dong, Chen Li
Aff.: 1Institute of Automation, Chinese Academy of Sciences

| Objective Metrics | Subjective Metrics | |||||||
|---|---|---|---|---|---|---|---|---|
| FD | KL | ImageBind | CLAP | Rhythm | Semantic | Rhythm | Overall | |
| CMT [6] | 61.45 | 1.673 | 0.1103 | 0.3064 | 0.1637 | 11.83 | 12.93 | 11.91 |
| Diff-BGM [25] | 105.5 | 2.243 | 0.1209 | 0.1905 | 0.1257 | 15.57 | 14.46 | 15.39 |
| Video2Music [17] | 101.3 | 2.147 | 0.0783 | 0.2221 | 0.2381 | 15.48 | 25.22 | 17.22 |
| M2ugen [32] | 41.79 | 1.900 | 0.1291 | 0.2968 | 0.2288 | 19.65 | 18.48 | 20.17 |
| M2ugen* | 45.46 | 1.691 | 0.1429 | 0.2913 | 0.1915 | 13.48 | 12.28 | 13.83 |
| VidMuse [40] | 31.43 | 1.260 | 0.1960 | 0.3853 | 0.1722 | 33.30 | 31.74 | 31.83 |
| VidMuse* | 33.85 | 1.521 | 0.1610 | 0.2900 | 0.2305 | 19.74 | 23.37 | 19.57 |
| VidMusician(ours) | 27.46 | 1.112 | 0.2158 | 0.4162 | 0.2516 | 81.56 | 80.22 | 81.44 |