notesum.ai
Published at November 26Video-Guided Foley Sound Generation with Multimodal Controls
cs.CV
cs.MM
cs.SD
eess.AS
Released Date: November 26, 2024
Authors: Ziyang Chen1, Prem Seetharaman2, Bryan Russell2, Oriol Nieto2, David Bourgin2, Andrew Owens1, Justin Salamon2
Aff.: 1University of Michigan; 2Adobe Research
![[Uncaptioned image]](https://arxiv.org/html/2411.17698v1/x1.png)
| Method | Variation | CLAP | AV-Sync | |
|---|---|---|---|---|
| Score | Acc | |||
| FoleyCrafter [98] | w/o NegP | 38.4 | 99.4 | 1.34 |
| (w/o semantic adapter) | w/ NegP | 35.7 | 99.9 | 1.36 |
| \cdashline1-5 FoleyCrafter [98] | w/o NegP | 31.0 | 79.2 | 1.29 |
| w/ NegP | 33.4 | 94.2 | 1.31 | |
| \cdashline1-5 Ours | w/o NegP | 31.4 | 85.5 | 0.81 |
| w/ NegP | 30.9 | 93.2 | 0.93 | |
| Ours – oracle | True category | 04.2 | 01.8 | 0.77 |
| T2A | 40.3 | 100 | 1.38 | |