notesum.ai

Published at November 26

Video-Guided Foley Sound Generation with Multimodal Controls

cs.CV

cs.MM

cs.SD

eess.AS

Released Date: November 26, 2024

Authors: Ziyang Chen¹, Prem Seetharaman², Bryan Russell², Oriol Nieto², David Bourgin², Andrew Owens¹, Justin Salamon²

Aff.: ¹University of Michigan; ²Adobe Research

Arxiv: http://arxiv.org/abs/2411.17698v1

[Uncaptioned image]

Method	Variation	CLAP $\uparrow$		AV-Sync $\downarrow$
Method	Variation	Score	Acc	AV-Sync $\downarrow$
FoleyCrafter [98]	w/o NegP	38.4	99.4	1.34
(w/o semantic adapter)	w/ NegP	35.7	99.9	1.36
\cdashline1-5 FoleyCrafter [98]	w/o NegP	31.0	79.2	1.29
\cdashline1-5 FoleyCrafter [98]	w/ NegP	33.4	94.2	1.31
\cdashline1-5 Ours	w/o NegP	31.4	85.5	0.81
\cdashline1-5 Ours	w/ NegP	30.9	93.2	0.93
Ours – oracle	True category	04.2	01.8	0.77
Ours – oracle	T2A	40.3	100	1.38