notesum.ai
Published at November 8Tell What You Hear From What You See -- Video to Audio Generation Through Text
cs.CV
cs.AI
cs.LG
cs.SD
eess.AS
Released Date: November 8, 2024
Authors: Xiulong Liu, Kun Su, Eli Shlizerman

| Methods | KLD | FAD | Align Acc | Speed (s) |
|---|---|---|---|---|
| SpecVQGAN [21] | 3.78 | 6.63 | 48.79 | 7.2 |
| IM2WAV [22] | 2.54 | 6.32 | 74.31 | 289.5 |
| Diff-Foley [25] | 3.15 | 6.40 | 82.47 | 4.4 |
| FoleyGen [23] | 2.89 | 2.59 | 73.83 | 6.9 |
| V2A-Mapper [26] | 2.78 | 0.99 | 74.37 | 11.54 |
| VATT-LLama (Ours) | 2.39 | 2.38 | 80.32 | 1.1 |
| VATT-Gemma (Ours) | 2.25 | 2.35 | 82.81 | 0.65 |
| VATT-LLama-T (Ours) | 1.41 | 2.54 | 80.16 | 1.2 |
| VATT-Gemma-T (Ours) | 1.66 | 2.98 | 81.48 | 0.76 |