notesum.ai
Published at November 27DistinctAD: Distinctive Audio Description Generation in Contexts
cs.CV
Released Date: November 27, 2024
Authors: Bo Fang1, Wenhao Wu2, Qiangqiang Wu1, Yuxin Song3, Antoni B. Chan1
Aff.: 1Department of Computer Science, City University of Hong Kong; 2The University of Sydney; 3Baidu Inc.

| Method | Pub. | VLM | LLM | ROUGE-L | CIDEr | SPICE | R@5/16 |
| Training-free | |||||||
| \hdashline[0.5pt/5pt] VLog [1] | - | - | GPT-4 | 7.5 | 1.3 | 2.1 | 42.3 |
| MM-Vid [38] | ArXiv’23 | GPT-4V | - | 9.8 | 6.1 | 3.8 | 46.1 |
| MM-Narrator [85] | CVPR’24 | CLIP-L14 | GPT-4 | 13.4 | 13.9 | 5.2 | 49.0 |
| LLM-AD [13] | ArXiv’24 | GPT-4V | - | 13.5 | 20.5 | - | - |
| AutoAD-Zero [81] | ACCV’24 | VideoLLaMA2-7B | LLaMA3-8B | - | 22.4 | - | - |
| Partial-fine-tuning | |||||||
| \hdashline[0.5pt/5pt] ClipCap [49] | ArXiv’21 | CLIP-B32 | GPT-2 | 8.5 | 4.4 | 1.1 | - |
| CapDec [51] | ArXiv’22 | - | - | 8.2 | 6.7 | 1.4 | - |
| AutoAD-I [20] | CVPR’23 | CLIP-B32 | GPT-2 | 11.9 | 14.3 | 4.4 | 42.1 |
| AutoAD-II [21] | ICCV’23 | CLIP-B32 | GPT-2 | 13.4 | 19.5 | - | 50.8 |
| AutoAD-III [22] | CVPR’24 | EVA-CLIP | OPT-2.7B | - | 22.8 | - | 52.0 |
| AutoAD-III [22] | CVPR’24 | EVA-CLIP | LLaMA2-7B | - | 24.0 | - | 52.8 |
| MovieSeq [39] | ECCV’24 | CLIP-B16 | LLaMA2-7B∗ | 15.5 | 24.4 | 7.0 | 51.6 |
| DistinctAD (Ours) | CLIP-B32 | GPT-2 | 15.4 | 24.5 | 6.7 | 49.8 | |
| DistinctAD (Ours) | CLIPAD-B32 | GPT-2 | 16.4 | 25.5 | 7.4 | 51.7 | |
| DistinctAD (Ours) | CLIPAD-B16 | LLaMA2-7B | 17.2 | 27.0 | 8.2 | 55.6 | |
| DistinctAD (Ours) | CLIPAD-B16 | LLaMA3-8B | 17.6 | 27.3 | 8.3 | 56.0 |