notesum.ai
Published at November 11Multi-Modal interpretable automatic video captioning
cs.CV
cs.AI
Released Date: November 11, 2024
Authors: Antoine Hanna-Asaad1, Decky Aspandi2, Titus Zaharia2
Aff.: 1Ecole Supérieure des Techniques Aéronautiques et de Construction Automobile (ESTACA), France; 2Department ARTEMIS, Télécom SudParis, France

| No. | Models | MSR-VTT | VATEX | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| BLEU@4 | ROUGE-L | METEOR | CIDEr | AVG | BLEU@4 | ROUGE-L | METEOR | CIDEr | AVG | ||
| 1. | BLIP [17] | 36.4 | 55.2 | 26.0 | 43.1 | 40.2 | 21.5 | 38.6 | 21.3 | 29.8 | 27.8 |
| 2. | Audio Raw [11] | 4.6 | 28.7 | 16.7 | 2.6 | 13.1 | 4.1 | 29.5 | 10.9 | 1.2 | 11.4 |
| 3. | Vision-Based | 32.9 | 56.4 | 25.7 | 38.1 | 38.2 | 23.2 | 39.8 | 22.1 | 40.8 | 31.4 |
| 4. | Audio-Based | 28.1 | 53.4 | 21.0 | 15.6 | 29.5 | 19.2 | 32.4 | 15.5 | 19.3 | 21.6 |
| 5. | Fusion | 37.2 | 59.3 | 27.2 | 43.0 | 41.6 | 26.1 | 41.1 | 22.7 | 39.5 | 32.3 |
| 6. | MICap | 48.0 | 63.7 | 29.4 | 52.1 | 48.3 | 30.5 | 43.5 | 23.6 | 40.9 | 34.6 |