notesum.ai
Published at November 22Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
cs.CV
cs.CL
Released Date: November 22, 2024
Authors: AJ Piergiovanni1, Dahun Kim1, Michael S. Ryoo1, Isaac Noble1, Anelia Angelova1
Aff.: 1Google Deepmind

| ViTT | YouCook2 | ActivityNet | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Pretraining | S | C | M | F1 | S | C | M | F1 | S | C | M | F1 |
| E2ESG [77] | No | - | - | - | - | - | 25.0 | 3.5 | - | - | - | - | - |
| MT [74] | No | - | - | - | - | - | 6.1 | 3.2 | - | - | 9.3 | 5.0 | - |
| PDVC [56] | No | - | - | - | - | 4.9 | 28.9 | 5.7 | - | 6.0 | 29.3 | 7.6 | - |
| TimeChat [42] | No | - | - | - | - | 3.4 | 11.0 | - | 19.5 | - | - | - | - |
| GIT [53] | Multiple[53]* | 7.1 | 15.1 | 3.4 | 32.5 | 3.1 | 12.1 | 3.4 | 17.7 | 5.7 | 29.8 | 7.8 | 50.6 |
| OmniViD [54] | Kinetics-400 | - | - | - | - | - | - | - | - | - | 26.0 | 7.5 | - |
| Vid2Seq [65] | YT-Temporal-1B | 9.8 | 23.0 | 5.0 | 37.7 | 5.7 | 25.3 | 6.4 | 23.5 | 5.9 | 30.2 | 8.5 | 51.8 |
| DIBS [62] | Custom HowTo[62]* | - | - | - | - | 6.4 | 44.4 | 7.5 | 31.4 | 5.9 | 31.9 | 8.9 | 55.6 |
| Zhou et al. [75] | WebLI [8]* | 10.0 | 25.2 | 5.8 | 35.4 | 6.0 | 32.9 | 7.1 | 24.1 | 6.2 | 37.8 | 10.0 | 52.9 |
| Ours | HowTo | 10.2 | 37.2 | 18.9 | 47.4 | 11.3 | 57.6 | 27.7 | 33.6 | 12.3 | 18.4 | 19.5 | 53.3 |