notesum.ai
Published at November 29T2Vid: Translating Long Text into Multi-Image is the Catalyst for Video-LLMs
cs.CV
cs.CL
cs.LG
Released Date: November 29, 2024
Authors: Shukang Yin1, Chaoyou Fu2, Sirui Zhao1, Yunhang Shen3, Chunjiang Ge4, Yan Yang2, Zuwei Long3, Yuhan Dai1, Tong Xu1, Xing Sun3, Ran He5, Caifeng Shan2, Enhong Chen1
Aff.: 1USTC; 2NJU; 3Tencent YouTu Lab; 4THU; 5CAS

| Setting | S | M | L | Overall |
| Zero-shot | 61.3 | 51.8 | 44.3 | 52.5 |
| 30K sampled data | 66.2 (+4.9) | 53.3 (+1.5) | 47.4 (+3.1) | 55.7 (+3.2) |
| 200K full data | 66.7 (+0.5) | 54.2 (+0.9) | 48.1 (+0.7) | 56.3 (+0.6) |