notesum.ai
Published at December 10GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-grained Video-language Learning
cs.CV
Released Date: December 10, 2024
Authors: Yicheng Wang1, Zhikang Zhang2, Jue Wang2, David Fan2, Zhenlin Xu2, Linda Liu2, Xiang Hao2, Vimal Bhat2, Xinyu Li2
Aff.: 1Texas A&M University; 2Amazon

| \hlineB3 Method | Relation | Speaking | Scene |
| VideoBERT [50] | 52.4 | 37.9 | 54.9 |
| Obj.T4mer [62] | 53.1 | 39.4 | 56.9 |
| LST[23] | 52.4 | 37.3 | 62.8 |
| Orthoformer [45] | 50.0 | 38.3 | 66.3 |
| ViS4mer[23] | 57.1 | 40.8 | 67.4 |
| Ours | 61.9 | 42.7 | 70.9 |
| \rowcolorgray!10 LF-VILA[51] | 61.5 | 41.3 | 68.0 |
| \rowcolorgray!10S5[57] | 61.9 | 41.8 | 69.9 |
| \rowcolorgray!10 S5[57] | 66.7 | 41.8 | 73.3 |
| \rowcolorgray!10 MA-LMM[22] | 58.2 | 44.8 | 80.3 |
| \hlineB3 |