notesum.ai
Published at December 27MVTamperBench: Evaluating Robustness of Vision-Language Models
cs.CV
68T37, 68T05, 68Q32, 68T45, 94A08, 68T40, 68Q85
I.2.10; I.2.7; I.5.4; I.4.9; I.4.8; H.5.1
Released Date: December 27, 2024
Authors: Amit Agarwal1, Srikant Panda2, Angeline Charles3, Bhargava Kumar4, Hitesh Patel5, Priyanranjan Pattnayak, Taki Hasan Rafi6, Tejaswini Kumar4, Dong-Kyu Chae6
Aff.: 1Liverpool John Moores University, UK; 2Birla Institute of Technology, India; 3Christ University, India; 4Columbia University, USA; 5New York University, USA; 6Hanyang University, South Korea
| Dataset Name | Primary Scene Type and Unique Characteristics |
|---|---|
| STAR Wu et al. ( 2024 ) | Indoor actions and object interactions |
| PAXION Wang et al. ( 2023 ) | Real-world scenes with nuanced actions |
| Moments in Time (MiT) V1 Monfort et al. ( 2019 ) | Indoor/outdoor scenes across varied contexts |
| FunQA Xie et al. ( 2025 ) | Humor-focused, creative, real-world events |
| CLEVRER Mao et al. ( 2022 ) | Simulated scenes for object movement and reasoning |
| Perception Test Patraucean et al. ( 2024 ) | First/third-person views for object tracking |
| Charades-STA for AI ( 2024 ) | Indoor human actions and interactions |
| MoVQA Zhang et al. ( 2023 ) | Diverse scenes for scene transition comprehension |
| VLN-CE Krantz ( 2024 ) | Indoor navigation from agent perspective |
| TVQA Lei et al. ( 2018 ) | TV show scenes for episodic reasoning |