notesum.ai

Published at December 27

MVTamperBench: Evaluating Robustness of Vision-Language Models

cs.CV

68T37, 68T05, 68Q32, 68T45, 94A08, 68T40, 68Q85

I.2.10; I.2.7; I.5.4; I.4.9; I.4.8; H.5.1

Released Date: December 27, 2024

Authors: Amit Agarwal¹, Srikant Panda², Angeline Charles³, Bhargava Kumar⁴, Hitesh Patel⁵, Priyanranjan Pattnayak, Taki Hasan Rafi⁶, Tejaswini Kumar⁴, Dong-Kyu Chae⁶

Aff.: ¹Liverpool John Moores University, UK; ²Birla Institute of Technology, India; ³Christ University, India; ⁴Columbia University, USA; ⁵New York University, USA; ⁶Hanyang University, South Korea

Arxiv: http://arxiv.org/pdf/2412.19794v1

Dataset Name	Primary Scene Type and Unique Characteristics
STAR Wu et al. ( 2024 )	Indoor actions and object interactions
PAXION Wang et al. ( 2023 )	Real-world scenes with nuanced actions
Moments in Time (MiT) V1 Monfort et al. ( 2019 )	Indoor/outdoor scenes across varied contexts
FunQA Xie et al. ( 2025 )	Humor-focused, creative, real-world events
CLEVRER Mao et al. ( 2022 )	Simulated scenes for object movement and reasoning
Perception Test Patraucean et al. ( 2024 )	First/third-person views for object tracking
Charades-STA for AI ( 2024 )	Indoor human actions and interactions
MoVQA Zhang et al. ( 2023 )	Diverse scenes for scene transition comprehension
VLN-CE Krantz ( 2024 )	Indoor navigation from agent perspective
TVQA Lei et al. ( 2018 )	TV show scenes for episodic reasoning