notesum.ai
Published at December 3AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
cs.CV
cs.AI
cs.CL
cs.MM
cs.SD
eess.AS
Released Date: December 3, 2024
Authors: Kaixiong Gong1, Kaituo Feng1, Bohao Li2, Yibing Wang1, Mofan Cheng1, Shijia Yang3, Jiaming Han1, Benyou Wang2, Yutong Bai4, Zhuoran Yang5, Xiangyu Yue1
Aff.: 1CUHK MMLab; 2CUHK (SZ); 3Stanford University; 4UC Berkeley; 5Yale University

| Benchmark / Dataset | Modality | Questions | Answer Type | Customized Question | Audio Attributes | Multiple Domains | Interleaved | ||||||
| Timbre | Tone | Melody | Space | Time | Hallucination | Intricacy | |||||||
| MME Bench [21] | Image | 2194 | Y/N | ✓ | - | - | - | - | - | - | - | ✓ | ✗ |
| MMBench [42] | Image(s) | 2974 | A/B/C/D | ✓ | - | - | - | - | - | - | - | ✓ | ✗ |
| SEED-Bench-2 [32] | Image(s) & Video | 24371 | A/B/C/D | ✓ | - | - | - | - | - | - | - | ✓ | ✓ |
| AVQA Dataset [81] | Video & Audio | 57335 | A/B/C/D | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ |
| Pano-AVQA Dataset [88] | Video & Audio | 51700 | defined words & bbox | ✓ | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ |
| Music-AVQA Dataset [33] | Video & Audio | 45867 | defined words | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
| SAVE Bench [68] | Image & Video & Audio | 4350 | free-form | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ |
| OmniBench [37] | Image & Audio | 1142 | A/B/C/D | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✗ | ✗ | ✓ | ✗ |
| AV-Odyssey Bench (ours) | Image(s) & Video & Audio(s) | 4555 | A/B/C/D | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |