notesum.ai
Published at November 22Evaluating and Advancing Multimodal Large Language Models in Ability Lens
cs.CV
cs.CL
Released Date: November 22, 2024
Authors: Feng Chen1, Chenhui Gou2, Jing Liu2, Yang Yang3, Zhaoyang Li4, Jiyuan Zhang4, Zhenbang Sun4, Bohan Zhuang5, Qi Wu1
Aff.: 1AIML, University of Adelaide; 2Monash University; 3The Australian National University; 4Tiktok, Australia; 5Zhejiang University

| Model | Counting | OCR | Grounding | Entity | Attribute | Structured Data | Average |
|---|---|---|---|---|---|---|---|
| LLaVA1.5-7b | 36.78 | 29.13 | 28.64 | 73.83 | 46.88 | 22.87 | 39.69 |
| LLaVA1.6-7b | 42.80 | 63.99 | 38.80 | 75.02 | 55.70 | 37.89 | 52.37 |
| LLaVA-OV-0.5b | 32.33 | 64.54 | 17.04 | 62.81 | 44.90 | 23.55 | 40.86 |
| LLaVA-OV-7b | 51.60 | 71.23 | 50.36 | 84.16 | 62.04 | 64.83 | 64.04 |
| LLaVA-OV-SI-7b | 49.44 | 78.17 | 45.24 | 85.55 | 62.22 | 61.40 | 63.67 |
| LLaVA-Video-7b | 37.82 | 53.08 | 32.00 | 68.56 | 60.63 | 33.13 | 47.54 |
| LLaVA-OV-72b | 56.75 | 81.35 | 59.23 | 86.40 | 69.08 | 73.15 | 70.99 |
| QwenVL2-2b | 48.28 | 73.36 | 35.00 | 82.39 | 56.61 | 46.36 | 57.00 |
| QwenVL2-7b | 50.95 | 79.29 | 47.12 | 85.10 | 62.66 | 67.56 | 65.45 |
| QwenVL2-72b | 55.84 | 86.09 | 58.70 | 86.43 | 65.11 | 79.32 | 71.92 |
| InternVL2-8b | 49.23 | 80.47 | 50.89 | 84.30 | 61.21 | 64.92 | 65.17 |
| gpt-4-vision-preview | 39.00 | 71.95 | 47.67 | 77.92 | 58.56 | 67.12 | 60.37 |
| gpt-4o-2024-08-06 | 50.55 | 82.57 | 58.84 | 86.47 | 67.67 | 75.82 | 70.32 |
| claude-3-5-sonnet-2024102 | 51.64 | 79.80 | 58.63 | 85.47 | 61.20 | 77.06 | 68.97 |