notesum.ai
Published at November 26What's in the Image? A Deep-Dive into the Vision of Vision Language Models
cs.CV
cs.AI
Released Date: November 26, 2024
Authors: Omri Kaduri1, Shai Bagon1, Tali Dekel1
Aff.: 1Weizmann Institute of Science

| Existence | Count | Position | Color | OCR | Poster | ||||||||
| ACC | ACC+ | ACC | ACC+ | ACC | ACC+ | ACC | ACC+ | ACC | ACC+ | ACC | ACC+ | ||
| Naive (InternVL2) | 98.33 | 96.77 | 81.67 | 63.33 | 80.00 | 60.00 | 86.67 | 76.67 | 67.50 | 35.00 | 90.13 | 84.35 | |
| Describe-to-LLM | 90.00 | 80.00 | 75.00 | 73.33 | 66.67 | 46.67 | 86.67 | 80.00 | 77.50 | 55.00 | 86.05 | 80.27 | |
| Compressed Context | Query + K=5% | 91.66 | 83.33 | 85.00 | 70.00 | 68.33 | 40.00 | 80.00 | 60.00 | 77.50 | 55.00 | 87.55 | 78.23 |
| Query + K=2% | 85.00 | 76.67 | 78.33 | 60.00 | 68.33 | 40.00 | 70.00 | 40.00 | 72.50 | 45.00 | 82.39 | 65.49 | |
| Query | 56.67 | 13.33 | 46.67 | 13.33 | 53.33 | 16.67 | 46.67 | 3.33 | 55.00 | 15.00 | 71.08 | 48.29 | |
| K=2% | 65.00 | 30.00 | 56.67 | 30.00 | 56.67 | 30.00 | 50.00 | 10.00 | 52.50 | 10.00 | 64.28 | 33.33 | |