notesum.ai
Published at November 9An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
cs.CV
cs.AI
Released Date: November 9, 2024
Authors: Fatemeh Shiri1, Xiao-Yu Guo2, Mona Golestan Far1, Xin Yu3, Gholamreza Haffari1, Yuan-Fang Li1
Aff.: 1Department of Data Science & AI, Monash University; 2Australian Institute for Machine Learning, University of Adelaide; 3School of Electrical Engineering and Computer Science, University of Queensland

| GPT-4 vision | Gemini | LLaVA-1.5 | MiniGPT-v2 | |||||||||
| Prompt | 1_obj | 2_obj | all | 1_obj | 2_obj | all | 1_obj | 2_obj | all | 1_obj | 2_obj | all |
| Spatial-Obj | ||||||||||||
| stan | 50.80 | 52.85 | 52.18 | 46.63 | 41.53 | 43.18 | 51.59 | 43.58 | 46.18 | 45.73 | 35.83 | 39.99 |
| stan+bbox | 71.34 | 60.21 | 63.82 | 50.81 | 45.55 | 47.26 | 56.97 | 44.62 | 48.63 | 53.61 | 40.87 | 44.93 |
| stan+SG | 60.08 | 61.77 | 61.22 | 53.42 | 55.45 | 54.79 | 51.60 | 50.79 | 51.05 | 49.44 | 41.39 | 43.95 |
| GQA-spatial | ||||||||||||
| stan | 66.46 | 15.81 | 56.30 | 19.40 | 16.35 | 18.79 | 27.16 | 15.12 | 24.75 | 22.14 | 8.63 | 19.43 |
| stan+bbox | 79.69 | 40.89 | 71.91 | 33.74 | 37.84 | 33.76 | 68.19 | 42.96 | 63.13 | 33.45 | 21.69 | 31.09 |
| stan+bbox | 85.86 | 64.95 | 81.67 | 42.11 | 48.00 | 43.29 | 80.86 | 50.52 | 74.78 | 50.26 | 39.99 | 48.20 |
| stan+SG | 71.60 | 70.69 | 71.42 | 41.36 | 55.21 | 44.14 | 55.40 | 36.90 | 51.69 | 39.67 | 31.69 | 38.07 |
| stan+SG | 64.90 | 55.40 | 62.99 | 24.77 | 49.25 | 29.68 | 36.99 | 46.55 | 38.91 | 51.77 | 33.9 | 48.19 |