notesum.ai
Published at November 25RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics
cs.CV
cs.AI
cs.CL
cs.RO
Released Date: November 25, 2024
Authors: Chan Hee Song1, Valts Blukis2, Jonathan Tremblay2, Stephen Tyree2, Yu Su1, Stan Birchfield2
Aff.: 1The Ohio State University; 2NVIDIA

| Model | Indoor | Tabletop | Average | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Configuration | Context | Compatibility | Configuration | Context | Compatibility | Indoor | Tabletop | Total | |
| Open-source VLMs | |||||||||
| 2D VLMs | |||||||||
| VILA [30] | 54.7 | 18.3 | 56.3 | 45.1 | 13.2 | 53.8 | 43.1 | 37.4 | 40.2 |
| +RoboSpatial | 71.4 | 45.9 | 77.2 | 71.8 | 43.7 | 73.3 | 64.8 | 62.9 | 63.9 |
| LLaVA-Next [34] | 48.9 | 12.5 | 32.7 | 48.3 | 8.4 | 30.9 | 31.4 | 29.2 | 30.3 |
| +RoboSpatial | 69.3 | 41.3 | 70.5 | 70.7 | 44.8 | 66.1 | 60.4 | 60.5 | 60.5 |
| SpaceLLaVA [8] | 52.6 | 15.3 | 49.0 | 66.5 | 12.2 | 60.1 | 38.9 | 46.2 | 43.6 |
| +RoboSpatial | 76.0 | 50.7 | 76.6 | 74.9 | 46.4 | 70.5 | 67.8 | 63.6 | 65.7 |
| RoboPoint [57] | 39.0 | 41.4 | 38.3 | 37.9 | 31.6 | 45.2 | 39.6 | 38.2 | 38.9 |
| +RoboSpatial | 72.2 | 68.9 | 72.1 | 70.3 | 61.7 | 78.4 | 71.0 | 70.1 | 70.6 |
| 3D VLMs | |||||||||
| 3D-LLM [18] | 54.5 | 8.1 | 53.6 | 59.2 | 10.6 | 57.4 | 37.6 | 42.4 | 40.0 |
| +RoboSpatial | 76.3 | 35.4 | 77.5 | 76.2 | 46.8 | 75.0 | 63.1 | 66.0 | 64.6 |
| LEO [20] | 56.1 | 11.3 | 58.3 | 60.8 | 11.1 | 59.3 | 41.9 | 43.7 | 42.8 |
| +RoboSpatial | 80.2 | 56.7 | 82.5 | 78.1 | 55.2 | 78.9 | 73.1 | 70.7 | 71.9 |
| Not available for fine-tuning | |||||||||
| 2D VLMs | |||||||||
| Molmo [12] | 40.6 | 48.2 | 60.0 | 61.5 | 35.8 | 54.6 | 49.6 | 50.6 | 50.1 |
| GPT-4o [40] | 63.5 | 25.1 | 59.4 | 62.3 | 27.9 | 66.8 | 49.3 | 52.3 | 50.8 |