notesum.ai
Published at November 27Grid-augumented vision: A simple yet effective approach for enhanced spatial understanding in multi-modal agents
cs.CV
Released Date: November 27, 2024
Authors: Joongwon Chae1, Zhenyu Wang1, Peiwu Qin1
Aff.: 1Institute of Biopharmaceutical and Health Engineering, Shenzhen International Graduate School, Tsinghua University, Shenzhen, Guangdong, China

| Configuration | IoU | GIoU |
|---|---|---|
| Original images+CoT | 0.27 | 0.18 |
| 33 - black - 0.1 | 0.33 | 0.24 |
| 55 - black - 0.1 | 0.46 | 0.41 |
| 77 - black - 0.1 | 0.49 | 0.45 |
| 99 - black - 0.1 | 0.53 | 0.49 |
| 2020 - black - 0.1 | 0.45 | 0.40 |
| 3030 - black - 0.1 | 0.36 | 0.30 |
| 33 - black - 0.3 | 0.43 | 0.38 |
| 55 - black - 0.3 | 0.51 | 0.47 |
| 77 - black - 0.3 | 0.54 | 0.51 |
| 99 - black - 0.3 | 0.56 | 0.53 |
| 2020 - black - 0.3 | 0.45 | 0.41 |
| 3030 - black - 0.3 | 0.37 | 0.32 |
| 33 - black - 0.5 | 0.38 | 0.29 |
| 55 - black - 0.5 | 0.43 | 0.38 |
| 77 - black - 0.5 | 0.45 | 0.40 |
| 99 - black - 0.5 | 0.48 | 0.43 |
| 2020 - black - 0.5 | 0.40 | 0.31 |
| 3030 - black - 0.5 | 0.36 | 0.30 |
| 33 - black - 0.7 | 0.39 | 0.31 |