notesum.ai
Published at November 25TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation
cs.CV
cs.AI
cs.RO
Released Date: November 25, 2024
Authors: Linqing Zhong1, Chen Gao2, Zihan Ding1, Yue Liao3, Si Liu1
Aff.: 1Beihang University; 2National University of Singapore; 3MMLab, CUHK

| MP3D | HM3D | |||||||
| Methods | Zero-Shot | Training-Free | Reasoning Domain | SR | SPL | SR | SPL | |
| SemEXP [3] | [NeurIPS2020] | Latent Map | 36.0 | 14.4 | - | - | ||
| PONI [27] | [CVPR2022] | Latent Map | 31.8 | 12.1 | - | - | ||
| ProcTHOR [8] | [NeurIPS2022] | CLIP Embeddings | - | - | 54.4 | 31.8 | ||
| ProcTHOR-ZS [8] | [NeurIPS2022] | ✓ | CLIP Embeddings | - | - | 13.2 | 7.7 | |
| ZSON [21] | [NeurIPS2022] | ✓ | CLIP Embeddings | 15.3 | 4.8 | 25.5 | 12.6 | |
| PSL [32] | [ECCV2024] | ✓ | CLIP Embeddings | - | - | 42.4 | 19.2 | |
| Pixel-Nav [1] | [ICRA2024] | ✓ | Linguistic | - | - | 37.9 | 20.5 | |
| SGM [46] | [CVPR2024] | ✓ | Linguistic | 37.7 | 14.7 | 60.2 | 30.8 | |
| CoW [10] | [CVPR2023] | ✓ | ✓ | CLIP Embeddings | 7.4 | 3.7 | - | - |
| ESC [49] | [ICML2023] | ✓ | ✓ | Linguistic | 28.7 | 14.2 | 39.2 | 22.3 |
| VoroNav [34] | [ICML2024] | ✓ | ✓ | Linguistic | - | - | 42.0 | 26.0 |
| TopV-Nav | (Ours) | ✓ | ✓ | Top-view Map | 31.9 | 16.1 | 45.9 | 28.0 |