notesum.ai
Published at December 9DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction
cs.CV
Released Date: December 9, 2024
Authors: Yunheng Li1, Yuxuan Li2, Quansheng Zeng1, Wenhai Wang3, Qibin Hou1, Ming-Ming Cheng1
Aff.: 1VCIP Lab, Computer Science, NKU; 2University College London; 3The Chinese University of Hong Kong
| COCO | ADE20K | |||||||||||
| Boxes | Masks-T | Masks-S | Boxes | Masks-T | Masks-S | |||||||
| Method | Top1 | Top5 | Top1 | Top5 | Top1 | Top5 | Top1 | Top5 | Top1 | Top5 | Top1 | Top5 |
| OpenCLIP [3] | 49.8 | 74.3 | 51.9 | 72.2 | 29.2 | 54.9 | 28.4 | 54.1 | 29.6 | 53.4 | 37.9 | 66.6 |
| DFN [10] | 38.3 | 65.0 | 31.0 | 57.0 | 26.4 | 54.9 | 30.6 | 57.9 | 24.2 | 49.9 | 32.2 | 57.7 |
| SigLIP [55] | 39.9 | 61.4 | 40.4 | 60.1 | 30.3 | 56.4 | 25.9 | 49.2 | 27.3 | 47.6 | 34.5 | 57.3 |
| EVA-CLIP [41] | 44.3 | 68.7 | 44.7 | 66.0 | 26.2 | 51.9 | 33.0 | 57.6 | 33.9 | 56.2 | 36.2 | 62.3 |
| RegionCLIP† [57] | 68.5 | 89.5 | 60.7 | 84.3 | 22.0 | 53.5 | 43.2 | 72.2 | 34.0 | 62.6 | 37.7 | 68.6 |
| CLIPSelf† [46] | 69.1 | 88.2 | 66.7 | 83.0 | 41.7 | 75.2 | 48.1 | 77.7 | 47.5 | 74.2 | 53.7 | 82.8 |
| DenseVLM† (Ours) | 73.4 | 90.5 | 71.0 | 84.8 | 45.6 | 77.8 | 51.3 | 82.2 | 52.1 | 78.0 | 57.8 | 85.5 |