notesum.ai
Published at November 21CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation
cs.CV
Released Date: November 21, 2024
Authors: Lin Sun1, Jiale Cao1, Jin Xie2, Xiaoheng Jiang3, Yanwei Pang4
Aff.: 1Tianjin University; 2Chongqing University; 3Zhengzhou University; 4Tianjin University, Shanghai Artificial Intelligence Laboratory

| VOC | Context | Object | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | Encoder | mAP | F1 | P | R | mAP | F1 | P | R | mAP | F1 | P | R |
| CLIP [25] | ViT-L/14 | 90.2 | 75.3 | 79.7 | 71.3 | 59.8 | 54.2 | 55.1 | 53.2 | 66.3 | 53.7 | 57.9 | 50.0 |
| MaskCLIP [51] | ViT-L/14 | 87.3 | 63.2 | 56.3 | 72.1 | 58.6 | 48.9 | 48.5 | 49.3 | 67.4 | 48.2 | 47.6 | 52.4 |
| SCLIP [34] | ViT-L/14 | 92.7 | 74.5 | 81.0 | 69.1 | 63.2 | 57.7 | 57.6 | 57.7 | 71.4 | 55.4 | 64.5 | 48.6 |
| ClearCLIP [16] | ViT-L/14 | 92.1 | 74.0 | 80.5 | 68.4 | 63.0 | 57.4 | 52.3 | 63.6 | 70.4 | 54.3 | 61.7 | 48.5 |
| ProxyCLIP [15] | ViT-L/14 | 94.0 | 75.5 | 86.6 | 67.0 | 64.6 | 57.3 | 52.6 | 62.8 | 73.4 | 57.4 | 65.0 | 51.3 |
| CLIPer (Ours) | ViT-L/14 | 94.6 | 86.0 | 86.7 | 85.3 | 68.9 | 63.3 | 63.4 | 64.3 | 77.9 | 62.3 | 69.7 | 56.3 |