notesum.ai
Published at November 22Effective SAM Combination for Open-Vocabulary Semantic Segmentation
cs.CV
Released Date: November 22, 2024
Authors: Minhyeok Lee1, Suhwan Cho1, Jungho Lee1, Sunghun Yang1, Heeseung Choi2, Ig-Jae Kim2, Sangyoun Lee1
Aff.: 1Yonsei University; 2Korea Institute of Science and Technology (KIST)
![[Uncaptioned image]](https://arxiv.org/html/2411.14723v1/x1.png)
| Model | Publication | VLM | Additional Backbone | Training Dataset | Additional Dataset | A-847 | PC-459 | A-150 | PC-59 | PAS-20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SPNet [32] | CVPR’19 | - | ResNet-101 | PASCAL VOC | ✗ | - | - | - | 24.3 | 18.3 | - |
| ZS3Net [1] | NeurIPS’19 | - | ResNet-101 | PASCAL VOC | ✗ | - | - | - | 19.4 | 38.3 | - |
| LSeg [20] | ICLR’22 | CLIP ViT-B/32 | ResNet-101 | PASCAL VOC-15 | ✗ | - | - | - | - | 47.4 | - |
| LSeg+ [10] | ECCV’22 | ALIGN | ResNet-101 | COCO-Stuff | ✗ | 2.5 | 5.2 | 13.0 | 36.0 | - | 59.0 |
| ZegFormer [6] | CVPR’22 | CLIP ViT-B/16 | ResNet-101 | COCO-Stuff-156 | ✗ | 4.9 | 9.1 | 16.9 | 42.8 | 86.2 | 62.7 |
| ZSseg [35] | ECCV’22 | CLIP ViT-B/16 | ResNet-101 | COCO-Stuff | ✗ | 7.0 | - | 20.5 | 47.7 | 88.4 | - |
| OpenSeg [10] | ECCV’22 | ALIGN | ResNet-101 | COCO Panoptic | ✓ | 4.4 | 7.9 | 17.5 | 40.1 | - | 63.8 |
| OVSeg [21] | CVPR’23 | CLIP ViT-B/16 | ResNet-101c | COCO-Stuff | ✓ | 7.1 | 11.0 | 24.8 | 53.3 | 92.6 | - |
| ZegCLIP [40] | CVPR’23 | CLIP ViT-B/16 | - | COCO-Stuff-156 | ✗ | - | - | - | 41.2 | 93.6 | - |
| SAN [36] | CVPR’23 | CLIP ViT-B/16 | - | COCO-Stuff | ✗ | 10.1 | 12.6 | 27.5 | 53.8 | 94.0 | - |
| DeOP [11] | ICCV’23 | CLIP ViT-B/16 | ResNet-101c | COCO-Stuff-156 | ✗ | 7.1 | 9.4 | 22.9 | 48.8 | 91.7 | - |
| SCAN [22] | CVPR’24 | CLIP ViT-B/16 | Swin-B | COCO-Stuff | ✗ | 10.8 | 13.2 | 30.8 | 58.4 | 97.0 | - |
| EBSeg [30] | CVPR’24 | CLIP ViT-B/16 | SAM ViT-B | COCO-Stuff | ✗ | 11.1 | 17.3 | 30.0 | 56.7 | 94.6 | - |
| SED [33] | CVPR’24 | ConvNeXt-B | - | COCO-Stuff | ✗ | 11.4 | 18.6 | 31.6 | 57.3 | 94.4 | - |
| CAT-Seg [5] | CVPR’24 | CLIP ViT-B/16 | - | COCO-Stuff | ✗ | 12.0 | 19.0 | 31.8 | 57.5 | 94.6 | 77.3 |
| 13.3 | 21.1 | 35.6 | 59.0 | 97.3 | 80.1 | ||||||
| ESC-Net (ours) | - | CLIP ViT-B/16 | - | COCO-Stuff | âś— | (+1.3) | (+2.1) | (+3.8) | (+0.6) | (+0.3) | (+2.8) |
| LSeg [20] | ICLR’22 | CLIP ViT-B/32 | ViT-L/16 | PASCAL VOC-15 | ✗ | - | - | - | - | 52.3 | - |
| OpenSeg [10] | ECCV’22 | ALIGN | EfficientNet-B7 | COCO Panoptic | ✓ | 8.1 | 11.5 | 26.4 | 44.8 | - | 70.2 |
| OVSeg [21] | CVPR’23 | CLIP ViT-L/14 | Swin-B | COCO-Stuff | ✓ | 9.0 | 12.4 | 29.6 | 55.7 | 94.5 | - |
| SAN [36] | CVPR’23 | CLIP ViT-L/14 | - | COCO-Stuff | ✗ | 12.4 | 15.7 | 32.1 | 57.7 | 94.6 | - |
| ODISE [34] | CVPR’23 | CLIP ViT-L/14 | Stable Diffusion | COCO-Stuff | ✗ | 11.1 | 14.5 | 29.9 | 57.3 | - | - |
| FC-CLIP [37] | NeurIPS’23 | ConvNeXt-L | - | COCO Panoptic | ✗ | 11.2 | 12.7 | 26.6 | 42.4 | 89.5 | - |
| MAFT [15] | NeurIPS’23 | CLIP ViT-L/14 | - | COCO-Stuff | ✗ | 12.7 | 16.2 | 33.0 | 59.0 | 92.1 | - |
| USE [31] | CVPR’24 | CLIP ViT-L/14 | DINOv2, SAM | COCO-Stuff | ✓ | 13.4 | 15.0 | 37.1 | 58.0 | - | - |
| SCAN [22] | CVPR’24 | CLIP ViT-L/14 | Swin-B | COCO-Stuff | ✗ | 14.0 | 16.7 | 33.5 | 59.3 | 97.0 | - |
| EBSeg [30] | CVPR’24 | CLIP ViT-L/14 | SAM ViT-B | COCO-Stuff | ✗ | 13.7 | 21.0 | 32.8 | 60.2 | 97.2 | - |
| SED [33] | CVPR’24 | ConvNeXt-L | - | COCO-Stuff | ✗ | 13.9 | 22.6 | 35.2 | 60.6 | 96.1 | - |
| CAT-Seg [5] | CVPR’24 | CLIP ViT-L/14 | - | COCO-Stuff | ✗ | 16.0 | 23.8 | 37.9 | 63.3 | 97.0 | 82.5 |
| MAFT+ [17] | ECCV’24 | ConvNeXt-L | - | COCO-Stuff | ✗ | 15.1 | 21.6 | 36.1 | 59.4 | 96.5 | - |
| 18.1 | 27.0 | 41.8 | 65.6 | 98.3 | 86.3 | ||||||
| ESC-Net (ours) | - | CLIP ViT-L/14 | - | COCO-Stuff | âś— | (+2.1) | (+3.2) | (+3.9) | (+2.3) | (+1.1) | (+3.8) |