notesum.ai

Published at November 21

CLIPer: Hierarchically Improving Spatial Representation of CLIP for Open-Vocabulary Semantic Segmentation

cs.CV

Released Date: November 21, 2024

Authors: Lin Sun¹, Jiale Cao¹, Jin Xie², Xiaoheng Jiang³, Yanwei Pang⁴

Aff.: ¹Tianjin University; ²Chongqing University; ³Zhengzhou University; ⁴Tianjin University, Shanghai Artificial Intelligence Laboratory

Arxiv: http://arxiv.org/abs/2411.13836v1

Refer to caption

		VOC				Context				Object
Method	Encoder	mAP	F1	P	R	mAP	F1	P	R	mAP	F1	P	R
CLIP [25]	ViT-L/14	90.2	75.3	79.7	71.3	59.8	54.2	55.1	53.2	66.3	53.7	57.9	50.0
MaskCLIP [51]	ViT-L/14	87.3	63.2	56.3	72.1	58.6	48.9	48.5	49.3	67.4	48.2	47.6	52.4
SCLIP [34]	ViT-L/14	92.7	74.5	81.0	69.1	63.2	57.7	57.6	57.7	71.4	55.4	64.5	48.6
ClearCLIP [16]	ViT-L/14	92.1	74.0	80.5	68.4	63.0	57.4	52.3	63.6	70.4	54.3	61.7	48.5
ProxyCLIP [15]	ViT-L/14	94.0	75.5	86.6	67.0	64.6	57.3	52.6	62.8	73.4	57.4	65.0	51.3
CLIPer (Ours)	ViT-L/14	94.6	86.0	86.7	85.3	68.9	63.3	63.4	64.3	77.9	62.3	69.7	56.3