notesum.ai

Published at November 26

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

cs.CV

Released Date: November 26, 2024

Authors: Cong Wei¹, Yujie Zhong², Haoxian Tan², Yong Liu¹, Zheng Zhao², Jie Hu², Yujiu Yang¹

Aff.: ¹Tsinghua Shenzhen International Graduate School, Tsinghua University; ²Meituan Inc.

Arxiv: http://arxiv.org/abs/2411.17606v1

[Uncaptioned image]

Method	Backbone	ReVOS-Reasoning			ReVOS-Referring			ReVOS-Overall			ReasonSeg
Method	Backbone	$\mathcal{J}$	$\mathcal{F}$	$\mathcal{J\&F}$	$\mathcal{J}$	$\mathcal{F}$	$\mathcal{J\&F}$	$\mathcal{J}$	$\mathcal{F}$	$\mathcal{J\&F}$	gIoU	cIoU
LMPM [12]	Swin-T	13.3	24.3	18.8	29.0	39.1	34.1	21.2	31.7	26.4	-	-
ReferFormer [46]	Video-Swin-B	21.3	25.6	23.4	31.2	34.3	32.7	26.2	29.9	28.1	-	-
LISA-7B [21] $\ddagger$	ViT-H	33.8	38.4	36.1	44.3	47.1	45.7	39.1	42.7	40.9	52.9	54.0
LaSagnA-7B [44] $\ddagger$	ViT-H	-	-	-	-	-	-	-	-	-	48.8	47.2
SAM4MLLM-7B [6] $\ddagger$	EfficientViT-SAM-XL1	-	-	-	-	-	-	-	-	-	46.7	48.1
TrackGPT-13B [62] $\ddagger$	ViT-H	38.1	42.9	40.5	48.3	50.6	49.5	43.2	46.8	45.0	-	-
VISA-7B [51] $\ddagger$	ViT-H	36.7	41.7	39.2	51.1	54.7	52.9	43.9	48.2	46.1	52.7	57.8
VISA-13B [51] $\ddagger$	ViT-H	38.3	43.5	40.9	52.3	55.8	54.1	45.3	49.7	47.5	-	-
HyperSeg-3B	Swin-B	50.2	55.8	53.0	56.0	60.9	58.5	53.1	58.4	55.7	59.2	56.7