notesum.ai
Published at November 26HyperSeg: Towards Universal Visual Segmentation with Large Language Model
cs.CV
Released Date: November 26, 2024
Authors: Cong Wei1, Yujie Zhong2, Haoxian Tan2, Yong Liu1, Zheng Zhao2, Jie Hu2, Yujiu Yang1
Aff.: 1Tsinghua Shenzhen International Graduate School, Tsinghua University; 2Meituan Inc.
![[Uncaptioned image]](https://arxiv.org/html/2411.17606v1/x1.png)
| Method | Backbone | ReVOS-Reasoning | ReVOS-Referring | ReVOS-Overall | ReasonSeg | |||||||
| gIoU | cIoU | |||||||||||
| LMPMĀ [12] | Swin-T | 13.3 | 24.3 | 18.8 | 29.0 | 39.1 | 34.1 | 21.2 | 31.7 | 26.4 | - | - |
| ReferFormerĀ [46] | Video-Swin-B | 21.3 | 25.6 | 23.4 | 31.2 | 34.3 | 32.7 | 26.2 | 29.9 | 28.1 | - | - |
| LISA-7BĀ [21] | ViT-H | 33.8 | 38.4 | 36.1 | 44.3 | 47.1 | 45.7 | 39.1 | 42.7 | 40.9 | 52.9 | 54.0 |
| LaSagnA-7BĀ [44] | ViT-H | - | - | - | - | - | - | - | - | - | 48.8 | 47.2 |
| SAM4MLLM-7BĀ [6] | EfficientViT-SAM-XL1 | - | - | - | - | - | - | - | - | - | 46.7 | 48.1 |
| TrackGPT-13BĀ [62] | ViT-H | 38.1 | 42.9 | 40.5 | 48.3 | 50.6 | 49.5 | 43.2 | 46.8 | 45.0 | - | - |
| VISA-7B Ā [51] | ViT-H | 36.7 | 41.7 | 39.2 | 51.1 | 54.7 | 52.9 | 43.9 | 48.2 | 46.1 | 52.7 | 57.8 |
| VISA-13B Ā [51] | ViT-H | 38.3 | 43.5 | 40.9 | 52.3 | 55.8 | 54.1 | 45.3 | 49.7 | 47.5 | - | - |
| HyperSeg-3B | Swin-B | 50.2 | 55.8 | 53.0 | 56.0 | 60.9 | 58.5 | 53.1 | 58.4 | 55.7 | 59.2 | 56.7 |