notesum.ai
Published at November 28Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation
cs.CV
cs.AI
cs.CL
Released Date: November 28, 2024
Authors: Luca Barsellotti1, Lorenzo Bianchi2, Nicola Messina2, Fabio Carrara2, Marcella Cornia1, Lorenzo Baraldi1, Fabrizio Falchi2, Rita Cucchiara1
Aff.: 1University of Modena and Reggio Emilia, Italy; 2ISTI-CNR, Italy

| ViT-Base (mIoU) | ViT-Large (mIoU) | |||||||||||||||||||
| Model | Visual Backbone | Frozen | VOC | Context | Stuff | Cityscapes | ADE | Avg | VOC | Context | Stuff | Cityscapes | ADE | Avg | ||||||
| without Mask Refinement | ||||||||||||||||||||
| GroupViT [49] | Custom ViT | ✗ | 79.7 | 23.4 | 15.3 | 11.1 | 9.2 | 27.7 | - | - | - | - | - | - | ||||||
| ReCo [37] | CLIP | ✗ | 57.7 | 22.3 | 14.8 | 21.1 | 11.2 | 25.4 | - | - | - | - | - | - | ||||||
| TCL [9] | CLIP | ✗ | 77.5 | 30.3 | 19.6 | 23.1 | 14.9 | 33.1 | - | - | - | - | - | - | ||||||
| SILC [32] | Custom ViT | ✗ | 77.5 | 31.6 | 20.8 | 26.9 | 19.3 | 35.2 | - | - | - | - | - | - | ||||||
| MaskCLIP [58] | CLIP | ✓ | 72.1 | 25.3 | 15.1 | 11.2 | 9.0 | 26.5 | 29.4 | 12.4 | 8.8 | 11.5 | 7.2 | 13.9 | ||||||
| CLIP-DIY [46] | CLIP+DINO | ✓ | 79.7 | 19.8 | 13.3 | 11.6 | 9.9 | 26.9 | - | - | - | - | - | - | ||||||
| OVDiff [22] | CLIP+DINO+SD | ✓ | 80.9 | 32.9 | 20.3 | 23.4 | 14.9 | 34.5 | - | - | - | - | - | - | ||||||
| SCLIP [42] | CLIP | ✓ | 80.4 | 34.2 | 22.4 | 32.2 | 16.1 | 37.1 | 70.6 | 25.2 | 17.6 | 21.3 | 10.9 | 29.1 | ||||||
| CLIP-DINOiser [47] | CLIP | ✓ | 80.9 | 31.7 | 24.6 | 35.9 | 20.0 | 38.6 | - | - | - | - | - | - | ||||||
| NACLIP [19] | CLIP | ✓ | 79.7 | 35.2 | 23.3 | 35.5 | 17.4 | 38.2 | 78.7 | 32.1 | 21.4 | 31.4 | 17.3 | 36.2 | ||||||
| ProxyCLIP [23] | CLIP+DINOv2 | ✓ | 83.0 | 37.2 | 25.4 | 33.9 | 19.7 | 39.8 | 85.2 | 36.2 | 24.6 | 35.2 | 21.6 | 40.6 | ||||||
| ProxyCLIP [23] | CLIP+DINO | ✓ | 80.3 | 39.1 | 26.5 | 38.1 | 20.2 | 40.8 | 83.2 | 37.7 | 25.6 | 40.1 | 22.6 | 41.8 | ||||||
| Talk2DINO (Ours) | DINOv2 | ✓ | 85.3 | 40.5 | 27.9 | 38.2 | 21.8 | 42.7 | 86.8 | 39.3 | 27.1 | 37.0 | 21.4 | 42.3 | ||||||
| with Mask Refinement | ||||||||||||||||||||
| GroupViT [49] | Custom ViT | ✗ | 81.5 | 23.8 | 15.4 | 11.6 | 9.4 | 28.3 | - | - | - | - | - | - | ||||||
| ReCo [37] | CLIP | ✗ | 62.4 | 24.7 | 16.3 | 22.8 | 12.4 | 27.7 | - | - | - | - | - | - | ||||||
| TCL [9] | CLIP | ✗ | 83.2 | 33.9 | 22.4 | 24.0 | 17.1 | 36.1 | - | - | - | - | - | - | ||||||
| MaskCLIP [58] | CLIP | ✓ | 74.9 | 26.4 | 16.4 | 12.6 | 9.8 | 28.0 | 32.7 | 14.6 | 10.5 | 13.2 | 8.5 | 15.9 | ||||||
| FOSSIL [2] | DINOv2 | ✓ | - | - | - | - | - | - | - | 35.8 | 24.8 | 23.2 | 18.8 | - | ||||||
| SCLIP [42] | CLIP | ✓ | 83.5 | 36.1 | 23.9 | 34.1 | 17.8 | 39.1 | 76.3 | 27.4 | 18.7 | 23.9 | 11.8 | 31.6 | ||||||
| NACLIP [19] | CLIP | ✓ | 83.0 | 38.4 | 25.7 | 38.3 | 19.1 | 40.9 | 84.5 | 36.4 | 24.6 | 37.1 | 19.6 | 40.4 | ||||||
| FreeDA [3] | DINOv2 | ✓ | 75.6 | 37.6 | 25.2 | 34.4 | 20.7 | 38.7 | 70.2 | 39.0 | 26.9 | 33.2 | 19.5 | 37.8 | ||||||
| FreeDA [3] | CLIP+DINOv2 | ✓ | 85.6 | 43.1 | 27.8 | 36.7 | 22.4 | 43.1 | 87.9 | 43.5 | 28.8 | 36.7 | 23.2 | 44.0 | ||||||
| Talk2DINO (Ours) | DINOv2 | ✓ | 87.0 | 43.5 | 30.3 | 40.8 | 23.5 | 45.0 | 88.7 | 43.2 | 30.0 | 39.3 | 23.4 | 44.9 | ||||||