notesum.ai
Published at December 5Towards Real-Time Open-Vocabulary Video Instance Segmentation
cs.CV
Released Date: December 5, 2024
Authors: Bin Yan1, Martin Sundermeyer, David Joseph Tan, Huchuan Lu, Federico Tombari
Aff.: 1Dalian University of Technology

| Method | BURST [3] | LV-VIS [38] | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ALL | Common | Uncommon | ||||||||
| STCN Tracker [3] | 5.5 | 0.9 | 17.5 | 0.7 | 2.5 | 0.6 | - | - | - | - |
| Box Tracker [3] | 8.2 | 1.4 | 27.0 | 3.0 | 3.6 | 0.9 | - | - | - | - |
| Detic [58]-SORT [4] | - | - | - | - | - | - | 12.8 | 21.1 | 6.6 | 6.7 |
| Detic [58]-XMem [7] | - | - | - | - | - | - | 16.3 | 24.1 | 10.6 | 13.4 |
| OV2Seg [38] | - | 3.7 | - | - | - | - | 14.2 | 17.2 | 11.9 | 20.1 |
| GLEE-Lite [40] | 22.6 | 12.6 | 36.4 | 18.9 | 19.1 | 11.0 | 19.6 | 22.1 | 17.7 | 1.3 |
| TROY-VIS | 23.9 | 12.4 | 42.3 | 19.6 | 19.3 | 10.7 | 20.9 | 23.4 | 19.1 | 20.9 |