notesum.ai
Published at December 6Beyond Boxes: Mask-Guided Spatio-Temporal Feature Aggregation for Video Object Detection
cs.CV
Released Date: December 6, 2024
Authors: Khurram Azeem Hashmi1, Talha Uddin Sheikh1, Didier Stricker1, Muhammad Zeshan Afzal1
Aff.: 1DFKI - German Research Center for Artificial Intelligence

| Method | Source | Backbone | mAP(%) | Time (ms) |
| SELSA [65] | ICCV2019 | X101 | 83.1 | 153.8 |
| RDN [16] | ICCV2019 | R101 | 81.8 | 162.6 |
| MEGA [5] | CVPR2020 | R101 | 82.9 | 230.4 |
| TROIA [22] | AAAI2021 | X101 | 84.3 | 285.7 |
| MAMBA [55] | AAAI2021 | R101 | 84.6 | 110.3(T) |
| QueryProp [28] | AAAI2022 | R101 | 82.3 | 30.8(T) |
| SparseVOD [27] | BMVC2022 | R101 | 81.9 | 142.4 |
| FAQ [9] | CVPR2023 | R50 | 81.7 | 163.2 |
| Liu et al. [43] | ICCV2023 | R101 | 87.2 | 39.6(T) |
| STPN [56] | ICCV2023 | SwinT | 85.0 | 45.7 |
| TransVODLite [71] | TPAMI2022 | SwinT | 83.7 | 42.1 |
| YOLOV-S [54] | AAAI2023 | MCSP | 77.3 | 11.3 |
| YOLOV-L [54] | MCSP | 83.6 | 16.3 | |
| YOLOV-X [54] | MCSP | 85.0 | 22.7 | |
| FAIM-S | Ours | MCSP | 78.2+0.9 | 11.6 |
| FAIM-L | MCSP | 84.3+0.7 | 16.5 | |
| FAIM-X | MCSP | 85.6+0.6 | 22.7 | |
| With Post-processing | ||||
| YOLOV-S [54] | AAAI2023 | MCSP | 80.1 | 11.3 6.9 |
| YOLOV-L [54] | MCSP | 86.2 | 16.3 6.9 | |
| YOLOV-X [54] | MCSP | 87.2 | 22.7 6.1 | |
| FAIM-S | Ours | MCSP | 80.6+0.5 | 11.6 6.9 |
| FAIM-L | MCSP | 87.0+0.8 | 16.5 6.9 | |
| FAIM-X | MCSP | 87.9+0.7 | 22.7 6.9 | |