notesum.ai
Published at November 21Revisiting the Integration of Convolution and Attention for Vision Backbone
cs.CV
cs.AI
Released Date: November 21, 2024
Authors: Lei Zhu1, Xinjiang Wang2, Wayne Zhang2, Rynson W. H. Lau
Aff.: 1City University of Hong Kong; 2Sensetime Research

| Method | FLOPs | #Param. | Top-1 | |
|---|---|---|---|---|
| Standard supervised training | ||||
| Swin-T [35] | [ICCV’21] | 4.5G | 29M | 81.3 |
| PVT-S [51] | [ICCV’21] | 3.8G | 25M | 79.8 |
| CSWin-T [14] | [CVPR’22] | 4.5G | 23M | 82.7 |
| CMT-S [18] | [CVPR’22] | 4.0G | 25M | 83.5 |
| RegionViT-S [3] | [ICLR’22] | 5.7G | 31M | 83.3 |
| CrossFormer-S [53] | [ICLR’23] | 5.3G | 31M | 82.5 |
| MaxViT-T [47] | [ECCV’22] | 5.6G | 31M | 83.6 |
| MOAT-0 [59] | [ICLR’23] | 5.7G | 28M | 83.3 |
| NAT-T [20] | [CVPR’23] | 4.3G | 28M | 83.2 |
| InternImage-T [52] | [CVPR’23] | 5G | 30M | 83.5 |
| Flatten-T [19] | [ICCV’23] | 4.3G | 21M | 83.1 |
| SG-Former-S [43] | [ICCV’23] | 4.8G | 23M | 83.2 |
| GLNet-4G (ours) | | 4.5G | 27M | 83.7 |
| Swin-S [35] | [ICCV’21] | 8.7G | 50M | 83.0 |
| CSwin-S [14] | [CVPR’22] | 6.9G | 35M | 83.6 |
| RegionViT-M [3] | [ICLR’22] | 7.9G | 42M | 83.4 |
| MaxViT-S [47] | [ECCV’23] | 11.7G | 69M | 84.4 |
| MOAT-1 [59] | [ICLR’23] | 9.1G | 42M | 84.2 |
| NAT-S [20] | [CVPR’23] | 7.8G | 51M | 83.7 |
| InternImage-S [52] | [CVPR’23] | 8G | 50M | 84.2 |
| BiFormer-B [70] | [CVPR’23] | 9.8G | 57M | 84.3 |
| Flatten-S [19] | [ICCV’23] | 6.9G | 35M | 83.8 |
| SG-Former-M [43] | [ICCV’23] | 7.5G | 39M | 84.1 |
| SMT-B [34] | [ICCV’23] | 7.7G | 32M | 84.3 |
| GLNet-9G (ours) | | 9.7G | 61M | 84.5 |