notesum.ai
Published at November 25Interpreting Object-level Foundation Models via Visual Precision Search
cs.CV
Released Date: November 25, 2024
Authors: Ruoyu Chen1, Siyuan Liang2, Jingzhi Li1, Shiming Liu3, Maosen Li4, Zheng Huang5, Hua Zhang1, Xiaochun Cao6
Aff.: 1Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China; School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China; 2School of Computing, National University of Singapore, 119077, Singapore; 3RAMS Lab, Huawei Technologies Co., Ltd.; 4IAS BU, Huawei Technologies Co., Ltd.; 5College of Computer, National University of Defense Technology; 6School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University

| Datasets | Methods | Faithfulness Metrics | Location Metrics | |||||||
| Ins. () | Del. () | Ins. (class) () | Del. (class) () | Ins. (IoU) () | Del. (IoU) () | Ave. high. score () | Point Game () | Energy PG () | ||
| MS COCO [24] (Detection task) | Grad-CAM [38] | 0.2436 | 0.1526 | 0.3064 | 0.2006 | 0.6229 | 0.5324 | 0.5904 | 0.1746 | 0.1463 |
| SSGrad-CAM++ [50] | 0.2107 | 0.1778 | 0.2639 | 0.2314 | 0.5981 | 0.5511 | 0.5886 | 0.1905 | 0.1293 | |
| D-RISE [33] | 0.4412 | 0.0402 | 0.5081 | 0.0886 | 0.8396 | 0.3642 | 0.6215 | 0.9497 | 0.1850 | |
| D-HSIC [31] | 0.3776 | 0.0439 | 0.4382 | 0.0903 | 0.8301 | 0.3301 | 0.5862 | 0.7328 | 0.1861 | |
| ODAM [54] | 0.3103 | 0.0519 | 0.3655 | 0.0894 | 0.7869 | 0.3984 | 0.5865 | 0.5431 | 0.2034 | |
| Ours | 0.5459 | 0.0375 | 0.6204 | 0.0882 | 0.8581 | 0.3300 | 0.6873 | 0.9894 | 0.2046 | |
| RefCOCO [17] (REC task) | Grad-CAM [38] | 0.3749 | 0.4237 | 0.4658 | 0.5194 | 0.7516 | 0.7685 | 0.7481 | 0.2380 | 0.2171 |
| SSGrad-CAM++ [50] | 0.4113 | 0.3925 | 0.5008 | 0.4851 | 0.7700 | 0.7588 | 0.7561 | 0.2820 | 0.2262 | |
| D-RISE [33] | 0.6178 | 0.1605 | 0.7033 | 0.3396 | 0.8606 | 0.5164 | 0.8471 | 0.9400 | 0.2870 | |
| D-HSIC [31] | 0.5491 | 0.1846 | 0.6295 | 0.3509 | 0.8504 | 0.5120 | 0.7739 | 0.7900 | 0.3190 | |
| ODAM [54] | 0.4778 | 0.2718 | 0.5620 | 0.3757 | 0.8217 | 0.6641 | 0.7425 | 0.6320 | 0.3529 | |
| Ours | 0.7419 | 0.1250 | 0.8080 | 0.2457 | 0.9050 | 0.5103 | 0.8842 | 0.9460 | 0.3566 | |
| LVIS V1 (rare) [12] (Zero-shot det. task) | Grad-CAM [38] | 0.1253 | 0.1294 | 0.1801 | 0.1814 | 0.5657 | 0.5910 | 0.3549 | 0.1151 | 0.0941 |
| SSGrad-CAM++ [50] | 0.1253 | 0.1254 | 0.1765 | 0.1775 | 0.5800 | 0.5691 | 0.3504 | 0.1091 | 0.0931 | |
| D-RISE [33] | 0.2808 | 0.0289 | 0.3348 | 0.0835 | 0.8303 | 0.3174 | 0.4289 | 0.9697 | 0.1462 | |
| D-HSIC [31] | 0.2417 | 0.0353 | 0.2912 | 0.0928 | 0.8187 | 0.3550 | 0.4044 | 0.8303 | 0.1730 | |
| ODAM [54] | 0.2009 | 0.0410 | 0.2478 | 0.0844 | 0.7760 | 0.4082 | 0.3694 | 0.6061 | 0.2050 | |
| Ours | 0.3695 | 0.0277 | 0.4275 | 0.0799 | 0.8479 | 0.3242 | 0.4969 | 0.9758 | 0.1785 | |