notesum.ai
Published at November 26CoA: Chain-of-Action for Generative Semantic Labels
cs.CV
Released Date: November 26, 2024
Authors: Meng Wei1, Zhongnian Li1, Peng Ying1, Xinzheng Xu1
Aff.: 1School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, China

| Method | Split-0 | Split-1 | Split-2 | Split-3 | Avg | ||||||
| Mclip | Mram | Mclip | Mram | Mclip | Mram | Mclip | Mram | Mclip | Mram | ||
| VQA | BLIP-2 [24] | 46.00 | 68.83 | 32.39 | 69.10 | 41.35 | 66.40 | 25.00 | 70.72 | 36.19 | 68.76 |
| InstructBLIP [9] | 36.45 | 72.14 | 33.97 | 72.93 | 42.57 | 69.12 | 21.14 | 68.34 | 33.53 | 70.63 | |
| LLaVA [29] | 55.22 | 66.50 | 53.16 | 62.05 | 54.99 | 66.66 | 55.87 | 66.90 | 54.81 | 65.53 | |
| MiniGPT-4 [58] | 45.87 | 45.64 | 52.17 | 45.32 | 54.73 | 44.76 | 49.16 | 48.48 | 50.48 | 46.05 | |
| Caption | BLIP-2 [24] | 64.45 | 69.87 | 55.27 | 70.03 | 66.58 | 72.57 | 36.74 | 69.12 | 55.76 | 70.40 |
| InstructBLIP [9] | 69.37 | 69.12 | 69.69 | 70.54 | 73.44 | 72.08 | 71.04 | 74.05 | 70.89 | 71.48 | |
| LLaVA [29] | 58.88 | 60.16 | 66.06 | 65.28 | 63.62 | 67.93 | 67.21 | 65.57 | 63.94 | 64.74 | |
| MiniGPT-4 [58] | 46.76 | 40.34 | 52.98 | 37.32 | 52.09 | 39.62 | 45.30 | 43.63 | 49.28 | 40.23 | |
| CoA (Our) | 81.23 | 75.18 | 79.83 | 78.54 | 84.51 | 79.69 | 84.23 | 76.54 | 82.45 | 77.49 | |
| (a) Comparison of results on VOC. | |||||||||||
| Method | Split-0 | Split-1 | Split-2 | Split-3 | Avg | ||||||
| Mclip | Mram | Mclip | Mram | Mclip | Mram | Mclip | Mram | Mclip | Mram | ||
| VQA | BLIP-2 [24] | 34.19 | 68.23 | 25.40 | 62.67 | 42.03 | 72.39 | 37.61 | 70.25 | 34.81 | 68.39 |
| InstructBLIP [9] | 34.14 | 70.14 | 26.35 | 59.98 | 43.87 | 72.47 | 31.36 | 70.89 | 33.93 | 68.37 | |
| LLaVA [29] | 54.16 | 65.85 | 47.28 | 65.09 | 70.14 | 69.04 | 55.87 | 72.32 | 56.86 | 68.08 | |
| MiniGPT-4 [58] | 38.52 | 50.48 | 40.88 | 55.98 | 44.44 | 55.25 | 57.16 | 60.04 | 45.25 | 55.44 | |
| Caption | BLIP-2 [24] | 60.17 | 62.35 | 41.00 | 65.65 | 62.79 | 64.76 | 58.70 | 72.66 | 55.67 | 66.36 |
| InstructBLIP [9] | 64.73 | 75.20 | 61.08 | 72.78 | 61.91 | 77.44 | 63.00 | 71.71 | 62.68 | 74.28 | |
| LLaVA [29] | 67.67 | 63.29 | 67.63 | 66.84 | 75.42 | 75.96 | 69.61 | 71.70 | 70.08 | 69.45 | |
| MiniGPT-4 [58] | 44.28 | 45.26 | 39.60 | 49.69 | 47.66 | 51.60 | 58.99 | 51.52 | 47.63 | 49.52 | |
| CoA (Our) | 77.70 | 82.04 | 78.60 | 80.37 | 84.20 | 83.02 | 82.90 | 76.51 | 80.85 | 80.49 | |
| (b) Comparison of results on COCO. | |||||||||||