notesum.ai
Published at November 26ShowUI: One Vision-Language-Action Model for GUI Visual Agent
cs.CV
cs.AI
cs.CL
cs.HC
Released Date: November 26, 2024
Authors: Kevin Qinghong Lin1, Linjie Li2, Difei Gao1, Zhengyuan Yang2, Shiwei Wu1, Zechen Bai1, Weixian Lei, Lijuan Wang2, Mike Zheng Shou1
Aff.: 1Show Lab, National University of Singapore; 2Microsoft

| Method | Size | #Train | Mobile | Desktop | Web | Avg. | |||
|---|---|---|---|---|---|---|---|---|---|
| Text | Icon | Text | Icon | Text | Icon | ||||
| Qwen2-VL-2B [41] | 2B | – | 24.2 | 10.0 | 1.4 | 9.3 | 8.7 | 2.4 | 9.3 |
| Fuyu [5] | 8B | – | 41.0 | 1.3 | 33.0 | 3.6 | 33.9 | 4.4 | 19.5 |
| CogAgent [17] | 18B | 400K | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | 47.4 |
| SeeClick [11] | 9.6B | 364K | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | 53.4 |
| OmniParser [31] | * | – | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | 73.0 |
| UGround [15] | 7B | 1.3M | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | 73.3 |
| ShowUI-G | 2B | 119K | 91.6 | 69.0 | 81.8 | 59.0 | 83.0 | 65.5 | 74.9 |
| ShowUI | 2B | 256K | 92.3 | 75.5 | 76.3 | 61.1 | 81.7 | 63.6 | 75.1 |