notesum.ai
Published at October 31AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents
cs.AI
Released Date: October 31, 2024
Authors: Yifan Xu1, Xiao Liu1, Xueqiao Sun1, Siyi Cheng2, Hao Yu1, Hanyu Lai1, Shudan Zhang1, Dan Zhang1, Jie Tang1, Yuxiao Dong1
Aff.: 1Tsinghua University; 2Peking University

| Mode | Model | SR | Sub-SR | RRR | ROR |
| XML | GPT-4o | 25.36 | 30.56 | 107.45 | 86.56 |
| GPT-4-1106-Preview | 31.16 | 38.21 | 66.34 | 86.24 | |
| Gemini-1.5-Pro | 18.84 | 22.40 | 57.72 | 83.99 | |
| Gemini-1.0 | 8.70 | 10.75 | 51.80 | 71.08 | |
| GLM4-PLUS | 27.54 | 32.08 | 92.35 | 83.41 | |
| LLaMA3.1-8B-Instruct | 2.17 | 3.62 | - | 52.77 | |
| Qwen2-7B-Instruct | 4.35 | 4.95 | - | 67.26 | |
| GLM4-9B-Chat | 7.25 | 9.06 | 54.43 | 58.34 | |
| \cdashline2-6 | |||||
| XML+SFT | LLaMA3.1-8B-ft | 23.91 | 30.31 | 75.58 | 92.46 |
| Qwen2-7B-ft | 19.57 | 24.40 | 77.31 | 92.48 | |
| GLM4-9B-ft | 21.01 | 26.45 | 74.81 | 93.25 | |
| SoM | GPT-4o | 31.16 | 35.02 | 87.32 | 85.36 |
| GPT-4-Vision-Preview | 26.09 | 29.53 | 99.22 | 78.79 | |
| Gemini-1.5-Pro | 16.67 | 18.48 | 105.95 | 91.52 | |
| Gemini-1.0 | 10.87 | 12.56 | 72.52 | 76.70 | |
| Claude-3.5-Sonnet | 28.99 | 32.66 | 113.41 | 81.16 | |
| Claude-3-Opus | 13.04 | 15.10 | 81.41 | 83.89 | |
| CogVLM2 | 0.72 | 0.72 | - | 17.97 | |
| LLaMA3.2-11B-Vision-Instruct | 1.45 | 1.45 | - | 50.76 | |
| Qwen2-VL-7B-Instruct | 3.62 | 4.59 | - | 84.81 | |
| \cdashline2-6 | |||||
| SoM+SFT | CogVLM2-ft | 11.59 | 16.06 | 57.37 | 85.58 |
| LLaMA3.2-11B-Vision-ft | 10.14 | 12.98 | 61.67 | 87.85 | |
| Qwen2-VL-7B-Instruct-ft | 18.12 | 22.64 | 65.23 | 88.29 |