notesum.ai

Published at October 31

AndroidLab: Training and Systematic Benchmarking of Android Autonomous Agents

cs.AI

Released Date: October 31, 2024

Authors: Yifan Xu¹, Xiao Liu¹, Xueqiao Sun¹, Siyi Cheng², Hao Yu¹, Hanyu Lai¹, Shudan Zhang¹, Dan Zhang¹, Jie Tang¹, Yuxiao Dong¹

Aff.: ¹Tsinghua University; ²Peking University

Arxiv: http://arxiv.org/abs/2410.24024v1

Refer to caption

Mode	Model	SR	Sub-SR	RRR	ROR
XML	GPT-4o	25.36	30.56	107.45	86.56
	GPT-4-1106-Preview	31.16	38.21	66.34	86.24
	Gemini-1.5-Pro	18.84	22.40	57.72	83.99
	Gemini-1.0	8.70	10.75	51.80	71.08
	GLM4-PLUS	27.54	32.08	92.35	83.41
	LLaMA3.1-8B-Instruct	2.17	3.62	-	52.77
	Qwen2-7B-Instruct	4.35	4.95	-	67.26
	GLM4-9B-Chat	7.25	9.06	54.43	58.34
\cdashline2-6
XML+SFT	LLaMA3.1-8B-ft	23.91	30.31	75.58	92.46
	Qwen2-7B-ft	19.57	24.40	77.31	92.48
	GLM4-9B-ft	21.01	26.45	74.81	93.25
SoM	GPT-4o	31.16	35.02	87.32	85.36
	GPT-4-Vision-Preview	26.09	29.53	99.22	78.79
	Gemini-1.5-Pro	16.67	18.48	105.95	91.52
	Gemini-1.0	10.87	12.56	72.52	76.70
	Claude-3.5-Sonnet	28.99	32.66	113.41	81.16
	Claude-3-Opus	13.04	15.10	81.41	83.89
	CogVLM2	0.72	0.72	-	17.97
	LLaMA3.2-11B-Vision-Instruct	1.45	1.45	-	50.76
	Qwen2-VL-7B-Instruct	3.62	4.59	-	84.81
\cdashline2-6
SoM+SFT	CogVLM2-ft	11.59	16.06	57.37	85.58
	LLaMA3.2-11B-Vision-ft	10.14	12.98	61.67	87.85
	Qwen2-VL-7B-Instruct-ft	18.12	22.64	65.23	88.29