notesum.ai

Published at November 29

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

cs.RO

cs.AI

cs.CL

cs.CV

cs.LG

Released Date: November 29, 2024

Authors: Qixiu Li¹, Yaobo Liang², Zeyu Wang¹, Lin Luo², Xi Chen², Mozheng Liao³, Fangyun Wei², Yu Deng², Sicheng Xu², Yizhong Zhang², Xiaofan Wang⁴, Bei Liu², Jianlong Fu², Jianmin Bao², Dong Chen², Yuanchun Shi¹, Jiaolong Yang², Baining Guo²

Aff.: ¹Tsinghua University; ²Microsoft Research Asia; ³USTC; ⁴Institute of Microelectronics, CAS

Arxiv: http://arxiv.org/pdf/2411.19650v1

Google Robot	Method	Pick	Move	Open/Close	Open Top Drawer	Average
Google Robot	Method	Coke Can	Near	Drawer	and Place Apple	Average
SIMPLER (Visual Matching)	RT-1 [7]	85.7	44.2	73.0	6.5	52.4
	RT-1-X [48]	56.7	31.7	59.7	21.3	42.4
	RT-2-X [48]	78.7	77.9	25.0	3.7	46.3
	Octo-Base [62]	17.0	4.2	22.7	0.0	11.0
	OpenVLA [30]	18.0	56.3	63.0	0.0	34.3
	Ours	91.3	85.0	71.8	50.9	74.8
SIMPLER (Variant Aggregation)	RT-1 [7]	89.8	50.0	32.3	2.6	43.7
	RT-1-X [48]	49.0	32.3	29.4	10.1	30.2
	RT-2-X [48]	82.3	79.2	35.3	20.6	54.4
	Octo-Base [62]	0.6	3.1	1.1	0.0	1.2
	OpenVLA [30]	60.8	67.7	28.8	0.0	39.3
	Ours	89.6	80.8	28.3	46.6	61.3