notesum.ai
Published at November 29CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation
cs.RO
cs.AI
cs.CL
cs.CV
cs.LG
Released Date: November 29, 2024
Authors: Qixiu Li1, Yaobo Liang2, Zeyu Wang1, Lin Luo2, Xi Chen2, Mozheng Liao3, Fangyun Wei2, Yu Deng2, Sicheng Xu2, Yizhong Zhang2, Xiaofan Wang4, Bei Liu2, Jianlong Fu2, Jianmin Bao2, Dong Chen2, Yuanchun Shi1, Jiaolong Yang2, Baining Guo2
Aff.: 1Tsinghua University; 2Microsoft Research Asia; 3USTC; 4Institute of Microelectronics, CAS
![[Uncaptioned image]](https://arxiv.org/html/2411.19650v1/x1.png)
| Google Robot | Method | Pick | Move | Open/Close | Open Top Drawer | Average |
|---|---|---|---|---|---|---|
| Coke Can | Near | Drawer | and Place Apple | |||
| SIMPLER (Visual Matching) | RT-1 [7] | 85.7 | 44.2 | 73.0 | 6.5 | 52.4 |
| RT-1-X [48] | 56.7 | 31.7 | 59.7 | 21.3 | 42.4 | |
| RT-2-X [48] | 78.7 | 77.9 | 25.0 | 3.7 | 46.3 | |
| Octo-Base [62] | 17.0 | 4.2 | 22.7 | 0.0 | 11.0 | |
| OpenVLA [30] | 18.0 | 56.3 | 63.0 | 0.0 | 34.3 | |
| Ours | 91.3 | 85.0 | 71.8 | 50.9 | 74.8 | |
| SIMPLER (Variant Aggregation) | RT-1 [7] | 89.8 | 50.0 | 32.3 | 2.6 | 43.7 |
| RT-1-X [48] | 49.0 | 32.3 | 29.4 | 10.1 | 30.2 | |
| RT-2-X [48] | 82.3 | 79.2 | 35.3 | 20.6 | 54.4 | |
| Octo-Base [62] | 0.6 | 3.1 | 1.1 | 0.0 | 1.2 | |
| OpenVLA [30] | 60.8 | 67.7 | 28.8 | 0.0 | 39.3 | |
| Ours | 89.6 | 80.8 | 28.3 | 46.6 | 61.3 |