notesum.ai
Published at November 4DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution
cs.RO
cs.AI
cs.LG
Released Date: November 4, 2024
Authors: Yang Yue1, Yulin Wang1, Bingyi Kang2, Yizeng Han1, Shenzhi Wang1, Shiji Song1, Jiashi Feng2, Gao Huang1
Aff.: 1Department of Automation, BNRist, Tsinghua University; 2ByteDance
![[Uncaptioned image]](https://arxiv.org/html/2411.02359v1/extracted/5969415/images/deer.png)
| Method | Input | Data | Foundation model | Avg. successful len (LLM GFLOPs) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| DD | ABCDD | ABCD | ||||||||
| GR-1 [69] (ICLR’24) |
|
LANG |
|
- | 4.21 | 3.06 | ||||
| HULC [13] (RA-L’22) | RGB | ALL | ✗ | 2.64 | 3.06 | 0.67 | ||||
| RT-1 [15] (RSS’23) | RGB | LANG | ✗ | - | 2.45 | 0.9 | ||||
| SPIL [70] (ICML’24) | RGB | ALL | ✗ | 2.67 | - | 1.71 | ||||
| SuSIE [71] (ICLR’24) | RGB | ALL | InstructPix2Pix [72] | - | - | 2.69 | ||||
| RoboFlamingo (ICLR’24) | RGB | LANG | OpenFlamingo 3B | 2.46 (31.2) | 4.08 (31.2) | 2.47 (31.2) | ||||
| RoboFlamingo++ | RGB | LANG | OpenFlamingo 3B | 2.71 (31.2) | 4.07 (31.2) | 2.59 (31.2) | ||||
| DeeR (ours) | RGB | LANG | OpenFlamingo 3B | 2.83 (8.6) | 4.13 (10.0) | 2.82 (12.5) | ||||
| DeeR w. online (ours) | RGB | LANG | OpenFlamingo 3B | 2.92 (8.5) | 4.13 (9.7) | 2.90 (9.5) | ||||