notesum.ai
Published at December 5Liquid: Language Models are Scalable Multi-modal Generators
cs.CV
Released Date: December 5, 2024
Authors: Junfeng Wu1, Yi Jiang2, Chuofan Ma3, Yuliang Liu1, Hengshuang Zhao3, Zehuan Yuan2, Song Bai2, Xiang Bai1
Aff.: 1Huazhong University of Science and Technology; 2Bytedance Inc; 3University of Hong Kong

| Type | Method | LLM | Visual Token | Res. | VQAv2 | GQA | TextVQA | POPE | MME |
| Und. Only | LLaVA-1.5 [38] | Vicuna-1.5-7B | Continuous | 336 | 78.5∗ | 62.0∗ | 58.2 | 85.9 | 1510.7 |
| VILA [33] | LLaMA-2-7B | Continuous | 336 | 79.9∗ | 62.3∗ | 64.4 | 85.5 | 1533.0 | |
| InstructBLIP [11] | Vicuna-7B | Continuous | 224 | – | 49.2 | 50.1 | – | – | |
| IDEFICS-9B [26] | LLaMA-7B | Continuous | 224 | 50.9 | 38.4 | 25.9 | – | – | |
| Und. & Gen. | Unified-IO 2 [40] | 6.8B from scratch | Continuous | 384 | 79.4∗ | – | – | 87.7 | – |
| Emu [58] | LLaMA-13B | Continuous | 224 | 52.0 | – | – | – | – | |
| LaVIT [25] | LLaMA-7B | Continuous | 224 | 66.0 | 46.8 | – | – | – | |
| DreamLLM [13] | Vicuna-7B | Continuous | 224 | 72.9∗ | – | 41.8 | – | – | |
| CM3Leon-7B [72] | 7B from scratch | Discrete | 256 | 47.6 | – | – | – | – | |
| LWM [39] | LLaMA-2-7B | Discrete | 256 | 55.8 | 44.8 | 18.8 | 75.2 | – | |
| Show-o [69] | Phi-1.5-1.3B | Discrete | 256 | 59.3∗ | 48.7∗ | – | 73.8 | 948.4 | |
| VILA-U [39] | LLaMA-2-7B | Discrete | 256 | 75.3∗ | 58.3∗ | 48.3 | 83.9 | 1336.2 | |
| Chameleon [59] | 34B from scratch | Discrete | 512 | 69.6 | – | – | – | – | |
| Ours | Gemma-7B | Discrete | 512 | 68.0∗ | 56.1∗ | 40.4 | 81.1 | 1107.2 | |
| Ours | Gemma-7B | Discrete | 512 | 71.3∗ | 58.4∗ | 42.4 | 81.1 | 1119.3 |