notesum.ai
Published at November 29Interleaved-Modal Chain-of-Thought
cs.CV
cs.AI
cs.LG
Released Date: November 29, 2024
Authors: Jun Gao1, Yongqi Li2, Ziqiang Cao1, Wenjie Li2
Aff.: 1School of Computer Science and Technology, Soochow University; 2Department of Computer Science, The Hong Kong Polytechnic University

| Backbone | Methods | 0-shot | 1-shot | ||||
|---|---|---|---|---|---|---|---|
| M3CoT | ScienceQA | LLaVA-W | M3CoT | ScienceQA | LLaVA-W | ||
| ACC. | ACC. | ROUGE-L | ACC. | ACC. | ROUGE-L | ||
| Chameleon-7B | No-CoT | 26.7 | 45.0 | 13.1 | 19.8 | 40.1 | 23.9 |
| Multimodal CoT [33] | 28.9 | 40.8 | 20.4 | 29.5 | 47.9 | 20.6 | |
| CCoT [19] | 29.4 | 42.6 | 22.1 | 30.1 | 47.4 | 24.5 | |
| DDCoT [34] | 28.6 | 43.2 | 20.2 | 29.8 | 48.3 | 23.1 | |
| SCAFFOLD [12] | 29.6 | 39.5 | 21.7 | 31.1 | 46.8 | 24.7 | |
| ICoT (Ours) | 29.8 | 45.8 | 25.2 | 33.6 | 48.7 | 27.6 | |
| % Improve | 0.6% | 1.8% | 14.0% | 8.0% | 0.8 % | 11.7% | |
| Backbone | Methods | 0-shot | 1-shot | ||||
| M3CoT | ScienceQA | LLaVA-W | M3CoT | ScienceQA | LLaVA-W | ||
| ACC. | ACC. | ROUGE-L | ACC. | ACC. | ROUGE-L | ||
| Qwen2-VL-7B | No-CoT | 39.7 | 43.0 | 32.7 | 38.7 | 44.9 | 33.5 |
| Multimodal CoT [33] | 37.5 | 50.9 | 30.7 | 40.3 | 49.1 | 31.4 | |
| CCoT [19] | 38.5 | 51.0 | 29.4 | 39.1 | 52.5 | 33.9 | |
| DDCoT [34] | 39.1 | 52.6 | 31.2 | 38.9 | 53.7 | 32.8 | |
| SCAFFOLD [12] | 39.5 | 51.2 | 31.8 | 40.2 | 52.8 | 33.1 | |
| ICoT (Ours) | 40.2 | 53.2 | 34.2 | 41.2 | 54.4 | 35.7 | |
| % Improve | 1.3% | 1.1% | 4.6% | 2.2% | 1.3% | 5.3% | |