notesum.ai
Published at December 6MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
cs.CL
cs.CV
Released Date: December 6, 2024
Authors: Jarvis Guo1, Tuney Zheng1, Yuelin Bai1, Bo Li2, Yubo Wang3, King Zhu1, Yizhi Li4, Graham Neubig5, Wenhu Chen3, Xiang Yue5
Aff.: 1M-A-P; 2Nanyang Technological University; 3University of Waterloo; 4The University of Manchester; 5Carnegie Mellon University
![[Uncaptioned image]](https://arxiv.org/html/2412.05237v1/x1.png)
| Stage-1 | Stage-2 | Stage-3 | |
| Resolution | 384 | 384 {11, …} | 384 {11, …} |
| #Tokens | 729 | Max 7295 | Max 7295 |
| Dataset | LCS | Single Image | Single, Multi-Image & Video |
| #Samples | 558K | 10M | 2M |
| Vision Tower | siglip-so400m-patch14-384 | ||
| LLM Backbone | Qwen2.5-7B-Instruct | ||
| Trainable Model Parameters | Projector: 20.0M | Full Model: 8.0B | Full Model: 8.0B |
| Batch Size | 512 | 256 | 256 |
| Model Max Length | 8192 | 8192 | 16384 |
| Learning Rate: | 110-3 | 210-6 | 210-6 |
| Learning Rate: | 110-3 | 110-5 | 110-5 |
| Epoch | 1 | 1 | 1 |