notesum.ai

Published at December 6

cs.CL

cs.CV

Released Date: December 6, 2024

Authors: Jarvis Guo¹, Tuney Zheng¹, Yuelin Bai¹, Bo Li², Yubo Wang³, King Zhu¹, Yizhi Li⁴, Graham Neubig⁵, Wenhu Chen³, Xiang Yue⁵

Aff.: ¹M-A-P; ²Nanyang Technological University; ³University of Waterloo; ⁴The University of Manchester; ⁵Carnegie Mellon University

	Stage-1	Stage-2	Stage-3
Resolution	384	384 $\times$ {1 $\times$ 1, …}	384 $\times$ {1 $\times$ 1, …}
#Tokens	729	Max 729 $\times$ 5	Max 729 $\times$ 5
Dataset	LCS	Single Image	Single, Multi-Image & Video
#Samples	558K	10M	2M
Vision Tower	siglip-so400m-patch14-384
LLM Backbone	Qwen2.5-7B-Instruct
Trainable Model Parameters	Projector: 20.0M	Full Model: 8.0B	Full Model: 8.0B
Batch Size	512	256	256
Model Max Length	8192	8192	16384
Learning Rate: $\psi_{vision}$	1 $\times$ 10^-3	2 $\times$ 10^-6	2 $\times$ 10^-6
Learning Rate: $\{\theta_{proj},\Phi_{LLM}\}$	1 $\times$ 10^-3	1 $\times$ 10^-5	1 $\times$ 10^-5
Epoch	1	1	1