notesum.ai

Published at November 27

cs.CV

Released Date: November 27, 2024

Authors: Xiao An¹, Jiaxing Sun¹, Zihan Gui¹, Wei He¹

Aff.: ¹The State Key Lab. LIESMARS, Wuhan University

Model		ILC	SII	CID	AttR	AssR	CSR
General-domain Large Vision-Language Models
Qwen2-VL-7B[41]		0.830	0.682	0.703	0.243	0.233	0.910
InternVL2-8B[4]		0.806	0.625	0.666	0.480	0.278	0.822
LLaVA-1.6-7B[23]		0.746	0.543	0.630	0.230	0.155	0.718
LLama3.2-11B[9]		0.749	0.524	0.614	0.290	0.158	0.808
GLM-4V-9B[10]		0.788	0.572	0.655	0.187	0.010	0.846
DeepSeek-VL-7B[26]		0.796	0.638	0.628	0.270	0.060	0.860
MiniCPM-V-2.5[46]		0.785	0.603	0.653	0.247	0.208	0.882
Phi3-Vision[1]		0.750	0.547	0.584	0.280	0.075	0.714
Remote Sensing Large Vision-Language Models
GeoChat[15]		0.726	0.508	0.251	0.327	0.138	0.696
LHRS-Bot[29]		0.708	0.317	0.181	0.267	0.230	0.574
LHRS-Bot-nova[29]		0.768	0.526	0.262	0.327	0.143	0.578
	RN50	0.717	0.328	0.504	0.307	0.225	0.416
	ViT-B	0.753	0.348	0.338	0.287	0.185	0.500
RemoteCLIP[22]	ViT-L	0.709	0.355	0.514	0.293	0.140	0.736
	ViT-B	0.757	0.311	0.541	0.327	0.243	0.828
	ViT-B_RET-2	0.772	0.255	0.447	0.337	0.188	0.766
	ViT-L	0.749	0.299	0.571	0.327	0.230	0.904
	ViT-L-336	0.762	0.333	0.598	0.347	0.170	0.906
GeoRSCLIP[49]	ViT-H	0.763	0.331	0.404	0.370	0.285	0.928