notesum.ai

Published at October 22

Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

cs.LG

cs.AI

Released Date: October 22, 2024

Authors: Xinyi Ling¹, Bo Peng¹, Hanwen Du¹, Zhihui Zhu¹, Xia Ning²

Aff.: ¹Department of Computer Science and Engineering, The Ohio State University; ²Department of Biomedical Informatics, The Ohio State University

Arxiv: https://arxiv.org/abs/2410.17337v1

Model	IND							OOD
	$\mathop{\mathtt{AP}}\limits$	$\mathop{\mathtt{CC}}\limits$	$\mathop{\mathtt{PRP}}\limits$	$\mathop{\mathtt{PSI}}\limits$	$\mathop{\mathtt{MPC}}\limits$	$\mathop{\mathtt{SA}}\limits$	$\mathop{\mathtt{SR}}\limits$	$\mathop{\mathtt{AP}}\limits$	$\mathop{\mathtt{CC}}\limits$	$\mathop{\mathtt{PRP}}\limits$	$\mathop{\mathtt{SA}}\limits$	$\mathop{\mathtt{SR}}\limits$
	F1	R@1	M-F1	F1	Acc	M-F1	R@1	F1	R@1	M-F1	M-F1	R@1
ft-FashionCLIP	0.759	0.863	0.497	0.201	0.605	0.323	0.145	0.600	0.903	0.453	0.376	0.087
ft-Llama-2-13B	0.866	0.969	0.468	0.235	0.700	0.628	0.184	0.831	0.959	0.523	0.595	0.285
ft-Mistral-7B-v0.3	0.876	0.971	0.533	0.312	0.725	0.617	0.218	0.847	0.965	0.530	0.659	0.312
ft-Llama-3.2-3B	0.866	0.951	0.493	0.270	0.699	0.565	0.191	0.838	0.962	0.511	0.614	0.305
eCeLLM-L	0.872	0.870	0.519	0.178	0.706	0.613	0.188	0.860	0.916	0.531	0.584	0.304
eCeLLM-M	0.864	0.890	0.492	0.131	0.719	0.632	0.182	0.841	0.942	0.564	0.624	0.302
ft-LLaVA-NExT-Interleave	0.791	0.964	0.568	0.340	0.721	0.561	0.053	0.579	0.043	0.334	0.206	0.000
SoTA Task-specific Model	0.868	0.671	0.531	0.316	0.702	0.495	0.163	0.849	0.658	0.447	0.510	0.210
$\mathop{\mathtt{\pipeline\text{-}L}}\limits$	0.868	0.969	0.473	0.268	0.706	0.651	0.190	0.840	0.968	0.531	0.607	0.297
$\mathop{\mathtt{\pipeline\text{-}M}}\limits$	0.891	0.979	0.566	0.398	0.731	0.656	0.223	0.855	0.977	0.585	0.625	0.330
$\mathop{\mathtt{\pipeline\text{-}S}}\limits$	0.871	0.963	0.504	0.336	0.707	0.601	0.196	0.857	0.959	0.580	0.647	0.297
imprv over best (%; avg: 2.9)	1.7	0.8	-0.4	17.1	0.8	3.8	2.3	-0.3	1.2	3.7	-1.8	5.8
average imprv (%; avg: 50.3)	5.7	11.0	10.8	76.6	5.2	23.8	61.6	12.2	280.5	23.2	37.6	54.9
caption used (%; avg: 45.0)	62.1	62.3	50.5	74.5	72.2	56.8	30.3	68.2	62.6	43.2	56.4	30.4