notesum.ai
Published at October 22Captions Speak Louder than Images (CASLIE): Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data
cs.LG
cs.AI
Released Date: October 22, 2024
Authors: Xinyi Ling1, Bo Peng1, Hanwen Du1, Zhihui Zhu1, Xia Ning2
Aff.: 1Department of Computer Science and Engineering, The Ohio State University; 2Department of Biomedical Informatics, The Ohio State University

| Model | IND | OOD | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| F1 | R@1 | M-F1 | F1 | Acc | M-F1 | R@1 | F1 | R@1 | M-F1 | M-F1 | R@1 | ||
| ft-FashionCLIP | 0.759 | 0.863 | 0.497 | 0.201 | 0.605 | 0.323 | 0.145 | 0.600 | 0.903 | 0.453 | 0.376 | 0.087 | |
| ft-Llama-2-13B | 0.866 | 0.969 | 0.468 | 0.235 | 0.700 | 0.628 | 0.184 | 0.831 | 0.959 | 0.523 | 0.595 | 0.285 | |
| ft-Mistral-7B-v0.3 | 0.876 | 0.971 | 0.533 | 0.312 | 0.725 | 0.617 | 0.218 | 0.847 | 0.965 | 0.530 | 0.659 | 0.312 | |
| ft-Llama-3.2-3B | 0.866 | 0.951 | 0.493 | 0.270 | 0.699 | 0.565 | 0.191 | 0.838 | 0.962 | 0.511 | 0.614 | 0.305 | |
| eCeLLM-L | 0.872 | 0.870 | 0.519 | 0.178 | 0.706 | 0.613 | 0.188 | 0.860 | 0.916 | 0.531 | 0.584 | 0.304 | |
| eCeLLM-M | 0.864 | 0.890 | 0.492 | 0.131 | 0.719 | 0.632 | 0.182 | 0.841 | 0.942 | 0.564 | 0.624 | 0.302 | |
| ft-LLaVA-NExT-Interleave | 0.791 | 0.964 | 0.568 | 0.340 | 0.721 | 0.561 | 0.053 | 0.579 | 0.043 | 0.334 | 0.206 | 0.000 | |
| SoTA Task-specific Model | 0.868 | 0.671 | 0.531 | 0.316 | 0.702 | 0.495 | 0.163 | 0.849 | 0.658 | 0.447 | 0.510 | 0.210 | |
| 0.868 | 0.969 | 0.473 | 0.268 | 0.706 | 0.651 | 0.190 | 0.840 | 0.968 | 0.531 | 0.607 | 0.297 | ||
| 0.891 | 0.979 | 0.566 | 0.398 | 0.731 | 0.656 | 0.223 | 0.855 | 0.977 | 0.585 | 0.625 | 0.330 | ||
| 0.871 | 0.963 | 0.504 | 0.336 | 0.707 | 0.601 | 0.196 | 0.857 | 0.959 | 0.580 | 0.647 | 0.297 | ||
| ย ย ย ย ย ย imprv over best (%; avg: 2.9) | 1.7 | 0.8 | -0.4 | 17.1 | 0.8 | 3.8 | 2.3 | -0.3 | 1.2 | 3.7 | -1.8 | 5.8 | |
| ย ย ย ย ย ย average imprv (%; avg: 50.3) | 5.7 | 11.0 | 10.8 | 76.6 | 5.2 | 23.8 | 61.6 | 12.2 | 280.5 | 23.2 | 37.6 | 54.9 | |
| ย ย ย ย ย ย caption used (%; avg: 45.0) | 62.1 | 62.3 | 50.5 | 74.5 | 72.2 | 56.8 | 30.3 | 68.2 | 62.6 | 43.2 | 56.4 | 30.4 | |