notesum.ai
Published at November 4MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs
cs.CL
cs.AI
cs.CV
cs.IR
cs.LG
Released Date: November 4, 2024
Authors: Sheng-Chieh Lin1, Chankyu Lee1, Mohammad Shoeybi1, Jimmy Lin2, Bryan Catanzaro1, Wei Ping1
Aff.: 1NVIDIA; 2University of Waterloo

| Task | Dataset | MM-Embed | |||||||
|---|---|---|---|---|---|---|---|---|---|
| CLIPSF | LLaVa-E | LLaVa-P | NV-Embed-v1 | CLIPSF | LLaVa-P | NV-Embed-v1 | |||
| 1. | VisualNews | 43.8 | 33.2 | 34.2 | 32.1 | 42.7 | 39.7 | 41.1 | 41.0 |
| MSCOCO | 72.0 | 69.3 | 70.8 | 64.6 | 69.2 | 73.8 | 72.7 | 71.3 | |
| Fashion200K | 16.4 | 13.5 | 13.3 | 10.4 | 19.7 | 17.4 | 18.6 | 17.1 | |
| 2. | WebQA | 83.2 | 88.6 | 88.8 | 92.1 | 88.2 | 93.6 | 95.6 | 95.9 |
| 3. | EDIS | 46.5 | 55.9 | 56.6 | 55.1 | 54.2 | 68.8 | 69.8 | 68.8 |
| WebQA | 76.0 | 80.3 | 81.6 | 81.3 | 80.1 | 84.9 | 84.8 | 85.0 | |
| 4. | VisualNews | 39.5 | 32.4 | 33.3 | 30.4 | 40.6 | 39.4 | 41.4 | 41.3 |
| MSCOCO | 91.0 | 91.8 | 92.2 | 90.3 | 88.5 | 89.5 | 88.9 | 90.1 | |
| Fashion200K | 17.2 | 13.9 | 14.7 | 13.2 | 20.0 | 17.5 | 19.9 | 18.4 | |
| 5. | NIGHTS | 31.6 | 31.8 | 30.7 | 30.4 | 31.9 | 31.8 | 31.1 | 32.4 |
| 6. | OVEN | 40.4 | 37.9 | 39.1 | 36.3 | 40.9 | 42.9 | 42.6 | 42.1 |
| InfoSeek | 26.1 | 31.0 | 32.9 | 33.3 | 27.6 | 37.2 | 35.8 | 42.3 | |
| 7. | FashionIQ | 24.2 | 27.4 | 27.0 | 26.0 | 21.7 | 25.8 | 26.6 | 25.7 |
| CIRR | 43.2 | 48.1 | 45.4 | 45.3 | 38.3 | 49.5 | 50.8 | 50.0 | |
| 8. | OVEN | 60.9 | 61.6 | 62.6 | 61.7 | 61.6 | 63.9 | 63.5 | 64.1 |
| InfoSeek | 45.9 | 50.3 | 50.0 | 53.4 | 47.1 | 54.4 | 53.5 | 57.7 | |
| M-BEIR Avg. | All | 47.4 | 47.9 | 48.3 | 47.2 | 48.3 | 51.9 | 52.3 | 52.7 |
| Single-modal Qry | 51.7 | 51.0 | 51.6 | 50.0 | 53.5 | 55.6 | 56.4 | 56.1 | |
| Multi-modal Qry | 40.1 | 42.7 | 42.8 | 42.7 | 39.5 | 45.6 | 45.5 | 47.0 | |
| MTEB Text Retrieval Avg. | - | - | - | - | - | 46.4 | 49.7 | 60.3 | |