notesum.ai
Published at November 13Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval
cs.CV
cs.AI
cs.IR
cs.MM
Released Date: November 13, 2024
Authors: Yeong-Joon Ju1, Ho-Joong Kim1, Seong-Whan Lee1
Aff.: 1Department of Artificial Intelligence, Korea University

| Model | Dataset | KB | Metric | ||||||
|---|---|---|---|---|---|---|---|---|---|
| MRR@5 | P@ | R@5 | R@10 | R@20 | R@50 | R@100 | |||
| ColBERTv2 | OK-VQA | Wiki-11M | 36.00 | 24.07 | 52.20 | 63.54 | 73.80 | 83.31 | 88.27 |
| FLMR*+WiT | 32.56 | 21.20 | 50.61 | 62.58 | 73.40 | 84.94 | 90.17 | ||
| ReViz+VL-ICT | 42.97 | 31.95 | 61.24 | 70.00 | 79.65 | 87.32 | 90.95 | ||
| PreFLMR+ViD2R | 50.08 | 35.96 | 69.12 | 78.36 | 86.01 | 92.15 | 95.12 | ||
| Ret-XKnow+ViD2R | 51.10 | 36.97 | 70.83 | 80.94 | 88.59 | 94.09 | 96.35 | ||
| CLIP† | OK-VQA | GS-112K | 19.08 | 11.13 | 34.54 | 50.48 | 65.08 | 80.62 | 88.11 |
| ColBERTv2 | 52.46 | 37.53 | 69.60 | 79.57 | 86.58 | 93.10 | 96.51 | ||
| FLMR*+WiT | 38.15 | 24.62 | 57.25 | 69.42 | 79.43 | 88.62 | 93.14 | ||
| ReViz+VL-ICT† | 45.77 | 33.18 | 64.05 | 75.39 | 84.21 | 91.64 | 94.59 | ||
| Ret-XKnow+ViD2R | 59.88 | 44.93 | 78.10 | 86.50 | 92.27 | 96.43 | 98.08 | ||
| CLIP† | ReMuQ | 199K | 0.34 | - | 0.78 | 1.36 | 2.41 | 7.34 | 47.88 |
| ReViz+VL-ICT.† | 23.61 | - | 39.43 | 46.77 | 53.56 | 63.70 | 71.13 | ||
| PreFLMR+ViD2R | 54.44 | 52.37 | 57.66 | 58.94 | 59.63 | 60.54 | 60.85 | ||
| Ret-XKnow+ViD2R | 80.88 | 78.11 | 85.20 | 87.48 | 89.14 | 90.77 | 91.63 | ||
| ColBERTv2 | A-OKVQA | Rationale | 58.32 | 49.52 | 72.58 | 79.83 | 85.07 | 90.92 | 94.93 |
| FLMR+WiT | 48.43 | 38.95 | 63.93 | 73.45 | 81.75 | 91.79 | 96.77 | ||
| Ret-XKnow+ViD2R | 68.13 | 58.95 | 82.53 | 88.82 | 93.19 | 97.38 | 98.52 | ||