notesum.ai
Published at December 5Discriminative Fine-tuning of LVLMs
cs.CV
cs.AI
Released Date: December 5, 2024
Authors: Yassine Ouali1, Adrian Bulat2, Alexandros Xenos3, Anestis Zaganidis1, Ioannis Maniadis Metaxas1, Georgios Tzimiropoulos3, Brais Martinez1
Aff.: 1Samsung AI Cambridge; 2Technical University of Iasi; 3Queen Mary University of London

| Method | Params | Replace | Swap | Add | ||||
|---|---|---|---|---|---|---|---|---|
| (B) | Object | Attribute | Relation | Object | Attribute | Object | Attribute | |
| Human | – | 100 | 99 | 97 | 99 | 100 | 99 | 99 |
| CLIP (ViT-B) [36] | 0.15 | 90.9 | 80.1 | 69.2 | 61.4 | 64.0 | 77.2 | 68.8 |
| CLIP (ViT-L) [36] | 0.43 | 94.1 | 79.2 | 65.2 | 60.2 | 62.3 | 78.3 | 71.5 |
| BLIP (ViT-L) [28] | 0.23 | 96.5 | 81.7 | 69.1 | 66.6 | 76.8 | 92.0 | 85.1 |
| BLIP2 (ViT-L) [29] | 1.17 | 97.6 | 81.7 | 77.8 | 62.1 | 65.5 | 92.4 | 87.4 |
| OpenCLIP (ViT-G/14) [37] | 1.37 | 95.8 | 85.0 | 72.4 | 63.0 | 71.2 | 91.5 | 82.1 |
| OpenCLIP (ViT-BigG/14) [37] | 2.54 | 96.6 | 87.9 | 74.9 | 62.5 | 75.2 | 92.2 | 84.5 |
| EVA-02-CLIP (ViT-E/14+) [38] | 5.04 | 97.1 | 88.5 | 74.2 | 67.3 | 74.1 | 91.8 | 83.9 |
| EVA-CLIP (8B) [39] | 8.22 | 96.4 | 86.6 | 74.8 | 66.1 | 74.6 | 91.3 | 82.0 |
| EVA-CLIP ((18B) [39] | 18.3 | 97.5 | 88.8 | 76.1 | 65.3 | 76.0 | 92.4 | 85.0 |
| NegCLIP [47] | 0.15 | 92.7 | 85.9 | 76.5 | 75.2 | 75.4 | 88.8 | 82.8 |
| LLaVA-1.5-7B [34] | 7.06 | 88.0 | 81.6 | 76.1 | 60.9 | 58.8 | 67.0 | 62.4 |
| E5-V (LLaVA-Next-8B) [23] | 8.36 | 96.7 | 89.5 | 85.3 | 75.0 | 70.1 | 89.2 | 83.5 |
| E5-V (LLaVA-1.5-7B) [23] | 7.06 | 95.8 | 86.6 | 81.6 | 62.9 | 64.0 | 93.5 | 88.0 |
| VladVA (Ours) (LLaVA-1.5-7B) | 7.06 | 98.1 | 92.1 | 86.8 | 79.0 | 82.9 | 95.2 | 95.8 |