notesum.ai
Published at November 8Aligned Vector Quantization for Edge-Cloud Collabrative Vision-Language Models
cs.CV
cs.AI
Released Date: November 8, 2024
Authors: Xiao Liu1, Lijun Zhang1, Deepak Ganesan1, Hui Guan1
Aff.: 1Manning College of Information and Computer Sciences, University of Massachusetts, Amherst, Massachusetts, USA

| Method | VQAv2 | GQA | VisWiz | TextVQA | POPE | MMBench | LLaVA-Wild | MM-Vet | |||
| rand | pop | adv | en | cn | |||||||
| LLaVA-Ori | 79.13 | 62.98 | \ul47.78 | 58.21 | 87.71 | \ul86.72 | \ul84.72 | \ul66.67 | 58.93 | 61.1 | 30.9 |
| LLaVA1+ | 79.31 | \ul63.53 | 47.38 | 56.24 | \ul87.84 | 86.39 | 84.65 | 65.64 | 57.56 | \ul61.6 | 31.9 |
| LLaVA-JPEG-10 | 72.59 | 59.54 | 47.75 | 50.58 | 81.02 | 79.78 | 77.71 | 64.17 | 54.46 | 54.3 | 26.9 |
| LLaVA-JPEG-101+ | 73.35 | 59.99 | 43.29 | 49.94 | 85.11 | 83.14 | 80.23 | 63.05 | 54.20 | 51.5 | 28.5 |
| LLaVA-JPEG-90 | 79.00 | 62.44 | 48.37 | 57.89 | 87.53 | 86.42 | 84.57 | 66.32 | \ul58.59 | 60.9 | \ul31.1 |
| LLaVA-JPEG-901+ | \ul79.46 | 63.06 | 43.61 | 56.56 | 87.81 | 86.54 | 84.50 | 67.09 | 57.30 | 61.0 | 30.7 |
| LLaVA-AlignedVQ | 79.98 | 63.70 | 47.25 | \ul58.06 | 88.49 | 86.79 | 85.16 | 65.37 | 56.70 | 62.7 | 30.7 |
| LLaVA-Ori | +0.85 | +0.72 | -0.53 | -0.15 | +0.78 | +0.07 | +0.44 | -1.30 | -2.23 | +1.6 | -0.2 |
| LLaVA1+ | +0.67 | +0.17 | -0.13 | +1.82 | +0.65 | +0.40 | +0.51 | -0.27 | -0.86 | +1.1 | -1.2 |