notesum.ai
Published at December 5Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
cs.CV
cs.AI
Released Date: December 5, 2024
Authors: Jiuhai Chen1, Jianwei Yang2, Haiping Wu2, Dianqi Li2, Jianfeng Gao2, Tianyi Zhou1, Bin Xiao2
Aff.: 1University of Maryland; 2Microsoft Research

| General Benchmarks | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
# Vis tok. |
VQAv2 |
GQA |
MMBench (EN) |
MMBench (CN) |
VizWiz |
POPE |
MM-Vet |
MME-P |
MME-C |
Seed-image |
HallusionBench |
LLaVA-bench |
MMStar |
|
| Vila 3B | - | 80.4 | 61.5 | 63.4 | 52.7 | 53.5 | 86.9 | 35.4 | 1442.4 | - | 67.9 | - | - | 40.3 |
| Phi 3.5-Vision | - | - | 63.5 | 75.5 | 64.2 | 58.2 | 82.2 | 46.5 | 1473.4 | 412.1 | 69.9 | 53.3 | 68.8 | 49.0 |
| Florence-VL 3B (ours) | 576 | 82.1 | 61.8 | 71.6 | 60.8 | 59.1 | 88.3 | 51.0 | 1498.7 | 403.9 | 70.6 | 58.1 | 71.1 | 44.9 |
| LLaVA next 8B | 2880 | - | 65.4 | 72.2 | - | 57.7 | 86.6 | 41.7 | 1595.1 | 379.3 | 72.7 | 47.7 | 76.8 | - |
| Vila 8B | - | 80.9 | 61.7 | 72.3 | 66.2 | 58.7 | 84.4 | 38.3 | 1577.0 | - | 71.4 | - | - | - |
| Mini-Gemini-HD 8B | 2880 | - | 64.5 | 72.7 | - | - | - | - | 1606.0 | - | 73.2 | - | - | - |
| Cambrain 8B | 576 | - | 64.6 | 75.9 | 67.9 | - | 87.4 | 48.0 | 1547.1 | - | 74.7 | 48.7 | 71.0 | 50.0 |
| Florence-VL 8B (ours) | 576 | 84.7 | 64.4 | 76.2 | 69.5 | 59.1 | 89.9 | 56.3 | 1560.0 | 381.1 | 74.9 | 57.3 | 74.2 | 50.0 |