notesum.ai
Published at November 7TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models
cs.CV
cs.AI
Released Date: November 7, 2024
Authors: Jonathan Fhima1, Elad Ben Avraham2, Oren Nuriel2, Yair Kittenplon2, Roy Ganz2, Aviad Aberdam2, Ron Litman2
Aff.: 1Technion, Israel; 2AWS AI Labs

| Scene-Text | Document | 0-shot | Average | |||||||
| Method | LLM | TextVQA [50] | STVQA [9] | TextCaps [49] | DocVQA [43] | InfoVQA [44] | DUDE [31] | Scene-Text | Document | |
| VQAScore | ANLS | CIDEr | ANLS | ANLS | ANLS | |||||
| Specialist | UniTNT [24] | - | 55.4 | 66.0 | 109.0 | - | - | - | - | - |
| DocFormer v2 [6] | T5 large [48] | 64.0 | 71.8 | - | 87.8 | 48.8 | - | - | - | |
| GIT2 [52] | - | - | 75.8 | 145.0 | - | - | - | - | - | |
| PALI 17B [18] | mT5-XXL [54] | 71.8 | 79.9 | 160.4 | - | - | - | - | - | |
| PALI-X 55B [16] | - | 80.8 | 84.5 | 163.7 | 86.8 | 54.8 | - | - | - | |
| PALI 3 [17] | UL2 [51] | 78.3 | 85.7 | 164.3 | 88.6 | 62.4 | - | - | - | |
| Generalist | InstructBlip [20] | Flan-T5-XL [19] | 64.0 | 63.9 | 139.9 | 77.2 | 43.6 | 36.3 | 89.3 | 60.4 |
| + TAP-VL | 67.3 | 66.3 | 145.5 | 85.5 | 51.6 | 40.9 | 93.0 | 68.6 | ||
| +3.3 | +2.4 | +5.6 | +8.3 | +8.0 | +4.6 | +3.7 | +8.2 | |||
| InstructBlip [20] | Flan-T5-XXL [19] | 66.8 | 65.0 | 143.1 | 81.1 | 49.4 | 40.0 | 91.6 | 65.3 | |
| + TAP-VL | 69.5 | 67.1 | 146.9 | 85.9 | 54.2 | 42.9 | 94.5 | 70.1 | ||
| +2.7 | +2.1 | +3.8 | +4.8 | +4.8 | +2.9 | +2.9 | +4.8 | |||
| LLaVA-1.6[39] | Mistral-7B [30] | 70.6 | 71.4 | 130.0 | 81.2 | 46.7 | 37.9 | 90.6 | 64.0 | |
| + TAP-VL | 72.8 | 72.8 | 142.2 | 86.7 | 54.3 | 41.4 | 95.9 | 70.5 | ||
| +2.2 | +1.4 | +12.2 | +5.5 | +7.6 | +3.5 | +5.3 | +6.5 | |||
| Qwen-VL [8] | Qwen-7B [7] | 75.6 | 77.6 | 141.1 | 80.5 | 40.4 | 34.0 | 98.1 | 60.5 | |
| + TAP-VL | 76.4 | 78.8 | 141.3 | 85.1 | 47.8 | 36.7 | 98.8 | 66.5 | ||
| +0.8 | +1.2 | +0.2 | +4.6 | +7.4 | +2.7 | +0.7 | +6.0 | |||