notesum.ai
Published at December 4Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
cs.CV
cs.AI
cs.LG
Released Date: December 4, 2024
Authors: Mahtab Bigverdi1, Zelun Luo2, Cheng-Yu Hsieh1, Ethan Shen1, Dongping Chen1, Linda G. Shapiro1, Ranjay Krishna1
Aff.: 1University of Washington; 2Google Research

| Training | |||||||||||
| Model | Direct Labeling Data | Bounding Box Data | CoT Data | CV-Bench Counting | SEED-Bench Counting | BLINK Counting | |||||
| LLaVA One Vision | ✗ | ✗ | ✗ | 34.4 | 31.7 | 35.8 | |||||
| LLaVA 1.5 13B | ✗ | ✗ | ✗ | 40.9 | 52.2 | 35.0 | |||||
| Fine-tunned LlaVA | ✓ | ✗ | ✗ | 44.7 | 46.3 | 0.2 | |||||
| LLaVA-Aurora (Ours) | ✓ | ✓ | ✓ | 56.0 | 54.6 | 45.8 | |||||
| GPT-4o | ✗ | ✗ | ✗ | 70.18 | 64.6 | 47.5 | |||||
| GPT-4 Turbo | ✗ | ✗ | ✗ | 61.3 | 64.8 | 57.5 | |||||
| GPT-4 Turbo + Tool | ✗ | ✗ | ✗ | 48.6 | 29.9 | 26.7 | |||||