notesum.ai
Published at December 9From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding
cs.CV
cs.AI
cs.LG
Released Date: December 9, 2024
Authors: Yixiong Fang1, Ziran Yang1, Zhaorun Chen2, Zhuokai Zhao2, Jiawei Zhou1
Aff.: 1Stony Brook University; 2University of Chicago

| Model | Method | CHAIR | THRONE | ||||
|---|---|---|---|---|---|---|---|
| CHAIRS | CHAIRI | ||||||
| LLaVA-1.5 | Greedy | 42.20±2.86 | 12.83±0.36 | 0.795±0.006 | 0.784±0.009 | 0.772±0.015 | 0.847±0.010 |
| Beam Search | 46.33±1.10 | 13.9±0.60 | 0.790±0.007 | 0.772±0.004 | 0.759±0.003 | 0.862±0.009 | |
| OPERA | 41.47±0.92 | 12.37±0.72 | 0.802±0.003 | 0.791±0.004 | 0.782±0.009 | 0.854±0.011 | |
| VCD | 49.20±0.88 | 14.87±0.47 | 0.786±0.012 | 0.771±0.017 | 0.759±0.020 | 0.854±0.015 | |
| Dropout Decoding | 39.80±2.3 | 11.73±0.25 | 0.804±0.002 | 0.796±0.006 | 0.790±0.009 | 0.851±0.005 | |
| Dropout Decoding (w/o prelim) | 39.73±2.15 | 12.20±0.70 | 0.799±0.002 | 0.794±0.004 | 0.791±0.007 | 0.843±0.005 | |
| InstructBLIP | Greedy | 27.87±1.32 | 7.90±0.63 | 0.809±0.001 | 0.826±0.003 | 0.832±0.006 | 0.803±0.007 |
| Beam Search | 25.87±2.77 | 6.93±0.569 | 0.809±0.002 | 0.827±0.006 | 0.836±0.005 | 0.807±0.015 | |
| OPERA | 28.07±1.75 | 8.23±0.53 | 0.805±0.004 | 0.824±0.003 | 0.830±0.004 | 0.798±0.008 | |
| VCD | 39.33±2.70 | 19.10±0.30 | 0.737±0.008 | 0.746±0.012 | 0.751±0.020 | 0.757±0.007 | |
| Dropout Decoding | 24.53±1.26 | 6.63±0.65 | 0.814±0.008 | 0.833±0.004 | 0.838±0.002 | 0.808±0.016 | |
| Dropout Decoding (w/o prelim) | 26.2±2.40 | 7.10±0.854 | 0.807±0.008 | 0.823±0.006 | 0.827±0.010 | 0.804±0.010 | |
| LLaVA-NEXT | Greedy | 28.80±2.12 | 8.10±0.92 | 0.815±0.012 | 0.832±0.009 | 0.830±0.007 | 0.799±0.008 |
| Beam Search | 28.06±1.30 | 7.10±0.20 | 0.816±0.007 | 0.834±0.006 | 0.834±0.004 | 0.801±0.002 | |
| OPERA | 29.06±1.89 | 8.06±1.07 | 0.814±0.011 | 0.832±0.011 | 0.831±0.006 | 0.799±0.007 | |
| VCD | 33.19±0.52 | 8.10±0.91 | 0.818±0.004 | 0.822±0.003 | 0.808±0.005 | 0.822±0.003 | |
| Dropout Decoding | 26.26±2.4 | 7.39±0.69 | 0.821±0.010 | 0.840±0.009 | 0.842±0.002 | 0.805±0.010 | |
| Dropout Decoding (w/o prelim) | 27.0±1,80 | 7.53±0.643 | 0.814±0.009 | 0.835±0.007 | 0.837±0.003 | 0.793±0.008 | |