notesum.ai

Published at December 9

From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding

cs.CV

cs.AI

cs.LG

Released Date: December 9, 2024

Authors: Yixiong Fang¹, Ziran Yang¹, Zhaorun Chen², Zhuokai Zhao², Jiawei Zhou¹

Aff.: ¹Stony Brook University; ²University of Chicago

Arxiv: http://arxiv.org/pdf/2412.06474v1

Refer to caption

Model	Method	CHAIR		THRONE
Model	Method	CHAIR_S $\downarrow$	CHAIR_I $\downarrow$	$F^{1}_{\text{all}}$ $\uparrow$	$F^{0.5}_{\text{all}}$ $\uparrow$	$P_{\text{all}}$ $\uparrow$	$R_{\text{all}}$ $\uparrow$
LLaVA-1.5	Greedy	42.20_±2.86	12.83_±0.36	0.795_±0.006	0.784_±0.009	0.772_±0.015	0.847_±0.010
	Beam Search	46.33_±1.10	13.9_±0.60	0.790_±0.007	0.772_±0.004	0.759_±0.003	0.862_±0.009
	OPERA	41.47_±0.92	12.37_±0.72	0.802_±0.003	0.791_±0.004	0.782_±0.009	0.854_±0.011
	VCD	49.20_±0.88	14.87_±0.47	0.786_±0.012	0.771_±0.017	0.759_±0.020	0.854_±0.015
	Dropout Decoding	39.80_±2.3	11.73_±0.25	0.804_±0.002	0.796_±0.006	0.790_±0.009	0.851_±0.005
	Dropout Decoding (w/o prelim)	39.73_±2.15	12.20_±0.70	0.799_±0.002	0.794_±0.004	0.791_±0.007	0.843_±0.005
InstructBLIP	Greedy	27.87_±1.32	7.90_±0.63	0.809_±0.001	0.826_±0.003	0.832_±0.006	0.803_±0.007
	Beam Search	25.87_±2.77	6.93_±0.569	0.809_±0.002	0.827_±0.006	0.836_±0.005	0.807_±0.015
	OPERA	28.07_±1.75	8.23_±0.53	0.805_±0.004	0.824_±0.003	0.830_±0.004	0.798_±0.008
	VCD	39.33_±2.70	19.10_±0.30	0.737_±0.008	0.746_±0.012	0.751_±0.020	0.757_±0.007
	Dropout Decoding	24.53_±1.26	6.63_±0.65	0.814_±0.008	0.833_±0.004	0.838_±0.002	0.808_±0.016
	Dropout Decoding (w/o prelim)	26.2_±2.40	7.10_±0.854	0.807_±0.008	0.823_±0.006	0.827_±0.010	0.804_±0.010
LLaVA-NEXT	Greedy	28.80_±2.12	8.10_±0.92	0.815_±0.012	0.832_±0.009	0.830_±0.007	0.799_±0.008
	Beam Search	28.06_±1.30	7.10_±0.20	0.816_±0.007	0.834_±0.006	0.834_±0.004	0.801_±0.002
	OPERA	29.06_±1.89	8.06_±1.07	0.814_±0.011	0.832_±0.011	0.831_±0.006	0.799_±0.007
	VCD	33.19_±0.52	8.10_±0.91	0.818_±0.004	0.822_±0.003	0.808_±0.005	0.822_±0.003
	Dropout Decoding	26.26_±2.4	7.39_±0.69	0.821_±0.010	0.840_±0.009	0.842_±0.002	0.805_±0.010
	Dropout Decoding (w/o prelim)	27.0_±1,80	7.53_±0.643	0.814_±0.009	0.835_±0.007	0.837_±0.003	0.793_±0.008