notesum.ai
Published at November 15Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination
cs.CV
cs.AI
Released Date: November 15, 2024
Authors: Haojie Zheng1, Tianyang Xu2, Hanchi Sun3, Shu Pu4, Ruoxi Chen3, Lichao Sun3
Aff.: 1University of Pennsylvania; 2Columbia University; 3Lehigh University; 4Independent Researcher

| Models | Method | MMVP | HallusionBench | POPEadversarial | Mathvista | SEED-Benchsingle | Average | MME | |
| Perception | Cognition | ||||||||
| Baselines | Human | 0.957 | 0.986 | 0.995 | 0.603 | 0.967 | 0.901 | - | - |
| Random choice | 0.250 | 0.500 | 0.500 | 0.179 | 0.250 | 0.336 | - | - | |
| GPT-4o mini | Origin | 0.446 | 0.574 | 0.786 | 0.526 | 0.636 | 0.590 | 1097.23 | 407.14 |
| zero-shot CoT | 0.443 ( 2.85%) | 0.611 ( 6.46%) | 0.773 ( 1.65%) | 0.520 ( 1.14%) | 0.660 ( 3.77%) | 0.601 ( 1.86%) | 1069.81 ( 2.50%) | 417.14 ( 2.46%) | |
| VIC | 0.520 ( 16.59%) | 0.639 ( 11.37%) | 0.793 ( 0.89%) | 0.536 ( 1.90%) | 0.696 ( 9.43%) | 0.637 ( 7.96%) | 1105.27 ( 0.73%) | 505.00 ( 24.04%) | |
| Gemini 1.5 Flash | Origin | 0.527 | 0.560 | 0.769 | 0.479 | 0.658 | 0.599 | 1077.36 | 358.92 |
| zero-shot CoT | 0.513 ( 2.54%) | 0.614 ( 9.69%) | 0.741 ( 3.64%) | 0.520 ( 8.56%) | 0.672 ( 2.13%) | 0.612 ( 2.17%) | 1105.9 ( 2.65%) | 500.71 ( 39.50%) | |
| VIC | 0.553 ( 5.05%) | 0.638 ( 13.93%) | 0.780 ( 1.43%) | 0.516 ( 7.72%) | 0.713 ( 8.36%) | 0.640 ( 6.84%) | 1118.54 ( 3.82%) | 508.21 ( 41.59%) | |
| GPT-4o | Origin | 0.673 | 0.626 | 0.811 | 0.597 | 0.657 | 0.673 | 1174.39 | 522.85 |
| zero-shot CoT | 0.687 ( 1.99%) | 0.673 ( 7.49%) | 0.793 ( 2.22%) | 0.622 ( 4.19%) | 0.739 ( 12.48%) | 0.701 ( 4.16%) | 1166.12 ( 0.70%) | 537.14 ( 2.73%) | |
| VIC | 0.747 ( 10.90%) | 0.692 ( 10.52%) | 0.827 ( 1.27%) | 0.620 ( 3.85%) | 0.751 ( 14.31%) | 0.727 ( 8.08%) | 1238.69 ( 5.48%) | 557.85 ( 6.69%) | |
| Gemini 1.5 Pro | Origin | 0.420 | 0.617 | 0.779 | 0.568 | 0.678 | 0.612 | 1166.82 | 462.14 |
| zero-shot CoT | 0.500 ( 19.05%) | 0.611 ( 0.97%) | 0.793 ( 1.67%) | 0.563 ( 0.88%) | 0.691 ( 1.92%) | 0.632 ( 3.20%) | 1099.33 ( 5.78%) | 521.42 ( 12.83%) | |
| VIC | 0.553 ( 31.74%) | 0.664 ( 7.62%) | 0.803 ( 3.08%) | 0.558 ( 1.76%) | 0.713 ( 5.16%) | 0.658 ( 7.55%) | 1147.23 ( 1.68%) | 561.42 ( 21.48%) | |