notesum.ai
Published at November 12Is Cognition consistent with Perception? Assessing and Mitigating Multimodal Knowledge Conflicts in Document Understanding
cs.AI
Released Date: November 12, 2024
Authors: Zirui Shao1, Chuwei Luo2, Zhaoqing Zhu2, Hangdi Xing1, Zhi Yu1, Qi Zheng2, Jiajun Bu1
Aff.: 1Zhejiang University; 2Alibaba Group
| DocVQA | DeepForm | KLC | |||||||
|---|---|---|---|---|---|---|---|---|---|
| C | P | C&P | C | P | C&P | C | P | C&P | |
| Qwen-VL-Chat | 56.53 | 51.66 | 20.82 | 52.11 | 57.79 | 5.240 | 50.99 | 49.01 | 37.87 |
| Qwen-VL-Chat (Ours) | 98.90 | 97.36 | 56.05 | 99.27 | 95.20 | 37.12 | 99.51 | 98.76 | 70.55 |
| InternVL2-2b | 53.69 | 47.65 | 13.92 | 45.71 | 45.71 | 1.456 | 59.98 | 62.13 | 18.48 |
| InternVL2-2b (Ours) | 99.52 | 95.07 | 41.07 | 100.0 | 96.22 | 44.54 | 99.92 | 97.11 | 76.40 |
| InternVL2-8b | 47.50 | 79.48 | 20.47 | 52.84 | 79.04 | 3.202 | 85.48 | 81.35 | 30.53 |
| InternVL2-8b (Ours) | 99.90 | 95.39 | 52.21 | 100.0 | 96.94 | 44.69 | 100.0 | 98.52 | 81.68 |
| FUNSD | ChartQA | WTQ | |||||||
| C | P | C&P | C | P | C&P | C | P | C&P | |
| Qwen-VL-Chat | 55.46 | 53.27 | 7.264 | 58.37 | 51.67 | 21.64 | 56.83 | 54.68 | 8.672 |
| Qwen-VL-Chat (Ours) | 95.93 | 97.58 | 45.04 | 98.88 | 98.92 | 72.68 | 97.97 | 96.88 | 32.59 |
| InternVL2-2b | 46.47 | 59.32 | 7.506 | 48.84 | 46.89 | 9.107 | 58.00 | 66.46 | 10.30 |
| InternVL2-2b (Ours) | 98.93 | 95.40 | 26.63 | 99.04 | 98.38 | 78.81 | 98.57 | 95.60 | 39.97 |
| InternVL2-8b | 91.44 | 81.60 | 11.14 | 50.28 | 47.25 | 9.558 | 95.76 | 77.03 | 9.214 |
| InternVL2-8b (Ours) | 99.79 | 94.43 | 37.29 | 99.56 | 99.46 | 83.86 | 99.59 | 97.56 | 59.35 |