notesum.ai
Published at November 6M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
cs.CL
cs.AI
Released Date: November 6, 2024
Authors: Chuhan Li1, Ziyao Shangguan1, Yilun Zhao1, Deyuan Li1, Yixin Liu1, Arman Cohan2
Aff.: 1Yale University; 2Allen Institute for AI
![[Uncaptioned image]](https://arxiv.org/html/2411.04075v1/extracted/5978284/logos/yale.jpeg)
| Model | Modality | Reasoning Type | ||||||||
| Table | Figure | COM | DE | LOC | VU | All | ||||
| Expert Performance | 0.678 | 0.765 | 0.751 | 0.872 | 0.711 | 0.732 | 0.796 | |||
| Random | 0.134 | 0.106 | 0.134 | 0.130 | 0.110 | 0.111 | 0.126 | |||
| Simple Baselines | ||||||||||
| text-embedding-3-large | 0.321 | 0.239 | 0.267 | 0.323 | 0.384 | 0.218 | 0.297 | |||
| text-embedding-3-small | 0.223 | 0.205 | 0.221 | 0.223 | 0.267 | 0.138 | 0.217 | |||
| text-embedding-ada-002 | 0.185 | 0.168 | 0.200 | 0.171 | 0.224 | 0.096 | 0.180 | |||
| Contriever | 0.165 | 0.229 | 0.196 | 0.144 | 0.274 | 0.142 | 0.184 | |||
| BM25 | 0.138 | 0.098 | 0.118 | 0.128 | 0.160 | 0.110 | 0.127 | |||
| Open-Source Large Multi-modal Models (LMMs) | ||||||||||
| InternVL-Chat-V1.1 | 0.168 | 0.084 | 0.136 | 0.153 | 0.170 | 0.109 | 0.144 | |||
| Yi-VL-34B | 0.105 | 0.057 | 0.101 | 0.088 | 0.080 | 0.086 | 0.091 | |||
| Qwen-VL-Plus | 0.065 | 0.131 | 0.077 | 0.053 | 0.148 | 0.136 | 0.089 | |||
| LLaVA-1.6 | 0.079 | 0.000 | 0.088 | 0.044 | 0.052 | 0.000 | 0.056 | |||
| DeepSeek-VL | 0.075 | 0.087 | 0.064 | 0.081 | 0.109 | 0.070 | 0.079 | |||
| Proprietary Large Multi-modal Models (LMMs) | ||||||||||
| GPT-4o | 0.520 | 0.454 | 0.443 | 0.565 | 0.570 | 0.418 | 0.500 | |||
| GPT-4V(ision) | 0.440 | 0.309 | 0.383 | 0.407 | 0.523 | 0.288 | 0.400 | |||
| Claude-3-Sonnet | 0.385 | 0.369 | 0.357 | 0.363 | 0.395 | 0.422 | 0.374 | |||
| Claude-3-Opus | 0.256 | 0.343 | 0.320 | 0.362 | 0.301 | 0.204 | 0.316 | |||
| Gemini-Pro-Vision-1.0 | 0.217 | 0.188 | 0.196 | 0.160 | 0.284 | 0.195 | 0.197 | |||
| Claude-3-Haiku | 0.189 | 0.188 | 0.194 | 0.201 | 0.130 | 0.208 | 0.188 | |||