notesum.ai
Published at October 30VisAidMath: Benchmarking Visual-Aided Mathematical Reasoning
cs.CV
cs.AI
cs.CL
cs.LG
Released Date: October 30, 2024
Authors: Jingkun Ma1, Runzhe Zhan1, Derek F. Wong1, Yang Li1, Di Sun2, Hou Pong Chan3, Lidia S. Chao1
Aff.: 1NLP2CT Lab, Department of Computer and Information Science, University of Macau; 2Department of Mathematics, University of Macau; 3DAMO Academy, Alibaba Group

| Model | ALL | PLG | SDG | AYG | CAL | AXL | RTC | THC | PLG | SDG | FUG |
| Heuristics Baselines | |||||||||||
| Random Answer | 24.42 | 21.54 | 34.31 | 21.45 | 20.07 | 24.44 | 20.87 | 35.16 | 10.53 | 32.89 | 21.50 |
| Frequent Answer | 40.83 | 28.92 | 50.65 | 40.36 | 44.22 | 32.79 | 47.25 | 74.73 | 20.00 | 47.73 | 44.53 |
| Large Language Models (LLMs): Text-Only Input | |||||||||||
| Llama2-7B | 26.83 | 21.85 | 34.64 | 30.55 | 20.75 | 26.68 | 25.23 | 39.56 | 11.58 | 30.26 | 26.49 |
| Mistral-7b-Instruct-v0.2 | 27.42 | 27.38 | 30.72 | 27.64 | 23.81 | 27.57 | 28.21 | 28.57 | 11.58 | 27.63 | 26.87 |
| GPT3.5 | 37.58 | 32.31 | 42.16 | 37.45 | 38.78 | 37.56 | 38.30 | 40.66 | 13.68 | 42.11 | 38.20 |
| GPT4 | 51.92 | 41.54 | 52.29 | 50.91 | 63.95 | 45.75 | 54.59 | 60.44 | 23.16 | 53.29 | 61.23 |
| Large Multimodal Models (LMMs): Text-Only Input | |||||||||||
| LLaVA-Next-Mistral-7B | 23.08 | 21.23 | 22.55 | 25.45 | 23.47 | 22.21 | 23.62 | 25.27 | 8.42 | 26.32 | 25.34 |
| InternLM-XComposer2-VL | 33.17 | 24.62 | 44.12 | 32.36 | 31.97 | 30.40 | 33.03 | 46.15 | 10.53 | 41.45 | 34.17 |
| Qwen-VL-Plus | 34.75 | 30.15 | 43.46 | 33.82 | 31.63 | 34.43 | 34.63 | 48.35 | 21.05 | 44.74 | 32.63 |
| Gemini-Pro-Vision | 38.42 | 31.08 | 48.37 | 31.27 | 42.86 | 34.72 | 37.84 | 49.45 | 18.95 | 51.97 | 39.54 |
| Claude-3-Sonnet | 38.58 | 31.38 | 43.46 | 39.27 | 40.82 | 36.66 | 40.14 | 46.15 | 14.74 | 43.42 | 42.23 |
| GPT4V | 47.00 | 35.08 | 47.06 | 50.55 | 56.80 | 41.43 | 50.69 | 48.35 | 15.79 | 47.37 | 55.66 |
| Large Multimodal Models (LMMs): Multimodal Input | |||||||||||
| LLaVA-Next-Mistral-7B | 24.58 | 22.77 | 24.18 | 27.64 | 24.15 | 23.55 | 24.54 | 29.67 | 9.47 | 25.00 | 25.91 |
| InternLM-XComposer2-VL | 29.00 | 21.54 | 32.68 | 31.64 | 30.95 | 26.97 | 30.73 | 37.36 | 10.53 | 35.53 | 32.05 |
| Qwen-VL-Plus | 32.00 | 28.62 | 35.95 | 33.45 | 30.27 | 32.34 | 33.49 | 32.97 | 21.05 | 42.11 | 32.05 |
| Gemini-Pro-Vision | 38.33 | 28.92 | 48.69 | 32.73 | 43.20 | 33.68 | 38.07 | 50.55 | 14.74 | 53.95 | 39.73 |
| Claude-3-Sonnet | 37.08 | 27.69 | 41.50 | 39.27 | 40.82 | 33.38 | 40.60 | 46.15 | 14.74 | 41.45 | 42.42 |
| GPT4V | 45.33 | 34.46 | 42.16 | 49.45 | 56.80 | 39.64 | 50.00 | 41.76 | 13.68 | 46.71 | 55.28 |