notesum.ai
Published at November 11UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
cs.CL
cs.AI
Released Date: November 11, 2024
Authors: Bo Yang1, Qingping Yang1, Runtao Liu2
Aff.: 1South China University of Technology; 2Hong Kong University of Science and Technology

| Model | Pass@1 (%) | Pass@5(%) | Avg. Run Time (s) | ||||
| PoT | RCoT | PoT | RCoT | PoT | RCoT | Efficiency | |
| closed-source models | |||||||
| GPT-4o | 25.53 | 26.93 (1.40) | 32.67 | 35.90 (3.23) | 6.98 | 6.23 | 12.04% |
| Claude-3.5-Sonnet | 18.58 | 19.11 (0.53) | 27.83 | 31.34 (3.51) | 6.44 | 5.32 | 21.05% |
| Gemini-1.5-Pro | 14.13 | 12.35 (1.78) | 25.17 | 24.69 (0.48) | 5.71 | 5.76 | 0.87% |
| GPT-3.5-Turbo | 11.68 | 6.82 (4.86) | 17.09 | 13.30 (3.79) | 5.42 | 5.06 | 7.11% |
| open-source models | |||||||
| Qwen2.5-72B | 23.48 | 22.17 (1.31) | 31.05 | 33.33 (2.28) | 5.88 | 4.31 | 36.42% |
| DeepSeek-V2.5-236B | 20.95 | 21.63 (0.68) | 30.10 | 31.72 (1.62) | 6.64 | 5.44 | 22.06% |
| Qwen2.5-Math-72B | 19.72 | 20.53 (0.81) | 26.69 | 28.11 (1.42) | 5.04 | 3.81 | 24.40% |
| LLaMA-3.1-405B | 15.76 | 16.09 (0.33) | 25.26 | 27.35 (2.09) | 5.73 | 5.12 | 11.91% |