notesum.ai
Published at December 9PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models
cs.CL
Released Date: December 9, 2024
Authors: Qian Zhang1, Panfeng Chen, Jiali Li, Linkun Feng, Shuyu Liu, Mei Chen, Hui Li, Yanhao Wang
Aff.: 1State Key Laboratory of Public Big Data, Guizhou University

| Domain | Model | ToF | MC | PA | ES | CA | (Scaled) | |
| Medical LLM | BianQue-7B | 0 | 0 | 0 | 3394 | 150 | 3544 | 25.32 |
| QiZhenGPT-13B | 0 | 0 | 0 | 4141.3 | 170 | 4311.3 | 30.80 | |
| PULSE-7B | 139 | 1021 | 95 | 4009 | 213 | 5477 | 39.13 | |
| PULSE-20B | 151 | 1695.5 | 145 | 3977.4 | 248 | 6216.9 | 44.42 | |
| Open-Source LLM | Baichuan2-7B | 156.5 | 1641.5 | 108 | 4373.3 | 284.5 | 6563.8 | 46.90 |
| Baichuan2-13B | 158 | 1771.5 | 132 | 4814.7 | 312 | 7188.2 | 51.36 | |
| ChatGLM3-6B | 146.5 | 1312.5 | 120 | 4199.7 | 231 | 6009.7 | 42.94 | |
| InternLM2-7B | 153.5 | 1782 | 188 | 5257.4 | 351 | 7731.9 | 55.25 | |
| InternLM2-20B | 140 | 1839.5 | 213 | 5345.3 | 384 | 7921.8 | 56.60 | |
| LLaMa3-8B | 140 | 1322.5 | 142 | 3565 | 148 | 5317.5 | 37.99 | |
| LLaMa3-70B | 196 | 2398.5 | 404 | 4640.7 | 287 | 7926.2 | 56.63 | |
| Qwen1.5-7B | 158.5 | 1392.5 | 143 | 3957.2 | 252 | 5903.2 | 42.18 | |
| Qwen1.5-14B | 191 | 2190 | 376 | 4335.8 | 326 | 7418.8 | 53.01 | |
| Qwen-72B | 203.5 | 3349.5 | 480 | 4964 | 425 | 9422 | 67.32 | |
| Mixtral-8x7B (47B) | 148 | 1387 | 181 | 3834.5 | 178 | 5728.5 | 40.93 | |
| Mixtral-8x22B (141B) | 177.5 | 1674 | 335 | 4362.9 | 250 | 6799.4 | 48.58 | |
| Commercial LLM | GLM-4 | 209.5 | 3059.5 | 458 | 5901.8 | 443 | 10071.8 | 71.96 |
| ERNIE-3.5-8K-0329 | 230 | 2949 | 652 | 5671.2 | 514 | 10016.2 | 71.57 | |
| Qwen-MAX | 227.5 | 3280.5 | 609 | 6036.5 | 515 | 10668.5 | 76.23 | |
| GPT3.5-turbo | 164 | 1644 | 324 | 4563.4 | 300.5 | 6995.9 | 49.99 | |
| Full Score | 300 | 4351.5 | 849 | 7825 | 670 | 13995.5 | 100 | |