notesum.ai
Published at November 22Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation
cs.CL
cs.AI
Released Date: November 22, 2024
Authors: Xunyu Zhu1, Jian Li2, Can Ma1, Weiping Wang1
Aff.: 1Institute of Information Engineering, Chinese Academy of Sciences; School of Cyber Security, University of Chinese Academy of Sciences; 2School of Artificial Intelligence, Beijing Normal University

| Models | #Params | GSM8K | ASDiv | SVAMP | MultiArith | AVG |
|---|---|---|---|---|---|---|
| Proprietary Large Language Models | ||||||
| GPT-4 [19] | - | 92.0 | 91.3 | 93.1 | - | 92.13 |
| ChatGPT | - | 80.8 | 87.3 | 83.0 | - | 83.7 |
| Claude-2 | - | 85.2 | - | - | - | 85.2 |
| PaLM-2 [20] | 540B | 80.7 | - | - | - | 80.7 |
| Open-Source Large Language Models | ||||||
| Llama-2 [21] | 7B | 13.3 | 50.7 | 38.0 | - | 34 |
| CodeLLaMA [22] | 7B | 34.0 | 61.4 | 59.0 | - | 51.46 |
| Platypus-2 [23] | 7B | 14.4 | 47.9 | 36.7 | - | 33 |
| WizardMath [24] | 7B | 54.9 | 59.1 | 57.3 | - | 57.1 |
| TORA [25] | 7B | 68.8 | 73.9 | 68.2 | - | 70.3 |
| Fine-tuned Small Language Models | ||||||
| Ho et al. [2] | 0.3B | 3.11 | - | - | - | 3.11 |
| Fu et al. [3] | 0.76B | 20.2 | 23.8 | 20.4 | 38.5 | 25.72 |
| Shridhar et al. [4] | 0.77B | 17.89 | - | 18.14 | - | 18.01 |
| Zhu et al. [5] | 0.77B | 39.2 | 51.2 | 48.2 | 79.2 | 54.45 |
| Zhu et al. [6] | 0.77B | 42.45 | 52.81 | 49.59 | 85.5 | 57.58 |
| Our fine-tuned Small Language Models | ||||||
| FlanT5-Small | 0.06B | 2.1 | 2.8 | 2.1 | 4.0 | 2.75 |
| (+) FDD | 29.87 | 52.86 | 43.4 | 77.16 | 50.82 | |
| FlanT5-Base | 0.25B | 3.0 | 4.2 | 3.8 | 7.0 | 4.5 |
| (+) FDD | 40.25 | 58.44 | 54.3 | 87.83 | 60.20 | |
| FlanT5-Large | 0.76B | 6.9 | 10.1 | 6.8 | 13.0 | 9.2 |
| (+) FDD | 49.43 | 64.88 | 61.9 | 94.0 | 67.55 | |