notesum.ai
Published at November 25Preference Optimization for Reasoning with Pseudo Feedback
cs.CL
Released Date: November 25, 2024
Authors: Fangkai Jiao1, Geyang Guo2, Xingxing Zhang3, Nancy F. Chen1, Shafiq Joty4, Furu Wei3
Aff.: 1Nanyang Technological University and I2R, A*STAR; 2Georgia Institute of Technology; 3Microsoft Research; 4Salesforce Research

| MATH | GSM8K | College Math | |
| GPT-4o-2024-0512 | 78.7 | 95.8 | 46.7 |
| GPT-4-Turbo-2024-0409 | 72.8 | 94.8 | 44.2 |
| GPT-4-Turbo-1106-preview† | 64.3 | — | — |
| GPT-4-0613 | 55.0 | 93.5 | 39.0 |
| NuminaMath-72B-CoT (Beeching et al., 2024) | 67.1 | 91.7 | 39.8 |
| Llama-3.1-8B-Instruct (Dubey et al., 2024) | 47.5 | 84.5 | 27.5 |
| Llama-3.1-70B-Instruct (Dubey et al., 2024) | 68.1 | 95.5 | 41.8 |
| Llama-3.1-8B-base (Dubey et al., 2024) | 20.3 (4-shot) | 56.7 (8-shot) | 20.1 (4-shot) |
| w/ SFT | 53.8 | 85.1 | 34.6 |
| w/ PFPO-LLM Iter. 0 | 55.0 | 86.6 | 35.8 |
| w/ PFPO-Self Iter. 1 | 55.9 | 87.6 | 36.6 |
| w/ PFPO-Self Iter. 2 | 56.6 | 88.9 | 37.0 |
| w/ PFPO-Self Iter. 3 | 57.0 | 88.8 | 36.7 |
| /w PFPO-Self Iter. 4 | 57.4 | 89.1 | 37.6 |
| w/ PFPO-Self Iter. 5 | 57.8 | 89.6 | 38.0 |
| Mathstral-7B-v0.1 (Mistral AI Team, 2024b) | 58.3 | 85.6 | 34.3 |
| w/ SFT | 61.4 | 87.3 | 38.4 |
| w/ PFPO-LLM Iter. 0 | 66.7 | 90.0 | 41.3 |
| w/ PFPO-Self Iter. 1 | 67.8 | 90.8 | 42.0 |
| w/ PFPO-Self Iter. 2 | 68.6 | 90.3 | 42.2 |
| w/ PFPO-Self Iter. 3 | 68.2 | 90.4 | 42.3 |