notesum.ai
Published at November 12Large Language Models Can Self-Improve in Long-context Reasoning
cs.CL
cs.AI
Released Date: November 12, 2024
Authors: Siheng Li1, Cheng Yang1, Zesen Cheng2, Lemao Liu3, Mo Yu3, Yujiu Yang4, Wai Lam1
Aff.: 1The Chinese University of Hong Kong; 2Peking University; 3Tencent; 4Tsinghua University

| Model | Qasper | MultiFieldQA-En | HotpotQA | MuSiQue | 2WikiMQA | Avg. |
| Qwen-2.5-7B-Instruct (Yang et al., 2024a) | 21.0 | 28.0 | 70.5 | 48.0 | 77.5 | 49.0 |
| + SeaLong | 26.0 | 29.3 | 72.5 | 51.5 | 79.5 | 51.8 |
| Qwen-2.5-14B-Instruct (Yang et al., 2024a) | 21.0 | 32.0 | 73.0 | 52.0 | 83.0 | 52.2 |
| + SeaLong | 24.0 | 30.0 | 75.0 | 57.0 | 87.5 | 54.7 |
| Llama-3.1-8B-Instruct (Dubey et al., 2024) | 29.0 | 29.3 | 64.0 | 49.5 | 82.0 | 50.8 |
| + SeaLong | 32.5 | 31.3 | 68.0 | 58.5 | 84.5 | 55.0 |
| Qwen-2.5-32B-Instruct (Yang et al., 2024a) | 24.5 | 26.0 | 72.0 | 55.0 | 88.0 | 53.1 |
| Qwen-2.5-72B-Instruct (Yang et al., 2024a) | 27.0 | 28.7 | 74.5 | 58.5 | 89.0 | 55.5 |
| Llama-3.1-70B-Instruct (Dubey et al., 2024) | 30.0 | 33.3 | 74.0 | 68.5 | 85.5 | 58.3 |
| GPT-4o (Hurst et al., 2024) | 21.5 | 28.0 | 74.5 | 64.0 | 84.0 | 54.4 |