notesum.ai
Published at November 11Explore the Reasoning Capability of LLMs in the Chess Testbed
cs.CL
cs.AI
Released Date: November 11, 2024
Authors: Shu Wang1, Lei Ji2, Renxi Wang3, Wenxiao Zhao1, Haokun Liu4, Yifan Hou5, Ying Nian Wu1
Aff.: 1UCLA; 2Microsoft Research; 3MBZUAI; 4University of Toronto; 5Peking University

| Model | Zero-Shot Learning | Few-Shot Learning | ||||||
|---|---|---|---|---|---|---|---|---|
| N | S | T | ST | N | S | T | ST | |
| gpt-4 | 53.1 | 54.6 | 60.0 | 60.0 | 54.7 | 58.9 | 57.7 | 68.1 |
| gpt-4o | 46.4 | 52.8 | 54.8 | 60.1 | 48.5 | 54.3 | 52.7 | 63.1 |
| o1-mini | 51.5 | 58.8 | 64.1 | 69.2 | 50.4 | 58.3 | 62.0 | 65.9 |
| o1-preview | 56.4 | 65.4 | 77.2 | 76.6 | 59.0 | 65.4 | 76.2 | 78.6 |
| claude-3.5-sonnet | 49.6 | 54.9 | 56.9 | 54.9 | 51.9 | 63.7 | 59.9 | 66.1 |
| claude-3-opus | 48.3 | 54.5 | 53.7 | 57.3 | 51.0 | 55.8 | 53.2 | 60.2 |
| gemini-1.5-pro | 50.6 | 48.8 | 54.2 | 52.6 | 50.5 | 50.1 | 52.7 | 50.4 |
| gemini-1.5-flash | 46.1 | 50.8 | 54.2 | 52.9 | 49.7 | 48.2 | 53.8 | 55.6 |
| Ours-no-explanation | 63.5 | – | – | – | 64.7 | – | – | – |
| Ours-strategy | – | 89.7 | – | – | – | 89.8 | – | – |
| Ours-tactic | – | – | 94.6 | – | – | – | 94.5 | – |
| Ours-strategy&tactic | – | – | – | 95.2 | – | – | – | 95.3 |