notesum.ai
Published at December 5From Code to Play: Benchmarking Program Search for Games Using Large Language Models
cs.AI
Released Date: December 5, 2024
Authors: Manuel Eberhardinger1, James Goodman2, Alexander Dockhorn3, Diego Perez-Liebana2, Raluca D. Gaina2, Duygu Çakmak4, Setareh Maghsudi5, Simon Lucas2
Aff.: 1Institute of Applied AI, Stuttgart Media University; 2School of Electronic Engineering and Computer Science, Queen Mary University of London; 3Institute for Information Processing, Leibniz University Hannover; 4Creative Assembly; 5Ruhr University Bochum

| Model | Level 1 | Level 2 | Level 3 | Level 4 | Level 5 | Level 6 | Level 7 | Level 8 | Level 9 | Level 10 | #Levels |
|---|---|---|---|---|---|---|---|---|---|---|---|
| solved | |||||||||||
| Llama 3.1 8B | 95 (7) | 0.0 (0) | 0.0 (0) | 95 (5) | 0.0 (0) | 0.0 (0) | 58 (2) | 95 (5) | 0.0 (0) | 0.0 (0) | 4 |
| Llama 3.1 70B | 95 (7) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 94 (3) | 95 (3) | 0.0 (0) | 0.0 (0) | 3 |
| Llama 3.1 405B | 95 (9) | 0.0 (0) | 0.0 (0) | 95 (1) | 0.0 (0) | 0.0 (0) | 94 (7) | 95 (4) | 0.0 (0) | 90 (1) | 5 |
| Claude 3.5 Haiku | 95 (10) | 0.0 (0) | 0.0 (0) | 95 (5) | 0.0 (0) | 25 (1) | 94 (8) | 95 (10) | 0.0 (0) | 0.0 (0) | 5 |
| Claude 3.5 Sonnet | 95 (10) | 0.0 (0) | 0.0 (0) | 95 (9) | 0.0 (0) | 91 (2) | 94 (6) | 95 (9) | 90 (2) | 92 (1) | 7 |
| GPT 4o mini | 95 (7) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 90 (1) | 95 (6) | 0.0 (0) | 0.0 (0) | 3 |
| GPT 4o | 95 (10) | 0.0 (0) | 0.0 (0) | 95 (4) | 0.0 (0) | 91 (1) | 94 (5) | 95 (8) | 0.0 (0) | 0.0 (0) | 5 |
| o1-mini | 95 (10) | 0.0 (0) | 0.0 (0) | 95 (7) | 0.0 (0) | 83 (1) | 94 (7) | 95 (9) | 0.0 (0) | 92 (1) | 6 |
| Mistral Small | 95 (5) | 0.0 (0) | 0.0 (0) | 95 (3) | 0.0 (0) | 0.0 (0) | 33 (1) | 95 (5) | 0.0 (0) | 0.0 (0) | 4 |
| Mistral Large | 95 (10) | 89 (1) | 0.0 (0) | 95 (1) | 0.0 (0) | 0.0 (0) | 94 (8) | 95 (9) | 0.0 (0) | 0.0 (0) | 5 |
| Gemini Flash | 95 (10) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 83 (1) | 94 (5) | 95 (8) | 0.0 (0) | 90 (1) | 5 |
| Gemini Pro | 95 (6) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 0.0 (0) | 83 (4) | 76 (5) | 95 (6) | 0.0 (0) | 0.0 (0) | 4 |