notesum.ai
Published at November 20BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
cs.AI
Released Date: November 20, 2024
Authors: Davide Paglieri1, Bartłomiej Cupiał2, Samuel Coward3, Ulyana Piterbarg4, Maciej Wolczyk2, Akbir Khan1, Eduardo Pignatelli1, Łukasz Kuciński, Lerrel Pinto4, Rob Fergus4, Jakob Nicolaus Foerster3, Jack Parker-Holder1, Tim Rocktäschel
Aff.: 1AI Centre, University College London; 2IDEAS NCBR; 3University of Oxford; 4New York University

| Model | Average Progress (%) |
| gpt-4o | 32.34 1.49 |
| claude-3.5-sonnet | 29.98 1.98 |
| llama-3.1-70b-it | 27.88 1.43 |
| llama-3.2-90B-it | 23.66 1.09 |
| gemini-1.5-pro | 21.00 1.18 |
| gpt-4o-mini | 17.36 1.35 |
| llama-3.1-8b-it | 14.14 1.51 |
| llama-3.2-11B-it | 13.54 1.05 |
| gemini-1.5-flash | 9.73 0.77 |
| llama-3.2-3B-it | 8.47 1.12 |
| llama-3.2-1B-it | 6.32 1.00 |