notesum.ai
Published at December 9GameArena: Evaluating LLM Reasoning through Live Computer Games
cs.AI
cs.CL
Released Date: December 9, 2024
Authors: Lanxiang Hu1, Qiyu Li1, Anze Xie1, Nan Jiang1, Ion Stoica, Haojian Jin1, Hao Zhang1
Aff.: 1University of California, San Diego

| Akinator | Taboo | Bluffing | ||||
|---|---|---|---|---|---|---|
| Model | Avg. Win Rate | Avg. # Round | Avg. Win Rate | Avg. # Round | Avg. Win Rate | Avg. # Round |
| claude-3-5-sonnet-20240620 | 0.550.11 | 16.611.75 | 0.610.18 | 3.360.88 | 0.670.13 | 6.000.00 |
| gpt-4o-2024-08-06 | 0.490.13 | 16.360.86 | 0.670.11 | 3.190.34 | 0.580.13 | 5.920.18 |
| gemini-1.5-pro | 0.510.17 | 16.571.49 | 0.610.04 | 3.740.45 | 0.600.18 | 5.960.10 |
| llama-3.1-405b | 0.440.04 | 17.150.66 | 0.620.18 | 3.080.18 | 0.440.22 | 5.900.27 |
| mistral-large-latest | 0.020.04 | 19.990.02 | 0.660.13 | 3.430.57 | 0.00.00 | 6.000.00 |