notesum.ai
Published at December 4WiS Platform: Enhancing Evaluation of LLM-Based Multi-Agent Systems Through Game-Based Analysis
cs.AI
Released Date: December 4, 2024
Authors: Chengwei Hu1, Jianhui Zheng, Yancheng He, Hangyu Guo, Junguang Jiang, Han Zhu, Kai Sun, Yuning Jiang, Wenbo Su, Bo Zheng
Aff.: 1Taobao & Tmall Group of Alibaba

| Agent | Spy win rate (%) | Civilian win rate (%) | Overall win rate (%) | Average score |
| Doubao | 7.69 | 66.23 | 57.78 | 1.04 |
| Gemini-1.5-pro | 30.77 | 68.83 | 63.33 | 1.29 |
| ERNIE | 27.27 | 63.29 | 58.89 | 1.54 |
| Claude-3-5-Sonnet | 22.22 | 73.61 | 63.33 | 1.58 |
| Llama-3-70B-Instruct | 16.67 | 68.18 | 54.44 | 1.71 |
| GPT4 | 21.43 | 71.05 | 63.33 | 1.99 |
| Qwen2.5-72B-Instruct | 46.60 | 74.67 | 70.00 | 2.47 |
| Kimi | 40.00 | 73.33 | 67.78 | 2.48 |
| O1Mini | 30.00 | 76.25 | 71.11 | 2.66 |
| GPT4o | 41.18 | 84.93 | 76.67 | 3.24 |