notesum.ai
Published at November 20SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs
cs.SE
cs.AI
Released Date: November 20, 2024
Authors: Shirley Kokane1, Ming Zhu1, Tulika Awalgaonkar1, Jianguo Zhang1, Thai Hoang1, Akshara Prabhakar1, Zuxin Liu1, Tian Lan1, Liangwei Yang1, Juntao Tan1, Rithesh Murthy1, Weiran Yao1, Zhiwei Liu1, Juan Carlos Niebles1, Huan Wang1, Shelby Heinecke1, Caiming Xiong1, Silivo Savarese1
Aff.: 1Salesforce AI Research, USA

| Model | Success Rate | IAC | RAC | IAV | IFE | IAT | IAN | IFN |
| GPT-4-0125-preview | 0.71 | 0.84 | 1.00 | 0.94 | 1.00 | 0.93 | 0.94 | 0.94 |
| xLAM-8x22b | 0.68 | 0.87 | 0.96 | 0.92 | 1.00 | 0.94 | 0.91 | 0.93 |
| xLAM-7b | 0.64 | 0.78 | 0.98 | 0.93 | 0.98 | 0.95 | 0.86 | 0.91 |
| xLAM-8x7b | 0.63 | 0.78 | 0.95 | 0.92 | 1.00 | 0.95 | 0.93 | 0.91 |
| Code-Llama-13b | 0.54 | 0.61 | 0.89 | 0.92 | 0.91 | 0.78 | 0.79 | 0.81 |
| GPT-4o-turbo-2024-05-13 | 0.53 | 0.83 | 0.82 | 0.95 | 1.00 | 0.88 | 0.87 | 0.89 |
| GPT-3.5-turbo-1106 | 0.53 | 0.67 | 0.84 | 0.89 | 0.99 | 0.79 | 0.77 | 0.81 |
| Mixtral-8x22b-Instruct-v0.1 | 0.4 | 0.7 | 0.92 | 0.62 | 0.99 | 0.6 | 0.81 | 0.78 |
| Meta-Llama3-8b | 0.27 | 0.58 | 0.4 | 0.92 | 0.71 | 0.49 | 0.42 | 0.64 |
| Mistral-7b-Instruct-v0.1 | 0.24 | 0.49 | 0.54 | 0.91 | 0.93 | 0.52 | 0.67 | 0.7 |
| Vicuna-13b-16k | 0.16 | 0.27 | 0.26 | 0.61 | 0.69 | 0.39 | 0.4 | 0.5 |
| Mixtral-8x7b-Instruct-v0.1 | 0.1 | 0.11 | 0.42 | 0.41 | 0.43 | 0.43 | 0.42 | 0.2 |