notesum.ai
Published at November 21FunctionChat-Bench: Comprehensive Evaluation of Language Models' Generative Capabilities in Korean Tool-use Dialogs
cs.CL
cs.AI
Released Date: November 21, 2024
Authors: Shinbok Lee1, Gaeun Seo1, Daniel Lee1, Byeongil Ko1, Sunghee Jung1, Myeongcheol Shin1
Aff.: 1Kakao Corp., Sungnam, South Korea

| Tool | Answer | Slot | Relevance | macro | micro | |
|---|---|---|---|---|---|---|
| Call | Completion | Question | Detection | AVG | AVG | |
| gpt-4o | 0.94 | 0.97 | 0.86 | 0.91 | 0.92 | 0.94 |
| gpt-4-turbo | 0.96 | 0.99 | 0.92 | 0.96 | 0.96 | 0.96 |
| gpt-3.5-turbo | 0.97 | 0.92 | 0.58 | 0.61 | 0.77 | 0.84 |
| gemini-1.5-pro | 0.70 | 0.87 | 0.83 | 0.97 | 0.84 | 0.82 |
| gemini-1.5-flash | 0.66 | 0.94 | 0.89 | 0.74 | 0.81 | 0.81 |
| gemini-1.0-pro | 0.69 | 0.85 | 0.67 | 0.61 | 0.71 | 0.73 |
| functionary-medium | 0.56 | 0.94 | 0.69 | 0.65 | 0.71 | 0.73 |
| solar-1-mini-chat | 0.63 | 0.77 | 0.08 | 0.13 | 0.40 | 0.53 |