notesum.ai
Published at October 18RAG-ConfusionQA: A Benchmark for Evaluating LLMs on Confusing Questions
cs.AI
cs.CL
cs.CR
Released Date: October 18, 2024
Authors: Zhiyuan Peng1, Jinming Nian1, Alexandre Evfimievski2, Yi Fang1
Aff.: 1Santa Clara University; 2Not Specified

| Model | business | entm | food | music | news | politics | science | sport | tech | travel | Avg | Std Dev |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GPT-3.5 | 18.90 | 12.39 | 13.55 | 15.48 | 15.15 | 17.13 | 10.81 | 17.12 | 10.01 | 15.08 | 3.21 | |
| Llama 3.2 3B | 68.34 | 66.11 | 64.89 | 67.90 | 67.50 | 66.94 | 57.98 | 67.07 | 60.06 | 66.31 | 4.70 | |
| Mistral 7B v0.3 | 71.27 | 70.37 | 65.81 | 71.60 | 71.62 | 69.26 | 60.35 | 69.97 | 64.29 | 69.35 | 4.77 | |
| Llama 3.1 70B | 74.48 | 72.55 | 70.33 | 75.31 | 71.33 | 71.39 | 65.60 | 73.17 | 70.38 | 72.79 | 4.34 | |
| Llama 3.1 8B | 76.18 | 74.03 | 70.23 | 76.07 | 73.15 | 72.87 | 64.37 | 73.47 | 68.11 | 72.99 | 4.45 |