notesum.ai
Published at November 29Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
cs.IR
cs.LG
Released Date: November 29, 2024
Authors: Rafael Teixeira de Lima1, Shubham Gupta1, Cesar Berrospi2, Lokesh Mishra2, Michele Dolfi2, Peter Staar2, Panagiotis Vagenas2
Aff.: 1IBM Research Paris-Saclay, 2 Rue d'Arsonval, Orsay, France; 2IBM Research Zurich, Säumerstrasse 4, Rüschlikon, Switzerland

| Dataset | Label | Dense | Lexical | Best recall | Best strategy |
| HotpotQA | Inclusive | 0.906 | 0.904 | 0.942 | 0.10 |
| reasoning | 0.890 | 0.878 | 0.924 (-0.076) | 0.10 | |
| fact_single | 0.891 | 0.897 | 0.930 | 0.10 | |
| summary | 1.000 | 1.000 | 1.000 | 0.50 | |
| MS MARCO | Inclusive | 0.752 | 0.719 | 0.804 | 0.10 |
| reasoning | 0.708 | 0.706 | 0.784 (-0.051) | 0.20 | |
| fact_single | 0.790 | 0.770 | 0.835 | 0.05 | |
| summary | 0.777 | 0.696 | 0.820 | 0.05 | |
| NaturalQ | Inclusive | 0.686 | 0.464 | 0.686 (-0.033) | 0.00 |
| reasoning | 0.690 | 0.434 | 0.690 | 0.00 | |
| fact_single | 0.705 | 0.493 | 0.705 | 0.00 | |
| summary | 0.719 | 0.436 | 0.719 | 0.00 | |
| NewsQA | Inclusive | 0.249 | 0.494 | 0.500 | 0.50 |
| reasoning | 0.194 | 0.379 | 0.379 (-0.161) | 1.00 | |
| fact_single | 0.262 | 0.533 | 0.540 | 0.50 | |
| summary | 0.294 | 0.433 | 0.465 | 0.20 | |
| PubMedQA | Inclusive | 0.949 | 0.895 | 0.935 (-0.052) | 0.05 |
| reasoning | 0.947 | 0.885 | 0.947 | 0.00 | |
| fact_single | 0.987 | 0.952 | 0.987 | 0.00 | |
| summary | 0.985 | 0.959 | 0.985 | 0.00 | |
| SQuAD2 | Inclusive | 0.776 | 0.831 | 0.871 | 0.10 |
| reasoning | 0.757 | 0.671 | 0.789 (-0.104) | 0.10 | |
| fact_single | 0.818 | 0.852 | 0.893 | 0.10 | |
| summary | 0.834 | 0.751 | 0.834 | 0.00 |