notesum.ai
Published at October 30Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations
cs.CL
cs.AI
Released Date: October 30, 2024
Authors: Leonardo Ranaldi1, Marco Valentino1, Andrè Freitas
Aff.: 1Idiap Research Institute, Martigny, Switzerland

| Models | NQ | PopQA | TriviaQA | FEVER | Train data for LLM | |
| Baseline (no RAG) | Training Size | Annotation | ||||
| Llama-2-7b | 19.2 | 18.4 | 30.5 | 20.1 | - | - |
| Llama-2-13b | 24.0 | 22.6 | 38.5 | 25.2 | - | - |
| \hdashlineC-RAG-7b | 30.0 | 44.8 | 58.6 | 32.5 | - | - |
| C-RAG-13b | 31.8 | 46.2 | 61.3 | 34.0 | - | - |
| \hdashline GPT-4-o | 35.2 | 52.4 | 64.3 | 36.5 | - | - |
| RAG | ||||||
| Llama-2-7b | 27.8 | 47.8 | 55.6 | 23.2 | - | - |
| Llama-2-13b | 34.0 | 48.1 | 59.2 | 25.3 | - | - |
| GPT-4-o | 46.6 | 62.5 | 74.6 | 87.7 | - | - |
| \hdashlineLlama-2-7b (C-RAG) | 24.6 | 47.0 | 54.8 | 22.9 | - | - |
| Llama-2-13b (C-RAG) | 33.5 | 47.4 | 58.0 | 24.9 | - | - |
| GPT-4-o (C-RAG) | 49.4 | 64.8 | 76.4 | 90.3 | - | - |
| RAG + Tuning (Llama-2-7b, -13b) | ||||||
| Llama-2-7b (SFT) | 36.8 | 54.4 | 61.9 | 67.5 | 2k | single-step |
| RECOMP | 38.4 | - | - | 39.1 | 150k | external |
| Self-RAG-7b | 37.2 | 54.9 | 66.4 | 70.2 | 190k | external |
| Self-RAG-13b | 38.8 | 55.8 | 67.2 | 72.2 | 190k | external |
| Self-Reasoning-7b | 38.0 | 54.2 | - | 78.6 | 2k | double-step |
| Self-Reasoning-13b | 41.4 | 57.3 | - | 83.9 | 2k | double-step |
| C-RAG-7b | 40.2 | 56.4 | 68.4 | 79.2 | 2k | single-step |
| C-RAG-13b | 42.6 | 58.2 | 70.3 | 83.6 | 2k | single-step |