notesum.ai

Published at October 30

Eliciting Critical Reasoning in Retrieval-Augmented Language Models via Contrastive Explanations

cs.CL

cs.AI

Released Date: October 30, 2024

Authors: Leonardo Ranaldi¹, Marco Valentino¹, Andrè Freitas

Aff.: ¹Idiap Research Institute, Martigny, Switzerland

Arxiv: http://arxiv.org/abs/2410.22874v1

Refer to caption

Models	NQ	PopQA	TriviaQA	FEVER	Train data for LLM
Baseline (no RAG)					Training Size	Annotation
Llama-2-7b	19.2	18.4	30.5	20.1	-	-
Llama-2-13b	24.0	22.6	38.5	25.2	-	-
\hdashlineC-RAG-7b	30.0	44.8	58.6	32.5	-	-
C-RAG-13b	31.8	46.2	61.3	34.0	-	-
\hdashline GPT-4-o	35.2	52.4	64.3	36.5	-	-
RAG
Llama-2-7b	27.8	47.8	55.6	23.2	-	-
Llama-2-13b	34.0	48.1	59.2	25.3	-	-
GPT-4-o	46.6	62.5	74.6	87.7	-	-
\hdashlineLlama-2-7b (C-RAG)	24.6	47.0	54.8	22.9	-	-
Llama-2-13b (C-RAG)	33.5	47.4	58.0	24.9	-	-
GPT-4-o (C-RAG)	49.4	64.8	76.4	90.3	-	-
RAG + Tuning (Llama-2-7b, -13b)
Llama-2-7b (SFT)	36.8	54.4	61.9	67.5	2k	single-step
RECOMP	38.4	-	-	39.1	150k	external
Self-RAG-7b	37.2	54.9	66.4	70.2	190k	external
Self-RAG-13b	38.8	55.8	67.2	72.2	190k	external
Self-Reasoning-7b	38.0	54.2	-	78.6	2k	double-step
Self-Reasoning-13b	41.4	57.3	-	83.9	2k	double-step
C-RAG-7b	40.2	56.4	68.4	79.2	2k	single-step
C-RAG-13b	42.6	58.2	70.3	83.6	2k	single-step