notesum.ai
Published at October 23SimRAG: Self-Improving Retrieval-Augmented Generation for Adapting Large Language Models to Specialized Domains
cs.LG
cs.AI
Released Date: October 23, 2024
Authors: Ran Xu1, Hui Liu2, Sreyashi Nag2, Zhenwei Dai2, Yaochen Xie2, Xianfeng Tang2, Chen Luo2, Yang Li2, Joyce C. Ho1, Carl Yang1, Qi He2
Aff.: 1Emory University; 2Amazon

| Datasets | PubMedQA | BioASQ | MedQA | MedMCQA | MMLU-med | LiveQA | MedicationQA | Avg. |
| Metrics | ACC | ACC | ACC | ACC | ACC | Rouge-L∗ / MAUVE | Rouge-L∗ / MAUVE | — |
| Proprietary LLMs, For Reference Only | ||||||||
| GPT-3.5 (OpenAI, 2022) | 67.40 | 90.29 | 66.61 | 58.04 | 75.48 | 42.3 / 62.5 | 36.3 / 46.0 | 62.35 |
| GPT-4 (OpenAI, 2023) | 70.60 | 92.56 | 82.80 | 66.65 | 87.24 | 44.0 / 65.9 | 41.5 / 59.2 | 69.34 |
| Medical LLMs | ||||||||
| PMC-Llama 13B (Wu et al., 2024) | 56.00 | 65.21 | 42.58 | 48.29 | 52.53 | 35.7 / 60.6 | 36.4 / 38.3 | 48.10 |
| MEDITRON 70B (Chen et al., 2023) | 56.40 | 76.86 | 49.57 | 52.67 | 65.38 | — | — | — |
| AdaptLLM-v2 8B (Cheng et al., 2024) | 45.00 | 78.80 | 43.13 | 42.74 | 51.24 | 30.2 / 48.0 | 39.2 / 51.4 | 47.19 |
| BioMistral 7B (Labrak et al., 2024) | 59.20 | 82.69 | 32.52 | 32.20 | 47.47 | 43.1 / 63.2 | 39.6 / 51.9 | 48.11 |
| MedLlama3 8B (John Snow Labs, 2024) | 74.20 | 83.50 | 61.43 | 61.18 | 77.13 | 27.9 / 45.2 | 29.8 / 35.0 | 59.31 |
| Retrieval-Augmented LLMs | ||||||||
| Self-RAG 13B† (Asai et al., 2024) | 71.20 | 73.70 | 48.60 | 44.00 | 53.90 | 35.6 / 54.1 | 39.3 / 46.4 | 52.33 |
| ChatQA1.5 8B (Liu et al., 2024) | 66.40 | 82.69 | 42.36 | 46.97 | 61.40 | 39.3 / 65.5 | 39.9 / 48.9 | 54.15 |
| ChatQA1.5 70B (Liu et al., 2024) | 74.80 | 83.17 | 68.89 | 62.54 | 80.51 | 40.1 / 66.3 | 40.8 / 50.2 | 64.40 |
| ‡Backbone: Llama3-8B-Instruct | ||||||||
| Llama3-8B-it (Meta-AI, 2024) | 64.60 | 88.51 | 55.30 | 58.91 | 69.79 | 34.1 / 54.1 | 37.2 / 45.6 | 58.34 |
| RAFT 8B† (Zhang et al., 2024c) | 73.40 | 88.67 | 54.28 | 60.15 | 70.25 | 36.2 / 55.6 | 38.9 / 56.4 | 60.26 |
| EvidenceRAG 8B† (Schimanski et al., 2024) | 75.00 | 90.61 | 57.74 | 61.13 | 72.27 | 36.6 / 57.8 | 34.6 / 53.6 | 61.14 |
| \ours 8B | 80.00 | 91.75 | 62.92 | 67.51 | 75.57 | 44.4 / 66.6 | 40.1 / 57.4 | 66.04 |
| w/o Stage II | 78.00 | 90.45 | 60.56 | 65.22 | 74.56 | 42.8 / 62.9 | 38.5 / 55.6 | 64.30 |
| ‡Backbone: Gemma2-27B-Instruct | ||||||||
| Gemma2-27B-it (Team et al., 2024) | 56.20 | 89.32 | 59.70 | 57.30 | 75.67 | 37.4 / 52.8 | 40.2 / 57.0 | 59.40 |
| RAFT 27B† (Zhang et al., 2024c) | 67.20 | 91.70 | 62.22 | 61.56 | 78.97 | 39.4 / 62.2 | 40.2 / 48.2 | 63.04 |
| EvidenceRAG 27B† (Schimanski et al., 2024) | 63.00 | 90.61 | 62.14 | 61.80 | 79.43 | 34.5 / 58.6 | 34.5 / 44.6 | 60.85 |
| \ours 27B | 73.60 | 92.07 | 63.63 | 64.16 | 81.63 | 39.9 / 66.8 | 41.2 / 62.1 | 65.17 |
| w/o Stage II | 66.00 | 91.59 | 62.45 | 58.67 | 79.61 | 37.2 / 61.6 | 40.8 / 58.6 | 62.33 |