notesum.ai
Published at November 7M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
cs.CV
cs.AI
cs.CL
Released Date: November 7, 2024
Authors: Jaemin Cho1, Debanjan Mahata2, Ozan Irsoy2, Yujie He2, Mohit Bansal1
Aff.: 1UNC Chapel Hill; 2Bloomberg

| Method | # Pages | Evidence Modalities | Question Hops | Overall | ||||
| Image | Table | Text | Single-hop | Multi-hop | EM | F1 | ||
| Text RAG (w/ ColBERT v2) | ||||||||
| Llama 3.1 8B | 1 | 8.3 | 15.7 | 29.6 | 25.3 | 12.3 | 15.4 | 20.0 |
| Llama 3.1 8B | 2 | 7.7 | 16.8 | 31.7 | 27.4 | 12.1 | 15.8 | 21.2 |
| Llama 3.1 8B | 4 | 7.8 | 21.0 | 34.1 | 29.4 | 15.2 | 17.8 | 23.7 |
| M3DocRAG (w/ ColPali) | ||||||||
| Qwen2-VL 7B (Ours) | 1 | 25.1 | 27.8 | 39.6 | 37.2 | 25.0 | 27.9 | 32.3 |
| Qwen2-VL 7B (Ours) | 2 | 26.8 | 30.4 | 42.1 | 41.0 | 25.2 | 29.9 | 34.6 |
| Qwen2-VL 7B (Ours) | 4 | 24.7 | 30.4 | 41.2 | 43.2 | 26.6 | 31.4 | 36.5 |