notesum.ai
Published at November 4FactTest: Factuality Testing in Large Language Models with Statistical Guarantees
cs.CL
cs.AI
stat.ML
Released Date: November 4, 2024
Authors: Fan Nie1, Xiaotian Hou2, Shuhang Lin2, James Zou1, Huaxiu Yao3, Linjun Zhang2
Aff.: 1Stanford University; 2Rutgers University; 3UNC-Chapel Hill

| Dataset | Model | |||||||
|---|---|---|---|---|---|---|---|---|
| ParaRel | OpenLLaMA-3B | 0.0455 | 0.0467 | 0.0513 | 0.0479 | 0.0520 | 0.0486 | 0.0342 |
| OpenLLaMA-7B | 0.0225 | 0.0093 | 0.0145 | 0.0393 | 0.0394 | 0.0435 | 0.0400 | |
| OpenLLaMA-13B | 0.0192 | 0.0087 | 0.0302 | 0.0341 | 0.0477 | 0.0337 | 0.0331 | |
| HotpotQA | OpenLLaMA-3B | 0.0242 | 0.0247 | 0.0272 | 0.0289 | 0.0319 | 0.0297 | 0.0309 |
| OpenLLaMA-7B | 0.0273 | 0.0298 | 0.0295 | 0.0344 | 0.0298 | 0.0308 | 0.0266 | |
| LLaMA-13B | 0.0200 | 0.0226 | 0.0367 | 0.0278 | 0.0300 | 0.0286 | 0.0353 | |
| WiCE | OpenLLaMA-3B | 0.0325 | 0.0089 | 0.0207 | 0.0175 | 0.0029 | 0.0118 | – |
| OpenLLaMA-7B | 0.0694 | 0.0579 | 0.0617 | 0.0077 | 0.0 | 0.0039 | – | |
| LLaMA-13B | 0.0266 | 0.0290 | 0.0363 | 0.0 | 0.0072 | 0.0024 | – | |
| FEVER | OpenLLaMA-3B | 0.0164 | 0.0005 | 0.0217 | 0.0570 | 0.0471 | 0.0496 | – |
| LLaMA-7B | 0.0598 | 0.0081 | 0.0329 | 0.0392 | 0.0495 | 0.0495 | – | |
| LLaMA-13B | 0.0172 | 0.0383 | 0.0293 | 0.0459 | 0.0518 | 0.0552 | – |