notesum.ai

Published at November 4

FactTest: Factuality Testing in Large Language Models with Statistical Guarantees

cs.CL

cs.AI

stat.ML

Released Date: November 4, 2024

Authors: Fan Nie¹, Xiaotian Hou², Shuhang Lin², James Zou¹, Huaxiu Yao³, Linjun Zhang²

Aff.: ¹Stanford University; ²Rutgers University; ³UNC-Chapel Hill

Arxiv: http://arxiv.org/abs/2411.02603v1

Refer to caption

Dataset	Model	$\text{{FTest}-ve}_{5}$	$\text{{FTest}-ve}_{10}$	$\text{{FTest}-ve}_{15}$	$\text{{FTest}-se}_{5}$	$\text{{FTest}-se}_{10}$	$\text{{FTest}-se}_{15}$	$\text{{FTest}-kle}_{15}$
ParaRel	OpenLLaMA-3B	0.0455	0.0467	0.0513	0.0479	0.0520	0.0486	0.0342
	OpenLLaMA-7B	0.0225	0.0093	0.0145	0.0393	0.0394	0.0435	0.0400
	OpenLLaMA-13B	0.0192	0.0087	0.0302	0.0341	0.0477	0.0337	0.0331
HotpotQA	OpenLLaMA-3B	0.0242	0.0247	0.0272	0.0289	0.0319	0.0297	0.0309
	OpenLLaMA-7B	0.0273	0.0298	0.0295	0.0344	0.0298	0.0308	0.0266
	LLaMA-13B	0.0200	0.0226	0.0367	0.0278	0.0300	0.0286	0.0353
WiCE	OpenLLaMA-3B	0.0325	0.0089	0.0207	0.0175	0.0029	0.0118	–
	OpenLLaMA-7B	0.0694	0.0579	0.0617	0.0077	0.0	0.0039	–
	LLaMA-13B	0.0266	0.0290	0.0363	0.0	0.0072	0.0024	–
FEVER	OpenLLaMA-3B	0.0164	0.0005	0.0217	0.0570	0.0471	0.0496	–
	LLaMA-7B	0.0598	0.0081	0.0329	0.0392	0.0495	0.0495	–
	LLaMA-13B	0.0172	0.0383	0.0293	0.0459	0.0518	0.0552	–