notesum.ai
Published at November 4Addressing Uncertainty in LLMs to Enhance Reliability in Generative AI
cs.AI
Released Date: November 4, 2024
Authors: Ramneet Kaur1, Colin Samplawski1, Adam D. Cobb1, Anirban Roy1, Brian Matejek1, Manoj Acharya1, Daniel Elenius1, Alexander M. Berenbeim2, John A. Pavlik2, Nathaniel D. Bastian2, Susmit Jha1
Aff.: 1Neuro-symbolic Computing and Intelligence, SRI, Menlo Park, USA; 2Army Cyber Institute, United States Military Academy, West Point, NY USA

| COQA Dataset | TriviaQA Dataset | ||||||||
| Model | Eval. | Model | Sem. Ent. | EigV | Ours | Model | Sem. Ent. | EigV | Ours |
| Acc. | Unnorm/Norm | Unnorm/Norm | Acc. | Unnorm/Norm | Unnorm/Norm | ||||
| Llama-13b | GPT-4 | 73.22 | 58.97/56.90 | 54.63 | 58.42/55.32 | 67.03 | 40.09/40.27 | 39.42 | 39.92/39.38 |
| Mistral-7b | GPT-4 | 73.38 | 63.06/62.02 | 59.83 | 62.77/61.41 | 60.68 | 35.57/35.04 | 33.19 | 35.13/33.29 |
| Mean | GPT-4 | 73.30 | 61.02/59.46 | 57.23 | 60.60/58.37 | 63.86 | 37.83/37.66 | 36.31 | 37.53/36.34 |
| Llama-13b | RougeL | 72.75 | 56.75/55.53 | 53.78 | 55.78/52.65 | 64.60 | 39.12/39.39 | 38.93 | 38.81/38.35 |
| Mistral-7b | RougeL | 44.74 | 27.62/29.65 | 27.12 | 27.37/28.26 | 42.33 | 17.15/19.56 | 17.06 | 16.95/18.11 |
| Mean | RougeL | 58.75 | 42.19/42.59 | 40.45 | 41.58/40.46 | 53.47 | 28.14/29.48 | 28.00 | 27.88/28.23 |
| Llama-13b | Deberta | 63.74 | 46.07/46.91 | 42.04 | 45.07/43.56 | 63.33 | 37.23/37.94 | 36.70 | 36.88/36.84 |
| Mistral-7b | Deberta | 11.23 | 3.84/5.70 | 4.13 | 3.82/5.00 | 33.92 | 11.00/13.54 | 11.35 | 10.89/12.45 |
| Mean | Deberta | 37.49 | 24.96/26.31 | 23.09 | 24.45/24.28 | 48.63 | 24.12/25.74 | 24.03 | 23.89/24.65 |