notesum.ai
Published at November 3Graph-based Confidence Calibration for Large Language Models
cs.CL
cs.AI
cs.IR
cs.LG
Released Date: November 3, 2024
Authors: Yukun Li1, Sijia Wang2, Lifu Huang3, Li-Ping Liu1
Aff.: 1Tufts University; 2Virginia Tech; 3University of California, Davis

| Method | TriviaQA | CoQA | ||||
|---|---|---|---|---|---|---|
| Brier | AUROC | ECE | Brier | AUROC | ECE | |
| GraphSpectral (GS) | 0.22 | 0.80 | 0.056 | 0.19 | 0.76 | 0.091 |
| GS + Iso | 0.16 | 0.80 | 0.072 | 0.16 | 0.76 | 0.054 |
| GS + Platt | 0.16 | 0.79 | 0.048 | 0.16 | 0.74 | 0.041 |
| Self-check GPT | 0.33 | 0.65 | 0.187 | 0.21 | 0.63 | 0.178 |
| Seq. likelihood | 0.54 | 0.53 | 0.22 | 0.38 | 0.47 | 0.17 |
| Platt | 0.28 | 0.59 | 0.05 | 0.26 | 0.47 | 0.09 |
| Verbalized Qual | 0.32 | 0.62 | 0.14 | 0.30 | 0.68 | 0.16 |
| Verbalized % | 0.25 | 0.67 | 0.033 | 0.42 | 0.66 | 0.21 |
| APRICOT | 0.14 | 0.76 | 0.074 | 0.17 | 0.78 | 0.132 |
| APRICOT + Iso | 0.18 | 0.76 | 0.073 | 0.17 | 0.78 | 0.097 |
| APRICOT + Platt | 0.17 | 0.76 | 0.039 | 0.17 | 0.78 | 0.069 |
| Ours | 0.14 | 0.84 | 0.025 | 0.12 | 0.77 | 0.016 |
| Ours (Multi prompts) | 0.14 | 0.84 | 0.025 | 0.12 | 0.78 | 0.015 |