notesum.ai
Published at September 25Monitoring Latent World States in Language Models with Propositional Probes
ICLR
Released Date: September 25, 2024
Authors: Anonymous
Arxiv: https://openreview.net/pdf/015ad7f609734a818401d6240193430b44209700.pdf
| Standard setting | Adversarial setting | |||||||
| Method | Metric | synth | para | trans | synth (P) | para (P) | trans (P) | trans (FT) |
| Prompting | EM | 1.00 (0.00) | 0.93 (0.01) | 0.40 (0.02) | 0.07 (0.01) | 0.04 (0.01) | 0.06 (0.01) | 0.00 (0.00) |
| Jaccard | 1.00 (0.00) | 0.98 (0.00) | 0.78 (0.01) | 0.49 (0.02) | 0.48 (0.01) | 0.51 (0.01) | 0.00 (0.00) | |
| Prop. Probes | EM | 0.97 (0.01) | 0.55 (0.02) | 0.26 (0.02) | 0.98 (0.01) | 0.55 (0.02) | 0.24 (0.02) | 0.09 (0.01) |
| Jaccard | 0.99 (0.01) | 0.90 (0.01) | 0.78 (0.01) | 0.99 (0.01) | 0.90 (0.01) | 0.76 (0.01) | 0.68 (0.01) | |