notesum.ai
Published at November 13Can sparse autoencoders be used to decompose and interpret steering vectors?
cs.LG
cs.AI
cs.CL
Released Date: November 13, 2024
Authors: Harry Mayne1, Yushi Yang1, Adam Mahdi1
Aff.: 1University of Oxford

| Corrigibility | Zero vector | ||
| steering vector | |||
| Feature | Activation | Feature | Activation |
| 4888 | 95.04 | 4888 | 89.06 |
| 15603 | 36.34 | 15603 | 35.94 |
| 12695 | 22.64 | 7589 | 19.80 |
| 7589 | 18.89 | 15471 | 11.84 |
| 2350 | 11.35 | 2350 | 10.74 |