notesum.ai

Published at November 13

Can sparse autoencoders be used to decompose and interpret steering vectors?

cs.LG

cs.AI

cs.CL

Released Date: November 13, 2024

Authors: Harry Mayne¹, Yushi Yang¹, Adam Mahdi¹

Aff.: ¹University of Oxford

Arxiv: http://arxiv.org/abs/2411.08790v1

Refer to caption

Corrigibility		Zero vector
steering vector		Zero vector
Feature	Activation	Feature	Activation
4888	95.04	4888	89.06
15603	36.34	15603	35.94
12695	22.64	7589	19.80
7589	18.89	15471	11.84
2350	11.35	2350	10.74