作者: Lauro Langosco , Neel Alex , William Baker , David John Quarel , Herbie Bradley
DOI:
关键词:
摘要: Mechanistic interpretability aims to open the black box of neural networks. Previous work has demonstrated that the mechanisms implemented by small neural networks can be fully reverse-engineered. Since these efforts rely on human labor that does not scale to models with billions of parameters, there is growing interest in automating interpretability methods. We propose to use \emph{meta-models}, neural networks that take another network's parameters as input, to scale interpretability efforts. To this end, we present a scalable meta-model architecture and successfully apply it to a variety of problems, including mapping neural network parameters to human-legible code and detecting backdoors in networks. Our results aim to provide a proof-of-concept for automating mechanistic interpretability methods.