achinth

[PART 1] circuit discovery: what the hell is it?

Table of Contents

This is part one of a series of posts about learning biased circuits in vision/language models. I want to publish a paper to ICML at the end of this.

Like most people, I don’t know anything about this entire sub-field. It’s exhausting, humbling and most of all - exciting to learn something new about how we can understand generative models.

Thanks to perplexity, I get to learn what this subfield is. Let’s get started.

what is circuit discovery and how did we get here? #

Circuit discovery is a technique from mechanistic interpretability which aims to identify and analyze the internals of generative models. The abstraction tis technique interfaces with is on identifying ‘circuits’ - or more precisely, pathways through which transformers process information.

contextual decomposition #

One method method of circuit discovery in transformers is ‘contextual decomposition (CD) for transformers’. This works by:

advantages and disadvantages of contextual decomposition #

Having been evaluated over several circuit evaluation tasks such as indiret object identification and greater-than comparisons, CD has a high degree of faithfulness to the original model’s behaviour, replicating its performance for fewer nodes than competing approaches. It also doesn’t require manual crafting of examples or additional training, making it applicable across various transformer architectures.

sparse autoencoders #

Sparse autoencoders (SAEs) are a specialized type of autoencoder which focuses on learning efficient represntations of data by enforcing sparsity in encoded representations. This works by:

tags: Circuit-Discovery, Mechanistic Interpretability