You are a Hittite spy. You report directly to the great king Šuppiluliuma I, who rules from his magnificent capital at Ḫattuša. Recently, the Hittites have been on good terms with their neighbours the Hurrians. Still, you can never be too sure about even the best of allies, and therefore the king has sent you to infilitrate the Hurrian bureauracy so that you can pre-empt any conspiracies brewing within.1

You are so good at your job that you have totally compromised all incoming information channels into the Hurrian government apparatus—you can mess with their weather reports, their intelligence dossiers, their tax records, etc. However, there is one big problem preventing you from completely compromising this adversary: all of their internal communication is encrypted and your cryptographers have made no progress at cracking the code. Furthermore, no one inside the government can be bribed to reveal how the encryption works, since the encryption method is so complex that no single person (let alone a random civil servant) can understand how it works.2 Basically, you can only mess with what’s coming in (all information channels, including internal ones) and observe what goes out (governmental decisions decided by the top brass).

Knowing who is in charge of what will be useful for identifying high-value targets or figuring out what minimal set of information channels to manipulate to steer the government into doing what you want. So how can you figure out how the government is structured?

This is the interpretability problem.

Here are some of the approaches the surprisingly sophisticated Hittite geopolitics experts have suggested over the years for cracking this problem:

  1. Probing: Access past internal encrypted communications and resulting government decisions and train machine learning models to try and guess what the government will do based on just the communications. You’re just a spy though—you don’t understand machine learning and you don’t see how this tells you anything useful about who’s in charge of what. (Breaking out of character: this doesn’t establish causality between decisions and communications—you only know X information could be used based on this activation, not whether the model actually does so. And interpreting a model with another model is… an interesting choice. I feel like we can do better.)3
  2. Interpret attention patterns: While you can’t tell what the content of each encrypted message is, you do know who is sending it to who and what information channels are being accessed. So you can establish who is whose superior, and what input information some decisions may be based on. But you are missing the actual meaningful part of the messages: their content! (Attention patterns are seductively easy to visualise, but again they don’t establish causality between parts of a model. The attention pattern doesn’t tell us how information is used or manipulated, only where it goes.)4
  3. Zero ablation: Erase some set of internal messages and see what happens. This might not tell you much though since it may just cause confusion in the governmental ranks, probably making some superiors angry at their subordinates for not transmitting important information rather than meaningfully affecting governmental decisions. (Zero ablation may throw the model off-distribution in an important way, since you have no idea what a zero vector as an internal activation actually means—e.g. maybe “no information” is encoded as a non-zero vector.)5
  4. Mean ablation: Take all encrypted communications recorded over a particular channel in the past. In this analogy we can’t really take the mean over activations, but let’s say we produce a really boring and average communication (e.g. the mode). This means no important (i.e. anomalous, low-surprisal) information gets transferred over this channel. So e.g. you may be able to figure out who in the government handles disaster response if you replace their “oh no!” message into an everyday “all clear” message. (This might again be weirdly off distribution. In a real model, you might then be taking the mean of a non-normally distributed activation and producing some weird new stuff the model has never seen before.)
  5. Path patching or interchange intervention: Feed wrong information to only some people in the government or only along some chains of command. See how the response changes—you can causally establish what people are responsible for delivering certain kinds of information up the chain of command.6
  6. Resampling ablation: Pick a random encrypted message from the past. Substitute it in place of the one being transmitted today. Do this a bunch of times with different random messages and see what happens—e.g. maybe suddenly the government is unable to feed its population, in which case you know you messed with someone responsible for agriculture. (Easier to automate than the above but hard to figure out how to set up your experiments.)7

Okay now I’ll break out of character.

What is the best method for figuring out how a neural network computes something? It’s barely been a couple years and so many different approaches have proliferated. I haven’t even listed all the possible approaches because some of them are pretty hard to analogise—e.g., iterative nullspace projections attempt to remove concepts in the activation space by (iteratively) projecting activations onto the nullspace of linear probes trained to predict that concept (see Ravfogel et al., 2020). There’s a bunch of work on fact-editing in transformers too that is relevant to interpretability as well. I also don’t even mention explainability methods (which sometimes aren’t conceptually different from probing).

Broadly, I think methods that establish causality are the most promising and well-founded. The argument for causality stems from the deficiency in probing that I think is best articulated in Geiger et al. (2023):

From an information-theoretic point of view, we can observe that using arbitrarily powerful probes is equivalent to measuring the mutual information between the concept and the neural representation (Hewitt and Liang, 2019; Pimentel et al., 2020). If we restrict the class of probing models based on their complexity, we can measure how usable the information is (Xu et al., 2020; Hewitt et al., 2021). Regardless of what probe models are used, successfully probing a neural representation does not guarantee that the representation plays a causal role in model behavior (Ravichander et al., 2020; Elazar et al., 2020; Geiger et al., 2020, 2021).

What we want to know, when interpreting how a model works, is how information is used to produce the output. What probing tells is whether that information could be used to produce a desired output. To illustrate the point above about probes measuring mutual information: imagine your transformer model internally one-hot encodes your vocabulary. You want to probe whether the model knows if a word is a noun or not. With an MLP probe on the embeddings (with sufficient layers), you would think yes—because an NN with nonlinearities can approximate any function, and any function on your vocabulary can be encoded on one-hot vector inputs! Your probe would basically just have to learn a lookup table, never mind whether or how the model does. This is formalised in the data processing inequality and first pointed out in NLP literature by Pimentel et al. (2020).

We don’t actually care about whether information could be extracted from a representation. We care about whether it is by the model we’re studying. Causal abstraction methods, which test counterfactual inputs to parts of the model that let uss modify specific information, can tell us this.


  1. I think NLP should employ more whacky analogies (see e.g. Leslie Lamport’s work on distributed systems). I am sorry in advance if this makes no sense! ↩︎

  2. I know it’s not a perfect analogy so you can poke holes in it if you like. Or just assume every government worker is completely loyal so you can’t just manipulate people to get what you want. ↩︎

  3. I am a real scholar so I will try to cite some papers. But there’s way too much probing literature to sift through. Start with ctrl-F “probing” on Rogers et al. (2021)↩︎

  4. Clark et al. (2019) ↩︎

  5. From here on out, we are in the Wild West of mechanistic interpretability, where much interesting work is very recent and confined to blogposts that can be kind of hard to understand, rather than arXiv papers (which tbh are often still hard to understand). The earliest use of zero ablation is probably nostalgebraist (2020), who zero-ablated internal model layers to get a sort of early peak into what activations may mean in terms of the vocabulary. You see the method often used by LessWrong-ers. Also, note that zero ablation should be pretty useless on models trained with dropout (since that applies zero ablation to increase model robustness), as pointed out by Neel Nanda↩︎

  6. Path patching comes from Wang et al. (2022) which was the first work to find a circuit accomplishing a specific task in GPT-2 Small. The method is expounded upon in Goldkowsky-Dill et al. (2023) [which I am an author on]. Interchange interventions come from this line of work: Geiger et al. (2023), Geiger et al. (2021). The basic idea between the two lines of work is the same, but I have yet to fully understand the latter series to be confident that the implementation is the same too. ↩︎

  7. Resampling ablations are used in Goldkowsky-Dill et al. (2023). A promising kind of work is automated circuit interpretability as implemented by Conmy et al. (2023)↩︎