Aryaman Arora » Blog » The Hittite problem
You are a Hittite spy. You report directly to the great king Šuppiluliuma I, who rules from his magnificent capital at Ḫattuša. Recently, the Hittites have been on good terms with their neighbours the Hurrians. Still, you can never be too sure about even the best of allies, and therefore the king has sent you to infilitrate the Hurrian bureauracy so that you can pre-empt any conspiracies brewing within. I think NLP should employ more whacky analogies (see e.g. Leslie Lamport’s work on distributed systems). I am sorry in advance if this makes no sense!
You are so good at your job that you have totally compromised all incoming information channels into the Hurrian government apparatus—you can mess with their weather reports, their intelligence dossiers, their tax records, etc. However, there is one big problem preventing you from completely compromising this adversary: all of their internal communication is encrypted and your cryptographers have made no progress at cracking the code. Furthermore, no one inside the government can be bribed to reveal how the encryption works, since the encryption method is so complex that no single person (let alone a random civil servant) can understand how it works. I know it’s not a perfect analogy so you can poke holes in it if you like. Or just assume every government worker is completely loyal so you can’t just manipulate people to get what you want. Basically, you can only mess with what’s coming in (all information channels, including internal ones) and observe what goes out (governmental decisions decided by the top brass).
Knowing who is in charge of what will be useful for identifying high-value targets or figuring out what minimal set of information channels to manipulate to steer the government into doing what you want. So how can you figure out how the government is structured?
This is the interpretability problem.
Here are some of the approaches the surprisingly sophisticated Hittite geopolitics experts have suggested over the years for cracking this problem:
Okay now I’ll break out of character.
What is the best method for figuring out how a neural network computes something? It’s barely been a couple years and so many different approaches have proliferated. I haven’t even listed all the possible approaches because some of them are pretty hard to analogise—e.g., iterative nullspace projections attempt to remove concepts in the activation space by (iteratively) projecting activations onto the nullspace of linear probes trained to predict that concept (see Ravfogel et al., 2020). There’s a bunch of work on fact-editing in transformers too that is relevant to interpretability as well. I also don’t even mention explainability methods (which sometimes aren’t conceptually different from probing).
Broadly, I think methods that establish causality are the most promising and well-founded. The argument for causality stems from the deficiency in probing that I think is best articulated in Geiger et al. (2023)
From an information-theoretic point of view, we can observe that using arbitrarily powerful probes is equivalent to measuring the mutual information between the concept and the neural representation (Hewitt and Liang, 2019; Pimentel et al., 2020). If we restrict the class of probing models based on their complexity, we can measure how usable the information is (Xu et al., 2020; Hewitt et al., 2021). Regardless of what probe models are used, successfully probing a neural representation does not guarantee that the representation plays a causal role in model behavior (Ravichander et al., 2020; Elazar et al., 2020; Geiger et al., 2020, 2021).
What we want to know, when interpreting how a model works, is how information is used to produce the output. What probing tells is whether that information could be used to produce a desired output. To illustrate the point above about probes measuring mutual information: imagine your transformer model internally one-hot encodes your vocabulary. You want to probe whether the model knows if a word is a noun or not. With an MLP probe on the embeddings (with sufficient layers), you would think yes—because an NN with nonlinearities can approximate any function, and any function on your vocabulary can be encoded on one-hot vector inputs! Your probe would basically just have to learn a lookup table, never mind whether or how the model does. This is formalised in the data processing inequality and first pointed out in NLP literature by Pimentel et al. (2020).
We don’t actually care about whether information could be extracted from a representation. We care about whether it is by the model we’re studying. Causal abstraction methods, which test counterfactual inputs to parts of the model that let uss modify specific information, can tell us this.