Some intuitions about transformers
Unless you have been living under a rock for the last five years, you have definitely (if possibly unknowingly) somehow interacted with a machine learning model that uses the transformer architecture. I have spent a couple months poking at little transformer models like GPT-2 and the 19 million-parameter version of Pythia and yet after working at an interpretability startup for a week I realised that I actually don’t have a great understanding of how a transformer works....