Aryaman Arora ยป Recruiting
I'm Aryaman, a third-year Ph.D. student at Stanford University, and a part-time researcher at Transluce. I am looking for students who want to work on interpretability for language models! I'm most interested in coming up with new applications of interpretability to the entire LM training stack, towards the goal of improving models via better understanding.
You might be a good fit for this if you are:
- excited about interpretability (regardless of prior experience in the area, or research in general!)
- have at least a little bit of background in:
- PyTorch / other deep learning frameworks / software development
- math for deep learning (linear algebra / probability / calculus)
- linguistics
Logistics. We will meet at least once a week to discuss the project. I prefer in-person, but am willing to advise remote students as well. I will be responsible for getting you compute for the project. I won't be able to compensate you, but (for Stanford students) I will see if RAship or course credits are possible if you do well.
Expected outcome. A publication in a top-tier AI/NLP conference (e.g. NeurIPS / ACL / ICML / ICLR).
Working style. I currently prefer students who have high agency who can work on low-level implementation independently.
In the past, I've done more collaborative work with students but that usually requires far more time from me than I have this quarter; I'd instead like to take on many students!
How to apply
You have two tasks:
- Create a small Jupyter notebook (less than 10 code cells, ideally) that reproduces a single experiment or result from an existing interpretability paper.
For inspiration, check out my
pyvene
tutorial replicating causal tracing from ROME, Zhengxuan Wu's quiz on memorisation subspaces, nnsight
mini-replications of papers. Also, feel free to use a small language model like gpt2
for this, if you don't have GPUs. Feel free to use LM assistance to write the code, I'm more interested in what you picked to investigate; negative results are okay.
- Submit your notebook along with some info about yourself here.
Project ideas
This is a sample of immediate ideas I have in mind, but I'm open to novel ideas or even just exploring a general direction for a bit to find interesting problems.
- I recently did some work on how different architectures perform associative recall. We can (a) come up with more variants of AR to study, informed by formal language theory; or (b) analyse the feature geometry of AR across architectures, particularly to see if non-linear representations are involved in some way.
- Thinking Machines released a nice blogpost analysing LoRA. Let's repeat these experiments with ReFT, particularly on thinking tasks, and come up with ideas to improve long-form coherency when intervening on representations.
- Can we train interpretability agents with reinforcement learning? What primitives can we provide to them that will surface insights about models (SAE features seem to be the wrong choice, see Anthropic's automated auditing)? I have more ideas on this which I can share if you're interested.
- How can we improve distributed alignment search using the notions contributed by this paper?
- Anything around pretraining data filtering, data/model diffing, with interpretability methods in the loop; I have some ideas on this which I can share if you're interested.
- Engineering: I maintain several interpretability libraries which are actively being improved and have ~hundreds of users. If you have a background in software development, we can work on improving these libraries! All of my research depends on this software; we will likely come up with research ideas along the way.I strongly agree with Omar Khattab's belief that projects, not papers, are the unit of impact in AI research. If we make our research software efficient and easy-to-use, and develop good abstractions for repeated tasks in our research, we will make it much easier to run experiments and free ourselves to focus on ideation rather than implementation. And other people will adopt our software too!
- pyvene: a general-purpose interpretability library for PyTorch models; I'm looking for adding support for circuit analysis/interventions, and visualisation/easier-to-use interfaces. all the other libraries below depend on pyvene.
- tinylang: a research library for comparing architectures on toy tasks; let's come up with more tasks to study, and support for analysing representations/feature geometry.
- pyreft: parameter-efficient fine-tuning library; would like to push ReFT upstream to PEFT and generally clean up the codebase.
- axbench: benchmark for steering LLMs with concepts; make it easier to test novel methods.
- causalgym: benchmark for causal interpretability methods on linguistic tasks; needs cleanup, potentially new linguistics experiments based on this / good testbed for DAS alternatives.